# üéì File2Learning - AI Model Training on Google Colab

## üìö Difficulty Classifier Training Pipeline

**Model**: DistilBERT-based Text Difficulty Classifier (A1-C2 CEFR levels)

**GPU**: Tesla T4 (16GB VRAM) - Mi·ªÖn ph√≠ tr√™n Google Colab

**Training Time**: ~8-12 ph√∫t

---

### üöÄ Quick Start Guide:
1. **Runtime** ‚Üí **Change runtime type** ‚Üí **GPU** (T4 ho·∫∑c V100)
2. **Run All** (Runtime ‚Üí Run all) ho·∫∑c ch·∫°y t·ª´ng cell
3. ƒê·ª£i training ho√†n th√†nh (~10 ph√∫t)
4. Download model v·ªÅ local

---


## üîß Step 1: Setup Environment & GPU Check


In [None]:
# Check GPU availability
import torch
import os

print("="*70)
print("üîç GPU Information")
print("="*70)

if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"üìä GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"üî¢ CUDA Version: {torch.version.cuda}")
    print(f"üêç PyTorch Version: {torch.__version__}")
else:
    print("‚ùå GPU NOT AVAILABLE!")
    print("‚ö†Ô∏è  Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print("="*70)


## üíæ Step 2: Mount Google Drive (Optional)

**N·∫øu b·∫°n mu·ªën save model v√†o Google Drive**, uncomment v√† ch·∫°y cell n√†y:


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# # Create output directory in Drive
# DRIVE_OUTPUT_DIR = '/content/drive/MyDrive/File2Learning_Models'
# os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)
# print(f"‚úÖ Google Drive mounted! Models will be saved to: {DRIVE_OUTPUT_DIR}")


## üìÅ Step 3: Upload Project Files

**Ch·ªçn 1 trong 2 options:**

### **Option A: Upload t·ª´ local** (Recommended)
1. Zip to√†n b·ªô folder `backend/` th√†nh `backend.zip`
2. Upload file zip v√† extract


In [None]:
# Option A: Upload ZIP file
from google.colab import files
import zipfile

print("üì§ Upload backend.zip file...")
uploaded = files.upload()

# Extract
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"üì¶ Extracting {filename}...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('/content/')
        print("‚úÖ Extraction complete!")

# Change to backend directory
%cd /content/backend
!pwd
!ls -la


### **Option B: Clone t·ª´ GitHub** (N·∫øu b·∫°n ƒë√£ push code l√™n GitHub)


In [None]:
# # Option B: Clone from GitHub
# !git clone https://github.com/YOUR_USERNAME/File2Learning.git
# %cd File2Learning/backend
# !pwd
# !ls -la


## üì¶ Step 4: Install Dependencies

Install t·∫•t c·∫£ packages c·∫ßn thi·∫øt cho AI training


In [None]:
print("üì¶ Installing AI dependencies...")
print("‚è≥ This may take 2-3 minutes...\n")

# Install core packages
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q transformers==4.36.0 tokenizers==0.15.0
!pip install -q accelerate==0.25.0
!pip install -q pandas numpy scikit-learn
!pip install -q matplotlib seaborn plotly
!pip install -q tqdm

print("\n‚úÖ All dependencies installed!")

# Verify installation
import transformers
import torch
print(f"\nüìö Transformers version: {transformers.__version__}")
print(f"üî• PyTorch version: {torch.__version__}")
print(f"üéÆ CUDA available: {torch.cuda.is_available()}")


## üîç Step 5: Verify Project Structure

Ki·ªÉm tra xem t·∫•t c·∫£ files c·∫ßn thi·∫øt ƒë√£ c√≥ ch∆∞a


In [None]:
import os
from pathlib import Path

print("üîç Verifying project structure...\n")

required_files = [
    'train_ai_model.py',
    'app/ai/models/difficulty_classifier.py',
    'app/ai/training/train_difficulty.py',
    'app/ai/datasets/collect_data.py',
    'app/ai/utils/data_preprocessing.py',
]

all_good = True
for file in required_files:
    if Path(file).exists():
        print(f"‚úÖ {file}")
    else:
        print(f"‚ùå {file} - MISSING!")
        all_good = False

if all_good:
    print("\nüéâ All required files present!")
else:
    print("\n‚ö†Ô∏è  Some files are missing. Please check your upload.")

# Check if dataset exists
dataset_path = Path('app/ai/datasets/raw_dataset.json')
if dataset_path.exists():
    import json
    with open(dataset_path) as f:
        data = json.load(f)
    print(f"\nüìä Dataset found: {data.get('num_samples', 0)} samples")
else:
    print("\n‚ö†Ô∏è  Dataset not found. Will generate synthetic dataset.")


## ‚öôÔ∏è Step 6: Training Configuration

C·∫•u h√¨nh t·ªëi ∆∞u cho GPU T4 (16GB VRAM)


In [None]:
# Training configuration for Google Colab T4
TRAINING_CONFIG = {
    'batch_size': 16,        # TƒÉng t·ª´ 8 (local) l√™n 16 v√¨ T4 c√≥ 16GB VRAM
    'num_epochs': 3,         # Gi·ªØ nguy√™n
    'learning_rate': 2e-5,   # Gi·ªØ nguy√™n
    'max_length': 512,       # Gi·ªØ nguy√™n
    'warmup_steps': 500,     # Gi·ªØ nguy√™n
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

print("‚öôÔ∏è Training Configuration for Google Colab")
print("="*70)
for key, value in TRAINING_CONFIG.items():
    print(f"  {key:20s}: {value}")
print("="*70)


## üìä Step 7: Collect Training Data

Generate synthetic dataset (ho·∫∑c s·ª≠ d·ª•ng dataset c√≥ s·∫µn)


In [None]:
print("üìä Step 7: Collecting training data...")
print("="*70)

!python -m app.ai.datasets.collect_data

print("\n‚úÖ Data collection complete!")


## üöÄ Step 8: Train the Model!

**Main training process** - ƒê√¢y l√† b∆∞·ªõc quan tr·ªçng nh·∫•t!

Expected time: **~8-12 ph√∫t** tr√™n T4 GPU

### What happens:
1. Load dataset v√† preprocessing
2. Initialize DistilBERT model
3. Train for 3 epochs
4. Save best model d·ª±a tr√™n validation F1 score
5. Generate training curves v√† confusion matrix


In [None]:
import time

print("üöÄ Starting AI Model Training...")
print("="*70)
print("‚è±Ô∏è  Estimated time: 8-12 minutes on T4 GPU")
print("üìä You'll see progress bars for each epoch")
print("="*70)
print()

start_time = time.time()

# Run training
!python -m app.ai.training.train_difficulty

end_time = time.time()
duration = end_time - start_time

print("\n" + "="*70)
print(f"‚úÖ Training Complete!")
print(f"‚è±Ô∏è  Total time: {duration/60:.2f} minutes ({duration:.0f} seconds)")
print("="*70)


## üìà Step 9: View Training Results

Visualize training curves v√† confusion matrix


In [None]:
from IPython.display import Image, display
import os

print("üìà Training Results Visualization")
print("="*70)

# Display training curves
curves_path = 'models/difficulty_classifier/training_curves.png'
if os.path.exists(curves_path):
    print("\nüìä Training Curves:")
    display(Image(filename=curves_path))
else:
    print(f"‚ö†Ô∏è  Training curves not found at {curves_path}")

# Display confusion matrix
cm_path = 'models/difficulty_classifier/confusion_matrix.png'
if os.path.exists(cm_path):
    print("\nüéØ Confusion Matrix:")
    display(Image(filename=cm_path))
else:
    print(f"‚ö†Ô∏è  Confusion matrix not found at {cm_path}")

# List all generated files
print("\nüìÇ Generated Files:")
!ls -lh models/difficulty_classifier/


## üß™ Step 10: Test Model Inference

Test model v·ªõi m·ªôt s·ªë sample texts


In [None]:
import torch
from transformers import DistilBertTokenizer
import sys
from pathlib import Path

# Import model class
sys.path.append(str(Path.cwd()))
from app.ai.models.difficulty_classifier import DifficultyClassifier

print("üß™ Testing Model Inference")
print("="*70)

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_path = 'models/difficulty_classifier/best_model.pt'

print(f"üì• Loading model from {model_path}...")
model = DifficultyClassifier.load_model(model_path, device=device)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

print("‚úÖ Model loaded!\n")

# Test samples
test_texts = [
    "I have a cat. It is black.",  # A1
    "Last week I went to the park. The weather was nice.",  # A2
    "Learning a new language requires dedication and consistent practice.",  # B1
    "The implementation of new technologies has fundamentally transformed businesses.",  # B2
    "The paradigmatic shift in environmental policy necessitates comprehensive reevaluation.",  # C1
    "The epistemological implications fundamentally challenge deterministic paradigms.",  # C2
]

print("üîç Testing sample texts:\n")

for i, text in enumerate(test_texts, 1):
    # Tokenize
    encoding = tokenizer(
        text,
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    # Predict
    result = model.predict_text(input_ids, attention_mask)
    
    print(f"Text {i}: {text[:60]}...")
    print(f"  ‚û°Ô∏è  Predicted: {result['level']} (Confidence: {result['confidence']:.2%})")
    print(f"  üìä Top 3: {', '.join([f'{k}:{v:.1%}' for k, v in sorted(result['probabilities'].items(), key=lambda x: x[1], reverse=True)[:3]])}") 
    print()

print("="*70)
print("‚úÖ Inference test complete!")


## üíæ Step 11: Download Trained Model

Download model v√† results v·ªÅ m√°y local


In [None]:
from google.colab import files
import shutil
import os

print("üíæ Preparing files for download...")
print("="*70)

# Create zip file with all results
output_dir = 'models/difficulty_classifier'
zip_filename = 'file2learning_trained_model'

# Zip the model directory
shutil.make_archive(zip_filename, 'zip', output_dir)

zip_file = f"{zip_filename}.zip"
print(f"\nüì¶ Created {zip_file}")
print("\nContents:")
!unzip -l {zip_file}

print("\n‚¨áÔ∏è  Downloading...")
files.download(zip_file)

print("\n‚úÖ Download complete!")
print("\nüìã Next steps:")
print("  1. Extract the zip file")
print("  2. Copy contents to your local: backend/models/difficulty_classifier/")
print("  3. Test model tr√™n local project")
print("="*70)


---

## üéâ Training Complete!

### üìä Summary

B·∫°n ƒë√£ successfully train **Difficulty Classifier** v·ªõi:
- ‚úÖ Model: DistilBERT (66M parameters)
- ‚úÖ Task: 6-class classification (A1, A2, B1, B2, C1, C2)
- ‚úÖ GPU: Google Colab T4 (16GB VRAM)
- ‚úÖ Dataset: Synthetic + OneStop English Corpus

### üìÇ Output Files
- `best_model.pt` - Trained model weights
- `training_curves.png` - Loss/Accuracy/F1 curves
- `confusion_matrix.png` - Model performance visualization
- Checkpoint files for each epoch

### üîÑ Next Steps
1. Download model v·ªÅ local project
2. Test model trong application
3. Integrate v√†o document processing pipeline
4. Fine-tune n·∫øu c·∫ßn v·ªõi real user data

### üí° Tips
- N·∫øu mu·ªën train l·∫°i v·ªõi parameters kh√°c, ch·ªânh config ·ªü **Step 6**
- N·∫øu mu·ªën train v·ªõi dataset l·ªõn h∆°n, add more data v√†o `collect_data.py`
- Model c√≥ th·ªÉ improve over time khi c√≥ real user data

---

### üìû Troubleshooting

**Common Issues:**

1. **GPU Not Available** ‚Üí Runtime ‚Üí Change runtime type ‚Üí GPU
2. **Out of Memory** ‚Üí Gi·∫£m batch_size t·ª´ 16 xu·ªëng 8
3. **Files Not Found** ‚Üí Ki·ªÉm tra l·∫°i upload ·ªü Step 3
4. **Import Errors** ‚Üí Re-run Step 4 (Install dependencies)

---

**Happy Training! üöÄ**
