# 🎓 File2Learning - AI Model Training on Google Colab

## 📚 Difficulty Classifier Training Pipeline

**Model**: DistilBERT-based Text Difficulty Classifier (A1-C2 CEFR levels)

**GPU**: Tesla T4 (16GB VRAM) - Miễn phí trên Google Colab

**Training Time**: ~8-12 phút

---

### 🚀 Quick Start Guide:
1. **Runtime** → **Change runtime type** → **GPU** (T4 hoặc V100)
2. **Run All** (Runtime → Run all) hoặc chạy từng cell
3. Đợi training hoàn thành (~10 phút)
4. Download model về local

---


## 🔧 Step 1: Setup Environment & GPU Check


In [None]:
# Check GPU availability
import torch
import os

print("="*70)
print("🔍 GPU Information")
print("="*70)

if torch.cuda.is_available():
    print(f"✅ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"📊 GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
    print(f"🔢 CUDA Version: {torch.version.cuda}")
    print(f"🐍 PyTorch Version: {torch.__version__}")
else:
    print("❌ GPU NOT AVAILABLE!")
    print("⚠️  Go to Runtime → Change runtime type → GPU")

print("="*70)


## 💾 Step 2: Mount Google Drive (Optional)

**Nếu bạn muốn save model vào Google Drive**, uncomment và chạy cell này:


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# # Create output directory in Drive
# DRIVE_OUTPUT_DIR = '/content/drive/MyDrive/File2Learning_Models'
# os.makedirs(DRIVE_OUTPUT_DIR, exist_ok=True)
# print(f"✅ Google Drive mounted! Models will be saved to: {DRIVE_OUTPUT_DIR}")


## 📁 Step 3: Upload Project Files

**Chọn 1 trong 2 options:**

### **Option A: Upload từ local** (Recommended)
1. Zip toàn bộ folder `backend/` thành `backend.zip`
2. Upload file zip và extract


In [None]:
# Option A: Upload ZIP file
from google.colab import files
import zipfile

print("📤 Upload backend.zip file...")
uploaded = files.upload()

# Extract
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"📦 Extracting {filename}...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('/content/')
        print("✅ Extraction complete!")

# Change to backend directory
%cd /content/backend
!pwd
!ls -la


### **Option B: Clone từ GitHub** (Nếu bạn đã push code lên GitHub)


In [None]:
# # Option B: Clone from GitHub
# !git clone https://github.com/YOUR_USERNAME/File2Learning.git
# %cd File2Learning/backend
# !pwd
# !ls -la


## 📦 Step 4: Install Dependencies

Install tất cả packages cần thiết cho AI training


In [None]:
print("📦 Installing AI dependencies...")
print("⏳ This may take 2-3 minutes...\n")

# Install core packages
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
!pip install -q transformers==4.36.0 tokenizers==0.15.0
!pip install -q accelerate==0.25.0
!pip install -q pandas numpy scikit-learn
!pip install -q matplotlib seaborn plotly
!pip install -q tqdm

print("\n✅ All dependencies installed!")

# Verify installation
import transformers
import torch
print(f"\n📚 Transformers version: {transformers.__version__}")
print(f"🔥 PyTorch version: {torch.__version__}")
print(f"🎮 CUDA available: {torch.cuda.is_available()}")


## 🔍 Step 5: Verify Project Structure

Kiểm tra xem tất cả files cần thiết đã có chưa


In [None]:
import os
from pathlib import Path

print("🔍 Verifying project structure...\n")

required_files = [
    'train_ai_model.py',
    'app/ai/models/difficulty_classifier.py',
    'app/ai/training/train_difficulty.py',
    'app/ai/datasets/collect_data.py',
    'app/ai/utils/data_preprocessing.py',
]

all_good = True
for file in required_files:
    if Path(file).exists():
        print(f"✅ {file}")
    else:
        print(f"❌ {file} - MISSING!")
        all_good = False

if all_good:
    print("\n🎉 All required files present!")
else:
    print("\n⚠️  Some files are missing. Please check your upload.")

# Check if dataset exists
dataset_path = Path('app/ai/datasets/raw_dataset.json')
if dataset_path.exists():
    import json
    with open(dataset_path) as f:
        data = json.load(f)
    print(f"\n📊 Dataset found: {data.get('num_samples', 0)} samples")
else:
    print("\n⚠️  Dataset not found. Will generate synthetic dataset.")


## ⚙️ Step 6: Training Configuration

Cấu hình tối ưu cho GPU T4 (16GB VRAM)


In [None]:
# Training configuration for Google Colab T4
TRAINING_CONFIG = {
    'batch_size': 16,        # Tăng từ 8 (local) lên 16 vì T4 có 16GB VRAM
    'num_epochs': 3,         # Giữ nguyên
    'learning_rate': 2e-5,   # Giữ nguyên
    'max_length': 512,       # Giữ nguyên
    'warmup_steps': 500,     # Giữ nguyên
    'device': 'cuda' if torch.cuda.is_available() else 'cpu'
}

print("⚙️ Training Configuration for Google Colab")
print("="*70)
for key, value in TRAINING_CONFIG.items():
    print(f"  {key:20s}: {value}")
print("="*70)


## 📊 Step 7: Collect Training Data

Generate synthetic dataset (hoặc sử dụng dataset có sẵn)


In [None]:
print("📊 Step 7: Collecting training data...")
print("="*70)

!python -m app.ai.datasets.collect_data

print("\n✅ Data collection complete!")


## 🚀 Step 8: Train the Model!

**Main training process** - Đây là bước quan trọng nhất!

Expected time: **~8-12 phút** trên T4 GPU

### What happens:
1. Load dataset và preprocessing
2. Initialize DistilBERT model
3. Train for 3 epochs
4. Save best model dựa trên validation F1 score
5. Generate training curves và confusion matrix


In [None]:
import time

print("🚀 Starting AI Model Training...")
print("="*70)
print("⏱️  Estimated time: 8-12 minutes on T4 GPU")
print("📊 You'll see progress bars for each epoch")
print("="*70)
print()

start_time = time.time()

# Run training
!python -m app.ai.training.train_difficulty

end_time = time.time()
duration = end_time - start_time

print("\n" + "="*70)
print(f"✅ Training Complete!")
print(f"⏱️  Total time: {duration/60:.2f} minutes ({duration:.0f} seconds)")
print("="*70)


## 📈 Step 9: View Training Results

Visualize training curves và confusion matrix


In [None]:
from IPython.display import Image, display
import os

print("📈 Training Results Visualization")
print("="*70)

# Display training curves
curves_path = 'models/difficulty_classifier/training_curves.png'
if os.path.exists(curves_path):
    print("\n📊 Training Curves:")
    display(Image(filename=curves_path))
else:
    print(f"⚠️  Training curves not found at {curves_path}")

# Display confusion matrix
cm_path = 'models/difficulty_classifier/confusion_matrix.png'
if os.path.exists(cm_path):
    print("\n🎯 Confusion Matrix:")
    display(Image(filename=cm_path))
else:
    print(f"⚠️  Confusion matrix not found at {cm_path}")

# List all generated files
print("\n📂 Generated Files:")
!ls -lh models/difficulty_classifier/


## 🧪 Step 10: Test Model Inference

Test model với một số sample texts


In [None]:
import torch
from transformers import DistilBertTokenizer
import sys
from pathlib import Path

# Import model class
sys.path.append(str(Path.cwd()))
from app.ai.models.difficulty_classifier import DifficultyClassifier

print("🧪 Testing Model Inference")
print("="*70)

# Load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_path = 'models/difficulty_classifier/best_model.pt'

print(f"📥 Loading model from {model_path}...")
model = DifficultyClassifier.load_model(model_path, device=device)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

print("✅ Model loaded!\n")

# Test samples
test_texts = [
    "I have a cat. It is black.",  # A1
    "Last week I went to the park. The weather was nice.",  # A2
    "Learning a new language requires dedication and consistent practice.",  # B1
    "The implementation of new technologies has fundamentally transformed businesses.",  # B2
    "The paradigmatic shift in environmental policy necessitates comprehensive reevaluation.",  # C1
    "The epistemological implications fundamentally challenge deterministic paradigms.",  # C2
]

print("🔍 Testing sample texts:\n")

for i, text in enumerate(test_texts, 1):
    # Tokenize
    encoding = tokenizer(
        text,
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)
    
    # Predict
    result = model.predict_text(input_ids, attention_mask)
    
    print(f"Text {i}: {text[:60]}...")
    print(f"  ➡️  Predicted: {result['level']} (Confidence: {result['confidence']:.2%})")
    print(f"  📊 Top 3: {', '.join([f'{k}:{v:.1%}' for k, v in sorted(result['probabilities'].items(), key=lambda x: x[1], reverse=True)[:3]])}") 
    print()

print("="*70)
print("✅ Inference test complete!")


## 💾 Step 11: Download Trained Model

Download model và results về máy local


In [None]:
from google.colab import files
import shutil
import os

print("💾 Preparing files for download...")
print("="*70)

# Create zip file with all results
output_dir = 'models/difficulty_classifier'
zip_filename = 'file2learning_trained_model'

# Zip the model directory
shutil.make_archive(zip_filename, 'zip', output_dir)

zip_file = f"{zip_filename}.zip"
print(f"\n📦 Created {zip_file}")
print("\nContents:")
!unzip -l {zip_file}

print("\n⬇️  Downloading...")
files.download(zip_file)

print("\n✅ Download complete!")
print("\n📋 Next steps:")
print("  1. Extract the zip file")
print("  2. Copy contents to your local: backend/models/difficulty_classifier/")
print("  3. Test model trên local project")
print("="*70)


---

## 🎉 Training Complete!

### 📊 Summary

Bạn đã successfully train **Difficulty Classifier** với:
- ✅ Model: DistilBERT (66M parameters)
- ✅ Task: 6-class classification (A1, A2, B1, B2, C1, C2)
- ✅ GPU: Google Colab T4 (16GB VRAM)
- ✅ Dataset: Synthetic + OneStop English Corpus

### 📂 Output Files
- `best_model.pt` - Trained model weights
- `training_curves.png` - Loss/Accuracy/F1 curves
- `confusion_matrix.png` - Model performance visualization
- Checkpoint files for each epoch

### 🔄 Next Steps
1. Download model về local project
2. Test model trong application
3. Integrate vào document processing pipeline
4. Fine-tune nếu cần với real user data

### 💡 Tips
- Nếu muốn train lại với parameters khác, chỉnh config ở **Step 6**
- Nếu muốn train với dataset lớn hơn, add more data vào `collect_data.py`
- Model có thể improve over time khi có real user data

---

### 📞 Troubleshooting

**Common Issues:**

1. **GPU Not Available** → Runtime → Change runtime type → GPU
2. **Out of Memory** → Giảm batch_size từ 16 xuống 8
3. **Files Not Found** → Kiểm tra lại upload ở Step 3
4. **Import Errors** → Re-run Step 4 (Install dependencies)

---

**Happy Training! 🚀**
