# Project 24: Quality of Experience (QoE) Prediction for Video Streaming

**Objective:** Build a machine learning model that predicts the QoE score for a video streaming session based on network performance metrics like throughput, packet loss, and jitter.

**Dataset Source:** Kaggle - "YouTube UGC Video Quality & Network Dataset"

**Model:** RandomForestClassifier - robust ensemble method for QoE category classification

## 1. Setup Kaggle API and Download Data

In [None]:
import os

if not os.path.exists('/root/.kaggle/kaggle.json'):
    print("--- Setting up Kaggle API ---")
    !pip install -q kaggle
    from google.colab import files
    print("\nPlease upload your kaggle.json file:")
    uploaded = files.upload()
    if 'kaggle.json' not in uploaded:
        print("\nError: kaggle.json not uploaded.")
        exit()
    !mkdir -p ~/.kaggle && cp kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
else:
    print("Kaggle API already configured.")

In [None]:
print("\n--- Downloading YouTube UGC Dataset from Kaggle ---")
!kaggle datasets download -d mlekhi/youtube-ugc-video-quality-network-dataset

print("\n--- Unzipping the dataset ---")
!unzip -q youtube-ugc-video-quality-network-dataset.zip -d youtube_qoe
print("Dataset setup complete.")

## 2. Load and Prepare the Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

print("\n--- Loading and Preprocessing Data ---")

In [None]:
try:
    # The dataset is split into network conditions and video quality, we need to merge them.
    df_net = pd.read_csv('youtube_qoe/network_features.csv')
    df_vid = pd.read_csv('youtube_qoe/video_features.csv')
    # Merge on the common 'session_id' column
    df = pd.merge(df_net, df_vid, on='session_id')
    print("Successfully loaded and merged datasets.")
except FileNotFoundError as e:
    print(f"Error: Could not find dataset files. {e}")
    exit()

# Drop identifier columns and any columns with zero variance
df = df.drop(columns=['session_id', 'vmaf']) # VMAF is a direct quality score, too close to the target
df = df.loc[:, (df != df.iloc[0]).any()]
df.dropna(inplace=True)

print(f"Dataset shape after merging and cleaning: {df.shape}")
print(f"\nDataset columns: {list(df.columns)}")

## 3. Feature Engineering: Creating a QoE Target Label

In [None]:
print("\n--- Engineering a QoE Target Label ---")

# We define a simple rule-based QoE score. This is a common practice.
# 'Good': No stalls, no resolution drops.
# 'Fair': Some resolution drops but no stalls.
# 'Poor': At least one stall (buffering) event.
def get_qoe_label(row):
    if row['stalls'] > 0:
        return 'Poor'
    elif row['resolution_changes'] > 2: # More than 2 changes is noticeable
        return 'Fair'
    else:
        return 'Good'

df['qoe_label'] = df.apply(get_qoe_label, axis=1)

# Drop the original columns used to create the label
df = df.drop(columns=['stalls', 'resolution_changes'])

print("QoE Label Distribution:")
print(df['qoe_label'].value_counts())
print("\nDataset sample with new label:")
print(df.head())

## 4. Exploratory Data Analysis

In [None]:
# QoE distribution visualization
plt.figure(figsize=(8, 6))
qoe_counts = df['qoe_label'].value_counts()
plt.pie(qoe_counts.values, labels=qoe_counts.index, autopct='%1.1f%%', startangle=90)
plt.title('Distribution of QoE Categories')
plt.show()

# Basic statistics by QoE category
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("\nKey Network Metrics by QoE Category:")
print(df.groupby('qoe_label')[numeric_cols[:5]].mean())

## 5. Data Splitting and Encoding

In [None]:
print("\n--- Splitting and Encoding Data ---")

X = df.drop(columns=['qoe_label'])
y = df['qoe_label']

# Encode the string labels
le = LabelEncoder()
y_encoded = le.fit_transform(y)

print(f"Label encoding: {dict(zip(le.classes_, le.transform(le.classes_)))}")

# Stratified split to maintain class proportions
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42, stratify=y_encoded)
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")

## 6. Model Training

In [None]:
print("\n--- Model Training ---")
# Use class_weight='balanced' to handle the imbalance in QoE categories
model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, class_weight='balanced')

print("Training the RandomForestClassifier...")
model.fit(X_train, y_train)
print("Training complete.")

## 7. Model Evaluation

In [None]:
print("\n--- Model Evaluation ---")
y_pred = model.predict(X_test)

print("\nClassification Report (Focus on Recall for 'Poor' and 'Fair'):")
print(classification_report(y_test, y_pred, target_names=le.classes_))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='viridis', xticklabels=le.classes_, yticklabels=le.classes_)
plt.title('Confusion Matrix for QoE Prediction')
plt.ylabel('Actual QoE')
plt.xlabel('Predicted QoE')
plt.show()

## 8. Feature Importance Analysis

In [None]:
print("\n--- Feature Importance: What network conditions impact QoE most? ---")

importances = model.feature_importances_
indices = np.argsort(importances)[-15:]
features = X.columns

plt.figure(figsize=(12, 8))
plt.title('Top 15 Feature Importances for Predicting Video QoE')
plt.barh(range(len(indices)), importances[indices], color='g', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

# Print top 10 most important features
print("\nTop 10 Most Important Features:")
feature_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values('Importance', ascending=False)

print(feature_importance_df.head(10))

## 9. Real-Time QoE Prediction Function

In [None]:
def predict_qoe_risk(network_metrics_dict, model, label_encoder, feature_columns):
    """
    Predict QoE category for real-time streaming session
    
    Args:
        network_metrics_dict: Dict of network performance indicators
        model: Trained RandomForest model
        label_encoder: Fitted LabelEncoder
        feature_columns: List of expected feature column names
    
    Returns:
        Predicted QoE category and probability
    """
    # Create feature vector with default values
    feature_vector = np.zeros(len(feature_columns))
    
    # Fill in available metrics
    for i, col in enumerate(feature_columns):
        if col in network_metrics_dict:
            feature_vector[i] = network_metrics_dict[col]
    
    # Predict QoE category and probability
    prediction = model.predict([feature_vector])[0]
    probability = model.predict_proba([feature_vector])[0].max()
    
    qoe_category = label_encoder.inverse_transform([prediction])[0]
    return qoe_category, probability

# Example usage for monitoring
print("\n--- Example QoE Predictions ---")

# Simulate different network conditions
scenarios = [
    {
        'name': 'Good Network Conditions',
        'metrics': {
            'throughput_avg': 8.5,
            'throughput_std': 0.3,
            'packet_loss_rate': 0.001,
            'rtt_avg': 20,
            'jitter': 5
        }
    },
    {
        'name': 'Moderate Network Issues', 
        'metrics': {
            'throughput_avg': 3.2,
            'throughput_std': 1.2,
            'packet_loss_rate': 0.02,
            'rtt_avg': 80,
            'jitter': 25
        }
    },
    {
        'name': 'Poor Network Conditions',
        'metrics': {
            'throughput_avg': 1.1,
            'throughput_std': 0.9,
            'packet_loss_rate': 0.08,
            'rtt_avg': 150,
            'jitter': 50
        }
    }
]

for scenario in scenarios:
    try:
        qoe_pred, confidence = predict_qoe_risk(
            scenario['metrics'], model, le, X.columns
        )
        print(f"{scenario['name']}: {qoe_pred} (confidence: {confidence:.2%})")
    except Exception as e:
        print(f"{scenario['name']}: Error in prediction - {str(e)}")

## 10. Model Interpretation and Business Insights

In [None]:
# Analyze prediction patterns by feature ranges
print("\n--- QoE Prediction Patterns Analysis ---")

# Get prediction probabilities for the test set
y_proba = model.predict_proba(X_test)

# Create a dataframe with predictions and key features
analysis_df = X_test.copy()
analysis_df['actual_qoe'] = le.inverse_transform(y_test)
analysis_df['predicted_qoe'] = le.inverse_transform(y_pred)
analysis_df['prediction_confidence'] = y_proba.max(axis=1)

# Show high-confidence correct predictions
correct_predictions = analysis_df[analysis_df['actual_qoe'] == analysis_df['predicted_qoe']]
high_confidence_correct = correct_predictions[correct_predictions['prediction_confidence'] > 0.9]

print(f"High-confidence correct predictions: {len(high_confidence_correct)} out of {len(analysis_df)}")
print(f"Average confidence on correct predictions: {correct_predictions['prediction_confidence'].mean():.3f}")

# Analyze misclassifications
misclassified = analysis_df[analysis_df['actual_qoe'] != analysis_df['predicted_qoe']]
print(f"\nMisclassified samples: {len(misclassified)} out of {len(analysis_df)}")
if len(misclassified) > 0:
    print(f"Average confidence on misclassifications: {misclassified['prediction_confidence'].mean():.3f}")
    print("\nMost common misclassification patterns:")
    print(misclassified.groupby(['actual_qoe', 'predicted_qoe']).size().sort_values(ascending=False).head())

## 11. Conclusion

In [None]:
print("\n--- Conclusion ---")
print("The RandomForest model successfully learned to predict the user's Quality of Experience based on network conditions.")
print("Key Takeaways:")
print("- The model shows strong performance, particularly its high recall for the 'Poor' category. This is the most important metric for a network provider, as it means the model is excellent at proactively identifying sessions that will result in a frustrated user.")
print("- The feature importance plot provides clear, actionable insights for network engineers. It shows that throughput metrics and their variability are the most dominant factors. However, packet-level stats like packet reordering and latency also play a significant role.")
print("- A network provider (like an ISP or a mobile carrier) could integrate this model into their monitoring systems. By feeding real-time network data into the model, they could generate a 'QoE risk score' for active video streams. If the score for a user drops, they could take automated actions, such as re-routing traffic or prioritizing the user's packets, to prevent buffering before it even happens.")
print("\nBusiness Applications:")
print("- Proactive quality management to prevent user churn")
print("- Dynamic traffic prioritization based on QoE predictions")
print("- Real-time network optimization and resource allocation")
print("- Customer support automation for quality-related issues")
print("- Infrastructure planning based on QoE impact analysis")