## Machine Learning: Completion Prediction

This notebook applies an interpretable machine learning model to weekly
behavioral features in order to predict learning completion risk.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd

# Load engineered weekly features
data = pd.read_csv('/content/weekly_behavioral_features.csv')

# Define target variable
data["completed"] = (data["modules_completed"] > 0).astype(int)

# Select features
X = data[["sessions", "avg_progress", "avg_quiz_score", "confidence_score"]]
y = data["completed"]

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd

# Load engineered weekly features
data = pd.read_csv('/content/weekly_behavioral_features.csv')

# Define target variable
data["completed"] = (data["modules_completed"] > 0).astype(int)

# Select features
X_all = data[["sessions", "avg_progress", "avg_quiz_score", "confidence_score"]]
y_all = data["completed"]

# Get unique user_ids
unique_users = data['user_id'].unique()

# Split user_ids into training and testing sets (at least 80% for training)
train_users, test_users = train_test_split(
    unique_users, train_size=0.8, random_state=42
)

# Filter the data based on user_ids for training and testing sets
X_train = data[data['user_id'].isin(train_users)][["sessions", "avg_progress", "avg_quiz_score", "confidence_score"]]
y_train = data[data['user_id'].isin(train_users)]["completed"]
X_test = data[data['user_id'].isin(test_users)][["sessions", "avg_progress", "avg_quiz_score", "confidence_score"]]
y_test = data[data['user_id'].isin(test_users)]["completed"]

# Train logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

In [None]:
# Evaluate model performance
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.62      0.80      0.70        10
           1       0.67      0.44      0.53         9

    accuracy                           0.63        19
   macro avg       0.64      0.62      0.61        19
weighted avg       0.64      0.63      0.62        19



In [None]:
import pandas as pd

# Combine X_test and y_test into a single DataFrame for download
final_test_data = X_test.copy()
final_test_data['completed'] = y_test

# Get the user_id for the test data using the index
test_user_ids = data.loc[X_test.index, 'user_id']

# Add user_id to the final_test_data and set it as the index (primary key)
final_test_data.insert(0, 'user_id', test_user_ids)
final_test_data = final_test_data.set_index('user_id')

# Save the DataFrame to a CSV file
output_filename = 'predicted_completion_data.csv'
final_test_data.to_csv(output_filename)

print(f'Final test data with user_id as primary key saved to {output_filename}')

Final test data with user_id as primary key saved to predicted_completion_data.csv


Logistic Regression was selected due to its interpretability and suitability
for small datasets. The model provides probability-based completion predictions
that can be directly consumed by the web platform.

## Web Platform Integration

Weekly behavioral features support:
- Weekly Active Learners
- Completion Rates
- Confidence Trends
- At-Risk Learner Identification

Machine learning outputs provide early-warning indicators for learner dropout.