 Key Parameters Used in the Dashboard

| **Category**             | **Column Name**                                                                 |
|--------------------------|----------------------------------------------------------------------------------|
| **Academic Outcome**     | `final_score` – Overall final grade in the course                               |
| **Attendance**           | `attendance_final_score` – Final attendance-based performance score             |
| **Assignments**          | `unit_3_assignment_-_train_decision_trees_after_data_preparation`               |
|                          | `unit_5_assignment_-_model_selection_for_knn`                                   |
| **Engagement**           | `lab_attendance_overall_grade` – Grade derived from lab presence and participation |
|                          | `attendance_current_score` – Real-time attendance measure                       |
| **Professional Skills**  | `your_github_profile_readme` – Score for setting up a professional GitHub page  |
|                          | `your_elevator_pitch_(video_recording)` – Score for communication readiness     |
| **Support & Experience** | `my_tas_respond_quickly_outside_of_lab` – Survey rating on TA responsiveness     |
| **Student Sentiment**    | `how_likely_are_you_to_recommend_the_break_through_tech_ai_program_to_a_friend_or_fellow_student?` – NPS-style indicator of satisfaction and potential dropout risk |




In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
file_path = "ML_Foundations_Gradebook_median_merged.xlsx"
df = pd.read_excel(file_path)

# Drop irrelevant columns (IDs, emails, open-ended text responses)
df = df.drop(columns=["id", "sis_user_id", "sis_login_id", "section"], errors='ignore')

# Define target variable (Dropout: 1, Retained: 0) based on attendance & submission data
def infer_dropout(row):
    if row['How likely are you to recommend the Break Through Tech AI Program to a friend or fellow student? (1 = Very Unlikely, 10 = Highly Likely)'] < 5:
        return 1  # Dropout
    if row.isnull().sum() > 10:  # Many missing values as indicator
        return 1  # Dropout
    return 0  # Retained

df['dropout'] = df.apply(infer_dropout, axis=1)

# Select key predictors
engagement_features = [
    "My TAs respond quickly outside of lab.",
    "My TAs clearly answer my questions.",
    "My TAs provide valuable feedback that helps me learn and grow as an engineer.",
    "I feel safe to ask questions or make mistakes in front of my TAs."
]

performance_features = [col for col in df.columns if "check_your_knowledge" in col]

# Encode categorical survey responses
encoder = LabelEncoder()
for col in engagement_features:
    df[col] = df[col].astype(str)
    df[col] = encoder.fit_transform(df[col])

# Prepare dataset for ML
X = df[engagement_features + performance_features].fillna(0)
y = df['dropout']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9529411764705882
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.95      1.00      0.98       162

    accuracy                           0.95       170
   macro avg       0.48      0.50      0.49       170
weighted avg       0.91      0.95      0.93       170



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load dataset
file_path = "ML_Foundations_Gradebook_median_merged.xlsx"
df = pd.read_excel(file_path)

# Drop irrelevant columns (IDs, emails, open-ended text responses)
df = df.drop(columns=["id", "sis_user_id", "sis_login_id", "section"], errors='ignore')

# Define target variable (Dropout: 1, Retained: 0) based on attendance & submission data
def infer_dropout(row):
    if row['How likely are you to recommend the Break Through Tech AI Program to a friend or fellow student? (1 = Very Unlikely, 10 = Highly Likely)'] < 5:
        return 1  # Dropout
    if row.isnull().sum() > 10:  # Many missing values as indicator
        return 1  # Dropout
    return 0  # Retained

df['dropout'] = df.apply(infer_dropout, axis=1)

# Select key predictors
engagement_features = [
    "My TAs respond quickly outside of lab.",
    "My TAs clearly answer my questions.",
    "My TAs provide valuable feedback that helps me learn and grow as an engineer.",
    "I feel safe to ask questions or make mistakes in front of my TAs."
]

performance_features = [col for col in df.columns if "check_your_knowledge" in col]

# Encode categorical survey responses
encoder = LabelEncoder()
for col in engagement_features:
    df[col] = df[col].astype(str)
    df[col] = encoder.fit_transform(df[col])

# Prepare dataset for ML
X = df[engagement_features + performance_features].fillna(0)
y = df['dropout']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

# Display summary in Markdown format
summary = f'''
## Model Summary

| Component            | Details |
|----------------------|---------|
| **Algorithm**       | Random Forest Classifier |
| **Feature Set**     | Engagement Survey Responses, Knowledge Check Scores |
| **Target Variable** | Dropout (1) vs Retained (0) |
| **Data Preprocessing** | Categorical Encoding, Feature Scaling |
| **Train-Test Split** | 80%-20% |
| **Evaluation Metrics** | Accuracy, Precision, Recall, F1-score |
| **Model Accuracy**  | {accuracy:.4f} |

### Classification Report
```
{report}
```
'''

with open("model_summary.md", "w") as f:
    f.write(summary)

print("Model summary saved as 'model_summary.md'")


Accuracy: 0.9529411764705882
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.95      1.00      0.98       162

    accuracy                           0.95       170
   macro avg       0.48      0.50      0.49       170
weighted avg       0.91      0.95      0.93       170

Model summary saved as 'model_summary.md'


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:

summary = f'''
## Model Summary

| Component            | Details |
|----------------------|---------|
| **Algorithm**       | Random Forest Classifier |
| **Feature Set**     | Engagement Survey Responses, Knowledge Check Scores |
| **Target Variable** | Dropout (1) vs Retained (0) |
| **Data Preprocessing** | Categorical Encoding, Feature Scaling |
| **Train-Test Split** | 80%-20% |
| **Evaluation Metrics** | Accuracy, Precision, Recall, F1-score |
| **Model Accuracy**  | {accuracy:.4f} |

### Classification Report
```
{report}
```
'''

with open("model_summary.md", "w") as f:
    f.write(summary)

print("Model summary saved as 'model_summary.md'")


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, classification_report, silhouette_score

# Load dataset
file_path = "ML_Foundations_Gradebook_median_merged.xlsx"
df = pd.read_excel(file_path)

# Drop irrelevant columns (IDs, emails, open-ended text responses)
df = df.drop(columns=["id", "sis_user_id", "sis_login_id", "section"], errors='ignore')

# Define target variable (Dropout: 1, Retained: 0) based on attendance & submission data
def infer_dropout(row):
    if row['How likely are you to recommend the Break Through Tech AI Program to a friend or fellow student? (1 = Very Unlikely, 10 = Highly Likely)'] < 5:
        return 1  # Dropout
    if row.isnull().sum() > 10:  # Many missing values as indicator
        return 1  # Dropout
    return 0  # Retained

df['dropout'] = df.apply(infer_dropout, axis=1)

# Select key predictors
engagement_features = [
    "My TAs respond quickly outside of lab.",
    "My TAs clearly answer my questions.",
    "My TAs provide valuable feedback that helps me learn and grow as an engineer.",
    "I feel safe to ask questions or make mistakes in front of my TAs."
]

performance_features = [col for col in df.columns if "check_your_knowledge" in col]

# Encode categorical survey responses
encoder = LabelEncoder()
for col in engagement_features:
    df[col] = df[col].astype(str)
    df[col] = encoder.fit_transform(df[col])

# Prepare dataset for ML
X = df[engagement_features + performance_features].fillna(0)
y = df['dropout']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print("Classification Report:\n", report)

# Internship Matching using Clustering
internship_features = performance_features + engagement_features  # Features relevant for matching
X_internship = df[internship_features].fillna(0)
X_scaled = scaler.fit_transform(X_internship)

# K-Means Clustering for student grouping
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X_scaled)
df['internship_cluster'] = kmeans.labels_

silhouette = silhouette_score(X_scaled, kmeans.labels_)
print(f"Silhouette Score for Clustering: {silhouette:.4f}")

# Display summary in Markdown format
summary = f'''
## Model Summary

| Component            | Details |
|----------------------|---------|
| **Algorithm**       | Random Forest Classifier (Dropout Prediction), K-Means (Internship Matching) |
| **Feature Set**     | Engagement Survey Responses, Knowledge Check Scores |
| **Target Variable** | Dropout (1) vs Retained (0) |
| **Data Preprocessing** | Categorical Encoding, Feature Scaling |
| **Train-Test Split** | 80%-20% |
| **Evaluation Metrics** | Accuracy, Precision, Recall, F1-score (Dropout Prediction) |
| **Model Accuracy**  | {accuracy:.4f} |
| **Clustering Method** | K-Means for Internship Matching |
| **Silhouette Score** | {silhouette:.4f} |

### Classification Report
```
{report}
```
'''

with open("model_summary_team.md", "w") as f:
    f.write(summary)

print("Model summary saved as 'model_summary_team.md'")


Accuracy: 0.9529411764705882
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         8
           1       0.95      1.00      0.98       162

    accuracy                           0.95       170
   macro avg       0.48      0.50      0.49       170
weighted avg       0.91      0.95      0.93       170

Silhouette Score for Clustering: 0.0797
Model summary saved as 'model_summary_team.md'


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
