## Machine Learning Techniques for Tabular Data


#### Requirements:
- We're using the same social media dataset as last time [LINK](https://www.kaggle.com/datasets/adilshamim8/social-media-addiction-vs-relationships)
    - I reccomend putting the .csv file into a folder called `Data` to keep things organized
- Scikit-learn supplies great machine learning tools, if not already installed: `pip install scikit-learn`
- Other requirements: `pandas`, `numpy`, `matplotlib`, `seaborn`

In [None]:
import pandas as pd
import sklearn as skl
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sn
import os

### Load the data

In [None]:
df = pd.read_csv('Data/Student_Social_Media.csv')
df.head()

# Dataset information
print("\nDataset information:")
df.info()

# Summary statistics
print("\nSummary statistics of numerical variables:")
df.describe()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

def quick_data_viz(df, figsize=(14, 10), max_categories_box=10):
    """
    Minimalistic exploratory visualization.
    
    - Numeric (continuous-like): histogram+KDE and boxplot
    - Numeric (discrete/ordinal with few unique values): histogram+KDE and barplot instead of boxplot
    - Categorical: countplot
    - Correlation heatmap
    - Pairwise scatter (numeric only, light)
    """
    # Identify column types
    numeric_cols = df.select_dtypes(include=['int64','float64']).columns.drop('Student_ID', errors='ignore')
    categorical_cols = df.select_dtypes(include=['object']).columns
    
    # Split numeric into continuous vs discrete-ordinal
    continuous_cols = [c for c in numeric_cols if df[c].nunique() > max_categories_box]
    discrete_cols   = [c for c in numeric_cols if df[c].nunique() <= max_categories_box]
    
    # === Continuous variables ===
    if continuous_cols:
        fig, axes = plt.subplots(len(continuous_cols), 2, figsize=(figsize[0], 3*len(continuous_cols)))
        axes = np.array(axes).reshape(len(continuous_cols), 2)
        for i, col in enumerate(continuous_cols):
            sns.histplot(df[col], kde=True, ax=axes[i,0], color='steelblue')
            axes[i,0].set_title(f"{col} distribution", fontsize=10)
            axes[i,0].set_xlabel("")
            
            sns.boxplot(x=df[col], ax=axes[i,1], color='lightgrey')
            axes[i,1].set_title(f"{col} boxplot", fontsize=10)
            axes[i,1].set_xlabel("")
        plt.tight_layout()
        plt.show()
    
    # === Discrete numeric variables ===
    if discrete_cols:
        fig, axes = plt.subplots(len(discrete_cols), 2, figsize=(figsize[0], 3*len(discrete_cols)))
        axes = np.array(axes).reshape(len(discrete_cols), 2)
        for i, col in enumerate(discrete_cols):
            sns.histplot(df[col], kde=False, ax=axes[i,0], color='steelblue', discrete=True)
            axes[i,0].set_title(f"{col} distribution", fontsize=10)
            axes[i,0].set_xlabel("")
            
            sns.countplot(x=df[col], ax=axes[i,1], color='lightgrey', order=sorted(df[col].unique()))
            axes[i,1].set_title(f"{col} counts", fontsize=10)
            axes[i,1].set_xlabel("")
        plt.tight_layout()
        plt.show()
    
    # === Categorical variables ===
    if len(categorical_cols) > 0:
        ncols = 3
        nrows = (len(categorical_cols) + ncols - 1) // ncols
        fig, axes = plt.subplots(nrows, ncols, figsize=(figsize[0], nrows*3))
        axes = axes.flatten()
        for i, col in enumerate(categorical_cols):
            if col == 'Country': continue
            sns.countplot(y=df[col], ax=axes[i], color='steelblue', order=df[col].value_counts().index)
            axes[i].set_title(f"{col}", fontsize=10)
        for j in range(i+1, len(axes)):
            axes[j].axis("off")
        plt.tight_layout()
        plt.show()

In [None]:
quick_data_viz(df)

In [None]:
# --- Core Libraries ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# --- Preprocessing ---
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# --- Models ---
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

# --- Evaluation ---
from sklearn.metrics import mean_squared_error, r2_score

# --- Set plot style ---
plt.style.use('seaborn-v0_8-whitegrid')

### Data Preprocessing

Raw data is rarely ready for modeling. We need to preprocess it. This involves:

1.  **Cleaning the data**: Luckily the data we have is pretty clean (no NaN, no huge outliers, no mixed datatypes)
1.  **Separating features and target**.
2.  **Handling categorical variables**: ML models only understand numbers.
3.  **Scaling numerical features**: Ensures that features with larger scales don't dominate the model.


We will use `scikit-learn`'s `ColumnTransformer` to create a clean, reusable preprocessing pipeline. This is best practice as it prevents data leakage (information from the test set leaking into the training set) and makes the workflow reproducible.

We'll perform two main transformations:

  * **One-Hot Encoding**: For nominal categorical features (where order doesn't matter) like `Gender` or `Country`. It creates a new binary column for each category.
  * **Ordinal Encoding**: For ordinal categorical features (where order matters) like `Academic_Level`. It converts categories into integer values (e.g., Undergraduate=0, Graduate=1, PhD=2).
  * **Standard Scaling**: For numerical features. It transforms the data to have a mean of 0 and a standard deviation of 1. The formula is:
    $$z = \frac{x - \mu}{\sigma}$$
    Where $x$ is the original value, $\mu$ is the mean, and $\sigma$ is the standard deviation.

In [None]:
# 1. Define features (X) and target (y)
# We drop Student_ID as it's an identifier and not a predictive feature.
X = df.drop(['Student_ID', 'Addicted_Score'], axis=1)
y = df['Addicted_Score']

In [None]:
# 2. Identify different column types
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
nominal_categorical_features = ['Gender', 'Country', 'Most_Used_Platform', 'Relationship_Status']
ordinal_categorical_features = ['Academic_Level', 'Affects_Academic_Performance']

# Define the order for ordinal features
academic_levels = ['High School', 'Undergraduate', 'Graduate', 'PhD']
academic_impact = ['No', 'Maybe', 'Yes']

In [None]:
# 3. Create the preprocessing pipelines for each data type
numerical_transformer = StandardScaler()

# For nominal features, handle unknown categories that might appear in test data
nominal_transformer = OneHotEncoder(handle_unknown='ignore')

ordinal_transformer = OrdinalEncoder(categories=[academic_levels, academic_impact])

In [None]:
# 4. Use ColumnTransformer to apply different transformations to different columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('nom', nominal_transformer, nominal_categorical_features),
        ('ord', ordinal_transformer, ordinal_categorical_features)
    ],
    remainder='passthrough' # Keep other columns (if any), though we've handled all
)


Many times, we will use cross validation to get the most out of the data. Today, we'll just do a single test/train split

In [None]:
# 5. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

### Model Training

Now we'll train three different regression models to see which performs best on our data.

1.  **Linear Regression**: A simple baseline model.
2.  **Random Forest**: A powerful and popular ensemble model.
3.  **XGBoost**: A highly efficient gradient-boosting model, often a top performer in competitions.

See last week's lecture for more detail on the pros and cons of the above models

In [None]:
# --- Model 1: Linear Regression ---
# A simple model that finds the best linear relationship between features and target.
# Equation: y = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

lr_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])
lr_pipeline.fit(X_train, y_train)
print("Linear Regression model trained.")

In [None]:
# --- Model 2: Random Forest ---
# An ensemble model that builds many decision trees and averages their predictions.
# This helps reduce overfitting and improves accuracy.

rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])
rf_pipeline.fit(X_train, y_train)
print("Random Forest model trained.")

In [None]:
# --- Model 3: XGBoost ---
# A gradient boosting model that builds trees sequentially, with each new tree
# correcting the errors of the previous one. Highly effective.

xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', XGBRegressor(n_estimators=100, random_state=42))
])
xgb_pipeline.fit(X_train, y_train)
print("XGBoost model trained.")

### Model Evaluation

Training is done, but how good are our models? We need to evaluate their performance on the **test set**—data the model has never seen before.

For regression, we'll use two common metrics:

  * **Mean Squared Error (MSE)**: The average of the squared differences between the predicted and actual values. It penalizes larger errors more heavily.
    $$\text{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
  * **R-squared ($R^2$)**: The proportion of the variance in the target variable that is predictable from the features. An $R^2$ of 1 is a perfect prediction.
    $$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$

In [None]:
results = {}

models = {
    "Linear Regression": lr_pipeline,
    "Random Forest": rf_pipeline,
    "XGBoost": xgb_pipeline
}

for name, pipeline in models.items():
    y_pred = pipeline.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results[name] = {'MSE': mse, 'R2': r2}
    print(f"--- {name} ---")
    print(f"  MSE: {mse:.2f}")
    print(f"  R-squared: {r2:.2f}\n")

### Visualizing the regression

In [None]:
# Use the best model (XGBoost) to get predictions on the test set
y_pred_xgb = xgb_pipeline.predict(X_test)

# Create a scatter plot of actual vs. predicted values
plt.figure(figsize=(8, 8))
sns.scatterplot(x=y_test, y=y_pred_xgb, alpha=0.7, label='Model Predictions')

# Add a line for perfect predictions (y=x)
max_val = max(y_test.max(), y_pred_xgb.max())
min_val = min(y_test.min(), y_pred_xgb.min())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')

# Add labels and title
plt.title('Actual vs. Predicted Addiction Score (XGBoost)', fontsize=16)
plt.xlabel('Actual Score', fontsize=12)
plt.ylabel('Predicted Score', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()

### Exercise

Use one of the models to predict the addiction score of the student below

In [None]:
new_student_data = {
    'Age': 21,
    'Gender': 'Female',
    'Academic_Level': 'Graduate',
    'Country': 'USA',
    'Avg_Daily_Usage_Hours': 6.5,
    'Most_Used_Platform': 'TikTok',
    'Affects_Academic_Performance': 'Yes',
    'Sleep_Hours_Per_Night': 5.0,
    'Mental_Health_Score': 3,
    'Relationship_Status': 'Single',
    'Conflicts_Over_Social_Media': 7
}

### Predicting non-continuous variables

In [None]:
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelBinarizer

X_class = df.drop(['Student_ID', 'Addicted_Score', 'Affects_Academic_Performance', 'Sleep_Hours_Per_Night'], axis=1)
lb = LabelBinarizer()
y_class = lb.fit_transform(df['Affects_Academic_Performance'])


numerical_features_class = X_class.select_dtypes(include=np.number).columns.tolist()
nominal_categorical_features_class = ['Gender', 'Country', 'Most_Used_Platform', 'Relationship_Status']
ordinal_categorical_features_class = ['Academic_Level']
academic_levels = ['High School', 'Undergraduate', 'Graduate', 'PhD']

preprocessor_class = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features_class),
        ('nom', OneHotEncoder(handle_unknown='ignore'), nominal_categorical_features_class),
        ('ord', OrdinalEncoder(categories=[academic_levels]), ordinal_categorical_features_class)
    ],
    remainder='passthrough'
)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_class, y_class, test_size=0.2, random_state=42, stratify=y_class)

rf_class_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_class),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
xgb_class_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor_class),
    ('classifier', XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss'))
])

print("Training Classification Models...")
rf_class_pipeline.fit(X_train_c, y_train_c)
xgb_class_pipeline.fit(X_train_c, y_train_c)
print("Done.")

In [None]:
# Evaluate the Random Forest Classifier
y_pred_rf = rf_class_pipeline.predict(X_test_c)
print("--- Random Forest Classifier Results ---")
print(f"Accuracy: {accuracy_score(y_test_c, y_pred_rf):.2f}")
print(classification_report(y_test_c, y_pred_rf))

# Evaluate the XGBoost Classifier
y_pred_xgb = xgb_class_pipeline.predict(X_test_c)
print("\n--- XGBoost Classifier Results ---")
print(f"Accuracy: {accuracy_score(y_test_c, y_pred_xgb):.2f}")
print(classification_report(y_test_c, y_pred_xgb))

In [None]:
### Quick look at SHAP

In [None]:
import shap

X_test_transformed = xgb_class_pipeline.named_steps['preprocessor'].transform(X_test_c)
model = xgb_class_pipeline.named_steps['classifier']

explainer = shap.TreeExplainer(model)
shap_values = explainer(X_test_transformed)

shap.summary_plot(shap_values, X_test_transformed, feature_names=xgb_class_pipeline.named_steps['preprocessor'].get_feature_names_out())


In [None]:
shap.plots.beeswarm(shap_values, max_display=15)
shap.plots.waterfall(shap_values[0])
