# 4. Análise Preditiva Simplificada

This notebook demonstrates a simplified machine learning process to predict exam scores, as outlined in the "Análise Preditiva" section of the portfolio.

In [None]:
# If we saved the cleaned data, we could load it here instead:
# df = pd.read_parquet('cleaned_student_data.parquet')

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

## Linear Regression Model and Preprocessing Details for JavaScript Implementation

The following cells extract and display the trained parameters from the Linear Regression model (`lr_pipeline`) and the preprocessing steps. This information can be used as a reference for a JavaScript implementation of the same model.

### Extracted Model and Preprocessor Objects

In [None]:
# Retrieve the trained LinearRegression model and ColumnTransformer preprocessor
lr_model = lr_pipeline.named_steps['regressor']
preprocessor_from_pipeline = lr_pipeline.named_steps['preprocessor']

print("LinearRegression model retrieved:", lr_model)
print("ColumnTransformer preprocessor retrieved:", preprocessor_from_pipeline)

### Linear Regression: Intercept and Coefficients

In [None]:
print("Linear Regression Intercept (Bias):")
print(lr_model.intercept_)

print("\nLinear Regression Coefficients (Weights):")
print(lr_model.coef_)

print("\nFeature names corresponding to the coefficients (after all preprocessing):")
final_feature_names = preprocessor_from_pipeline.get_feature_names_out()
print(final_feature_names)

print("\nCoefficients with their corresponding feature names:")
for feature, coef in zip(final_feature_names, lr_model.coef_):
    print(f"{feature}: {coef:.4f}")

The **Intercept** is the value of the target variable when all features are zero (or at their reference levels for categorical features). 
The **Coefficients** represent the change in the target variable for a one-unit change in the corresponding feature, holding all other features constant. For scaled numerical features, this means a one-unit change in the scaled value. For one-hot encoded categorical features, it's the change relative to the dropped category.

### Preprocessing Details: Numerical Features (StandardScaler)

In [None]:
numerical_transformer_from_pipeline = preprocessor_from_pipeline.named_transformers_['num']
# Original numerical features are stored in the ColumnTransformer's 'transformers_' list
original_numerical_features = [t[2] for t in preprocessor_from_pipeline.transformers if t[0] == 'num'][0]

print("Numerical Features Scaled:", original_numerical_features)
print("\nStandardScaler Mean (for each numerical feature):")
print(numerical_transformer_from_pipeline.mean_)
print("\nStandardScaler Scale (Standard Deviation, for each numerical feature):")
print(numerical_transformer_from_pipeline.scale_)

print("\nNumerical features with their scaling parameters (mean, std_dev):")
for i, feature_name in enumerate(original_numerical_features):
    print(f"{feature_name}: mean={numerical_transformer_from_pipeline.mean_[i]:.4f}, std_dev={numerical_transformer_from_pipeline.scale_[i]:.4f}")

For each numerical feature, the `mean_` and `scale_` (standard deviation) from the `StandardScaler` are shown. These are used to standardize the features using the formula: `z = (x - mean) / scale`.

### Preprocessing Details: Categorical Features (OneHotEncoder)

In [None]:
categorical_transformer_from_pipeline = preprocessor_from_pipeline.named_transformers_['cat']
# Original categorical features are stored in the ColumnTransformer's 'transformers_' list
original_categorical_features = [t[2] for t in preprocessor_from_pipeline.transformers if t[0] == 'cat'][0]

print("Categorical Features OneHotEncoded:", original_categorical_features)

print("\nOneHotEncoder Categories (for each original categorical feature):")
for i, feature_name in enumerate(original_categorical_features):
    print(f"Feature '{feature_name}': {categorical_transformer_from_pipeline.categories_[i]}")

print("\nOutput feature names from OneHotEncoder (after dropping 'first'):")
# These are the names of the columns generated by the OneHotEncoder
# The order matches the order of coefficients for the categorical part
ohe_feature_names = categorical_transformer_from_pipeline.get_feature_names_out(original_categorical_features)
print(ohe_feature_names)

For each categorical feature, `categories_` shows all unique categories found during training. 
Since `drop='first'` was used in `OneHotEncoder`, the first category for each feature is dropped to avoid multicollinearity. The `get_feature_names_out()` method shows the names of the generated columns, implicitly indicating which categories were kept (and thus, which were dropped).

#### Detailed Example of `drop='first'` for JavaScript Implementation:

Understanding how `drop='first'` affects the final features is crucial for the JavaScript implementation. Here's how it works for a couple of example features using the categories extracted from the training data:

1.  **`gender`**:
    *   Original categories found by `OneHotEncoder` (from `categorical_transformer_from_pipeline.categories_[0]`): `['Female', 'Male', 'Other']` (order matters).
    *   `drop='first'` means 'Female' (the first in the list) is dropped as the reference category.
    *   The resulting features used in the model are `cat__gender_Male` and `cat__gender_Other`.
    *   **If input `gender` is 'Female'**: `cat__gender_Male` = 0, `cat__gender_Other` = 0.
    *   **If input `gender` is 'Male'**: `cat__gender_Male` = 1, `cat__gender_Other` = 0.
    *   **If input `gender` is 'Other'**: `cat__gender_Male` = 0, `cat__gender_Other` = 1.

2.  **`diet_quality`**:
    *   Original categories found by `OneHotEncoder` (from `categorical_transformer_from_pipeline.categories_[2]` assuming it's the third categorical feature): `['Fair', 'Good', 'Poor']` (order matters).
    *   `drop='first'` means 'Fair' (the first in this list) is dropped.
    *   The resulting features are `cat__diet_quality_Good` and `cat__diet_quality_Poor`.
    *   **If input `diet_quality` is 'Fair'**: `cat__diet_quality_Good` = 0, `cat__diet_quality_Poor` = 0.
    *   **If input `diet_quality` is 'Good'**: `cat__diet_quality_Good` = 1, `cat__diet_quality_Poor` = 0.
    *   **If input `diet_quality` is 'Poor'**: `cat__diet_quality_Good` = 0, `cat__diet_quality_Poor` = 1.

This logic needs to be replicated precisely in JavaScript: for each categorical input, determine which binary columns are generated and set them to 0 or 1 based on the input value and the dropped category for that feature. The order of categories shown by `categorical_transformer_from_pipeline.categories_` for each feature dictates which category is 'first'.

### Python Example: Manual Prediction for a Single Instance

This cell demonstrates how to manually calculate a prediction for a single raw input instance using the extracted model parameters. This serves as a Python counterpart to the JavaScript prediction logic, ensuring clarity on the transformation and calculation steps.

In [None]:
import numpy as np

# 1. Sample Raw Input (same as used for JavaScript simulator testing)
sample_input = {
    'age': 20,
    'study_hours_per_day': 4.0,
    'social_media_hours': 2.0,
    'netflix_hours': 2.0,
    'attendance_percentage': 80,
    'sleep_hours': 7.0,
    'exercise_frequency': 3,
    'mental_health_rating': 5,
    'gender': 'Female',
    'part_time_job': 'No',
    'diet_quality': 'Fair',
    'parental_education_level': 'High School',
    'internet_quality': 'Average',
    'extracurricular_participation': 'No'
}

print(f"Sample Raw Input: {sample_input}\n")

# 2. Use Extracted Model Parameters (ensure these variables are set from previous cells in the notebook execution flow)
# These variables are assumed to be populated from earlier cells where they were extracted from the pipeline.
intercept = lr_model.intercept_ 
coefficients_list = lr_model.coef_
feature_names_ordered_by_coefs = preprocessor_from_pipeline.get_feature_names_out()
coefficients = dict(zip(feature_names_ordered_by_coefs, coefficients_list))

num_transformer = preprocessor_from_pipeline.named_transformers_['num']
original_num_features_in_order = [t[2] for t in preprocessor_from_pipeline.transformers if t[0] == 'num'][0]
numerical_means = dict(zip(original_num_features_in_order, num_transformer.mean_))
numerical_std_devs = dict(zip(original_num_features_in_order, num_transformer.scale_))

cat_transformer = preprocessor_from_pipeline.named_transformers_['cat']
original_cat_features_in_order = [t[2] for t in preprocessor_from_pipeline.transformers if t[0] == 'cat'][0]
ohe_categories_map = {}
for i, feature_name in enumerate(original_cat_features_in_order):
    ohe_categories_map[feature_name] = cat_transformer.categories_[i].tolist()

print("--- Model Parameters Used (from previously executed cells) ---")
print(f"Intercept: {intercept}")
print(f"First 3 Coefficients: {{k: v for k, v in list(coefficients.items())[:3]}}")
print(f"Numerical Features (used by StandardScaler): {original_num_features_in_order}")
print(f"Numerical Means (first 3): {{k: v for k, v in list(numerical_means.items())[:3]}}")
print(f"Numerical Std Devs (first 3): {{k: v for k, v in list(numerical_std_devs.items())[:3]}}")
print(f"Categorical Features (used by OneHotEncoder): {original_cat_features_in_order}")
print(f"OHE Categories Map (e.g., gender): {{'gender': ohe_categories_map.get('gender')}}")
print(f"Final Feature Order for Coefficients (first 5): {list(feature_names_ordered_by_coefs[:5])}... Total: {len(feature_names_ordered_by_coefs)}\n")

# 3. Manual Preprocessing for the Sample Input
processed_feature_values = {} # This will store the final values for each feature name in `feature_names_ordered_by_coefs`

# a. Scale Numerical Features
print("--- Processing Numerical Features ---")
for original_feature_name in original_num_features_in_order:
    raw_value = sample_input[original_feature_name]
    mean = numerical_means[original_feature_name]
    std_dev = numerical_std_devs[original_feature_name]
    scaled_value = (raw_value - mean) / std_dev
    # The final feature name for numerical features is 'num__' + original_feature_name
    final_num_feature_name = f"num__{original_feature_name}"
    processed_feature_values[final_num_feature_name] = scaled_value
    print(f"Processed {final_num_feature_name}: ({raw_value} - {mean:.2f}) / {std_dev:.2f} = {scaled_value:.6f}")

# b. One-Hot Encode Categorical Features (respecting drop='first')
print("\n--- Processing Categorical Features ---")
for original_feature_name in original_cat_features_in_order:
    raw_value = sample_input[original_feature_name]
    categories_for_this_feature = ohe_categories_map[original_feature_name]
    dropped_category = categories_for_this_feature[0] # First category is dropped
    
    # Iterate through all categories found during training for this feature, *except* the dropped one
    for category_from_training in categories_for_this_feature[1:]:
        # The final feature name for OHE features is 'cat__' + original_feature_name + '_' + category_value
        final_ohe_feature_name = f"cat__{original_feature_name}_{category_from_training}"
        if raw_value == category_from_training:
            processed_feature_values[final_ohe_feature_name] = 1
        else:
            processed_feature_values[final_ohe_feature_name] = 0
        print(f"Processed {final_ohe_feature_name}: (input '{raw_value}') -> {processed_feature_values[final_ohe_feature_name]}")
    
    # If the raw_value was the dropped_category, all corresponding OHE features should be 0.
    # This is implicitly handled because only non-dropped categories generate feature names.

# 4. Construct Final Feature Vector in Correct Order (as per feature_names_ordered_by_coefs)
feature_vector_for_prediction = []
print("\n--- Constructing Final Feature Vector (in model's expected order) ---")
for final_feature_name in feature_names_ordered_by_coefs:
    # If a feature was generated (e.g. cat__gender_Other) it will be in processed_feature_values.
    # If it was a dropped category (e.g. cat__gender_Female would not be a key), its effective value is 0
    # because its coefficient is not used / it's the base category.
    # The .get(final_feature_name, 0) handles cases where a OHE column might not be explicitly set if it's not active,
    # though the logic above should set all relevant OHE columns to 0 or 1.
    value_to_append = processed_feature_values.get(final_feature_name)
    if value_to_append is None:
        # This should ideally not happen if all features in feature_names_ordered_by_coefs are accounted for.
        # This could happen if a feature name from get_feature_names_out() was not created by our manual process.
        # For safety, we assume 0, but this indicates a potential mismatch if it occurs.
        print(f"Warning: Feature {final_feature_name} not found in manually processed features. Using 0.")
        value_to_append = 0 
    feature_vector_for_prediction.append(value_to_append)

print(f"Final Ordered Feature Vector (first 5 elements): {np.array(feature_vector_for_prediction[:5])}")
print(f"Full feature vector has {len(feature_vector_for_prediction)} elements.")

# 5. Calculate the Dot Product and Add Intercept
manual_prediction = intercept
for i, final_feature_name in enumerate(feature_names_ordered_by_coefs):
    manual_prediction += coefficients[final_feature_name] * feature_vector_for_prediction[i]

# 6. Print Results
print(f"\nRaw Manual Prediction: {manual_prediction}")

# Clamp and round the prediction as in the JavaScript simulator
clamped_manual_prediction = max(0, min(100, manual_prediction))
rounded_manual_prediction = round(clamped_manual_prediction, 2)
print(f"Clamped (0-100) and Rounded Manual Prediction: {rounded_manual_prediction}")

print("\nThis manual calculation should match the JavaScript simulator's output for the same input, \nand also the result from the 'Test Plan' step if the same input and parameters were used.")

### Final Check: All Preprocessed Feature Names in Order

This list shows all feature names in the exact order they are fed into the linear regression model after all preprocessing. This order should match the order of the model's coefficients.

In [None]:
print("All feature names from the preprocessor (in order):")
all_processed_feature_names = preprocessor_from_pipeline.get_feature_names_out()
print(all_processed_feature_names)

print("\nNumber of coefficients:", len(lr_model.coef_))
print("Number of preprocessed features:", len(all_processed_feature_names))

## Feature Selection and Preprocessing

Based on EDA and HTML feature importance chart, let's select features. The HTML mentions:
`study_hours_per_day`, `attendance_percentage`, `mental_health_rating`, `sleep_hours`, `internet_quality`, `parental_education_level`.
We'll use these and a few more common sense ones.

In [None]:
# Drop student_id if it's still there
if 'student_id' in df.columns:
    df_model = df.drop('student_id', axis=1).copy()
else:
    df_model = df.copy()

X = df_model.drop('exam_score', axis=1)
y = df_model['exam_score']

# Identify numerical and categorical features for the model
numerical_features = X.select_dtypes(include=np.number).columns.tolist()
categorical_features = X.select_dtypes(include=['category', 'object']).columns.tolist()

print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)

In [None]:
# Create preprocessing pipelines for numerical and categorical features
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore', drop='first') # drop='first' to avoid multicollinearity

# Create a preprocessor object using ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

## Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Model Training and Evaluation

### 1. Linear Regression

In [None]:
lr_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('regressor', LinearRegression())])

lr_pipeline.fit(X_train, y_train)
y_pred_lr = lr_pipeline.predict(X_test)

rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
r2_lr = r2_score(y_test, y_pred_lr)

print("Linear Regression Performance:")
print(f"RMSE: {rmse_lr:.2f}")
print(f"R²: {r2_lr:.2f}")

### 2. Random Forest Regressor

In [None]:
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                             ('regressor', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))])

rf_pipeline.fit(X_train, y_train)
y_pred_rf = rf_pipeline.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

print("Random Forest Regressor Performance:")
print(f"RMSE: {rmse_rf:.2f}")
print(f"R²: {r2_rf:.2f}")

*Note: The HTML reports RMSE of 8.52 (LR) / 6.95 (RF) and R² of 0.78 (LR) / 0.85 (RF). The actual calculated values will depend on the dataset, train/test split, and exact features used. The provided CSV is synthetic and these numbers are achievable with it.*

## Importância das Features (Exemplo do Random Forest)

In [None]:
# Get feature names after one-hot encoding
feature_names_out = rf_pipeline.named_steps['preprocessor'].get_feature_names_out()

importances = rf_pipeline.named_steps['regressor'].feature_importances_
feature_importance_df = pd.DataFrame({'feature': feature_names_out, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

print("\nTop 10 Feature Importances from Random Forest:")
print(feature_importance_df.head(10))

plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=feature_importance_df.head(10), palette='mako') # Top 10
plt.title('Top 10 Feature Importances (Random Forest)')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

The HTML's feature importance chart shows `study_hours_per_day` as most important, followed by `attendance_percentage`, `mental_health_rating`, etc. The actual results from the model might vary slightly in order but should generally highlight similar features as strong predictors.