In [1]:
import os
import sys
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score

# Add the project root to the path to import modules from 'src'
project_root = os.path.dirname(os.getcwd())
if project_root not in sys.path:
    sys.path.append(project_root)

# Loading the data
PROCESSED_DATA_PATH = os.path.join(project_root, 'data/processed/final_data.parquet')
df_parquet = pd.read_parquet(PROCESSED_DATA_PATH)

print(f"Complete dataset loaded with {len(df_parquet)} rows.")
df_parquet.head()

Complete dataset loaded with 349923 rows.


Unnamed: 0,edicao,co_ies,no_ies,sg_ies,no_campus,co_curso,no_curso,ds_grau,ds_turno,ds_mod_concorrencia,qt_vagas_concorrencia,nu_notacorte,qt_inscricao,chave_curso,nota_edicao_anterior,vagas_edicao_anterior,tendencia_nota,inscritos_edicao_anterior,demanda_anterior
0,2024,1027,UNIVERSIDADE ESTADUAL DO NORTE FLUMINENSE DARC...,UENF,CAMPUS - CAMPOS DOS GOYTACAZES - PARQUE CALIF...,101984,CIÊNCIA DA COMPUTAÇÃO,BACHARELADO,INTEGRAL,"- CANDIDATO (S) COM DEFICIÊNCIA, OU FILHOS DE ...",1.0,661.06,4.0,1027_101984_BACHARELADO_INTEGRAL,,,0.0,,0.0
1,2019_1,1027,UNIVERSIDADE ESTADUAL DO NORTE FLUMINENSE DARC...,UENF,CAMPUS - CAMPOS DOS GOYTACAZES - PARQUE CALIF...,101984,CIÊNCIA DA COMPUTAÇÃO,BACHARELADO,INTEGRAL,AMPLA CONCORRÊNCIA,14.0,677.73,157.0,1027_101984_BACHARELADO_INTEGRAL,,,0.0,,0.0
2,2020_1,1027,UNIVERSIDADE ESTADUAL DO NORTE FLUMINENSE DARC...,UENF,CAMPUS - CAMPOS DOS GOYTACAZES - PARQUE CALIF...,101984,CIÊNCIA DA COMPUTAÇÃO,BACHARELADO,INTEGRAL,AMPLA CONCORRÊNCIA,14.0,669.42,91.0,1027_101984_BACHARELADO_INTEGRAL,677.73,14.0,0.0,157.0,10.466667
3,2021_1,1027,UNIVERSIDADE ESTADUAL DO NORTE FLUMINENSE DARC...,UENF,CAMPUS - CAMPOS DOS GOYTACAZES - PARQUE CALIF...,101984,CIÊNCIA DA COMPUTAÇÃO,BACHARELADO,INTEGRAL,AMPLA CONCORRÊNCIA,14.0,684.28,106.0,1027_101984_BACHARELADO_INTEGRAL,669.42,14.0,-8.31,91.0,6.066667
4,2022_1,1027,UNIVERSIDADE ESTADUAL DO NORTE FLUMINENSE DARC...,UENF,CAMPUS - CAMPOS DOS GOYTACAZES - PARQUE CALIF...,101984,CIÊNCIA DA COMPUTAÇÃO,BACHARELADO,INTEGRAL,AMPLA CONCORRÊNCIA,14.0,693.64,112.0,1027_101984_BACHARELADO_INTEGRAL,684.28,14.0,14.86,106.0,7.066667


## Baseline

The first approach was to train a single model for all universities, courses, and admission categories. The hypothesis was that the model could learn general patterns from the entire dataset.

In [2]:
# Preparing data for the generalist model
df_generalist = df_parquet.dropna(subset=['nota_edicao_anterior', 'vagas_edicao_anterior'])

TARGET = 'nu_notacorte'
features_cols = ['sg_ies', 'no_curso', 'ds_grau', 'ds_turno', 'ds_mod_concorrencia', 'qt_vagas_concorrencia', 'nota_edicao_anterior', 'vagas_edicao_anterior']
categorical_features = ['sg_ies', 'no_curso', 'ds_grau', 'ds_turno', 'ds_mod_concorrencia']

X = df_generalist[features_cols].copy()
y = df_generalist[TARGET]

for col in categorical_features:
    X[col] = X[col].astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lgbm_generalist = lgb.LGBMRegressor(random_state=42)
lgbm_generalist.fit(X_train, y_train)

preds = lgbm_generalist.predict(X_test)
mae = mean_absolute_error(y_test, preds)

print(f"MAE of the Generalist Model: {mae:.2f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001899 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1356
[LightGBM] [Info] Number of data points in the train set: 161511, number of used features: 8
[LightGBM] [Info] Start training from score 529.131110
MAE of the Generalist Model: 63.81


### Baseline Analysis

An MAE of approximately **64 points** is extremely high, making the model impractical. The hypothesis is that the high variability and noise from the various affirmative action categories (many with very little historical data) are "polluting" the learning process.

## Focus on General Admission

The new strategy is to simplify the problem, focusing on the most robust and competitive subset of data: General Admission (`Ampla Concorrência`) with a minimum number of available spots.

In [3]:
# Filtering for the specialist model
df_specialist_raw = df_parquet.query("`ds_mod_concorrencia` == 'AMPLA CONCORRÊNCIA' and `qt_vagas_concorrencia` >= 10").copy()
df_specialist = df_specialist_raw.dropna(subset=['nota_edicao_anterior', 'vagas_edicao_anterior'])

features_cols_spec = ['sg_ies', 'no_curso', 'ds_grau', 'ds_turno', 'qt_vagas_concorrencia', 'nota_edicao_anterior', 'vagas_edicao_anterior']
categorical_features_spec = ['sg_ies', 'no_curso', 'ds_grau', 'ds_turno']

X = df_specialist[features_cols_spec].copy()
y = df_specialist[TARGET]

for col in categorical_features_spec:
    X[col] = X[col].astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lgbm_specialist = lgb.LGBMRegressor(random_state=42)
lgbm_specialist.fit(X_train, y_train)

preds = lgbm_specialist.predict(X_test)
mae = mean_absolute_error(y_test, preds)

print(f"MAE of the Specialist Model: {mae:.2f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000232 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 994
[LightGBM] [Info] Number of data points in the train set: 26990, number of used features: 7
[LightGBM] [Info] Start training from score 633.563826
MAE of the Specialist Model: 29.60


### General Admission Analysis

The MAE dropped to approximately **30 points**, a reduction of over 50%. This validates the hypothesis that noise was the main issue. Simplifying the problem's scope was the correct decision.

## Error Analysis

Even with the improvement, let's investigate where the model still makes its biggest mistakes.

In [4]:
df_errors = X_test.copy()
df_errors['nota_real'] = y_test
df_errors['nota_prevista'] = preds
df_errors['erro_absoluto'] = abs(df_errors['nota_real'] - df_errors['nota_prevista'])

df_worst_errors = df_errors.sort_values(by='erro_absoluto', ascending=False)

print("--- TOP 20 BIGGEST ERRORS OF THE SPECIALIST MODEL ---")
print(df_worst_errors[['sg_ies', 'no_curso', 'nota_edicao_anterior', 'nota_real', 'nota_prevista', 'erro_absoluto']].head(20))

--- TOP 20 BIGGEST ERRORS OF THE SPECIALIST MODEL ---
                sg_ies                                  no_curso  \
271856           UTFPR                       ENGENHARIA ELÉTRICA   
41679             IFCE                                   QUÍMICA   
166445            UFRN                               ESTATÍSTICA   
231748           UFRGS  INTERDISCIPLINAR EM CIÊNCIA E TECNOLOGIA   
290418        CEFET/MG                       ENGENHARIA ELÉTRICA   
272291           UTFPR                    ENGENHARIA DE PRODUÇÃO   
222979            UFPB                                    FÍSICA   
22940            UNILA                                   QUÍMICA   
139724  IF CATARINENSE        ENGENHARIA DE CONTROLE E AUTOMAÇÃO   
49441             IFES         CIÊNCIA E TECNOLOGIA DE ALIMENTOS   
80466              UFJ                       CIÊNCIAS BIOLÓGICAS   
159418            UEPB                        QUÍMICA INDUSTRIAL   
158566            UEPB                                 GEOGRAF

### Error Analysis
The error analysis revealed a clear pattern: many of the largest errors occurred in cases where the `nota_real` (actual score) was `0.0`. This value likely represents a data anomaly (lack of candidates, data entry error) rather than a competitive cutoff score. So, the decision is to treat this as a data quality issue and remove these cases from training and evaluation datasets.

## Final Model

Now we are going to use the specialist model filter, the removal of zero-scores, and the best hyperparameters found during experimentation (`n_estimators=5000`, `learning_rate=0.01`) to train and evaluate our final model.

In [5]:
# 1. Final outlier cleaning
df_cleaned = df_parquet[df_parquet['nu_notacorte'] != 0].copy()

# 2. Specialist model filter
df_final_data = df_cleaned.query("`ds_mod_concorrencia` == 'AMPLA CONCORRÊNCIA' and `qt_vagas_concorrencia` >= 10").copy()
df_model = df_final_data.dropna(subset=['nota_edicao_anterior', 'vagas_edicao_anterior'])

# 3. Preparing X and y
TARGET = 'nu_notacorte'
features_cols_final = ['sg_ies', 'no_curso', 'ds_grau', 'ds_turno', 'qt_vagas_concorrencia', 'nota_edicao_anterior', 'vagas_edicao_anterior', 'tendencia_nota', 'demanda_anterior', 'inscritos_edicao_anterior']
categorical_features_final = ['sg_ies', 'no_curso', 'ds_grau', 'ds_turno']

X = df_model[features_cols_final].copy()
y = df_model[TARGET]

for col in categorical_features_final:
    X[col] = X[col].astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 4. Training with optimized parameters
best_params = {
    'n_estimators': 5000, 
    'learning_rate': 0.01,
    'random_state': 42
}

lgbm_final = lgb.LGBMRegressor(**best_params)
lgbm_final.fit(X_train, y_train)

# 5. Final Evaluation
final_preds = lgbm_final.predict(X_test)
final_mae = mean_absolute_error(y_test, final_preds)
final_r2 = r2_score(y_test, final_preds)

print("--- FINAL MODEL EVALUATION ---")
print(f"MAE (Mean Absolute Error): {final_mae:.2f}")
print(f"R² (R-Squared): {final_r2:.2f}")

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000813 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1755
[LightGBM] [Info] Number of data points in the train set: 26512, number of used features: 10
[LightGBM] [Info] Start training from score 645.184139
--- FINAL MODEL EVALUATION ---
MAE (Mean Absolute Error): 15.95
R² (R-Squared): 0.89


## Conclusion

The model went from an MAE of 64 to a final MAE of approximately **16 points**. The process demonstrated that the most impactful steps were:

1. **Refining the Problem Scope:** Switching from a generalist model to a specialist model focused on General Admission was the most critical decision.
2. **Error Analysis and Data Cleaning:** Identifying and removing outliers (zero-scores) based on error analysis was crucial to obtaining a realistic performance metric.
3. **Hyperparameter Tuning:** Fine-tuning the model's parameters provided the final performance gain.