# Projeto Final - Introdução à Ciência de Dados
### Tema: Previsão de Aprovação Escolar com Machine Learning

Este projeto visa prever a aprovação de alunos com base em variáveis como horas de estudo, frequência, notas, acesso à internet e apoio familiar, utilizando modelos de aprendizado de máquina.

In [1]:
# Importar bibliotecas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

In [2]:
# Gerar dados simulados
np.random.seed(42)
data = pd.DataFrame({
    'horas estudo': np.random.normal(3, 1, 200).clip(0, 10),
    'frequencia %': np.random.normal(85, 10, 200).clip(50, 100),
    'nota P1': np.random.normal(6, 1.5, 200).clip(0, 10),
    'nota P2': np.random.normal(6, 1.5, 200).clip(0, 10),
    'tem acesso internet': np.random.choice([1, 0], size=200, p=[0.8, 0.2]),
    'apoio familiar': np.random.choice([1, 0], size=200, p=[0.7, 0.3]),
})
media_final = (data['nota P1'] + data['nota P2']) / 2
data['aprovado'] = ((media_final >= 6) & (data['frequencia %'] >= 75)).astype(int)
data.head()

Unnamed: 0,horas estudo,frequencia %,nota P1,nota P2,tem acesso internet,apoio familiar,aprovado
0,3.496714,88.577874,3.608359,7.135483,1,0,0
1,2.861736,90.607845,5.100937,4.616752,0,1,0
2,3.647689,95.830512,6.007866,7.304409,1,0,1
3,4.52303,95.538021,6.070471,8.033457,1,1,1
4,2.765847,71.223306,5.324902,6.620152,1,1,0


In [3]:
# Divisão de dados
X = data.drop(columns=['aprovado'])
y = data['aprovado']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [4]:
# Treinamento de modelos
log_model = LogisticRegression()
rf_model = RandomForestClassifier(random_state=42)
log_model.fit(X_train_scaled, y_train)
rf_model.fit(X_train, y_train)
y_pred_log = log_model.predict(X_test_scaled)
y_pred_rf = rf_model.predict(X_test)

In [5]:
# Avaliação dos modelos
print("Regressão Logística:")
print(classification_report(y_test, y_pred_log))
print("\nRandom Forest:")
print(classification_report(y_test, y_pred_rf))

Regressão Logística:
              precision    recall  f1-score   support

           0       0.78      0.94      0.85        31
           1       0.91      0.72      0.81        29

    accuracy                           0.83        60
   macro avg       0.85      0.83      0.83        60
weighted avg       0.85      0.83      0.83        60


Random Forest:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97        31
           1       0.97      0.97      0.97        29

    accuracy                           0.97        60
   macro avg       0.97      0.97      0.97        60
weighted avg       0.97      0.97      0.97        60

