# Logistic Regression Model

In this notebook, we wanted to see how accurate of a logostic model we could make using our most significant variables from our exploratory analysis. The goal of this model is to predict whether a student will graduate given their age at enrollement, whether they're in debt, and whether they're up to date on their tuition fees. 

In [5]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
data_dir = Path("data")
outputs_dir = Path("outputs")
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler

# Loading and Cleaning Data

In [6]:
# Load data
student_data = pd.read_csv(data_dir/"student_data.csv", sep = ";" )
# Select columns
model_sd = student_data[['Debtor', 'Age at enrollment', 'Tuition fees up to date', 'Target']]
# Remove enrolled students
model_sd = model_sd[model_sd["Target"] != "Enrolled"]

# Map Target column values
model_sd["Target"] = model_sd["Target"].replace({"Graduate": 1, "Dropout": 0})

  model_sd["Target"] = model_sd["Target"].replace({"Graduate": 1, "Dropout": 0})


# Building Model

In [None]:
# Separating Features and Target
X = model_sd.drop("Target", axis = 1)
y = model_sd["Target"]

# Training and Test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 67)

# Scale model variables
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate Model

In [9]:
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix: {matrix}")
print(f"Classification Report: {report}")

Accuracy: 0.7245179063360881
Confusion Matrix: [[119 181]
 [ 19 407]]
Classification Report:               precision    recall  f1-score   support

           0       0.86      0.40      0.54       300
           1       0.69      0.96      0.80       426

    accuracy                           0.72       726
   macro avg       0.78      0.68      0.67       726
weighted avg       0.76      0.72      0.70       726



# Analysis

With an accuracy of 72% our model proved capable but certainly not perfect. Our confusion matrix gave us a slightly more informative picture into where our model worked well: of all graduates, only 19 were misclassified as dropouts. On the other hand, our model misclassified more dropouts as graduates than it was able to correctly label. In other words, and as our precision and recall further confirm as well, our model was very good at identifying all true graduates but labeled too many students as graduates as a consequence. While these results are also certainly impacted by the randomness of our sampled data and the randomness of our training and test split while building our model, it left us with an informative conclusion: our selected student factors (age at enrollment, if they're a debtor, and if they're tuition fees are up to date) all are better for a model intended to predict if a student will graduate, rather than identify students at risk of dropping out of school. This conclusion is clear given our model's bias towards predicting a student will graduate, and failure to effectively find which students would go on to drop out of school.