# Logistic regression and dummy variables

In this notebook, we are going to produce additional features derived from the pre-existing ones. We are also going to create a predictor of whether a student will pass or fail their exam based on a logistic regression on the time in hours they have spent studying, their class attendance and their number of hours of sleep the night before the exam.

FYI, the dataset is synthetic, I do not have any mean to monitor the time you spend studying my course.

In [None]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
df = pd.read_csv("./synthetic_student_data.csv")
df.head()

In [None]:
df.describe()

We are going to create a feature which is equal to 1 if the student has a grade higher than 10 and 0 otherwise. This feature codes whether the student passes or fails the exam.

In [None]:
df["succeeds"] = (df.grade >= 10).astype(int)

In [None]:
input_features = ["hours_studied", "sleep_hours", "class_attendance"]
X = df[input_features].to_numpy()
Y = df["succeeds"].to_numpy()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)

Until now, we have performed each step of our pre-processing and model training individually. This has multiple issues, for example:

- The code can become very redundent and lengthy, therefore making debugging harder
- The scikit-learn objects created (scaler, model, etc) are stored in separate objects

Here, we will instantiate a Pipeline which will combinate the different steps.

In [None]:
model = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()),
    ("logreg", LogisticRegression(penalty="l2", random_state=12345))
])

Our pipeline combines the creation of polynomial features, scaling of features and logistic regression on the pre-processed data in the same object which can then be re-used to perform predictions on raw data.

In [None]:
model.fit(X_train, Y_train)

In [None]:
Y_pred = model.predict(X_test)
Y_pred

In [None]:
print(accuracy_score(Y_test, Y_pred))

In [None]:
cm = confusion_matrix(Y_test, Y_pred)

print(cm)

In [None]:
print(classification_report(Y_test, Y_pred))