# Lab 04- Extended Exercises on Classification and Pipelines

In [38]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import adjusted_mutual_info_score
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import roc_auc_score, balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay, silhouette_score
from sklearn.model_selection import train_test_split

# Data directory
DATA_DIR = "./../../data/"

You are the Senior Data Scientist in a learning platform called LernTime. You have realized that many users stop using the platform and want to increase user retention. For this purpose, you decide to build a model to predict whether a student will stop using the learning platform or not.

Your data science team built a data frame in which each row contains the aggregated features per student (calculated over the first 5 weeks of interactions) and the feature `dropout` indicates whether the student stopped using the platform (1) or not (0) before week 10.

The dataframe is in the file `lerntime.csv` and contains the following features:
- `video_time`: total video time (in minutes) 
- `num_sessions` total number of sessions
- `num_quizzes`: total number of quizzes attempts
- `reading_time`: total theory reading time
- `previous_knowledge`: standardized previous knowledge
- `browser_speed`: standardized browser speed
- `device`:  whether the student logged in using a smartphone (1) or a computer (-1)
- `topics`: the topics covered by the user
- `education`: current level of education (0: middle school, 1: high school, 2: bachelor, 3: master, 4: Ph.D.).
- `dropout`: whether the student stopped using the platform (1) or not (0) before week 5.

The newest data scientist created two models with an excellent performance. As a Senior Data Scientist, you are suspicious of the results and decide to revise the code. 

Your task is to:

a) Identify the mistakes. In the first cell, add a comment above each line in which you identify an error and explain the error.

b) In the second cell, you must correct the code.

In [1]:
import requests

exec(requests.get("https://courdier.pythonanywhere.com/get-send-code").content)

npt_config = {
    'session_name': 'lab-04',
    'session_owner': 'mlbd',
    'sender_name': input("Your name: "),
}

Your name:  Paola


In [4]:
df = pd.read_csv('{DATA_DIR}lerntime_dropout.csv')

y = df['dropout']
X = df[['video_time', 'num_sessions', 'num_quizzes', 'reading_time',
       'previous_knowledge', 'browser_speed']]

### Task A) Identify the mistakes in the code 
In the following cell, add a comment above each line in which you identify an error and explain the why it is erroneous.
Please start each of your comments with `#ERROR:`. For example:

`#ERROR: the RMSE of the model is printed instead of the AUC`

`print("The AUC of the model is: {}".format(rmse))          `

You may assume that: 
- all the features are continous and numerical. 
- the features have already been cleaned and processed. 

In [33]:
## 1. Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

## 2. Feature selection (Lasso)
print(X.shape)
lasso = Lasso(alpha=0.1, random_state=0).fit(X, y)
selector = SelectFromModel(lasso, prefit = True)
X = selector.transform(X)
print(X.shape)

## 3. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

## Model 1
clf = RandomForestClassifier(n_estimators=10, max_depth=1, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 1: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))

## Model 2
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 2: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))


## Discussion
# Our second model achieved perfect results with unseen data and outperforms the first model.
## This is because we increased the number of estimators.

(300, 3)
(300, 3)
Score model 1: 0.05
Score model 2: 1.0


In [2]:
answer = """
COPY THE ERRORS IDENTIFIED HERE
"""

send(answer, 1) 

<Response [200]>

### Task B) Correct the code 
Correct all the erroneous code in the following cell.

In [34]:
## 1. Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

## 2. Feature selection (Lasso)
print(X.shape)
lasso = Lasso(alpha=0.1, random_state=0).fit(X, y)
selector = SelectFromModel(lasso, prefit = True)
X = selector.transform(X)
print(X.shape)

## 3. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

## Model 1
clf = RandomForestClassifier(n_estimators=10, max_depth=1, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 1: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))

## Model 2
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 2: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))


## Discussion
# Our second model achieved perfect results with unseen data and outperforms the first model.
## This is because we increased the number of estimators.

(240, 3)
(240, 3)
Score model 1: 0.9
Score model 2: 0.81


In [None]:
answer = """
COPY OUTPUT OF MODEL SCORE HERE
"""

send(answer, 2) 

### Task C) Re-write your code using pipelines.
Hint: Go over sklearn-pipeline-introduction.

In [None]:
answer = """
WRITE THE BEST PARAMETERS AND SCORE HERE
"""

send(answer, 3) 

In [None]:
answer = """
DO THE RESULTS DIFFER, WHY? 
"""

send(answer, 4) 