# Practical Machine Learning with scikit-learn

## Assignment

In this part of the day we will be building a machine learning algorithm together with Python and the library scikit-learn.

Our team prepared a dataset which consists of 100 instances with the granularity of a single student. This dataset is an aggregation from the logs of an e-learning system with learning material and exams. 

Our goal today is to build a model to predict, given the actions of a student, whether the student will pass or fail the final exam (column = FinalResult).


## scikit-learn
scikit-learn is one of the most prominent Python libraries for machine learning:

* Contains many state-of-the-art machine learning algorithms
* Wide range of evaluation measures and techniques
* Offers comprehensive documentation about each algorithm
* Widely used, and a wealth of tutorials and code snippets are available
* Works well with numpy, scipy, pandas, matplotlib,...

In [None]:
# # Global imports and settings

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Load data

If you have the data loaded in a pandas file, then this part is not needed. If not, we first have to load the data. 

* Load the "flatfile 1.csv" into a pandas dataframe.

In [None]:
data = #Your code

### Explore the data

In part 1 today you've already worked with the data and performed an Explorative Data Analysis. This is often a crucial part of the process, since you need to understand what you are working with before editing it. You can do some exploring here to see which variables are in the data and how they look like.

In [None]:
# Some examples

# data.shape

# data.describe()

# data["UnqVideosWatched"][0:50]

# Your code here

## 2. Data processing

The data first needs to be processed before we will train a model on the data. Some models only accept data in a certain way. Lets make our data model-ready.

### Missing values

The first thing to do when processing your data is to check for missing values. The easy way is to do this with pandas.

* the pandas .isnull() command on a dataframe returns the whole dataframe, but checks for every number if its empty or not and returns a Boolean.
* if you sum all those booleans, you will find the amount of rows that are empty for a certain column 

In [None]:
# Check how many missing values there are

data.#Your code

We need to decide what to do with the missing data. 

For this scenario, where a specific test is not made at all, it does not make sense to apply the average score to it. It will most likely distort the data distribution of these test scores. Also, the fact that someone did not make a test also gives valuable information. By imputing an average, you will lose information and introduce a bias.

* Lets decide to put a 0 for a test that is not done, to keep the information of a non-taken test. This can be done by the .fillna() function

In [None]:
# Transform all empty values to 0's 

data.#Your code here

# Check if there are missing values left
data.isnull().sum()

### Binary encoding

As you may have seen, the FinalResult column is a binary value defined with a "Pass" or "Fail". For machine learning, it's easier to model this into a 1 or 0 column instead.

* With pandas, you need to filter all the rows with FinalResult == "Pass" and turn them into 1's. 
* Same for "Fail", but turn those into 0's 

In [None]:
data['FinalResult']

In [None]:
data['FinalResult'] = data["FinalResult"].replace(#Your code here
data['FinalResult'] = data["FinalResult"].replace(#Your code here
data['FinalResult']

### Additional features

We can come up with all kinds of different features. Lets add at least 2 more.

* The average score of all quizzes combined (AvgQuizGrade). Using the mean() function with axis=1 as a parameter on a column list, you will get the mean of the values in the columns for each row. Another option is to sum all the quiz values, and divide it by the amount of quizzes. 

*Question:* Why would we use one over the other?


* An overal "activity" score: combinding UnqArticlesRead and UnqVideosWatched
* Optional: Getting features from the date column: Weekday/Weekend/DaysFromNow?

In [None]:
# AvgQuizGrade
quiz_column_list = ['Quiz1', 'Quiz2', 'Quiz3', 'Quiz4', 'Quiz5', 'Quiz6', 'Quiz7', 'Quiz8', 'Quiz9', 'Quiz10']

data["AvgQuizGrade"] = data[quiz_column_list].#Your code here
data["AvgQuizGrade"]

In [None]:
# OveralActivityScore

data["OveralActivityScore"] = #Your code here
data["OveralActivityScore"]

### Feature engineering with dates

The FirstLoginDate is currently read as a string. We first need to convert it to a datetime object and can then apply the transformations we want.

In [None]:
# Date columns

from datetime import datetime

data["FirstLoginDate"] = pd.to_datetime(data["FirstLoginDate"])
data["FirstLoginDate"]

Using .dt.weekday (the dt stands for datetime) on your data column will return the weekday number from [0-7]. Apply this to check whether the day is a weekday or weekend and make these 2 new columns. 

In [None]:
data['WeekdayLoginDate'] = #Your code here
data['WeekendLoginDate'] = #Your code here

### Days till now

Lets assume the exam was today. Students who logged in for the first time late might not have had enough time or be motivated enough to study. Thus, with this hypothesis, we can calculate the days from now to get information about the motivation of a student.

* Subtract from the current date the date the first login happened. Then, do a .dt.days to get the amount of days.
* Lets also drop the FirstLoginDate column, since it's not needed anymore

In [None]:
current_date = datetime.now()

data['days_till_now'] = #Your code here

data = data.drop('FirstLoginDate', axis=1)

data['days_till_now']

## Splitting your data into X and Y

In machine learning, you use a certain part of your data as features, and certain part of your data as the predicted value. Those we call X and Y respectively.

* Split a subset of your data into the X variable: all the other columns, except the target. In this case it's 2 variables: 'FinalResult', 'CourseGrade'. 
* Assign the output/target column to the y variable

In [None]:
# Separate features and target variable

X = #Your code

y = #Your code

## 3. Splitting Data into Training and Test Sets

Lets suppose all the data we have are these 100 instances. 

We cannot use all of the 100 instances to train the model, since then we will have no "unseen" instance to test our model on. Thus, we need to split the data. One part will be a training set, and another part will be the test set.

`train_test_split` from sklearn: splits data randomly in X% training and Y% test data. 

* What % should we set the test_size at?

In [None]:
from sklearn.model_selection import train_test_split

test_size = #Your % here in 0.X format.  

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)

## 4. Model selection & training

All scikitlearn estimators follow the same interface

```python
class SupervisedEstimator(...):
    def __init__(self, hyperparam, ...):

    def fit(self, X, y):   # Fit/model the training data
        ...                # given data X and targets y
        return self
     
    def predict(self, X):  # Make predictions
        ...                # on unseen data X  
        return y_pred
    
    def score(self, X, y): # Predict and compare to true
        ...                # labels y                
        return score
```

### Model selection

Selecting a model is an art in itself. It's a balance between complexity, interpretebility, available resources, and trial-and-error.

Some examples for classification:

* Logistic Regression: Good for binary classification problems, especially as a baseline model.

* Decision Trees: Useful for interpretability and when handling categorical features.

* Random Forest and Ensemble Methods: Offer improvements over single decision trees, reducing the risk of overfitting.

* Support Vector Machines (SVM): Effective in high-dimensional spaces and with various kernel functions.

* Neural Networks: Suitable for complex problems, particularly with large amounts of data and high computational capacity."

### Decision tree

The first model we'll build is one of the most interpretable models there are. We are going to train a Decision-tree classifier.

When building ML systems, we are not actually telling the system how it should classify the instances. We are giving it data, providing the output, and asking the code to give us it's best guess formula to get there. 

If you would do this by hand, you will look at the data, create if/else statements, and split the data untill you get something that gives you a decent enough answer. This is kind of what the decision tree classifier is also doing. The output is an if/else structure which is easy to understand.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the model
model = DecisionTreeClassifier(random_state=0)

Now we have an empty model. We need to train it using model.fit and giving it our data X_train and giving it the targets of that data y_train.

In [None]:
# Train the model

model.fit(# Your code here)

Now we have a trained model!

Because we just have 100 instances and it's a simple classifier, this happend quickly. With complex models, with many nodes (neural networks), and a lot of data this can take months with specialized hardware (TPU's).

We can now create a random instance and predict if it will pass (1) or fail (0) a test. 

Play with the variables in this instance to check when it will predict a fail (0).

In [None]:
random_instance = pd.DataFrame(columns=X.columns)
random_instance.loc[0] = [8246314, 22, 8, 23, 43, 17, 9, 4, 56, 25, 71, 70, 100, 84, 67, 54, 44, 61, 0, 72, 62.3, 60, True, True, 267]
random_instance

In [None]:
prediction = model.predict(random_instance)
prediction[0]

*hint* a suspect might be "QuizReviewed"

# Break: 10 minutes coffee/toilet

## 5. Model Evaluation

### Predict all test examples

Instead of giving it just 1 example, give it all the X_test examples as input to the .predict function

In [None]:
y_pred = model.predict(#Your code here)
    
print("Test set predictions:\n {}".format(y_pred))

The score function computes the percentage of correct predictions. This is also called the Accuracy

### Accuracy of the model

In [None]:
print("Score: {:.2f}".format(model.score(X_test, y_test) ))

### Other evaluation metrics

Accuracy is not enough to judge a model's performance. There are more metrics which say something about the performance of the model. They all use the confusion matrix as a basis. 

Trivia: "The name [confusion matrix] stems from the fact that it makes it easy to see whether the system is confusing two classes"

Lets calculate the confusion matrix and the evaluation metrics.

* The sklearn function to use is "confusion matrix" which needs the inputs of your test set targets (y_test) and predictions (y_pred)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

conf_matrix = #Your code here

Lets plot the confusion matrix so you can see how your model performed with classifying the test instances.

In [None]:
disp = ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=model.classes_)
disp.plot()
plt.show()

*Question*: Can you interpret this plot? What does it say?

## Precision & Recall

Precision measures the proportion of correctly predicted positive observations to the total predicted positives. It answers the question: "Of all instances classified as positive, how many are actually positive?"

Formula: Precision = True Positives / (True Positives + False Positives)

Recall (or Sensitivity) measures the proportion of correctly predicted positive observations to all observations in the actual class. It answers the question: "Of all actual positives, how many were correctly identified?"

Formula: Recall = True Positives / (True Positives + False Negatives)

Calculate the Precision and recall yourself. You can use sklearn imports, or calculate them by using the confusion matrix

In [None]:
# Calculate Precsion and Recall

# Your code here

## F1 score

 The F1 score is the harmonic mean of precision and recall. It is a better measure than accuracy for imbalanced datasets.
 
\begin{equation}
    F1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
\end{equation}

Calculate the F1 score using sklearn imports or manually using precision and recall

In [None]:
# Calculate the F1 score

# Your code here

## Visualizing the trained decision tree

You've trained a decision tree classifier which can predict instances. But how does it do this? 

The most important quality of a decision tree is that it's interpretable. We can plot the tree that the model is using to classify instances. 

* The top node is called the "root". The root node is often considered the most important feature in rapport with all other features.
* The decisions are called "branches"
* The end nodes are called "leafs"

In [None]:
from sklearn import tree

plt.figure(figsize=(15,15))

tree.plot_tree(model, feature_names=X.columns, fontsize=15)

plt.show()

Again using the confusion matrix we can derive other metrics like the True Positive Rate or the False Positive Rate.

The False Positive Rate is the number of false positives divided by the total number of actual negatives, while the True Positive Rate is the number of true positives divided by the total number of actual positives.

The True Positive Rate (TPR), also known as sensitivity or recall, is calculated as:
\begin{equation}
    TPR = \frac{TP}{TP + FN}
\end{equation}
where \( TP \) represents the number of true positives and \( FN \) represents the number of false negatives.


The False Positive Rate (FPR) is calculated as:
\begin{equation}
    FPR = \frac{FP}{FP + TN}
\end{equation}
where \( FP \) represents the number of false positives and \( TN \) represents the number of true negatives.


## AUC-ROC curve

The AUC-ROC curve is a performance measurement for the classification problems at various threshold settings. 

ROC (Receiver Operating Characteristics) is a probability curve, 

and AUC (Area Under the Curve) represents the degree or measure of separability between the classes.

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

In [None]:
y_pred_proba = model.predict_proba(X_test)[:, 1]

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)

In [None]:
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = %0.2f)' % auc_score)
plt.plot([0, 1], [0, 1], color='darkgray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

AUC is a useful metric to compare your model to other models and works well for imbalanced datasets. It measures how well the model discriminates between positive and negative classes. 

## 6. Hyper parameter tuning

The above classifier/estimator, the DecisionTreeClassifier, is trained using the most default parameters. Check the documentation of this classifier on the website of sci-kit learn:
* https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

You will see that this classifier has more than 10 parameters that you can pick and decide upon. For examples the parameters "max_depth" or "criterion" or "min_samples_leaf". 


In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'max_depth': [None, 3, 4, 5, 6, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy', 'log_loss']
}

In [None]:
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=15, n_jobs=-1, verbose=2, scoring='accuracy')

In [None]:
grid_search.fit(X_train, y_train)

*Question:* How did it come up with 162 candidates?

In [None]:
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

## 7. Model refinement

In [None]:
# Initialize the model with the best parameters
refined_model = DecisionTreeClassifier(**best_params, random_state=0)

# Train the model
refined_model.fit(X_train, y_train)

# Evaluate the refined model
accuracy_refined_model = refined_model.score(X_test, y_test)

print("Accuracy score of refined model: {:.2f}".format(accuracy_refined_model))

*Question:* Do you know what cv stands for in GridSearchCV?

*Question:* Can you think why the refined model might give a lower accuracy than our basic model?

### Goal: generalizability vs accuracy. Overfitting vs underfitting 

## Optional: Repeat above steps, but for a different classifier, and evaluate which is better

Explain why 'CourseGrade' is taken out as a variable, and how it can be used for Linear regression

Try one of the following:
* Linear regression
* Random forest
* Support Vector Machines

Look up the documentation in sklearn how to use them