# Machine learning basics

**Prediction** 

- Categorical
- Point estimate


**Categorical**

Categorical prediction is a supervised machine learning task that aims to predict a categorical label (or class) for a given input. The categorical labels are discrete and unordered, such as for example fruit types "Apple," "Orange," "Banana," etc. Categorical prediction is also known as classification. After training, one can use the model to predict new data.

![alt text](./images/binary_decision_problem.jpg "Title")

![alt text](./images/4_fields.jpg "Title")

Several algorithms can be used for categorical prediction, including:

-    Logistic regression: Logistic regression models the probability of an input belonging to a particular class.

-    Decision trees: A tree-based model that can be used for binary and multi-class classification. Decision trees use a set of if-then-else rules to make predictions.

-    Random forests: A method that combines multiple decision trees to improve the accuracy of the predictions.

-    Support Vector Machines (SVMs): A linear model that can be used for binary and multi-class classification. SVMs find the best boundary (or "hyperplane") that separates the classes. Note there is also a non-linear version of this method.

-    Neural networks: A type of model inspired by the structure of the human brain and can be used for binary and multi-class classification. Neural networks are handy for problems with many input features or a complex decision boundary.


In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import accuracy_score

In [2]:
# Works on MyBinder or locally #
df = pd.read_csv("./datasets/Group_A_B.csv")


In [3]:
# Works only on COLAB #


def read_csv_from_github(url):
    import requests
    from io import StringIO
    response = requests.get(url)
    data = response.text
    return pd.read_csv(StringIO(data))


url = 'https://github.com/bgagl/ML_Individual_Differences/raw/5b70d36362172bb50d5be984e8c97526dda26bd2/datasets/Group_A_B.csv'
df = read_csv_from_github(url=url)
df.head()


ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with LibreSSL 2.8.3. See: https://github.com/urllib3/urllib3/issues/2168

In [None]:
df

In [None]:
plt.hist([df["NoProblem"], df["Problem"]], bins=20, histtype='bar', color=['b','r'], label=['No Problem', 'Problem'])
plt.legend()
plt.xlabel('Measure')
plt.ylabel('Frequency')
plt.show()

**Exercise** Find the optimal threshold value that divides No Problem and Problem

**Metrics**

- Accuracy
![alt text](./images/accuracy.jpg "Title")

- Specificity
![alt text](./images/specificity.jpg "Title")

- Precision
![alt text](./images/precision.jpg "Title")

- True positive rate
![alt text](./images/true-positive-rate.jpg "Title")

- False positive rate
![alt text](./images/false-positive-rate.jpg "Title")


Let's apply a decision boundary (i.e., cutoff value) and calculate the accuracy. Here we calculate an example for a boundary of `Measure == 0` based on a so-called Rule. Note rule-based systems are the simplest prediction models with the advantage that the models are highly transparent and have a high explainability. 

1. We must convert our data frame from wide to long format. This is a central preprocessing step, as all machine learning models we use here work only with long-format data.

In [None]:
df_long = pd.melt(df, id_vars=None, value_vars=['NoProblem', 'Problem'], var_name='Group', value_name='Measure')
df_long

2. Then we apply the rule to our data and create a prediction of which group a person is given a specific measure.

In [None]:
df_long["Predicted Group"] = "NoProblem"
df_long["Predicted Group"][df_long["Measure"] > 0] = "Problem"
df_long

3. We compare the predicted group with the actual group on the accuracy metric

In [None]:
true_positive = len(df_long[df_long["Group"]==df_long["Predicted Group"]][df_long["Group"] == "Problem"])
true_positive

In [None]:
true_negative = len(df_long[df_long["Group"]==df_long["Predicted Group"]][df_long["Group"] == "NoProblem"])
true_negative

In [None]:
false_positive = len(df_long[df_long["Group"]!=df_long["Predicted Group"]][df_long["Group"] == "Problem"])
false_positive

In [None]:
false_negative = len(df_long[df_long["Group"]!=df_long["Predicted Group"]][df_long["Group"] == "NoProblem"])
false_negative

In [None]:
pd.DataFrame(
    {
        "Group: Problem": [true_positive, false_positive],
              "Group: No Problem": [false_negative, true_negative],
              "Predicted": ["Problem", "No Problem"]
    }
)

**Accuracy**
![alt text](./images/accuracy.jpg "Title")

In [None]:
(true_positive+true_negative) / (true_positive+true_negative+false_negative+false_positive)

**False positive rate**
![alt text](./images/false-positive-rate.jpg "Title")


In [None]:
false_positive / (true_negative+false_positive)

**Exercise** Apply your best guess decision boundary to the dataset and estimate the Accuracy and the Specificity metric

**First Machine Learning Example: Logistic Regression**

Let's use a logistic regression model to learn the optimal decision boundary for our problem. 

![alt text](./images/log_reg.jpg "Title")

Remember the overall framework. As a first step, we will learn and predict the existing labels. When this would not result in an accuracy > .5, one would learn that the current measure is not predictive of our labels. If the accuracy is between .5 and 1, we know there is a relation between our measure and our labels. 


![alt text](./images/ml_basics.jpg "Title")


First, we need to do a data transformation.

In [None]:
X = df_long["Measure"].values.reshape(-1, 1)
y = df_long["Group"]

Define the model type and fit the model on the data. 

In [None]:
clf = LogisticRegression()
clf.fit(X, y)

After that, we predict the labels based on our measure and the fitted model. For metrics estimations, we store the prediction in our data frame.

In [None]:
df_long["Predicted Group: Model"] = clf.predict(X)
df_long

Now we estimate the accuracy of the model predictions to compare the fitted model to our models. 

In [None]:
true_positive = len(df_long[df_long["Group"]==df_long["Predicted Group: Model"]][df_long["Predicted Group: Model"] == "Problem"])
false_positive = len(df_long[df_long["Group"]!=df_long["Predicted Group: Model"]][df_long["Predicted Group: Model"] == "Problem"])
true_negative = len(df_long[df_long["Group"]==df_long["Predicted Group: Model"]][df_long["Predicted Group: Model"] == "NoProblem"])
false_negative = len(df_long[df_long["Group"]!=df_long["Predicted Group: Model"]][df_long["Predicted Group: Model"] == "NoProblem"])
(true_positive+true_negative) / (true_positive+true_negative+false_negative+false_positive)

The `score()` function is also doing the same thing.

In [None]:
clf.score(X, y)

So in comparison to our model, the accuracy of the fitted model is higher. 

To get the boundary of our model, we can now look at, e.g., the minimum value of the "Problem" Group. 

In [None]:
min(df_long["Measure"][df_long["Predicted Group: Model"]=="Problem"])

To get a better picture, we can now look at the histogram again, showing the result of the model prediction. 

In [None]:
plt.hist([df_long["Measure"][df_long["Predicted Group: Model"]=="NoProblem"], 
         df_long["Measure"][df_long["Predicted Group: Model"]=="Problem"]], 
         bins=20, histtype='bar', color=['b','r'], label=['No Problem', 'Problem'])
plt.legend()
plt.xlabel('Measure')
plt.ylabel('Frequency')
plt.show()

The `0.87` value is the best solution the model could find. Considering that the accuracy is the highest. 

**Exercise** Think about that approach and what might be problems here.

**Train-Test split**

The train-test split is an essential step in the machine-learning process because it allows you to evaluate your model's performance on unseen data.
When you train a machine learning model, you use a dataset to fit the model's parameters to the data. This process is known as training the model. However, using the same data to evaluate the model's performance may achieve high accuracy because it has seen the data before. This phenomenon is known as overfitting. Here the model learned the noise in the training data so that it may perform poorly on new, unseen data.
To overcome this problem, you can split your data into training and test sets. The training set is used to fit the model's parameters, while the test set is used to evaluate the model's performance on unseen data. This allows you to estimate the model's performance on new data and compare different models' performance.

To implement this we use the `train_test_split` function from sklearn. The `test_size` parameter allows you to define the amount of data in the test set in percent (i.e., `0.2` is 20\%).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print(len(y_train))
len(y_test)

Then we fit the model only on the train data and score it on the test data.

In [None]:
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

The score on the training data indicates if we can learn something from the data and the test data indicates if the model can generalize to new data.  


So lets do this again but with a differen classificatioin algorithm. So that one can compare models. The `RandomForestClassifier` function allows to fit a random forest model. 

In [None]:
clf = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=0)

And now we can look a the model performance again

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)

print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

**Exercise** Define a decision tree classifier `DecisionTreeClassifier()` and compare the performance to the other two models.

**Cross Validation**

Cross-validation allows you to evaluate your model's performance on multiple subsets of your data rather than just a single train-test split. Therefore it provides a more robust estimate of your model's performance and reduces the risk of overfitting (i.e., Regulization method).

![alt text](./images/regularization.jpg "Title")

Cross validation is a good and easy method to reduce the possibility of overfitting. 

![alt text](./images/cv_figure.jpg "Title")

Define the cross-validation method. Here we used the `KFold` method. The parameter `n_splits` allows the sprecification of the number of equal sized random splits (e.g., `n_splits=5` defines that the data will be split in 5 equal sized data parts including randomly drawn cases).

In [None]:
cv = KFold(n_splits=5)

Compute cross-validated accuracy scores. Here the arithmetic mean of the five scores accros all folds is calculated.

In [None]:
scores = cross_val_score(clf, X, y, cv=cv, n_jobs=-1)

In [None]:
clf

Print the mean and standard deviation of the accuracy scores

In [None]:
print(np.mean(scores), np.std(scores))
scores

**Exercise** Do the Cross-Validation for the three methods we used above and present the scores in a box-plot side by side.

In [None]:
data = {'Random Forest': scores}
df = pd.DataFrame(data)
plt.boxplot(df)
plt.xlabel('Datasets')
plt.ylabel('Values')
plt.ylim(0.5, 1)
#plt.xticks([1, 2], df.columns.tolist())
plt.show()

In a final example, like in the figure above, we first split into training and test set and then estimate the parameters within a five fold cross-validation. The model that results from the cross-validation is then used to categorize the testdata.

First we define a classifier. 

In [None]:
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)

Do the train/test split with 80%/20% of the data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Next we train the classifier without cross-validation and score the fitted random forest to learn if the data holds patterns that can be trained.

In [None]:
clf.fit(X_train, y_train)
noCV_score = clf.score(X_train, y_train)
print("Score without CV: {}".format(noCV_score))

Next step is to get a model based on parameters after implementing a cross-validation.

In [None]:
scores = cross_val_score(clf, X_train, y_train, cv=5)

print("Cross-validation scores: {}".format(scores))
print("Average score: {:.2f}".format(scores.mean()))

Final step is to score the model with the test dataset.

In [None]:
y_pred = clf.predict(X_test)
print("Score on test data with CV model: {}".format(accuracy_score(y_pred,y_test)))

In the end, we have three parameters. Accuracy on the training data with and without cross-validation and the score on the left out test data. 

**Exercise** Describe in one or two sentences what we learn from the three accuracy measures.

- 
-
-