<a href="https://colab.research.google.com/github/d-tomas/transform4europe/blob/main/notebooks/supervised_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supervised learning

In this *notebook* we will develop two supervised learning systems: banknote authentication (*classification*) and house price prediction (*regression*).

## Initial setup

In [None]:
# Libraries

import matplotlib.pyplot as plt  # To make plots
import numpy as np  # Obtain unique values in a vector
import pandas as pd
from sklearn.metrics import accuracy_score  # Calculate the precision of the classifier
from sklearn.model_selection import train_test_split  # Split the dataset in train and test
from sklearn.metrics import confusion_matrix  # Obtain the confusion matrix
from sklearn.metrics import mean_absolute_error  # Mean Absolut Error (MAE) for regression
from sklearn.svm import SVC  # Support Vector Machines algorithm
from sklearn.tree import DecisionTreeClassifier  # Decission tree algorithm
from sklearn.naive_bayes import MultinomialNB  # Naïve Bayes algorithm
from sklearn.neural_network import MLPClassifier  # Neural Networks algorithm
from sklearn.neighbors import KNeighborsClassifier  # k-NN algorithm
from xgboost import XGBRegressor  # XGBoost regression algorithm
import seaborn as sns  # Visualización del mapa de calor

# Download the dataset to train and test the classification system
!wget https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/banknote_authentication.csv
# Download the dataset to train and test the regression system
!wget https://raw.githubusercontent.com/d-tomas/transform4europe/main/datasets/houses.csv

## Classification

Let's create a classifier to predict whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. There are 1,372 observations with four input variables (*features*) and one output variable (*class*). The variable names are as follows:

* Variance of Wavelet Transformed image (continuous)
* Skewness of Wavelet Transformed image (continuous)
* Kurtosis of Wavelet Transformed image (continuous)
* Entropy of image (continuous)
* Class (`0` for authentic, `1` for inauthentic)

The number of observations for each class is not balanced: 762 negative (`0`) and 610 positive (`1`) samples.

In [None]:
# Let's see what the training corpus looks like

!head banknote_authentication.csv

In [None]:
# Load the data for classification

data_classification = pd.read_csv('banknote_authentication.csv')
data_classification

In [None]:
# Create the classificer for banknote authenticity prediction

y = data_classification['Class']  # Store the class to predict
X = data_classification.drop(labels='Class', axis=1)  # Store all the features but the class

# Split the dataset into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Use SVM as algorithm for classification
model = SVC(kernel = 'linear')

# You can try other models! Uncomment what you want to use
# model = DecisionTreeClassifier()  # Decission tree
# model = KNeighborsClassifier()  # k-NN
# model = MLPClassifier()  # Neural network
# model = MultinomialNB()  # Naïve Bayes

# Train the model
model.fit(X_train, y_train)

# Do the prediction on the test set
predictions = model.predict(X_test)

# Calculate the precision of the algorithm
print('Precision: {:.2%}\n'.format(accuracy_score(predictions, y_test)))
print('Confusion matrix:')

plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, predictions), annot=True, linewidth=3)
plt.yticks(rotation=0)
plt.show()


In [None]:
# If we want to try the model with a new input

new_input = [[-1.37056, -2.87730,	5.4474, 0.24179	]]  # New input instance (all features but no class)
model.predict(new_input)  # Predict the class for the new input (0 or 1)

## Regression

We are going to build a system for **predicting house prices**. The system is trained on a *dataset* consisting of 1,460 examples of houses, with 80 features for each one and its selling price (class).

In [None]:
# Load the data for regression

data_regression = pd.read_csv('houses.csv')
data_regression

In [None]:
# Show info and data types for each column

data_regression.info()

In [None]:
# Build the regression model

y = data_regression['SalePrice']  # Class to predict (the price of the houses)
X = data_regression.drop(labels='SalePrice', axis=1)  # All the features of each house (but its price)

# Categorical variables (those that are not numbers) must be converted into numerical values
# Use 'one-hot-encoding' tehcnique
X = pd.get_dummies(X)

# # Split the dataset into train (80%) and test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Create the model tunning some parameters to improve the performance
model = XGBRegressor(colsample_bytree=0.6, learning_rate=0.015, max_depth=4, min_child_weight=3, n_estimators=3000, subsample=0.75, random_state=1)
model.fit(X_train, y_train)  # Train the model

# Predict on the test split
predictions = model.predict(X_test)

# Calculate the precision of the algorithm (MAE)
# The lower this value, the better
print("MAE: {:,.0f}".format(mean_absolute_error(predictions, y_test)))

# Referencias

* [Banknote authentication dataset](https://archive.ics.uci.edu/ml/datasets/banknote+authentication)
* [House prices dataset](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
