# Session 8 - Introduction to Machine Learning

Goal: the one thing that is difficult to guess is the price, we will try to learn the price. It's a numeric variable, when there is a very large amount values we can say that is continuous. To guess a continuous value we use regression. 

Simple case: 2 class problem - categorize the price

In [None]:
import pandas as pd
from sklearn import tree
import matplotlib.pyplot as plt
import numpy as np

In [None]:
df = pd.read_csv("UTSEUS-anjuke-real-estate-baoshan.csv")

In [None]:
df.head()

## Clean df 

Remove id, adress (we have long and lat), onesquaremeter, tags, district, neighborhood. The target variable is "price". We have to predict it by learning rules from df using decision trees.

In [None]:
X = df[['longitude', 'latitude', 'bedroom', 'room', 'surface']]
Y = df['price']

In [None]:
#plt.hist(Y, 100)

In [None]:
medianY = np.median(Y)

Binary classifier

In [None]:
Z = pd.DataFrame.copy(Y)
Z[Y<medianY] = "cheap"
Z[Y>=medianY] = "expensive"

In [None]:
df2 = pd.DataFrame.copy(X)
df2['class'] = Z
df2.head()

# 1. Decision Tree

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X,Z)

In [None]:
clf.predict([[121, 31, 3, 5, 200]])

Random and split the df in 2 parts: train and test

In [None]:
from sklearn.model_selection import train_test_split
def splitTrainTest(df, testSize = 0.3, nameColumnClass = 'class'):
    train, test = train_test_split(df, test_size=testSize, shuffle=True)
    result = [train.drop(columns=[nameColumnClass]), train[nameColumnClass], test.drop(columns=[nameColumnClass]), test[nameColumnClass]]
    return result

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)

In [None]:
X_train, Z_train, X_test, Z_test = splitTrainTest(df2) #by default the test part represent 30% of the data

In [None]:
X_train.head() #df with features

In [None]:
Z_train.head()

In [None]:
len(test)+len(train)

Compute the model

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Z_train)

In [None]:
prediction = clf.predict(X_test)

In [None]:
np.mean(prediction == Z_test)

We plot the decision tree

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(max_depth=2)
clf = clf.fit(X_train, Z_train)
plt.figure(figsize=(15,10))
tree.plot_tree(clf)

# 2. Crossed-validation

In [None]:
plt.plot(X["latitude"])

If we don't shuffle the dataset the result of the crossed-validation is not representative

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import ShuffleSplit
clf = tree.DecisionTreeClassifier(max_depth=5) #increasing the depth to increase perf, but be careful to not overfit
scores = cross_val_score(clf, X, Z, cv=ShuffleSplit(n_splits=5))
scores
np.mean(scores)

### Try to get the best parameters - GridSearchCV

In [None]:
# We  want to determine the best CV
from sklearn.model_selection import GridSearchCV
param_grid = {'max_depth': [2,4,6,8,10,20,40,100]}
search = GridSearchCV(tree.DecisionTreeClassifier(), param_grid, cv= ShuffleSplit(n_splits=5))
search.fit(X,Z)

We get the max_depth

In [None]:
search.best_estimator_

In [None]:
# max-depth=20 we can look around 20 after
clf = tree.DecisionTreeClassifier(max_depth=20)
scores = cross_val_score(clf, X,Z, cv=ShuffleSplit(n_splits=5))
np.mean(scores)

### Best model for the decision tree:

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=20)
clf = clf.fit(X_train, Z_train)
prediction = clf.predict(X_test)
np.mean(prediction == Z_test)

## Test other models to see if there's better than the decision tree

### 1. Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(max_depth=25, random_state=0, n_estimators = 100)
clf = clf.fit(X_train, Z_train)
prediction = clf.predict(X_test)
np.mean(prediction == Z_test)

### 2. K-Nearest neighbors

### 3.Support vector machines

## Multi-class classifier

In [None]:
df['price'].describe()

In [None]:
Z = pd.DataFrame.copy(Y)
Z[Y <= np.quantile(Y,0.25)] = "Very Cheap" # could also use class 1,2,3,4
Z[(Y > np.quantile(Y, 0.25)) & (Y <= np.quantile(Y, 0.5))] = "Cheap"
Z[(Y > np.quantile(Y, 0.5)) & (Y <= np.quantile(Y, 0.75))] = "Expensive"
Z[Y > np.quantile(Y,0.75)] = "Very Expensive"

In [None]:
Z

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Z_train, Z_test = train_test_split(X, Z, test_size=0.33, shuffle=True)

In [None]:
Z_train.head()

In [None]:
# We  want to determine the best C
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
param_grid = {'C': [0.1, 1, 10, 100], #how much u want to penalize the errors
             "decision_function_shape": ['ovr', 'ovo']}
search = GridSearchCV(SVC(kernel="linear"), param_grid, cv= 5)
search.fit(X_train, Z_train)

### Boosting

In [None]:
# We  want to determine the best param
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
param_grid = {'n_estimators': [10, 50, 100]}
clf = AdaBoostClassifier()
search = GridSearchCV(clf, param_grid, cv= 5)
search.fit(X_train, Z_train)

In [None]:
#best model
clf = AdaBoostClassifier(n_estimators=50)
clf = clf.fit(X_train, Z_train)
prediction = clf.predict(X_test)
np.mean(prediction == Z_test)

In [None]:
np.mean(Z_pred == Z_test)

In [None]:
conf_matrix = np.zeros((4,4))
for k in range(len(Z_pred)):
    conf_matrix[int(Z_pred[k]), int(ZZ[k])]+=1
conf_matrix