## Classification

In this notebook we will see some basic classification methods:
- Logistic regression
- Support vector machine
- Decision tree
- Random forests


We will use the following libraries:
- `sklearn`  main library for machine learning algorithms
- `scipy`
- `pandas`
- `numpy`
- `matplotlib`
- `seaborn`




In [None]:
from sklearn import datasets
from sklearn import linear_model, svm, tree
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from scipy.cluster.hierarchy import dendrogram, linkage 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline




## The breast cancer data
 

In [None]:
# Load dataset
data = datasets.load_breast_cancer()
print(data.DESCR)

With this dataset, we will try to classify the  tumor as begnin or malignant, the target variable

In [None]:
cancer_x=pd.DataFrame(data['data'],columns=data.feature_names) #these will be x in our model
cancer_y=data.target # these will be y in our model

In [None]:
cancer=pd.DataFrame(data['data'],columns=data.feature_names)
cancer["cancer10"]=data.target   # save cancer as another column in the dataset with 0 and 1
cancer["cancer10"]=data.target_names[data.target]   # save cancer as another column in the dataset with begnin or malignant
cancer.head()

In [None]:
cancer.describe()

In [None]:
sns.heatmap(cancer.corr())

Based on the result above, and some data exploration, describe some patterns in the breast dataset

## Classification

### Creating dataset

The first part is to create a training and test dataset, this can be done automatically with the `sklearn` library. we will set 20% of the data apart for training

In [None]:
train_x, test_x,train_y,test_y = train_test_split(cancer_x,cancer_y,test_size=0.20, random_state=40)

In [None]:
train_x.count()[0]

In [None]:
test_x.count()[0]

### Logistic regression



Logistic Regression is a type of Generalized Linear Model (GLM) that model a binary response variable $y$. Here $X$ is all the explanatory variables and $y$ is the tumor type.

1) Build the model with the `train` dataset

In [None]:
LR = linear_model.LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(train_x, train_y)  

2) Predict the classification with the test dataset 

In [None]:
LR.predict(test_x)  

3) Measure the accuracy of the prediction with this model on the test dataset

In [None]:
round(LR.score(train_x,train_y), 4) 

In [None]:
LR.score(test_x,test_y)

## Support Vector Machine

Support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. 

They work very similarly to logistic regression:

1) Fit the model to the `train` dataset

In [None]:
SVM = svm.LinearSVC().fit(train_x, train_y)  

2) Predict the classification with the test dataset 

In [None]:
SVM.predict(test_x)  

3) Measure the accuracy of the prediction with this model on the test dataset

In [None]:
SVM.score(test_x,test_y)

## Decision trees

Decision trees are model based on a tree structure with successive split from an inferred set of variables. The `DecisionTreeClassifier` implements a CART algorithm.


In [None]:
clf= tree.DecisionTreeClassifier().fit(train_x,train_y)

In [None]:
clf.predict(test_x)

In [None]:
clf.score(test_x,test_y)

To plot the tree try: (there may be conflict in the libraries)

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

dot_data = StringIO()

export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

## Random Forests

Random forest are methods for classification, regression and other tasks that operates by constructing a multitude of decision trees 

1) Build model

In [None]:
RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0).fit(train_x, train_y)  

2) Predict

In [None]:
RF.predict(test_x) 

3) Measure accuracy

In [None]:
RF.score(test_x,test_y)

## Comparison

For each model write the accuracy to compare them:
- logistic regression:
- SVM
- Decision tree
- Random forest

## Exercise

Using the digits dataset from `sklearn` perform a classification to infer the digits

In [None]:
data_digits=datasets.load_digits()