# Lecture 8

### Note: If you don't need to standardize the data - don't do it! You are killing the effect of variance in your data

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split

In [24]:
spam = pd.read_csv(r"C:\Users\olive\Documents\GitHub\Computational-Applied-Statistics\Week 7\spamdata.csv")
spam.columns = spam.columns.str.strip()
X = spam.values[:, :57]
y = spam.values[:, -1]

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

In [26]:
dt10 = DecisionTreeClassifier(max_depth=10)
dt10.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [27]:
y10_pred = dt10.predict(X_test)

In [28]:
from sklearn.metrics import accuracy_score

In [29]:
accuracy_score(y_test, y10_pred)

0.8915401301518439

### Bagging sends "similar" data to each classifier - not the same. They are subsets of data from the original dataset
    - You may get the same observation in the same dataset of one sub-classifier - it doesn't ensure that you don't repeat rows

### If you resample just the data its bagging - but if you sample the features it's a Random Forrest

### Sometimes a single tree approximates the data similarly - in this case use it. You lose interpretability when you introduce ensemble methods

### Boosting isn't much different from bagging - boosting recognizes that a good classifier classifies difficult data well. Boosting gives weight to the "difficult" datapoints so it gets better at classifying complicated data - while assuming that the easy datapoints will remain easy to classify

### Weighting classifiers more will accomplish this aswell - higher weight on a classifier that computes complicated data better (more commonly used) Adaboost or something....
    - Gradient boosting is a special case of adaboost with the cost function being exponential
    - Make sure you're using the correct loss function

### Top Kaggle Winners:
    - Boosting
    - Random Forrest
    - SVM

# Random Forrest

- Decision trees where the features and the data are resampled
- Ensure no column is the same within the subset (don't duplicate columns this will destroy your classifier)
    - This is the difference between resampling rows and features... features should not be duplicated while rows can be
- If you're worried about colinearity in your data keep the number of sub-features of each classifier small

Demo of Random Forrest:
    https://cs.stanford.edu/~karpathy/svmjs/demo/demoforest.html

In [30]:
from sklearn.ensemble import RandomForestClassifier

In [34]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [35]:
y_pred = rf.predict(X_test)

In [36]:
accuracy_score(y_test, y_pred)

0.9479392624728851

### Notice the improved accuracy from the regular decision tree

### For cross-validation - you need to re-split, shuffle, train, and evaluate on test at least 30 times so you can come up with a meaningful average for your accuracy score
- 30 times is meaningful because then you can use central limit theorem to infer the mean of the accuracy
- This also allows us to make a confidence interval for our accuracy score
- This allows you to report results accurately without giving someone else the exact test set you used

# SVM

- SVM is basically useless in low dimensions because the data is rarely seperable (this is a requirement for SVM since you need to calculate margins that separate the data the best)
- When you increase the # of features data will become more and more seperable
- You can also enforce seperability by giving non-linear kernels
- Basically a regression with a different loss function
    - Use RBF kernel: type of non-linear kernel
    - Instead of squared error loss use something else...
- Don't be confused - you can have a "neural network" kernel for SVM but that just means it contains a sigmoid... this is not a real neural network...

In [37]:
from sklearn.svm import SVC
sv = SVC(C=10)
sv.fit(X_train, y_train)
accuracy_score(sv.predict(X_test), y_test)

0.841648590021692

### The "C" in the SVM is related to how you define the margin to maximize separation - without this restriction on the margin there would be no solution to the ideal hyperplane