## Self-check Assignment (Not graded)

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data

We'll use the built in breast cancer dataset from Scikit Learn. We can obtain it with the load function:

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
cancerdata = load_breast_cancer()

The data set is presented in a dictionary form:

In [None]:
cancerdata.keys()

We can grab information and arrays out of this dictionary to set up our data frame and understanding of the features:

In [None]:
print(cancerdata['DESCR'])

In [None]:
cancerdata['feature_names']

## Set up DataFrame

In [None]:
df = pd.DataFrame(cancerdata['data'],columns=cancerdata['feature_names'])
df.info()

In [None]:
cancerdata['target']

In [None]:
df_target = pd.DataFrame(cancerdata['target'], columns=['Cancer'])
print(df_target.shape)

In [None]:
df.head()

Normally, you should go ahead with exploratory data analysis. But, we will skip that part for now and stick to getting a prediction.

Let's apply train-test-split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, np.ravel(df_target), test_size=0.30, random_state=101)

Why did I use np.ravel() there? Google it.

# Using Support Vector Machines (SVM): 

Initialize the model:

In [None]:
from sklearn.svm import SVC
model = SVC(kernel = 'rbf')
model.fit(X_train,y_train)

Our SVM Classifier model's training is complete. <br>
Now let's predict using the trained model.

In [None]:
predictions = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

print("Confusion Matrix: \n", confusion_matrix(y_test,predictions))
print("-"*80)
print("Classification Report: \n\n", classification_report(y_test,predictions))

Now, consider the hyperparameters: C and gamma <br>
You will get the best possible prediction accuracy for specific values of these parameters. Checking the accuracy for so many differnet permutations of values is complicated. <br>
So, we can search for parameters using a GridSearch!

# Gridsearch

Finding the right parameters (like what C or gamma values to use) is a tricky task! But luckily, we can be a little lazy and just try a bunch of combinations and see what works best! This idea of creating a 'grid' of parameters and just trying out all the possible combinations is called a Gridsearch, this method is common enough that Scikit-learn has this functionality built in with GridSearchCV! The CV stands for cross-validation.

GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested. 

In [None]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001]} 

In [None]:
from sklearn.model_selection import GridSearchCV

One of the great things about GridSearchCV is that it is a meta-estimator. It takes an estimator like SVC, and creates a new estimator, that behaves exactly the same - in this case, like a classifier.

Personal Tip: <br>
GridSearch might take a lot of time depending on the size of your dataset and the number of paramters you want to test on. <br>
You will have your cell running for a long time. So, to know that there's no problem, I'd recommend you to set verbose to whatever non-zero number you want, higher the number, the more verbose (verbose decides the text output describing the process).

In [None]:
improved_rfc = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)

What fit does is a bit more involved than usual. First, it runs the same loop with cross-validation, to find the best parameter combination. Once it has the best combination, it runs fit again on all data passed to fit (without cross-validation), to built a single new model using the best parameter setting.

In [None]:
# May take awhile!
improved_rfc.fit(X_train,y_train)

You can inspect the best parameters found by GridSearchCV in the best_params_ attribute, and the best estimator in the best\_estimator_ attribute:

In [None]:
improved_rfc.best_params_

In [None]:
improved_rfc.best_estimator_

Then you can re-run predictions on this grid object just like you would with a normal model.

In [None]:
improved_rfc_predictions = improved_rfc.predict(X_test)

In [None]:
print("Confusion Matrix: \n", confusion_matrix(y_test, improved_rfc_predictions))
print("-"*80)
print("Classification Report: \n\n", classification_report(y_test, improved_rfc_predictions))

**Note**: If you have a large number of features in your dataset, GridSearchCV can be computationally expensive. <br>In that case, you might want to look up "**RandomSearchCV**".

gg.

____

# Using Random Forest Classifier. 

Tree-based models are one of the most powerful set of machine learning algorithms.
They use a series of if-then rules to generate predictions from one or more decision trees. 


In [None]:
print(df.info())
df.head(2)

It is a scikit-learn convention: estimators accept matrices of numbers, not strings or other data types. This allows them to be agnostic to data type - each estimator can handle tabular, text data, images, etc. But it means you need to convert your data to numbers.

Since all the columns in our Dataframe have numeric values, we can apply the Random Forest Classifier directly.<br>
If there had been some categorical features with categories as strings, we would have to use OneHot Encoder or other similar techniques to convert it to numeric format. I'd urge you to Google it yourself.
<br>Search: **pd.get_dummies()**

In [None]:
df_target.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df, df_target, test_size=0.30, random_state = 42)

## Training a Decision Tree Model

Let's start by training a single decision tree first!

**Import DecisionTreeClassifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

In [None]:
predn = dtree.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix

print("Confusion Matrix: \n", confusion_matrix(y_test, predn))
print("-"*80)
print("Classification Report: \n\n", classification_report(y_test,predn))

Now, let's understand what a Random Forest is:

### Random Forest:
Imagine that you are a constestant in the show - Kaun Banega Crorepati.
You come across a really tough question.<br>
Lucky for you that you still have the Audience Poll lifeline left.<br>
Now that you aren't sure about the answer, you might prefer to aggregate many guesses of the audience members rather than going with your own guess alone.<br>
Many weak guesses together generate a strong guess. <br>

Think of the contestant as a single decision tree and the audience as a group of decision trees or a "Forest".
One individual tree might not be a great predictor. But, combining the predictions of many trees gives us a pretty good model!

Let's try to implement it.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=600)
rfc.fit(X_train,y_train)

In [None]:
predn = rfc.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix

print("Confusion Matrix: \n", confusion_matrix(y_test, predn))
print("-"*80)
print("Classification Report: \n\n", classification_report(y_test,predn))

You can see a clear improvement in the Random Forest approach as compared to a single Decision Tree model.<br> This will be even more obvious when you will work with larger and more complex datasets.

___

Try appying GridSearchCV for the Random Forest model and see if you are able to increase the accuracy any further. <br>If you can't decide on the appropriate parameter values for the grid, try Googling it :)

I've done it below for you. Check it only after you have given it a shot. You'll learn much more that way!

In [None]:
param_grid = {    
    'n_estimators': [100, 200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

In [None]:
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5, verbose= 3)
CV_rfc.fit(X_train, y_train.values.ravel())

In [None]:
CV_rfc.best_params_

In [None]:
new_rfc = CV_rfc.best_estimator_
new_rfc.fit(X_train, y_train.values.ravel())

In [None]:
pred = new_rfc.predict(X_test)

from sklearn.metrics import classification_report,confusion_matrix

print("Confusion Matrix: \n", confusion_matrix(y_test, pred))
print("-"*80)
print("Classification Report: \n\n", classification_report(y_test,pred))

gg.