<a href="https://colab.research.google.com/github/clementbowe14/ml-class/blob/main/labs/Working_Version_of_WorkingWithData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Data

#### Part of the [Inquiryum Machine Learning Fundamentals Course](http://inquiryum.com/machine-learning/)

In the examples we have been working with so far, all the columns had numerical data. For example, the violet classification data looked like:
    
    
Sepal Length|Sepal Width|Petal Length|Petal Width|Class
:--: | :--: |:--: |:--: |:--: 
5.3|3.7|1.5|0.2|Iris-setosa
5.0|3.3|1.4|0.2|Iris-setosa
5.0|2.0|3.5|1.0|Iris-versicolor
5.9|3.0|4.2|1.5|Iris-versicolor
6.3|3.4|5.6|2.4|Iris-virginica
6.4|3.1|5.5|1.8|Iris-virginica

Notice that all the feature columns had numeric data. This isn't always the case. In addition to **numeric** data, datasets often contain **categorical data**. A column that contains **categorical data** means that the values are from a limited set of values. For example:

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/rotten.png)

Movie | Tomato Rating | Genre | Rating | Length 
:---: | :---: | :---: | :---: | :---:  
First Man | 88 | Drama | PG-13 | 138
Can You Ever Forgive Me | 98 | Drama | R | 107
The Girl in the Spider's Web | 41 | Drama | R | -99
Free Solo | 99 | Documentary | PG-13 | 97
The Grinch | 57 | Animation | PG | 86
Overlord | 80 | Action | R | 109
Christopher Robin | 71 | Comedy | PG | -99
Ant Man and the Wasp  |  88 | Science Fiction | PG-13 | 118

Numeric columns like `Tomato Rating` and `Length` are fine as is, but the columns `Genre` and `Rating` are problematic for machine learning. Those columns contain categorical data which again means that the values of those columns are from a limited set of possibilities. Modern machine learning algorithms are designed to handle only numeric and boolean (True, False) data. So, as a preprocessing step, we will need to convert the categorical columns to numeric. One solution would be simply to map each categorical value to an integer. So drama is 1, documentary 2 etc:

index | genre
 :--: | :--:
 1 | Drama
 2 | Documentary
 3 | Animation
 4| Action
 5 | Comedy
 6 | Science Fiction

Using this scheme we can convert the original data to:

Movie | Tomato Rating | Genre | Rating | Length 
:---: | :---: | :---: | :---: | :---:  
First Man | 88 | 1 | 1 | 138
Can You Ever Forgive Me | 98 | 1 | 2 | 107
The Girl in the Spider's Web | 41 | 1 | 2 | -99
Free Solo | 99 | 2 | 1 | 97
The Grinch | 57 | 3 | 3 | 86
Overlord | 80 | 4 | 2 | 109
Christopher Robin | 71 | 5 | 3 | -99
Ant Man and the Wasp  |  88 | 6 | 1 | 118


But this solution is problematic in a different way. Integers infer  both an ordering and a distance where 2 is closer to 1 than 4. 

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/GenreLine.png)

Since in the genre column 1 is drama, 2 is documentary, and 4 is action, our scheme implies that dramas are closer to documentaries than they are to action films, which is clearly not the case. This problem also exists in the rating column (PG,PG13, R). Mapping the categories to integers in a different way will not fix this problem. No matter how clever we are in making this mapping, the problem will still exist. **So clearly this method is not the way to go**!
![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)
### One Hot Encoding
The solution is to do what is called one hot encoding. Our original table looked like:


Movie | Tomato Rating | Genre | Rating | Length 
:---: | :---: | :---: | :---: | :---:  
First Man | 88 | Drama | PG-13 | 138
Can You Ever Forgive Me | 98 | Drama | R | 107
The Girl in the Spider's Web | 41 | Drama | R | -99
Free Solo | 99 | Documentary | PG-13 | 97
The Grinch | 57 | Animation | PG | 86
Overlord | 80 | Action | R | 109
Christopher Robin | 71 | Comedy | PG | -99
Ant Man and the Wasp  |  88 | Science Fiction | PG-13 | 118

So, for example, we had the categorical column genre with the possible values drama, documentary, animation, action, comedy and science fiction. Instead of one column with those values, we are going to convert it to a form where each value is its own column.

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/normalize2.jpg)


If that data instance is of that value then we would put a **one** in the column, otherwise we would put a zero. For example, since *The Girl in the Spider's Web* is a drama, we would put a 1 in the drama column and a zero in the animation column. So we would convert

Movie | Genre 
:---: | :---: 
First Man | Drama 
Can You Ever Forgive Me | Drama
The Girl in the Spider's Web |  Drama 
Free Solo |  Documentary 
The Grinch |  Animation 
Overlord |  Action
Christopher Robin |  Comedy
Ant Man and the Wasp  |   Science Fiction

to

Movie | Drama | Documentary | Animation | Action | Comedy | Science Fiction
:--: | :--: | :--: | :--: | :--: | :--: | :--: 
First Man | 1 | 0 | 0| 0| 0 | 0 
Can You Ever Forgive Me | 1 | 0 | 0| 0| 0 | 0 
The Girl in the Spider's Web | 1 | 0 | 0| 0| 0 | 0 
Free Solo | 0 | 1 | 0| 0| 0 | 0 
The Grinch | 0 | 0 | 1| 0| 0 | 0 
Overlord | 0 | 0 | 0| 1| 0 | 0 
Christopher Robin | 0 | 0 | 0| 0| 1 | 0 
Ant Man and the Wasp | 0 | 0 | 0| 0| 0 | 1 

Notice that the movie *First Man* has a one in the drama column and zeroes elsewhere. The movie *Free Solo* has a one in the documentary column and zeroes elsewhere.
This is the prefered way of converting categorical data (when we work with text we will see other options). An added benefit to this approach is now an instance can be of multiple categories. For example, we may want to categorize *Ant Man and the Wasp* as both a comedy and science fiction, and that is easy to do in this scheme:

Movie | Drama | Documentary | Animation | Action | Comedy | Science Fiction
:--: | :--: | :--: | :--: | :--: | :--: | :--: 
First Man | 1 | 0 | 0| 0| 0 | 0 
Can You Ever Forgive Me | 1 | 0 | 0| 0| 0 | 0 
The Girl in the Spider's Web | 1 | 0 | 0| 0| 0 | 0 
Free Solo | 0 | 1 | 0| 0| 0 | 0 
The Grinch | 0 | 0 | 1| 0| 0 | 0 
Overlord | 0 | 0 | 0| 1| 0 | 0 
Christopher Robin | 0 | 0 | 0| 0| 1 | 0 
Ant Man and the Wasp | 0 | 0 | 0| 0| 1 | 1 



If we one-hot encoded all the categorical columns in our original dataset it would look like:

Movie            | Tomato Rating | Action | Animation | Comedy | Documentary | Drama | Science Fiction | PG | PG-13 | R | Length 
:---: | :---: | :---: | :---: | :---: |  :---: |  :---: |  :---: |  :---: |  :---: |  :---: |  :---: 
First Man        | 88            |  0     |    0      |   0    | 0           | 1     | 0    | 0 | 1  |    0| 138
Can You Ever Forgive Me | 98 |      0     |    0      |   0    | 0           | 1     | 0    | 0 | 0  |   1|   107
The Girl in the Spider's Web | 41 |  0     |    0      |   0    | 0           | 1     | 0    | 0 | 1  |    0|    -99
Free Solo | 99 |  0     |    0      |   0    | 1           | 0     | 0    | 0 | 1  |    0|   97
The Grinch | 57 |  0     |    1      |   0    | 0           | 0     | 0    | 1 | 0  |    0| 86
Overlord | 80 |  1     |    0      |   0    | 0           | 0     | 0    | 0 | 1  |    0|  109
Christopher Robin | 71  |  0     |    0      |   1   | 0           | 0     | 0    | 1 | 0  |    0| -99
Ant Man and the Wasp   |  0    |    0      |   0    | 0       | 0    | 0     | 1    | 0 | 1  |    0| 118



### Coding
Let's investigate this a bit with a coding example. 


In [14]:
import pandas as pd
bike = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/bike.csv')
bike = bike.set_index('Day')
bike


Unnamed: 0_level_0,Outlook,Temperature,Humidity,Wind,Bike
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Sunny,Hot,High,Weak,No
2,Sunny,Hot,High,Strong,No
3,Overcast,Hot,High,Weak,Yes
4,Rain,Mild,High,Weak,Yes
5,Rain,Cool,Normal,Weak,Yes
6,Rain,Cool,Normal,Strong,No
7,Overcast,Cool,Normal,Strong,Yes
8,Sunny,Mild,High,Weak,No
9,Sunny,Cool,Normal,Weak,Yes
10,Rain,Mild,Normal,Weak,Yes


Here we are trying to predict whether someone will mountain bike or not based on the outlook, temperature, humidity, and wind. 
Let's forge ahead and see if we can build a decision tree classifier:

In [15]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(bike[['Outlook', 'Temperature', 'Humidity', 'Wind']], bike['Bike'])

ValueError: ignored

And we see that doesn't work. We get the error:

```
ValueError: could not convert string to float: 'Sunny'
```

We need to one-hot encode these categorical columns.  Here is how to convert the Outlook column. The steps are

1. Create a new Dataframe of the one-hot encoded values for the Outlook column.
2. Drop the Outlook column from the original Dataframe.
3. Join the new one-hot encoded Dataframe to the original.
![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)

#### 1. Create the new Dataframe

In [16]:
one_hot = pd.get_dummies(bike['Outlook'])
one_hot

Unnamed: 0_level_0,Overcast,Rain,Sunny
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,0,1
2,0,0,1
3,1,0,0
4,0,1,0
5,0,1,0
6,0,1,0
7,1,0,0
8,0,0,1
9,0,0,1
10,0,1,0


Nice.
![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)

#### 2. Drop the outlook column from the original Dataframe:


In [17]:
bike = bike.drop('Outlook', axis=1)


![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)

#### 3. join the one-hot encoded Dataframe to the original

In [18]:
bike = bike.join(one_hot)
bike

Unnamed: 0_level_0,Temperature,Humidity,Wind,Bike,Overcast,Rain,Sunny
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Hot,High,Weak,No,0,0,1
2,Hot,High,Strong,No,0,0,1
3,Hot,High,Weak,Yes,1,0,0
4,Mild,High,Weak,Yes,0,1,0
5,Cool,Normal,Weak,Yes,0,1,0
6,Cool,Normal,Strong,No,0,1,0
7,Cool,Normal,Strong,Yes,1,0,0
8,Mild,High,Weak,No,0,0,1
9,Cool,Normal,Weak,Yes,0,0,1
10,Mild,Normal,Weak,Yes,0,1,0


It is simple, but a little tedious. Let's finish up encoding the other columns:


In [19]:
one_hot = pd.get_dummies(bike['Temperature'])
bike = bike.drop('Temperature', axis=1)
bike = bike.join(one_hot)
one_hot = pd.get_dummies(bike['Humidity'])
bike = bike.drop('Humidity', axis=1)
bike = bike.join(one_hot)
one_hot = pd.get_dummies(bike['Wind'])
bike = bike.drop('Wind', axis=1)
bike = bike.join(one_hot)

bike

Unnamed: 0_level_0,Bike,Overcast,Rain,Sunny,Cool,Hot,Mild,High,Normal,Strong,Weak
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,No,0,0,1,0,1,0,1,0,0,1
2,No,0,0,1,0,1,0,1,0,1,0
3,Yes,1,0,0,0,1,0,1,0,0,1
4,Yes,0,1,0,0,0,1,1,0,0,1
5,Yes,0,1,0,1,0,0,0,1,0,1
6,No,0,1,0,1,0,0,0,1,1,0
7,Yes,1,0,0,1,0,0,0,1,1,0
8,No,0,0,1,0,0,1,1,0,0,1
9,Yes,0,0,1,1,0,0,0,1,0,1
10,Yes,0,1,0,0,0,1,0,1,0,1


Great! Now we can train our classifier. I will just cut and paste the previous `clf.fit` and ...

In [20]:
clf.fit(bike[['Outlook', 'Temperature', 'Humidity', 'Wind']], bike['Bike'])

KeyError: ignored

Well that didn't work. The clf.fit instruction was

```
clf.fit(bike[['Outlook', 'Temperature', 'Humidity', 'Wind']], bike['Bike'])
```
So we instruct it to use the Outlook, Temperature, Humidity, and Wind columns, but we just deleted them. Instead we have the following columns:

In [21]:
list(bike.columns)

['Bike',
 'Overcast',
 'Rain',
 'Sunny',
 'Cool',
 'Hot',
 'Mild',
 'High',
 'Normal',
 'Strong',
 'Weak']

Using that list let's divide up our data into the label (what we are trying to predict) and the features (what we are using to make the prediction).

In [22]:
fColumns = list(bike.columns)
fColumns.remove('Bike')
bike_features = bike[fColumns]
bike_features

Unnamed: 0_level_0,Overcast,Rain,Sunny,Cool,Hot,Mild,High,Normal,Strong,Weak
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,0,1,0,1,0,1,0,0,1
2,0,0,1,0,1,0,1,0,1,0
3,1,0,0,0,1,0,1,0,0,1
4,0,1,0,0,0,1,1,0,0,1
5,0,1,0,1,0,0,0,1,0,1
6,0,1,0,1,0,0,0,1,1,0
7,1,0,0,1,0,0,0,1,1,0
8,0,0,1,0,0,1,1,0,0,1
9,0,0,1,1,0,0,0,1,0,1
10,0,1,0,0,0,1,0,1,0,1


and now the label:

In [23]:
bike_labels = bike[['Bike']]
bike_labels

Unnamed: 0_level_0,Bike
Day,Unnamed: 1_level_1
1,No
2,No
3,Yes
4,Yes
5,Yes
6,No
7,Yes
8,No
9,Yes
10,Yes


Now, finally, we can train our decision tree classifier.

In [24]:
clf.fit(bike_features, bike_labels)

DecisionTreeClassifier(criterion='entropy')

As you can see, preparing the data, can actually take a longer time than running the machine learning component.

### `get_dummies` not the only way

There are other methods to one-hot encode a dataset. For example, sklearn has a class, [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html?highlight=one%20hot%20encorder), which might be a better option for many machine learning tasks. The reason I selected `get_dummies` for this notebook was a pedagogical one---`get_dummies` is a bit more transparent and you get a better sense of what one-hot-encoding does.

![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)

### The `One Hot Encover` method


### Conditionals for munging data

Let's say we have this small DataFrame

In [19]:
bike = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/bike.csv')
bike = bike.set_index('Day')
bike

Unnamed: 0_level_0,Outlook,Temperature,Humidity,Wind,Bike
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Sunny,Hot,High,Weak,No
2,Sunny,Hot,High,Strong,No
3,Overcast,Hot,High,Weak,Yes
4,Rain,Mild,High,Weak,Yes
5,Rain,Cool,Normal,Weak,Yes
6,Rain,Cool,Normal,Strong,No
7,Overcast,Cool,Normal,Strong,Yes
8,Sunny,Mild,High,Weak,No
9,Sunny,Cool,Normal,Weak,Yes
10,Rain,Mild,Normal,Weak,Yes


In [32]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
bikeHot = enc.fit_transform(bike)
bikeHot

<14x22 sparse matrix of type '<class 'numpy.float64'>'
	with 154 stored elements in Compressed Sparse Row format>

Notice that the line 

```
bikeHot
```
did not print the matrix as we expected. Instead it instructs us that `bikeHot` is a space matrix. In our previous Rotten Tomato example, and the bike one, there were mostly zeros in the matrix. For large matrices, these zeros take a  lot of room. Instead we can just record the non-zero values using this sparce matrix representation. When we use `print` we can see this representation:

In [33]:
print(bikeHot)

  (0, 0)	1.0
  (0, 2)	1.0
  (0, 4)	1.0
  (0, 7)	1.0
  (0, 8)	1.0
  (0, 11)	1.0
  (0, 12)	1.0
  (0, 15)	1.0
  (0, 16)	1.0
  (0, 18)	1.0
  (0, 21)	1.0
  (1, 0)	1.0
  (1, 2)	1.0
  (1, 4)	1.0
  (1, 7)	1.0
  (1, 8)	1.0
  (1, 11)	1.0
  (1, 12)	1.0
  (1, 15)	1.0
  (1, 16)	1.0
  (1, 19)	1.0
  (1, 20)	1.0
  (2, 1)	1.0
  (2, 3)	1.0
  (2, 4)	1.0
  :	:
  (11, 16)	1.0
  (11, 19)	1.0
  (11, 20)	1.0
  (12, 1)	1.0
  (12, 3)	1.0
  (12, 4)	1.0
  (12, 6)	1.0
  (12, 8)	1.0
  (12, 11)	1.0
  (12, 12)	1.0
  (12, 14)	1.0
  (12, 17)	1.0
  (12, 18)	1.0
  (12, 21)	1.0
  (13, 0)	1.0
  (13, 2)	1.0
  (13, 5)	1.0
  (13, 6)	1.0
  (13, 8)	1.0
  (13, 10)	1.0
  (13, 13)	1.0
  (13, 15)	1.0
  (13, 16)	1.0
  (13, 19)	1.0
  (13, 20)	1.0


![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/divider.png)

Let's look at an example that has a column with only two possible values ...

In [None]:
from pandas import DataFrame
students = DataFrame({'name': ['Ann', 'Ben', 'Clara', 'Danielle', 'Eric', 'Akash'],
                     'sex':   ['f', 'm', 'f', 'f', 'm', 'm'],
                     'age': [21, 18, 23, 19, 20, 21]})
students

The column sex is categorical so we need to convert it. We could 
* one-hot-encode it and have two columns: f and m. 
* one-hot-encode it, have two columns: f and m, and then delete one of those columns
* create a column female and populate it correctly using a conditional

All three are fine options with the last 2 slightly better since they reduce the dimensionality. Let's see how we can do the last one using a lambda expression:

In [None]:
students['female'] =  students['sex'].apply(lambda x: True if x == 'f' else False)
students  = students.drop('sex', axis=1) 
students

That's great! 

Now suppose we think that whether or not a person is under 20 is relevant for our machine learning task. We can use the same type of lambda expression to create this new column:


In [None]:
students['under20'] =  students['age'].apply(lambda x: True if x < 20 else False)
students

As you can see, working with machine learning involves working on a pipeline of various processes--we don't start with the machine learning algorithm. Before leaping into the ML algorithm, takes some time to explore the data, decide if it needs to be cleaned in any way, one-hot-encoded, if some features are not needed or if new ones need to be added.

## Hyperparameters.

When we train a machine learning model (using `fit` in this case), the model learns a set of **parameters**. For example, in decision trees, one parameter is the depth of the tree. The depth isn't determined until the `fit` method finishes. The important point is that parameters are what the model learns on its own from analyzing the training dataset and not something we adjust.

In contrast **hyperparameters** are things we determine and not determined by the algorithm. Hyperparameters are set before the model looks at the training data--in our case before `fit`.  For decision trees there are a number of these hyperparameters. We already saw two: `max_depth` controls the size of the tree and `criterion`. Adjusting one hyperparameter may improve the accuracy of your classifier or it may worsen it. 

We have already learned that we shouldn't test our model using the same data that we trained on. Why not? Because the model is already tuned to the specific instances in our training data. In a kNN classifier, it may memorize every instance in our training data--*Gabby Douglas who is 49 inches tall and weighs 90 pounds is a gymnast*.  If we test using that same data, the accuracy will tend to be higher than if we tested using data the classifier has never seen before. Again, if we told the algorithm someone who is 49 inches tall and 90 pounds is a gymnast, we shouldn't find it surprising that if we asked it what sport does someone play who is 49 inches and 90 pounds, and the algorithm predicts *gymnast*. We want to see if the algorithm learned or generalized something from processing the dataset. In some previous labs, we reserved 20% of the original data to test on and used 80% for training. 

Now let's imagine a process where we will adjust hyperparameters to improve the accuracy of our model. So we build a classifier with one setting of the hyperparameters and build another with a different setting for the hyperparameters and see which one is more accurate. One approach might be:

1. Use 80% of the data to train on.
2. Test the classifier using the 20% test set and get the accuracy.
3. Adjust a hyperparameter and create a new classifier
4. Use the same 80% of the data to train the new classifier
5. Test the classifier using the 20% test set and get the accuracy.
6. Keep repeating this to find the value of the hyperparameter that performs the best.
7. The accuracy of your classifier will be the highest one obtained from evaluating the 20% test set.

The problem with this approach is that since we are tuning the hyperparameters based on the accuracy on the test set, some of the information about the test set is leaking into our classifier. Let me explain about information leaking into the classifier.

Let's look at our example of categorizing athletes into one of three categories: gymnast, basketball player, and marathoner. Leilani
Mitchell is not in the training set but is in the test set. She is 5 foot 5 inches tall and weighs 138. Initially, she was among the instances in the test set that were misclassified. We kept adjusting the hyperparameters until we improved accuracy and now she is correctly classified as a basketball player. So we tuned our classifier to work well with her and others in the test set. That is what we mean by information from the test set leaking into the classifier.

So again, we may get an arbitrary higher accuracy that is not reflective of the algorithm's performance on unseen data.

**So what can we do?**

The solution is to divide the original dataset into three:

1. the training set which we use to train our model.
2. the validation set which we use to test our model so we can adjust the hyperparameters
3. the test set which we use to perform an evaluation of the final model fit on the training set. We make our final adjustment of the hyperparameters **before** we evaluate the model using the test set.

There are many ways to divide up the original data into these three sets. For example, maybe 20% is reserved for the test set, 20% for validation and 60% for training.  However, there is a slightly better way.

### Cross Validation
For cross validation we are going to divide the dataset (typically just the training dataset) into roughly equal sized buckets. Typically we would divide the data into 10 parts and this is called 10-fold cross validation. To reiterate, with this method we have one data set which we divide randomly into 10 parts. We use 9 of
those parts for training and reserve one tenth for validation. We repeat this procedure 10 times
each time reserving a different tenth for validation.

Let’s look at an example. Suppose we want to build a classifier that just answers yes or no to
the question *Is this person a professional basketball player?* And our data consists of information
about 500 basketball players and 500 non-basketball players. 

#### Step 1. Divide the data into 10 bucks.

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/buckets.png)
We put 50 basketball players and 50 non-players in each bucket so each bucket contains information on 100 individuals.

#### Step 2. We iterate through the following steps 10 times

1. During each iteration hold back one of the buckets. For iteration 1, we will hold back bucket 1, iteration 2, bucket 2, and so on.
2. We will train the classifier with data from the other buckets. (during the first iteration we will train with the data in buckets 2 through 10)
3. We will validate the classifier we just built using data from the bucket we held back and save the results. In our case these results might be: 35 of the basketball players were classified correctly and  29 of the non basketball players were classified correctly. 


#### Step 3. we sum up the results.
Once we finish the ten iterations we sum the results. Perhaps we find that 937 of the 1,000 individuals were categorized correctly. 

#### Summary
Using cross-validation, every instance in our data is used in training and, in a different iteration, in validation. This results in a less biased model. By **bias** we mean that the algorithm is less accurate due to it not taking into account all relevant information in the data. With cross-validation we typically train on a larger percentage of the data than we would if we set aside a fixed validation set. One small disadvantage is that it now take 10 times as long to run.


### Leave One Out
Here is a suggestion from Lucy:
![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/nfold.png)

In the machine learning literature, n-fold cross validation (where n is the number of samples
in our data set) is called leave-one-out. Lucy above, already mentioned one benefit of leave-one-out—
at every iteration we are using the largest possible amount of our data for training. The other
benefit is that it is deterministic.

#### What do we mean by ‘deterministic’?

Suppose Lucy spends an intense 80 hour week creating and coding a new classifier. It is
Friday and she is exhausted so she asks two of her colleagues (Emily and Li) to evaluate the
classifier over the weekend. She gives each of them the classifier and the same dataset and
asks them to use 10-fold cross validation. On Monday she asks for the results ..

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/nfoldwomen2.png)

Hmm. They did not get the same results. Did Emily or Li make a mistake? Not necessarily. In
10-fold cross validation we place the data randomly into 10 buckets. Since there is this
random element, it is likely that Emily and Li did not divide the data into buckets in exactly
the same way. In fact, it is highly unlikely that they did. So when they train the classifier, they
are not using exactly the same data and when they test this classifier they are using different
test sets. So it is quite logical that they would get different results. This result has nothing to
do with the fact that two different people were performing the evaluation. If Lucy herself ran
10-fold cross validation twice, she too would get slightly different results. The reason we get
different results is that there is a random component to placing the data into buckets. So 10-
fold cross validation is called non-deterministic because when we run the test again we are
not guaranteed to get the same result. In contrast, the leave-one-out method is deterministic.
Every time we use leave-one-out on the same classifier and the same data we will get the
same result. That is a good thing!

#### The disadvantages of leave-one-out
The main disadvantage of leave-one-out is the computational expense of the method.
Consider a modest-sized dataset of 10,000 instances and that it takes one minute to train a
classifier. For 10-fold cross validation we will spend 10 minutes in training. In leave-one-out
we will spend 16 hours in training. If our dataset contains 10 million entries the total time
spent in training would nearly be two years. Eeeks!

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/twoyears.png)


The other disadvantage of leave-one-out is related to stratification.



#### Stratification.
Let us return to the example of building a classifier that predicts
what sport a woman plays (basketball, gymnastics, or track). When training the classifier we
want the training data to be representative and contain data from all three classes. Suppose
we assign data to the training set in a completely random way. It is possible that no
basketball players would be included in the training set and because of this, the resulting
classifier would not be very good at classifying basketball players. Or consider creating a data
set of 100 athletes. First we go to the Women’s NBA website and write down the info on 33
basketball players; next we go to Wikipedia and get 33 women who competed in gymnastics
at the 2012 Olympics and write that down; finally, we go again to Wikipedia to get
information on women who competed in track at the Olympics and record data for 34 people.
So our dataset looks like this:

![](https://raw.githubusercontent.com/zacharski/ml-class/master/labs/pics/womensports.png)

Let’s say we are doing 10-fold cross validation. We start at the beginning of the list and put
every ten people in a different bucket. In this case we have 10 basketball players in both the
first and second buckets. The third bucket has both basketball players and gymnasts. The
fourth and fifth buckets solely contain gymnasts and so on. None of our buckets are
representative of the dataset as a whole and you would be correct in thinking this would skew
our results. The preferred method of assigning instances to buckets is to make sure that the
classes (basketball players, gymnasts, marathoners) are represented in the same proportions
as they are in the complete dataset. Since one-third of the complete dataset consists of
basketball players, one-third of the entries in each bucket should also be basketball players.
And one-third the entries should be gymnasts and one-third marathoners. This is called
stratification and this is a good thing. The problem with the leave-one-out evaluation
method is that necessarily all the test sets are non-stratified since they contain only one
instance. In sum, while leave-one-out may be appropriate for very small datasets, 10-fold
cross validation is by far the most popular choice.

### Coding
Let's see how we can use cross validation using the Iris dataset.

<img src="https://upload.wikimedia.org/wikipedia/commons/1/1e/IMG_7911-Iris_virginica.jpg" width="250" />

First, let's load the dataset:


In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/iris.csv')
iris


Now let's divide this into a training set and a test set using an 80-20 split.

In [None]:
from sklearn.model_selection import train_test_split
iris_train, iris_test = train_test_split(iris, test_size = 0.2)
iris_train

#### 10 fold cross validation on iris_train
First, to make things as clear as possible, we will split the iris_train dataset into the features and the labels:

In [None]:
iris_train_features = iris_train[['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width']]
iris_train_labels = iris_train[['Class']]

Let's also create an instance of a decision tree classifier

In [None]:
from sklearn import tree
clf = tree.DecisionTreeClassifier(criterion='entropy')

### The cross validation steps

#### Step 1. Import  cross_val_score

In [None]:
from sklearn.model_selection import cross_val_score

#### Step 2. run cross validation

In [None]:
scores = cross_val_score(clf, iris_train_features, iris_train_labels, cv=10)

`cv=10` specified that we perform 10-fold cross validation. the function returns a 10 element array, where each element is the accuracy of that fold. Let's take a look:

In [None]:
print(scores)
print("The average accuracy is %5.3f" % (scores.mean()))


So `scores` contains the accuracy for each of the 10 runs. In my case it was:

```
[1.         0.83333333 1.         0.91666667 0.91666667 0.91666667
 0.91666667 0.91666667 1.         0.91666667]
The average accuracy is 0.933
```
So the best runs were 100% accurate and the worst was 83%. The average accuracy was 93%
![](https://raw.githubusercontent.com/zacharski/datamining-guide/master/labs/pics/torchdivide.png)


# <font color='#EE4C2C'>You Try ...</font> 
## <font color='#EE4C2C'>Pima Indian Diabetes Dataset</font> 

We have covered a lot of material and now is your chance to practice it using the Pima Indians Diabetes Data we used before. The data file is at 

[https://raw.githubusercontent.com/zacharski/ml-class/master/data/pima-indians-diabetes.csv](https://raw.githubusercontent.com/zacharski/ml-class/master/data/pima-indians-diabetes.csv)

The data file does not contain a header row. Of course you can name the columns whatever you want, but I used:
```
['pregnant', 'glucose', 'bp', 'skinfold', 'insulin', 'bmi', 'pedigree', 'age', 'diabetes']
```
## <font color='#EE4C2C'>1. Load in the data file</font> 
So load in the data file and let's reserve 20% for `pima_test` and 80% for `pima_train`.



In [28]:
# TO DO
data = pd.read_csv('https://raw.githubusercontent.com/zacharski/ml-class/master/data/pima-indians-diabetes.csv')
data.columns = ['pregnant', 'glucose', 'bp', 'skinfold', 'insulin', 'bmi', 'pedigree', 'age', 'diabetes']
data

Unnamed: 0,pregnant,glucose,bp,skinfold,insulin,bmi,pedigree,age,diabetes
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0
...,...,...,...,...,...,...,...,...,...
762,10,101,76,48,180,32.9,0.171,63,0
763,2,122,70,27,0,36.8,0.340,27,0
764,5,121,72,23,112,26.2,0.245,30,0
765,1,126,60,0,0,30.1,0.349,47,1


## <font color='#EE4C2C'>2. creating separate data structures for the features and labels</font> 



Next, for convenience let's create 2 DataFrames and 2 Series. The DataFrames are:

* `pima_train_features` will contain the feature columns from `pima_train` 
* `pima_test_features` will contain the feature columns from `pima_test`

The Series are:

* `pima_train_labels` will contain the `diabetes` column
* `pima_test_labels` will also contain the `diabetes` column

In [42]:
# TO DO
from sklearn.model_selection import train_test_split

pima_train, pima_test = train_test_split(data, test_size=0.2)
pima_train_labels = pima_train['diabetes']
pima_test_labels = pima_test['diabetes']
pima_train_features = pima_train[['pregnant', 'glucose', 'bp', 'skinfold', 'insulin', 'bmi', 'pedigree', 'age']]
pima_test_features = pima_test[['pregnant', 'glucose', 'bp', 'skinfold', 'insulin', 'bmi', 'pedigree', 'age']]

In [43]:
pima_train_features

Unnamed: 0,pregnant,glucose,bp,skinfold,insulin,bmi,pedigree,age
555,1,97,70,40,0,38.1,0.218,30
179,6,87,80,0,0,23.2,0.084,32
427,0,135,94,46,145,40.6,0.284,26
423,8,151,78,32,210,42.9,0.516,36
733,2,105,75,0,0,23.3,0.560,53
...,...,...,...,...,...,...,...,...
280,10,129,76,28,122,35.9,0.280,39
38,4,111,72,47,207,37.1,1.390,56
602,7,150,78,29,126,35.2,0.692,54
418,3,129,64,29,115,26.4,0.219,28


In [44]:
pima_train_labels

555    0
179    0
427    0
423    1
733    0
      ..
280    0
38     1
602    1
418    1
653    0
Name: diabetes, Length: 613, dtype: int64

## <font color='#EE4C2C'>3. Exploring hyperparameters: max_depth</font> 


We are interested in seeing which has higher accuracy:

1. a classifier unconstrained for max_depth 
2. a classifier with max_depth of 4

Create 2 decision tree classifiers: `clf` which is unconstrained for depth and `clf4` which has a max_depth of 4.

In [45]:
# TO DO
clf4 = tree.DecisionTreeClassifier(criterion='entropy', max_depth=4)
clf = tree.DecisionTreeClassifier(criterion='entropy')

### using 10-fold cross validation get the average accuracy of `clf`

## <font color='#EE4C2C'>4. Using 10-fold cross validation get the average accuracy of `clf4`</font> 



In [46]:
# TO DO
from sklearn.model_selection import cross_val_score

In [47]:
scores = cross_val_score(clf, pima_train_features, pima_train_labels, cv=10)
scores4 = cross_val_score(clf4, pima_train_features, pima_train_labels, cv=10)

print(f'The average accuracy of tree classifier max depth 4 was {scores4.mean()}')
print(f'The average accuracy of tree classifier with undefined max depth was {scores.mean()}')

The average accuracy of tree classifier max depth 4 was 0.7537281861448968
The average accuracy of tree classifier with undefined max depth was 0.7226335272342675


**which** has better accuracy, the one unconstrained for depth or the one whose max_depth is 4?
The unconstrained depth achieved an accuracy of 74.72% while the classifier with a max depth of 4 had a 72.43% accuracy.



## <font color='#EE4C2C'>5. Using the entire training set, train a new classifier with the best setting for the max_depth hyperparameter</font> 



In [48]:
# TO DO
pima_classifier = tree.DecisionTreeClassifier(criterion='entropy')

## <font color='#EE4C2C'>6. Finally, using the test set what is the accuracy?</font> 

In [49]:
from sklearn.metrics import accuracy_score
# TO DO
pima_classifier.fit(pima_train_features, pima_train_labels)
predictions = pima_classifier.predict(pima_test_features)
accuracy_score(predictions, pima_test_labels)

0.7532467532467533

## Automation

Let's say we want to find the best settings for `max_depth`and we will check out the values, 3, 4, 5, 6, ...12 and the best for `min_samples_split` and we will try 2, 3, 4, 5. That makes 10 values for `max_depth` and 4 for `min_samples_split`. That makes 40 different classifiers and it would be time consuming to do that by hand. Fortunately, we can automate the process using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html?highlight=gridsearchcv#sklearn.model_selection.GridSearchCV). 

First we will import the module:

In [None]:
from sklearn.model_selection import GridSearchCV

Now we are going to specify the values we want to test. For `max_depth` we want 3, 4, 5, 6, ... 12 and for `min_samples_split` we want  2, 3, 4, 5:

In [None]:
hyperparam_grid = [
    {'max_depth': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12], 
     'min_samples_split': [2,3,4, 5]}
  ]


Next, let's create a decision tree classifier:

In [None]:

clf = tree.DecisionTreeClassifier(criterion='entropy')


#### now create a grid search object

In [None]:
grid_search = GridSearchCV(clf, hyperparam_grid, cv=10)

When we create the object we pass in:

* the classifer - in our case `clf`
* the Python dictionary containing the hyperparameters we want to evaluate. In our case `hyperparam_grid`
* how many bins we are using. In our case 10: `cv=10`

#### now perform `fit`

In [None]:
grid_search.fit(pima_train_features, pima_train_labels)

When `grid_search` runs, it creates 40 different classifiers and runs 10-fold cross validation on each of them. We can ask `grid_search` what were the parameters of the classifier with the highest accuracy:

In [None]:
grid_search.best_params_

We can also ask `grid_search` to return the best classifier so we can use it to make predictions.

In [None]:
predictions = grid_search.best_estimator_.predict(pima_test_features)

In [None]:

accuracy_score(pima_test_labels, predictions)

As you can see, grid search is extremely helpful in tuning a classifier to work well with a particular problem.

