# Overview

This is a replication of an [Anaconda tutorial](https://know.anaconda.com/rs/387-XNW-688/images/ML.html). Like the original this tutorial utilizes the [Learning about Humans learning ML.csv](https://goo.gl/WgTQMX) dataset. - Which is Copyrighted © Anaconda Inc. 2018. Also, [saved locally](https://github.com/adamrossnelson/MacLearn/blob/master/whimsical/LearningAboutHumansLearningML.csv).

## This Tutorial Is In k Parts

### Part I

To provide additional commentary, this replicated tutorial splits itself into seperate parts. This notebook is the first part which covers loading, cleaning, preparing, transforming, encoding, splitting, training, testing, and evaluating a decision tree.

### Part II

The second part (forthcoming) will provide a closer look further evaluating a model by examining feature importance, tuning feature selection, and tuning parameter choices.

## Import Statements

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")

## Load, Clean, Inspect Data

In [2]:
fname = "LearningAboutHumansLearningML.csv"
humans = pd.read_csv(fname)

humans.head(2)

Unnamed: 0,Timestamp,Favorite programming language,Favorite Monty Python movie,Years of Python experience,Have used Scikit-learn,Age,"In the Terminator franchise, did you root for the humans or the machines?",Which is the better game?,Years of post-secondary education (e.g. BA=4; Ph.D.=10),How successful has this tutorial been so far?
0,4/8/2018 8:34:08,Python,Monty Python's Life of Brian,20.0,Yep!,53,Skynet is a WINNER!,"Tic-tac-toe (Br. Eng. ""noughts and crosses"")",12,8
1,4/8/2018 9:57:15,Python,Monty Python and the Holy Grail,4.0,Yep!,33,Team Humans!,Chess,5,9


In [3]:
# Remove timestamp
humans.drop('Timestamp', axis=1, inplace=True)

# Generate 'Education' variable based on survey response.
humans['Education'] = (humans[
    'Years of post-secondary education (e.g. BA=4; Ph.D.=10)']
                       .str.replace(r'.*=','')
                       .astype(int))

# Remove original survey response.
humans.drop('Years of post-secondary education (e.g. BA=4; Ph.D.=10)', 
            axis=1, inplace=True)

In [4]:
humans.head(2)

Unnamed: 0,Favorite programming language,Favorite Monty Python movie,Years of Python experience,Have used Scikit-learn,Age,"In the Terminator franchise, did you root for the humans or the machines?",Which is the better game?,How successful has this tutorial been so far?,Education
0,Python,Monty Python's Life of Brian,20.0,Yep!,53,Skynet is a WINNER!,"Tic-tac-toe (Br. Eng. ""noughts and crosses"")",8,12
1,Python,Monty Python and the Holy Grail,4.0,Yep!,33,Team Humans!,Chess,9,5


In [5]:
# Initial review of numerica data
humans.describe()

Unnamed: 0,Years of Python experience,Age,How successful has this tutorial been so far?,Education
count,116.0,116.0,116.0,116.0
mean,4.19569,36.586207,7.051724,6.172414
std,5.136187,13.260644,2.229622,3.467303
min,0.0,3.0,1.0,-10.0
25%,1.0,28.0,5.0,4.0
50%,3.0,34.0,8.0,6.0
75%,5.0,43.25,9.0,8.0
max,27.0,99.0,10.0,23.0


## Steps Not In Original Tutorial

In [6]:
# A value of -10 years education looks like a data entry problem.
humans['Education'] = humans['Education'].replace(-10,10)

# Ages < 20  also likely data entry err. Missing first digit (add 10).
humans['Age'] = np.where(humans['Age'] < 20, 
                         humans['Age'] + 10,
                         humans['Age'])

# Age == 99 is likely placehoder. Replace with median.
humans['Age'] = np.where(humans['Age'] == 99, 34,
                         humans['Age'])

In [7]:
humans.describe(include=['object', 'int', 'float'])

Unnamed: 0,Favorite programming language,Favorite Monty Python movie,Years of Python experience,Have used Scikit-learn,Age,"In the Terminator franchise, did you root for the humans or the machines?",Which is the better game?,How successful has this tutorial been so far?,Education
count,116,116,116.0,116,116.0,116,116,116.0,116.0
unique,7,6,,2,,2,4,,
top,Python,Monty Python and the Holy Grail,,Yep!,,Team Humans!,Chess,,
freq,94,57,,80,,88,69,,
mean,,,4.19569,,36.284483,,,7.051724,6.344828
std,,,5.136187,,11.339662,,,2.229622,3.137718
min,,,0.0,,13.0,,,1.0,0.0
25%,,,1.0,,28.0,,,5.0,4.0
50%,,,3.0,,34.0,,,8.0,6.0
75%,,,5.0,,42.25,,,9.0,8.0


## Data Preparation

Using [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) (link to Wikipedia). I also share (without comment) work from Medium.com 

[What is One Hot Encoding and How to Do It](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).

[Basics of one hot encoding using numpy, sklearn, Keras, and Tensorflow](https://medium.com/@pemagrg/one-hot-encoding-129ccc293cda).

[https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f](What is One Hot Encoding? Why And When do you have to use it?).

Or see the explanation given below.

In [8]:
# Using pandas.get_dummies()
human_dummies = pd.get_dummies(humans)

# Results, as displayed from tutorial
list(human_dummies.columns)

['Years of Python experience',
 'Age',
 'How successful has this tutorial been so far?',
 'Education',
 'Favorite programming language_C++',
 'Favorite programming language_JavaScript',
 'Favorite programming language_MATLAB',
 'Favorite programming language_Python',
 'Favorite programming language_R',
 'Favorite programming language_Scala',
 'Favorite programming language_Whitespace',
 'Favorite Monty Python movie_And Now for Something Completely Different',
 'Favorite Monty Python movie_Monty Python Live at the Hollywood Bowl',
 'Favorite Monty Python movie_Monty Python and the Holy Grail',
 "Favorite Monty Python movie_Monty Python's Life of Brian",
 "Favorite Monty Python movie_Monty Python's The Meaning of Life",
 'Favorite Monty Python movie_Time Bandits',
 'Have used Scikit-learn_Nope.',
 'Have used Scikit-learn_Yep!',
 'In the Terminator franchise, did you root for the humans or the machines?_Skynet is a WINNER!',
 'In the Terminator franchise, did you root for the humans or th

### One-Hot Encoding, Explanation

Performing one-hot encoding with `pandas.get_dummies()` returns a new data frame. Above, the tutorial displays the new data frame's columns. Notice that columns which were integer continous variables remain unchanged. Columns that contained nominal values have been expanded. There is a new column for each of the available nominal values. This code compares compare a few observations (numbers 8 & 9) from the original data frame and the new data frame.

In [9]:
humans[['Age',
        'Education',
        'Favorite programming language',
        'How successful has this tutorial been so far?']][8:10]

Unnamed: 0,Age,Education,Favorite programming language,How successful has this tutorial been so far?
8,34,10,Python,10
9,32,5,R,4


In [10]:
human_dummies[['Age',
               'Education',
               'Favorite programming language_Python',
               'Favorite programming language_R',
               'How successful has this tutorial been so far?']][8:10]

Unnamed: 0,Age,Education,Favorite programming language_Python,Favorite programming language_R,How successful has this tutorial been so far?
8,34,10,1,0,10
9,32,5,0,1,4


Note that for observation 8, the survey respondent indicated `Python` as a favorite programming language. For observation 9, the respondent indicated `R` as a favoriate. In `humans` the data exits as text. In `human_dummies` the data exists as one-hot encoded binary codes.

## Choosing Targets & Features

For folks with beginner (or even intermediate and advanced) backgrounds in statistics (but not machine learning) the terms target and features are simple to understand. Targets are your outcome or dependent variable. Features are the predictor or independent variables. Other turn of phrases are, targets = left-hand variables; features = right-hand variables or x variables. Conveniently in Machine Learning targes are often denoted as lower y, and targets as upper X.

In [11]:
# Create feature matrix. Keep everything in humans but the target.
X = human_dummies.drop('How successful has this tutorial been so far?', axis=1)

In [12]:
# Create target matrix. 
# Create new dataframe with observation True when rating was => median of 8.
y = human_dummies['How successful has this tutorial been so far?'] >= 8

## Splitting, Training & Testing

Suggestions for best practice usually suggest training and testing on different data sets. This can be accomplished by splitting the available data.

SciKitLearn's `train_test_slipt` function accomplishes this task in one like of code:

### Splitting

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print("Training features/target:", X_train.shape, y_train.shape)
print("Testing features/target:", X_test.shape, y_test.shape)

Training features/target: (87, 24) (87,)
Testing features/target: (29, 24) (29,)


### Training & Testing A Random Forest

In [14]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.5172413793103449

### Training & Testing A Decision Tree

In [15]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=7, random_state=0)
tree.fit(X_train, y_train)

# Return the mean accuracy on the given test data and labels.
tree.score(X_test, y_test)

0.4827586206896552

To understand how this score is calculated, generate a dummie that is 1 when the prediction is accurlate, 0 otherwise. Then display the mean value of that dummy. The core of this procedure which is to create a new variable `isCorrec` is:

```Python
pd.DataFrame({'Actual':y_test,
              'Predicted':tree.predict(X_test),
              'isCorrect':np.where(y_test == tree.predict(X_test), 1, 0)})
```
The `pandas.describe()` method will give the mean value of `isCorrect`.

In [16]:
pd.DataFrame({'Actual':y_test,
              'Predicted':tree.predict(X_test),
              'isCorrect':np.where(y_test == tree.predict(X_test), 1, 0)}).describe()[1:2]

Unnamed: 0,isCorrect
mean,0.482759


With a mean value of `0.482759` simplistic explanation of this prediction is that it was about as good as a coin flip: true about have the time. The original tutorial prduced a higher score (`0.58620`).

### Further Inspecting & Evaluating Results

In [17]:
# Display side-by-side predicted and actual outcomes.
pd.DataFrame({'Actual':y_test,
              'Predicted':tree.predict(X_test),
              'isCorrect':np.where(y_test == tree.predict(X_test), 1, 0)}).head()

Unnamed: 0,Actual,Predicted,isCorrect
95,False,False,1
44,False,False,1
56,True,False,0
97,False,False,1
69,True,False,0


In [18]:
# Evaluate Using conrpt.py
# https://github.com/adamrossnelson/conrpt/blob/master/conrpt.py
import conrpt

In [19]:
conrpt.conrpt(pd.DataFrame({'Actual':y_test,
              'Predicted':tree.predict(X_test)}))


Notes: ObservedPos: 17, ObservedNeg: 12, & ObservedTot: 29, Prevalence: 58.621


Unnamed: 0,Results,Perfect,Predicted,25coin,50coin,75coin
0,TestedPos,17.0,6.0,10.0,14.0,20.0
1,TestedNeg,12.0,23.0,19.0,15.0,9.0
2,TestedTot,29.0,29.0,29.0,29.0,29.0
3,TruePos,17.0,4.0,4.0,6.0,11.0
4,TrueNeg,12.0,10.0,6.0,4.0,3.0
5,FalesPos,0.0,13.0,13.0,11.0,6.0
6,FalseNeg,0.0,2.0,6.0,8.0,9.0
7,Sensitivity,1.0,0.235,0.235,0.353,0.647
8,Specificity,1.0,0.833,0.5,0.333,0.25
9,PosPredVal,1.0,0.235,0.235,0.353,0.647


The `0.482759` score from above can be found in `conrpt` output above next to the `CorrectRt` (Correct Rate) row. With a score near `.5` it might be tempting to reject the model as useful. However, as seen above, this model's correct rate is higher than simulated random coin flips also reported by `conrpt`. Obviously, a correct rate of `1.0` would indicated a perfect prediction.

Other valuable measures of quality are `ROCArea` which in this case is `0.534`. The ROC Area reports the area under the receiver operating characteristic curve. This curve is calculated by plotting the true positive rate against the false positive rate at various threshold settings. Thus, a score near `.5` means the model is correct about half the time. At the this `.5` result it might be tempting to reject the model as useful. However, as seen above, this model is out performing all three random coins. As with the `CorrectRt`, an `ROCArea` of `1.0` would represent a perfect prediction.

Iterpreting the `F1Score` is less direct. Generally a higher F1Score is considered better*. Likewise a higher `MattCorCoef` is also considered better*. To read more about interpreting measures of fit I recommend the [SciKitLearn Model evaluation: quantifying the quality of predictions](https://scikit-learn.org/stable/modules/model_evaluation.html) documentation.


*The meaning of 'better' is not the same from scientist to scientist. Your definition is likely different than mind. Further, my criteria for 'better' or 'goodness' might be different from use-case to use-case.

In [20]:
conrpt.display_keywords()


    Keywords, Terminology, & Calculations - Quick References

    Prevalence  = ObservedPos/ObservedTot
    Specificity = TrueNeg/ObservedNeg           Sensitivity = TruePos/ObservedPos
    PosPredVal  = TruePos/(TruePos+FalsePos)    NegPredVal  = TrueNeg/(TrueNeg+FalseNeg)
    FalsePosRt  = FalsePos/ObservedNeg          FalseNegRt  = FalseNeg/(FalseNeg+TruePos)
    CorrectRt   = (TruePos+TrueNeg)/TestedTot   IncorrectRt = (FalsePos+FalseNeg)/TestedTot

    FalsePos    = Type I Error                  FalseNeg    = Type II Error
    FalsePosRt  = Inverse Specificity           FalseNegRt  = Inverse Sensitivity
