# Overview

Notes and code taken from this [Anaconda tutorial](https://know.anaconda.com/rs/387-XNW-688/images/ML.html).

This tutorial utilizes the [Learning about Humans learning ML.csv](https://goo.gl/WgTQMX) dataset. - Which is Copyrighted © Anaconda Inc. 2018. Also, [saved locally](https://github.com/adamrossnelson/MacLearn/blob/master/whimsical/LearningAboutHumansLearningML.csv).

## Import Statements

In [10]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")

## Load, Clean, Inspect Data

In [None]:
fname = "LearningAboutHumansLearningML.csv"
humans = pd.read_csv(fname)

humans.head(2)

In [5]:
# Remove timestamp
humans.drop('Timestamp', axis=1, inplace=True)

# Generate 'Education' variable based on survey response.
humans['Education'] = (humans[
    'Years of post-secondary education (e.g. BA=4; Ph.D.=10)']
                       .str.replace(r'.*=','')
                       .astype(int))

# Remove original survey response.
humans.drop('Years of post-secondary education (e.g. BA=4; Ph.D.=10)', 
            axis=1, inplace=True)

In [6]:
humans.head(2)

Unnamed: 0,Favorite programming language,Favorite Monty Python movie,Years of Python experience,Have used Scikit-learn,Age,"In the Terminator franchise, did you root for the humans or the machines?",Which is the better game?,How successful has this tutorial been so far?,Education
0,Python,Monty Python's Life of Brian,20.0,Yep!,53,Skynet is a WINNER!,"Tic-tac-toe (Br. Eng. ""noughts and crosses"")",8,12
1,Python,Monty Python and the Holy Grail,4.0,Yep!,33,Team Humans!,Chess,9,5


In [7]:
# Initial review of numerica data
humans.describe()

Unnamed: 0,Years of Python experience,Age,How successful has this tutorial been so far?,Education
count,116.0,116.0,116.0,116.0
mean,4.19569,36.586207,7.051724,6.172414
std,5.136187,13.260644,2.229622,3.467303
min,0.0,3.0,1.0,-10.0
25%,1.0,28.0,5.0,4.0
50%,3.0,34.0,8.0,6.0
75%,5.0,43.25,9.0,8.0
max,27.0,99.0,10.0,23.0


## Steps Not In Original Tutorial

In [11]:
# A value of -10 years education looks like a data entry problem.
humans['Education'] = humans['Education'].replace(-10,10)

# Ages < 20  also likely data entry err. Missing first digit (add 10).
humans['Age'] = np.where(humans['Age'] < 20, 
                         humans['Age'] + 10,
                         humans['Age'])

# Age == 99 is likely placehoder. Replace with median.
humans['Age'] = np.where(humans['Age'] == 99, 34,
                         humans['Age'])

In [12]:
humans.describe(include=['object', 'int', 'float'])

Unnamed: 0,Favorite programming language,Favorite Monty Python movie,Years of Python experience,Have used Scikit-learn,Age,"In the Terminator franchise, did you root for the humans or the machines?",Which is the better game?,How successful has this tutorial been so far?,Education
count,116,116,116.0,116,116.0,116,116,116.0,116.0
unique,7,6,,2,,2,4,,
top,Python,Monty Python and the Holy Grail,,Yep!,,Team Humans!,Chess,,
freq,94,57,,80,,88,69,,
mean,,,4.19569,,36.284483,,,7.051724,6.344828
std,,,5.136187,,11.339662,,,2.229622,3.137718
min,,,0.0,,13.0,,,1.0,0.0
25%,,,1.0,,28.0,,,5.0,4.0
50%,,,3.0,,34.0,,,8.0,6.0
75%,,,5.0,,42.25,,,9.0,8.0


## Data Preparation

Using [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) (link to Wikipedia). I also share (without comment) work from Medium.com 

[What is One Hot Encoding and How to Do It](https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f).

[Basics of one hot encoding using numpy, sklearn, Keras, and Tensorflow](https://medium.com/@pemagrg/one-hot-encoding-129ccc293cda).

[https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f](What is One Hot Encoding? Why And When do you have to use it?).

Or see the explanation given below.

In [13]:
# Using pandas.get_dummies()
human_dummies = pd.get_dummies(humans)

# Results, as displayed from tutorial
list(human_dummies.columns)

['Years of Python experience',
 'Age',
 'How successful has this tutorial been so far?',
 'Education',
 'Favorite programming language_C++',
 'Favorite programming language_JavaScript',
 'Favorite programming language_MATLAB',
 'Favorite programming language_Python',
 'Favorite programming language_R',
 'Favorite programming language_Scala',
 'Favorite programming language_Whitespace',
 'Favorite Monty Python movie_And Now for Something Completely Different',
 'Favorite Monty Python movie_Monty Python Live at the Hollywood Bowl',
 'Favorite Monty Python movie_Monty Python and the Holy Grail',
 "Favorite Monty Python movie_Monty Python's Life of Brian",
 "Favorite Monty Python movie_Monty Python's The Meaning of Life",
 'Favorite Monty Python movie_Time Bandits',
 'Have used Scikit-learn_Nope.',
 'Have used Scikit-learn_Yep!',
 'In the Terminator franchise, did you root for the humans or the machines?_Skynet is a WINNER!',
 'In the Terminator franchise, did you root for the humans or th

### One-Hot Encoding, Explanation

Performing one-hot encoding with `pandas.get_dummies()` returns a new data frame. Above, the tutorial displays the new data frame's columns. Notice that columns which were integer continous variables remain unchanged. Columns that contained nominal values have been expanded. There is a new column for each of the available nominal values. This code compares compare a few observations (numbers 8 & 9) from the original data frame and the new data frame.

In [19]:
humans[['Age',
        'Education',
        'Favorite programming language']][8:10]

Unnamed: 0,Age,Education,Favorite programming language
8,34,10,Python
9,32,5,R


In [20]:
human_dummies[['Age',
               'Education',
               'Favorite programming language_Python',
               'Favorite programming language_R']][8:10]

Unnamed: 0,Age,Education,Favorite programming language_Python,Favorite programming language_R
8,34,10,1,0
9,32,5,0,1


Note that for observation 8, the survey respondent indicated `Python` as a favorite programming language. For observation 9, the respondent indicated `R` as a favoriate. In `humans` the data exits as text. In `human_dummies` the data exists as one-hot encoded binary codes.