## Pre-processing Section for Machine Learning A-Z
Also, really good resource and introduction to ML on youtube: https://www.youtube.com/watch?v=hd1W4CyPX58&t=385s

**Terminology:**

**Supervised** - Trying to learn the relationship between the data (iris measurements) and the outcome (the species of iris). 

**Unsupervised learning** - only know the data/measurements, so we try to cluster the data into meaningful categories/groups.

Each row is an **observation** (aka sample, example, instance, record).

Each column is a **feature** (aka predictor, attribute, independent variable, input, regressor, covariate).

Each value we are predicting is the **response** (aka target, outcome, label, dependent variable).

**Classification** - is supervised learning in which the response is categorical (values are finite, unordered, email spam or ham? iris species classification, etc.).

**Regression** - is supervised learning in which the response is ordered and continuous (price of a house, height of a person, etc.).

Example calls:
* iris.feature_names
* iris.target
* iris.target_names
* iris.data

## Requirements for working with data in scikit-learn
1. Features and response are **separate objects**
2. Features and response should be **numeric**
3. Features and response should be **NumPy arrays**
4. Features and response should have **specific shapes**

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os
from sklearn.preprocessing import Imputer

In [3]:
#Import the dataset but first need to set a working directory
os.getcwd()

'/Users/gaylonalfano/Documents/Python/Machine Learning A-Z/Part 1 - Data Preprocessing/Section 2 -------------------- Part 1 - Data Preprocessing --------------------'

In [4]:
dataset = pd.read_csv("Data.csv")

In [5]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 400.0+ bytes


## Important - Need to distingish matrix of features and dependent variables
Independent Vars = Country, Age, Salary; Dependent Var = Purchased

In [79]:
# Create a matrix/array of independent vars
X = dataset.iloc[:, :-1].values

In [80]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Practice slicing from np.array

In [81]:
X[:, 0] # countries
X[0, 2] # 72000
X[0:4, 0:2]
X[0, :]

array(['France', 44.0, 72000.0], dtype=object)

In [82]:
type(X)

numpy.ndarray

In [83]:
X  # returns np.array

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [84]:
# Create the dependent variable, Y
y = dataset.iloc[:, -1].values

In [85]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Object-oriented Programming Basics
A class is the model of something we want to build. For example, if we make a house construction plan that gathers the instructions on how to build a house, then this construction plan is the class.

An object is an instance of the class. So if we take that same example of the house construction plan, then an object is simply a house. A house (the object) that was built by following the instructions of the construction plan (the class).
And therefore there can be many objects of the same class, because we can build many houses from the construction plan.

A method is a tool we can use on the object to complete a specific action. So in this same example, a tool can be to open the main door of the house if a guest is coming. A method can also be seen as a function that is applied onto the object, takes some inputs (that were defined in the class) and returns some output.

## Working with missing data / NaN / nulls. Why do we have to .fit()?
Here's a good explanation: The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data. To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.

from sklearn.preprocessing import Imputer
obj = Imputer(strategy='mean')

obj.fit([[1, 2, 3], [2, 3, 4]])
print(obj.statistics_)
#array([ 1.5,  2.5,  3.5])

X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
print(X)
#array([[ 4. ,  2.5,  6. ],
#[ 5. ,  6. ,  3.5]])
You can do both steps in one if your train and test data are identical, using fit_transform.

X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
print(X)
#array([[ 1. ,  2. ,  4. ],
#[ 2. ,  3. ,  4. ]])
This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.

See the doc for more information about cross-validation.

In [86]:
# We could remove the observations (not suggested)
# Instead, should take the MEAN of the column and 
# use that to "fill the value" of null
# Using sklearn.preprocessing import Imputer

imputer = Imputer(missing_values = "NaN", strategy = 'mean', axis = 0)

# Now need to .fit() imputer object to the matrix X
imputer.fit(X[:, 1:3])

# After .fit(), need to set imputer to new value
imputer = imputer.fit(X[:, 1:3])

# After "fitting" need to then replace the missing vals
# with the mean of those columns using .transform()
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [29]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## How to convert to categorical ENCODED vals using sklearn.preprocessing LabelEncoder using .fit_transform() method
1. First need to pass original array/series to the .fit_transform() method to return encoded array
2. Then you need to store/save the encoded values. You can replace the original "text" values ("France, Germany, Spain, etc.) with the encoded (1, 0, 2, 3, etc.) values. Not sure if you should just overwrite original column or add new column to matrix??

In [87]:
# Need to convert Country and Purchased columns to
# categories. Can I also use Pandas for this??
from sklearn.preprocessing import LabelEncoder

In [88]:
# First step is you must create an object 
# of the LE class. #Then we use this object on 
#our categorical vars in the matrix
labelencoder_X = LabelEncoder()

# This line fits the obj to the first column of
# our matrix X, but this time returns the ENCODED values
# of the Countries columns. ENCODED seems to be the same 
# concept used in regression (like dummy vars?)

# To apply the labelencoder object on our column
# of countries, we need to call the .fit_transform() method
# The method returns an encoded array of the original
# array passed to the method (e.g., Country)
labelencoder_X.fit_transform(X[:, 0])

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0])

In [89]:
# Next, need to save encoded values back into the
# original matrix. Tutorial just replaces the values.
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

# What about trying to add a new column instead of
# overwriting the original? With pandas DF, you just 
# add a new column df['Col'], but need to test with
# np.matrix

# X[:, 3] = np.arange(len(X)) - How to add column?

In [90]:
# Returns encoded. However, since it's numberic
# we need to add dummy vars so the machine/pc doesn't
# Think that 2 is greater than 0 (Germany > France)
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

## Returns encoded. However, since it's numberic we need to add dummy vars so the machine learning equations don't think that 2 is greater than 0 (Germany > France). We need to create dummy vars uing the OneHotEncoder class
from sklearn.preprocessing import OneHotEncoder


## Can do the same thing in Pandas using get_dummies() method/function
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

In [91]:
# In sklearn, there is another class, OneHotEncoder, that essentially
# helps create the dummy var columns (1 or 0 for each categorical var)
from sklearn.preprocessing import OneHotEncoder

In [94]:
# Need to pass index of categorical values (index 0)
onehotencoder = OneHotEncoder(categorical_features = [0])

# Now object is ready so now need to fit to matrix X
# Just need to pass (X) since the OHE knows that it's
# looking at index[0] of whatever is passed to it w/ .fit_trans

# After calling .fit_transform(X), need to pass .toarray()
# to return an array output
X = onehotencoder.fit_transform(X).toarray()

In [95]:
X

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.40000000e+01, 7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.70000000e+01, 4.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        3.00000000e+01, 5.40000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.80000000e+01, 6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        4.00000000e+01, 6.37777778e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.50000000e+01, 5.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.87777778e+01, 5.20000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.80000000e+01, 7.90000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        5.00000000e+01, 

### NOTE: The difference between OneHotEncoder and LabelEncoder is that OHE is used to ensure that the machine learning algorithms do not attribute any order/ranking to the encoded categorical variables (1, 0, 2, etc.). So, if the variables are countries (France, China, USA, etc.), then there is no logical order or ranking, so we should use OHE. However, for T-shirt sizes S, M, L, there is an order/rank, so we should NOT use OHE, since we want the algorithm to know/understand the order between these categorical variables.

In [98]:
# To encode the dependent variable "Purchased" categorical values
# (yes or no), only need to use the LabelEncoder since the algorithms
# know that there is no order/ranking/difference between yes or no

labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
# transforms y into a encoded variable vector/array

In [97]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## NOTE for Section 2-15, change from sklearn.cross_validation import train_test_split TO from sklearn.model_selection import train_test_split

## Split dataset into Train set and a Test set using sklearn.model_selection train_test_split library. Why?
This is for the model/algorithm that is going to learn to do something on your dataset by understanding correlations on your dataset. This may learn too much as it cannot apply to an entirely new dataset. This is called overfitting your model.

So we have to build two sets TRAIN and TEST. 

In [103]:
from sklearn.model_selection import train_test_split

#Build training and test sets. Use test_size = to set a portion of
# your dataset to be assigned/used for your test set. Typically a good
# size is .2 or .25 (percent) of your dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                   random_state = 0)

In [107]:
X_train, X_test, y_train, y_test

(array([[1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
         4.00000000e+01, 6.37777778e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         3.70000000e+01, 6.70000000e+04],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
         2.70000000e+01, 4.80000000e+04],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
         3.87777778e+01, 5.20000000e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         4.80000000e+01, 7.90000000e+04],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
         3.80000000e+01, 6.10000000e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         4.40000000e+01, 7.20000000e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
         3.50000000e+01, 5.80000000e+04]]),
 array([[1.0e+00, 0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
        [1.0e+

In [108]:
y_train

array([1, 1, 1, 0, 1, 0, 0, 1])

## Feature Scaling - Section 2-17. Using StandardScaler class. Note that most libraries do not require you to apply feature scaling (are taken care of for you).
Looking at Age and Salary, the variables are not on the same scale. This will cause issues with your machine learning models because the Euclidean Distance between P1 and P2. Since Salary has a much wider range, the Euclidean Distance between Salary and Age will be will dominated by Salary (the square difference will be dominated by Salary values).

That is why we need to use Feature Scaling on your data. Even if your model is not dependent on Euclidean Distance (e.g., Decision Trees model), we will still need to do Feature Scaling because the algorithm will converge much faster (or run slowly).

In [111]:
from sklearn.preprocessing import StandardScaler
#Create an object of the StandardScaler class
sc_X = StandardScaler()

# Then need to fit and transform on TRAIN set but only 
# transform on TEST set
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### Should you feature scale your dummy variables?
Going to get a lot of debate on this. Depends on context, depends on how much interpretation you want to keep in your model. It won't break your model if you don't scale. But for now, we will scale those dummy variables.

In [112]:
X_train

array([[ 1.        , -1.        ,  2.64575131, -0.77459667,  0.26306757,
         0.12381479],
       [-1.        ,  1.        , -0.37796447, -0.77459667, -0.25350148,
         0.46175632],
       [ 1.        , -1.        , -0.37796447,  1.29099445, -1.97539832,
        -1.53093341],
       [ 1.        , -1.        , -0.37796447,  1.29099445,  0.05261351,
        -1.11141978],
       [-1.        ,  1.        , -0.37796447, -0.77459667,  1.64058505,
         1.7202972 ],
       [ 1.        , -1.        , -0.37796447,  1.29099445, -0.0813118 ,
        -0.16751412],
       [-1.        ,  1.        , -0.37796447, -0.77459667,  0.95182631,
         0.98614835],
       [-1.        ,  1.        , -0.37796447, -0.77459667, -0.59788085,
        -0.48214934]])

In [113]:
# It's important to .fit on X_train first before X_test, so that 
# they are both scaled on the same basis.
X_test

array([[ 1.        , -1.        ,  2.64575131, -0.77459667, -1.45882927,
        -0.90166297],
       [ 1.        , -1.        ,  2.64575131, -0.77459667,  1.98496442,
         2.13981082]])

In [114]:
# Do we need to apply feature scaling to y (dependent variable)?
# No, don't need to apply feature scaling. But if the dependent variable 
# takes a huge range of values, then you will need to feature scale it
# But for this y, the value range is small since it's an encoded
# categorical variable (yes or no; 1 or 0)

In [117]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [118]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Data Preprocessing Overview
1. Get the dataset
2. Importing the libraries
3. Importing the Dataset
4. Dealing with Missing Data
5. Encoding Categorical Data
6. Splitting the Dataset into the Training and Test sets
7. Feature scaling

In [119]:
X

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.40000000e+01, 7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        2.70000000e+01, 4.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        3.00000000e+01, 5.40000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.80000000e+01, 6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        4.00000000e+01, 6.37777778e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        3.50000000e+01, 5.80000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
        3.87777778e+01, 5.20000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        4.80000000e+01, 7.90000000e+04],
       [1.00000000e+00, 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
        5.00000000e+01, 

## Data Preprocessing Template used for this class:
1. Get the dataset
2. Importing the libraries
3. Importing the Dataset
4. ~~Dealing with Missing Data~~
5. ~~Encoding Categorical Data~~
6. Splitting the Dataset into the Training and Test sets
7. Feature scaling

In [115]:
"""#Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv("Data.csv")
X = dataset.iloc[:, :-1].values
y = datset.iloc[:, 3].values

# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

# Splitting the dataset inot the Training and Test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""

'#Importing the libraries\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport pandas as pd\n\n# Importing the dataset\ndataset = pd.read_csv("Data.csv")\nX = dataset.iloc[:, :-1].values\ny = datset.iloc[:, 3].values\n\n# Taking care of missing data\nfrom sklearn.preprocessing import Imputer\nimputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)\nimputer = imputer.fit(X[:, 1:3])\nX[:, 1:3] = imputer.transform(X[:, 1:3])\n\n# Encoding categorical data\nfrom sklearn.preprocessing import LabelEncoder, OneHotEncoder\nlabelencoder_X = LabelEncoder()\nX[:, 0] = labelencoder_X.fit_transform(X[:, 0])\nonehotencoder = OneHotEncoder(categorical_features = [0])\nX = onehotencoder.fit_transform(X).toarray()\nlabelencoder_y = LabelEncoder()\ny = labelencoder_y.fit_transform(y)\n\n# Splitting the dataset inot the Training and Test sets\nfrom sklearn.model_selection import train_test_split\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_stat