**Notes from Professor Xin Tong's DSO 530 class at USC**

## 1. Dealing with Missing Data

### 1.1 Identifying Missing Values in Tabular Data

In [2]:
import pandas as pd
from io import StringIO

In [12]:
df = pd.read_csv("Housing.csv")

In [13]:
df.head(5)

Unnamed: 0,crim,zn,river,rm,ptratio,medv
0,0.00632,18.0,0,6.575,15.3,24.0
1,0.02731,0.0,0,6.421,17.8,21.6
2,0.02729,0.0,0,7.185,17.8,34.7
3,0.03237,0.0,0,6.998,18.7,33.4
4,0.06905,0.0,0,7.147,18.7,36.2


In [16]:
# Return a dataframe with boolean values that indicate whether a cell 
# contains value (False) or if data is missing (True)

df.isnull()

Unnamed: 0,crim,zn,river,rm,ptratio,medv
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
501,False,False,False,False,False,False
502,False,False,False,False,False,False
503,False,False,False,False,False,False
504,False,False,False,False,False,False


In [17]:
# Using sum method, we can return number of missing values per column

df.isnull().sum()

crim       0
zn         0
river      0
rm         0
ptratio    0
medv       0
dtype: int64

In [18]:
# Although scikit-learn was developed for working with NumPy arrays, 
# it can sometimes be more convenient to preprocess using pandas' DF

# We can always access the underlying NumPy array of a DF using 
# values attribute before feeding into scikit-learn estimator

df.values

array([[6.3200e-03, 1.8000e+01, 0.0000e+00, 6.5750e+00, 1.5300e+01,
        2.4000e+01],
       [2.7310e-02, 0.0000e+00, 0.0000e+00, 6.4210e+00, 1.7800e+01,
        2.1600e+01],
       [2.7290e-02, 0.0000e+00, 0.0000e+00, 7.1850e+00, 1.7800e+01,
        3.4700e+01],
       ...,
       [6.0760e-02, 0.0000e+00, 0.0000e+00, 6.9760e+00, 2.1000e+01,
        2.3900e+01],
       [1.0959e-01, 0.0000e+00, 0.0000e+00, 6.7940e+00, 2.1000e+01,
        2.2000e+01],
       [4.7410e-02, 0.0000e+00, 0.0000e+00, 6.0300e+00, 2.1000e+01,
        1.1900e+01]])

### 1.2 Eliminating Observations or Features with Missing Values

In [19]:
# Easiest way to deal with missing data is to remove features (columns)
# or observations (rows) 

# Rows with missing values can be dropped with dropna method

# df.dropna(axis=0) with 0 

In [21]:
# Only drop rows where all columns are NaN
# Have a row where all values are NaN

# df.dropna(how='all')

In [22]:
# Drop rows that have less than 4 real values

# df.dropna(thresh=4)

In [23]:
# Only drop rows where NaN appear in specific columns (here: column 'C')

# df.dropna(subset['C'])

But we might end up removing too many observations or lose valuable info that are classifier needs to discriminate between classes!

### 1.3 Imputing Missing Values 

##### 1.3.1 Use sklearn.impute.SimpleImputer

In [25]:
# Mean Imputation, replace NaN value with mean value of entire column

import numpy as np 
from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values = np.nan, strategy='mean')
imp = imp.fit(df.values)
imputed_data = imp.transform(df.values)
imputed_data

# Other options for strategy parameter are median, constant or 
# most_frequent (useful for categorical values)

# When strategy = 'constant', there is another parameter called fill_value
# which is used to replace all occurrences of missing_values. If left to 
# default, fill_value = 0 for numerical and missing_value = 0 for strings/
# object data types

array([[6.3200e-03, 1.8000e+01, 0.0000e+00, 6.5750e+00, 1.5300e+01,
        2.4000e+01],
       [2.7310e-02, 0.0000e+00, 0.0000e+00, 6.4210e+00, 1.7800e+01,
        2.1600e+01],
       [2.7290e-02, 0.0000e+00, 0.0000e+00, 7.1850e+00, 1.7800e+01,
        3.4700e+01],
       ...,
       [6.0760e-02, 0.0000e+00, 0.0000e+00, 6.9760e+00, 2.1000e+01,
        2.3900e+01],
       [1.0959e-01, 0.0000e+00, 0.0000e+00, 6.7940e+00, 2.1000e+01,
        2.2000e+01],
       [4.7410e-02, 0.0000e+00, 0.0000e+00, 6.0300e+00, 2.1000e+01,
        1.1900e+01]])

##### 1.3.2 Use sklearn.impute.KNNImputer

Uses k-Nearest Neighbors approach 

- By default, euclidean distance metric supports missing values, nan_euclidean_distances to find nearest neighbors 

- Each observation's missing values are imputed using mean values from n_neighbors nearest neighbors found in training data set

In [28]:
# Replace NaN, encoded as np.nan, using mean feature value of 2 nearest 
# neighbors of samples

import numpy as np 
from sklearn.impute import KNNImputer

X = [ [1,2,np.nan], [3,4,3], [np.nan, np.nan, 5], [8,8,7] ]
df1 = pd.DataFrame(X, columns = ['A', 'B', 'C'])
df1

Unnamed: 0,A,B,C
0,1.0,2.0,
1,3.0,4.0,3.0
2,,,5.0
3,8.0,8.0,7.0


- If both x or both y are missing, dist^2 is infinite.
- E.g. comparing A0, A2 and B0, B2

In [29]:
# Set n_neighbors to 2, number of neighboring observations to use 
# Default is 5

# Default "weights"= "uniform", all points in each neighborhood are weighted = 
# Can set to "distance", closer neighbors have greater influence

imputer = KNNImputer(n_neighbors=2, weights='uniform')
imputer.fit_transform(X) # will get np array

array([[1. , 2. , 5. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

When "weights" = "distance", there is a fraction used to scale. 
- total # coordinates in each vector (e.g. 3) / # coordinates calculated (e.g. 2)

### 1.4 Understanding scikit-learn Estimator API

SimpleImputer and KNNImputer classes belong to "transformer classes" in scikit-learn, which are used for data transformation.

##### Two essential methods: fit and transform

- **Fit:** learn parameters from the training data 
- **Transform:** uses those parameters to transfer data

Any data array that is transformed needs to have the same number of features as the data array used to fit the model.

In [30]:
# Train data

data_train = {'A': [3,2,np.nan,4,3],
             
             'B': [3,np.nan,4,4,5],
             'C': [np.nan, 4.8, 5.1, 4.9, 5.2],
             'D': [6,7,9,np.nan,10]}
df2_train = pd.DataFrame(data = data_train)
df2_train

Unnamed: 0,A,B,C,D
0,3.0,3.0,,6.0
1,2.0,,4.8,7.0
2,,4.0,5.1,9.0
3,4.0,4.0,4.9,
4,3.0,5.0,5.2,10.0


In [31]:
# Mean of df2_train

df2_train.mean(axis=0)

A    3.0
B    4.0
C    5.0
D    8.0
dtype: float64

In [32]:
# Use fit to build model imp2 by calculating the mean value of each column in training data
# Use transform to impute missing data in training data

imp2 = SimpleImputer(missing_values = np.nan, strategy = "mean")
imp2 = imp2.fit(df2_train)

imputed_df2_train = imp2.transform(df2_train)
imputed_df2_train

array([[ 3. ,  3. ,  5. ,  6. ],
       [ 2. ,  4. ,  4.8,  7. ],
       [ 3. ,  4. ,  5.1,  9. ],
       [ 4. ,  4. ,  4.9,  8. ],
       [ 3. ,  5. ,  5.2, 10. ]])

In [34]:
# Test data:

data_test = {'A': [2,np.nan,3], 
            'B': [4,5,np.nan],
            'C': [np.nan, 4,5],
            'D': [7,8,np.nan]}

df2_test = pd.DataFrame(data = data_test)
df2_test

Unnamed: 0,A,B,C,D
0,2.0,4.0,,7.0
1,,5.0,4.0,8.0
2,3.0,,5.0,


In [35]:
# mean of df_2 test

df2_test.mean(axis=0)

A    2.5
B    4.5
C    4.5
D    7.5
dtype: float64

In [36]:
# Use imp2.transform to impute missing data in test data 

# Imputed values are means of the *training* data not test data 

imputed_df2_test = imp2.transform(df2_test)
imputed_df2_test

array([[2., 4., 5., 7.],
       [3., 5., 4., 8.],
       [3., 4., 5., 8.]])

## 2. Handling Categorical Data

### 2.1 Nominal and Ordinal Features

- Nominal: doesn't imply any order (e.g. t-shirt color, since it doesn't make sense to say red > blue)
- Ordinal: can be sorted/ordered (e.g. t-shirt size, XL > L > M > S)

##### 2.1.1 Creating an Example Dataset

In [37]:
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


The DF contains a nominal feature (color) and numerical feature (price). 

### 2.2 Mapping Ordinal Features

In [38]:
# Define mapping manually

size_mapping = {'XL':3, 
                'L': 2,
                'M':1}
df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


In [39]:
# Define reverse-mapping dictionary to tranform integers back to original string representation

inv_size_mapping = {v:k for k,v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

### 2.3 Encoding Class Labels

In [41]:
# Good practice to provide class labels as integer arrays 
# Since class labels are not ordinal, it doesn't matter which integer number is assigned to the string label

from sklearn.preprocessing import LabelEncoder

class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y

array([0, 1, 0])

In [42]:
df['classlabel']=y
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


In [43]:
# Use inverse_transform to change classlabel back to original string 
# Reverse mapping

df['classlabel'] = class_le.inverse_transform(y)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


In [44]:
# fit_transform method is a shortcut to call fit and transform separately 

class_le2 = LabelEncoder()
class_le2 = class_le2.fit(df['classlabel'].values)

y2 = class_le2.transform(df['classlabel'].values)
y2

array([0, 1, 0])

### 2.4 Performing One-Hot Encoding on Nominal Features

scikit-learn's estimators for classification treat class labels as categorical *nominal* data 

In [45]:
X = df[['color', 'size', 'price']].values
X

array([['green', 1, 10.1],
       ['red', 2, 13.5],
       ['blue', 3, 15.3]], dtype=object)

In [46]:
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

The first column of the NumPy array X now holds new color values, where blue = 0, green = 1, red = 2

**BUT, a learning algorithm will now assume that green > blue, red > green** 

Therefore, we will use one-hot encoding, creating a new dummy feature for each value in the nominal feature column. For example, blue is encoded as blue=1, green=0, red=0.

**OneHotEncoder and ColumnTransformer Function**: 

In [50]:
# ColumnTransformer specifies which column to transform 

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

X = df[['color', 'size', 'price']].values
ct = ColumnTransformer([("color", OneHotEncoder(), [0])], remainder = 'passthrough')
# color parameter seems redundant, but it must exist when using the package!

X = ct.fit_transform(X)

X

array([[0.0, 1.0, 0.0, 1, 10.1],
       [0.0, 0.0, 1.0, 2, 13.5],
       [1.0, 0.0, 0.0, 3, 15.3]], dtype=object)

**get_dummies method**:  only convert string columns and leave all other columns unchanged

In [49]:
# one-hot encoding via pandas

pd.get_dummies(df[['price', 'color', 'size']])

Unnamed: 0,price,size,color_blue,color_green,color_red
0,10.1,1,0,1,0
1,13.5,2,0,0,1
2,15.3,3,1,0,0


**BUT, using one-hot encoding dataset introduces multicollinearity. If features are highly correlated, matrices are computationally difficult to invert, leading to numerically unstable estimates.**

**Therefore, to reduce correlation among variables, we can remove a feature column (no important info will be lost) and feature information is perserved. Example: color_green=0, color_red =0 means observation is blue.**

**For nominal variable with K categories, we need K-1 dummies.**

In [51]:
# multicollinearity guard in get_dummies

In [52]:
pd.get_dummies(df[['price', 'color', 'size']], drop_first = True)

Unnamed: 0,price,size,color_green,color_red
0,10.1,1,1,0
1,13.5,2,0,1
2,15.3,3,0,0


In [53]:
# OneHotEncoder and ColumnTransformer function

In [55]:
X = df[['color', 'size', 'price']].values
ct = ColumnTransformer([("color", OneHotEncoder(drop = 'first'), [0])], remainder = 'passthrough')

X = ct.fit_transform(X)
X

array([[1.0, 0.0, 1, 10.1],
       [0.0, 1.0, 2, 13.5],
       [0.0, 0.0, 3, 15.3]], dtype=object)

## 3. Partitioning a Dataset into Separate Training and Test Sets

Before we apply our model to new measurements, we need to know whether is works/whether we should trust its predictions. 

We cannot use the model we built and then evaluate it. Our model can remember the training set and always predict the correct label for any point in the training set. 

- Training data/training set
- Test data/test set/hold-out set

In [63]:
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash', 
                  'Alcalinity of ash', 'Magnesium', 'Total phenols', 
                  'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                  'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                  'Proline']

print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()

Class labels [1 2 3]


Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [65]:
# Convenient way to separate training and test is to use train_test_split
# from scikit-learn

from sklearn.model_selection import train_test_split

X,y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size = 0.3,
                                                   random_state=0,
                                                   stratify = y)

# X: column 2-13
# y: column 1
# test_size: default is 0.35, 35% to X_test and y_test
# and 65% to y_train and X_train
# random_state: set random seed
# class label: ensures both training and test have same class proportions
# as original dataset

Don't allocate too much information to test set since the larger the test set, the more inaccurate the estimation of the generalization error. Most common splits are 60:40, 70:30, or 80:20. 

For larger datasets, 90:10 or 99:1 splits. 

It's common practice to retrain a classifier on the entire dataset since it can improve the predictive performance of the model.

## 4. Bringing Features onto the Same Scale

Bringing different features onto the same scale: normalization and standardization

**Normalization**: rescaling of features to a range of (0,1), which is a special case of min-max scaling

In [66]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

*If xmin in test data is smaller than xmin in training data, then the transformed test point might have a value smaller than 0*

In [67]:
ex_train = np.array([3,4,5,6,7,8]).reshape(6,1)
ex_train

array([[3],
       [4],
       [5],
       [6],
       [7],
       [8]])

In [68]:
mms_ex = MinMaxScaler()
ex_train_norm = mms_ex.fit_transform(ex_train)
ex_train_norm

array([[0. ],
       [0.2],
       [0.4],
       [0.6],
       [0.8],
       [1. ]])

In [69]:
ex_test = np.array([2,3,4,5]).reshape(4,1)
ex_test

array([[2],
       [3],
       [4],
       [5]])

In [70]:
ex_test_norm = mms_ex.transform(ex_test)
ex_test_norm

array([[-0.2],
       [ 0. ],
       [ 0.2],
       [ 0.4]])

**Standardization**: center feature columns at mean 0 with standard deviation 1. 

In [71]:
from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)