UC Irvine's Adult Data Set Notebook
============================

This Jupyter notebook contains the code used during OUAIxHacklahoma workshop for Fall 2021 (December 1st, 2021). Credits given to University of California, Irvine for collecting and organizing the provided ["Adult Data Set"](https://archive.ics.uci.edu/ml/datasets/Adult) dataset.

Objectives
--------------

This jupyter notebook implements a solution to best categorize whether an adult will make _more than_ or _less than_ \$50,000 a year. The dataset uses different attributes to help best predict the goal. The label attribute in this dataset is __income__. Unkown data values will be defined by a `?` symbol

In [None]:
import numpy as np
import pandas as pd

Data Obtention
============

The UCI's Adult dataset is provided in the datasets folder. We use Pandas's `read_csv` function. More formats are supported; read more in pandas [documentation](https://pandas.pydata.org/docs/reference/io.html).

In [None]:
adult = pd.read_csv('datasets/adult.csv')

Data Discovery
============

__It is important for us to gain an understanding of our data before performing any kind of learning or analysis.__ It is important to note that by "Discovery" we mean understanding what attributes and data types the dataset includes. This also includes identifying usefull and useless data attributes (such as dates or indexes). At this stage we want to focus in geting a grasp on what needs to be done next to achieve our goals.

### Look at the dataset's attributes.

In [None]:
adult.columns

### Look at the 7 first and last entries.

In [None]:
adult.head(7)

In [None]:
adult.tail(7)

### See what kind of data the dataset contains.
Note that workclass, education, marital.status, occupation, relationship, race, sex, native.country, and income are type `object`. Since we are using a CSV we can rest assured that we are dealing with a text datatype, most probably a _categorical_ data type. 

Do be careful when using other datatypes such as JSON or pickle, for whole python objects could be in the dataset; thus, requiring further data processing.

In [None]:
adult.info()

We can further probe into categorical data by using the `value_counts()` function. This provides us a matrix with values and the number of times they are found in the data. Play around with the other suspected categorical data attributes to further learn more about them.

In [None]:
adult['education'].value_counts()

## Dataset Statistics

The `describe()` function provided by Pandas allows us to get a glimpse into the statistical relationships going on in the dataset. This information can be useful to gain a better understanding of the dataset itself.

In [None]:
adult.describe()

Note that only __6__ attributes are displayed. This is due to the fact that `describe()` only looks at numerical data to produce its statistics.

#### Plotting the dataset

We can further gain knowledge by ploting the dataset. Pay close attention to the distribution! The `hist()` function allows us to plot a historgram. These are very useful when dealing with data that may have different statistical distributions. The `bins` parameter allows us to choose how many "rectangles" to include in a single histogram. The `figsize` allows us to determine the size of the histogram (y, x).

In [None]:
adult.hist(bins=50, figsize=(20, 15))

# Dataset Preparation

Once we have gained a better understanding of the kind of data that we are dealing with, it is important to avoid making assumptions on the dataset. This is because certain attributes may not influence the final prediction as we may think. This is crucial when seeking the best machine learning model.

When working with machine learning, it is very important for us to prepare the dataset for training. This is accomplished by first separating the data into two parts: _Training_ and _Testing_. Doing this will allow us to maintain a neutral position on the dataset and avoid outside bias from interfering with the learning process. Additionally, we will be using the testing dataset after we perform learning in order for us to test how well our model does with unseen data. At the end of the day, we care about a model that can do well with unseen data rather than just the data that it has learned already.

`StratifiedShuffleSplit` is a commonly-used splitting technique to separate the training from the testing sub-datasets. It uses a categorical attribute to keep the best representation among the datasets. In data science, the __80/20__ splitting rule is the most used splitting distribution. We keep 80% for our training data and 20% for our testing data.

### How do we choose which attribute to split on?

Expert advise is best when deciding what attribute to use. In general, it is a good idea to split your dataset using as basis an attribute that may be very important for machine learning; thus, it is benefitial to maintain the best representation of the data possible.

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

attrib = 'education'
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_i, test_i in split.split(adult, adult[attrib]):
    strat_train_set = adult.loc[train_i]
    strat_test_set = adult.loc[test_i]

Splitted data insights.

In [None]:
strat_train_set.head()

In [None]:
strat_test_set.head()

We can appreciate how the distribution in the number of samples is kept across the datasets.

In [None]:
# Train dataset
strat_train_set[attrib].value_counts().plot(figsize=(10, 5))

In [None]:
# Test dataset
strat_test_set[attrib].value_counts().plot(figsize=(10, 5))

In [None]:
# The distribution is kept if the graphs of the other datasets resemble this one.
adult[attrib].value_counts().plot(figsize=(10, 5))

From now on, __only use the training dataset__ to perform any further data analysis.

## Correlations

Given that in data science we are attempting to find correlations within the data that could give the machine learning model better insights, it is a good idea to check what attributes may be the most useful for this. Note that this will only work with numerical data, thus, we may not be able to use simple Pearson correlation on attributes that are categorical or `object` type.

The most used correlation measure is Pearson correlation. As explained, this only works with numerical data. Note that our label attribute (income) is of type `object`. As explained, this means that we are dealing most likely with a categorical label. We can create a temporal numerical representation of the categorial label by using `OrdinalEncoder`.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder()
strat_train_set['income_ord'] = oe.fit_transform(strat_train_set[['income']])

strat_train_set['income_ord']

Note how education.num has a correlation of 0.33. 

In [None]:
# Create a correlation matrix
corr_matrix = strat_train_set.corr()
corr_matrix['income_ord'].sort_values(ascending=False)

Using the OrdinalEncoder, it is possible for us to see the correlation of other attributes.

### Scatter Matrix

We can further visualize any correlations among the data attributes by using a correlation matrix. This can help us better observe the above correlations.

In [None]:
from pandas.plotting import scatter_matrix

attributes = ['age', 'capital.gain', 'hours.per.week', 'capital.loss', 'education.num', 'fnlwgt', 'income_ord']
scatter_matrix(strat_train_set[attributes], figsize=(12, 8))

# Attribute Engineering

Often, it is a good idea to try and combine attributes to further enrich the data of information that may not be there by the individual attributes. This is particularily used for numerical attributes. __Avoid using the label when combining features.__ This is important since the learning model may get information by the label indirectly through the combined attribute. Further, the label will not be available when we actually deploy the model (since we are trying to predict the label to begin with!)

In [None]:
# Create dummy dataset so that we can play more safely with our dataset.
attr_eng_train_set = strat_train_set.copy()
attr_eng_train_set['cap.gain.per.week.hour'] = attr_eng_train_set['capital.gain']/attr_eng_train_set['hours.per.week']
attr_eng_train_set['cap.loss.per.week.hour'] = attr_eng_train_set['capital.loss']/attr_eng_train_set['hours.per.week']
attr_eng_train_set['cap.gain.per.education.num'] = attr_eng_train_set['capital.gain']/attr_eng_train_set['education.num']

Note that `cap.gain.per.week.hour` and `cap.loss.per.week.hour` do not introduce more correlation. On the other hand, `education.num.per.cap.gain` introduces a slightly better correlation that is higher than the other two attributes.

In [None]:
corr_matrix = attr_eng_train_set.corr()
corr_matrix['income_ord'].sort_values(ascending=False)

In [None]:
attr_eng_train_set.describe()

Since we are dealing with the stratified data, we will need to add the new attribute we engineered back to the dataset as a whole. In this case, we will add `cap.gain.per.education.num` given that it adds more correlation. Additionally, we will re-shufflesplit the data to account for the fact that `education.num` seems to be the best predictor explored so far.

Further inspection of `education.num` reveals that the attribute, although numerical, is also a categorical attribute. Because of this, we can use it for the StratifiedShuffleSplit.

In [None]:
adult['education.num'].value_counts()

In [None]:
adult['cap.gain.per.education.num'] = adult['capital.gain']/adult['education.num']
attrib = 'education.num'
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_i, test_i in split.split(adult, adult[attrib]):
    strat_train_set = adult.loc[train_i]
    strat_test_set = adult.loc[test_i]

# Dataset Preparation

Some attributes may have values that may not be useful for learning. In particular, one shall be wary of undefined/invalid values. Oftentimes these values manifest themselves as `NaN` values. In our case, these values are displayed as `?`. This is explained in the dataset's description. This is why it is important to read a dataset description before using. Some authors use different standards when crafting datasets. The three offending attributes in our case are: `workclass, occupation, native.country`

In [None]:
adult['workclass'].value_counts()

In [None]:
adult['occupation'].value_counts()

In [None]:
adult['native.country'].value_counts()

There are different ways one can deal with missing data:

1. Remove offending samples.
  - Good when dealing with a small number of samples.
2. Remove attribute.
  - Good when we have too many values missing and we have plenty attributes.
3. Set value.
  - Good when there is high correlations among other attributes or there is enough samples that we can use mean, median, or its ok to set them to 0.
  
In our case, there are several things to consider:

1. `workclass` and `occupation` share the same samples that have missing data. If we remove those particular samples,  we would clean both attributes without having to loose too many samples.
2. `native.country` is a categorical type. Additionally, there is a small number of offending samples. Either labeling them USA or removing them can be good actions. In this case, we will remove them.

In [None]:
# Note that many samples have both workclass and occupation missing!
adult[adult['workclass'] == '?']

In [None]:
adult[adult['native.country'] == '?']

#### Clean `workclass`, `occupation`, `native.country`

We keep only the samples that do not have `?` as values. We only loose __2399__ data samples (~7%).

In [None]:
adult = adult[adult['workclass'] != '?']
adult = adult[adult['occupation'] != '?']
adult = adult[adult['native.country'] != '?']
adult.reset_index(inplace=True, drop=True)

In [None]:
adult

In [None]:
attrib = 'education.num'
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_i, test_i in split.split(adult, adult[attrib]):
    strat_train_set = adult.loc[train_i]
    strat_test_set = adult.loc[test_i]
strat_train_set.reset_index(inplace=True, drop=True)
strat_test_set.reset_index(inplace=True, drop=True)

## Dealing with Categorical Data

Often, machine learning models need special representation of categorical attributes. The reason for this is on their training algorithms that use optimization strategies that require numerical values. Often, it is enough for us to replace categorical values with integers. The categorical values we will be dealing with are: `workclass`, `education`, `education.num`, `marital.status`, `occupation`, `relationship`, `race`, `sex`, `native.country`. Additionally, the label also needs to be encoded.

In [None]:
strat_train_set.head()

### Encoding Categorical Attributes

There are seberal common ways to deal with categorical attributes. The first one is to encode categories using numbers for each respective category. In this manner, pre-school education could be given a value of 0, elementary education a value of 1, middle-school education a value of 2, and high-school education a value of 3. This can be achieved by using eithwe `OrdinalEncoder` or `LabelEncoder` from scikit-learn.

Often, a third option is used called `OneHotEncoder`. One-hot encoding uses several columns to encode categorical data in bit-string format. for example, encoding the education attribute would use 4 columns (following on the aforementioned example in the previous paragraph). Each colum would represent whether a sample is of one type or another. For example, if the first column has a 1 and all other columns have a 0, then, we could say that the education level is of pre-school. If the third column has a 1 and all other columns a value of 0, then we could say that the education level for that sample is middle-school. 

The benefit of using one-hot encoding is that by not placing a numerical value on the categories, we do not create a relationship among them where there may not exist one. For example, giving a higher value to single compare to married in marital status could cause the machine learning algorithm to pick an inexisting relationship among attributes. One-hot encoding alleviates this problem.

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_attrs = ['workclass', 'education', 'education.num', 'marital.status', 'occupation', 
             'relationship', 'race', 'sex', 'native.country']

ohe = OneHotEncoder()
ohe.fit(adult[cat_attrs])
ohe_train = pd.DataFrame(ohe.transform(strat_train_set[cat_attrs]).toarray())
ohe_test = pd.DataFrame(ohe.transform(strat_test_set[cat_attrs]).toarray())

# Drop categorical data
strat_train_set.drop(cat_attrs, axis=1, inplace=True)
strat_test_set.drop(cat_attrs, axis=1, inplace=True)

# Replace with one-hot encoded data
strat_train_set = strat_train_set.join(ohe_train)
strat_test_set = strat_test_set.join(ohe_test)

# Encode income as an ordinal value.
oe = OrdinalEncoder()
strat_train_set['income'] = oe.fit_transform(strat_train_set[['income']])
strat_test_set['income'] = oe.fit_transform(strat_test_set[['income']])

# Kepp column names as string
strat_train_set.columns = strat_train_set.columns.astype(str)
strat_test_set.columns = strat_test_set.columns.astype(str)



In [None]:
strat_train_set.head()

In [None]:
strat_test_set.head()

## Feature Scalling

Sometimes, data comes in varying degrees of scales. For example, our dataset attributes `fnlwgt`, `capital.gain`, and `cap.gain.per.education.num` are widely distributed. This is evident when we look at the `describe()` function of the dataset and note the difference between max and min statistics. These large numbers can make machine learning models take longer than expected to learn a solution. to make learning easier, it is good practice to "standarize" or "noramlize" our data. There are two main ways by which this can be achieved:

- min-max scaling:
  - We use the difference between the minimum and maximum values in the dataset to keep data between 0 and 1.
- standarization:
  - We use the mean to "de-mean" the data. Then, we divide the data by the standard deviation to scale values to unit variances.

Note that the main difference is that min-max gives values between 0 and 1 and standarization does not.

There is no particular rule to when to use each type of scaling. It is a good idea to try both and compare how much each improves data prediction.

In [None]:
adult.describe()

When performing the scaling, one may be tempted to scale the whole dataset. This __shall be avoided__. The reason is that because we split the dataset into a training and testing datasets, and these sub-datasets should be independent of each other, applying feature scalling on the dataset as a whole before splitting will introduce a certain degree of dependence. This is bad because we are interested in obtaining a model that will perform well regardless of whether the data has been seen previously or not.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler


to_scale = ['fnlwgt', 'capital.gain', 'cap.gain.per.education.num']
# We use two scalers to isolate both datasets
ss_train = StandardScaler()
ss_test = StandardScaler()
# mm_train = MinMaxScaler()
# mm_test = MinMaxScaler()

strat_train_set[to_scale] = ss_train.fit_transform(strat_train_set[to_scale])
strat_test_set[to_scale] = ss_test.fit_transform(strat_test_set[to_scale])

In [None]:
strat_train_set.describe()

# Model Selection

There is a plethora of different machine learning models. Depending on the application, some models may be better than others. In this case, we are dealing with a classification problem. Because of this, some recomended models to try are:

- Support Vector Machine
- Logistic Regression
- Naive Bayes
- K-Nearest Neighbors
- Decision Trees
- Random Forests
- Stochastic Gradient Descent

For training, we remove the labels from the training data. Leaving them in the dataset will be reminiscent to giving test answers to a class. It is common practice to use an X variable for the de-labeled data and y for the labels.

In [None]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier

model = LogisticRegression()

# Remove labels.
X = strat_train_set.drop(['income'], axis=1)
# Put labels in different variable
y = strat_train_set['income']

# Here it is where the magic happens!
model.fit(X, y)

In order for us to distinguish among models, it is a good idea to pick an accuracy estimator. There are many accuracy estimators that are used in different circumstances. Some of the most used are root mean squared error (RMSE), mean absolute error (MAE), and sklearn's classification accuracy score. Since we are dealing with a classification task, the accuracy score is our best bet.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score

train_pred = model.predict(X)
test_X = strat_test_set.drop(['income'], axis=1)
test_pred = model.predict(test_X)

test_y = strat_test_set['income']

train_rmse = np.sqrt(mean_squared_error(y, train_pred))
test_rmse = np.sqrt(mean_squared_error(test_y, test_pred))

train_mae = mean_absolute_error(y, train_pred)
test_mae = mean_absolute_error(test_y, test_pred)

train_as = accuracy_score(y, train_pred)
test_as = accuracy_score(test_y, test_pred)

In [None]:
print(f"Training Score: {train_rmse} | Testing Score: {test_rmse}")

In [None]:
print(f"Training Score: {train_mae} | Testing Score: {test_mae}")

In [None]:
print(f"Training Score: {train_as} | Testing Score: {test_as}")

# Model Fine Tuning

As it can be appreciated, the scores obtained above are not phenomenal. The model definetely learns something, but the score reglects that when the model is put to the test, it still struggles. There can be many reasons for this, but one of the main reasons is that the model has not been fine tuned. Machine learning models come with pre-set attributes that may or may not be the best for a particular dataset. Because of this, it is a good idea to test different parameters. Often, testing new parameters will help the model perform better. Check out scikit-learn's documentation to find out about what parameters you can test to try.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {
        'n_estimators': [50, 75, 100, 125, 150],
        'criterion': ['gini', ]
    },
]
model = RandomForestClassifier()
grid_search = GridSearchCV(model, param_grid, cv=10, scoring='accuracy', return_train_score=True)
grid_search.fit(X, y)

In [None]:
best_model = grid_search.best_estimator_

train_pred = best_model.predict(strat_train_set.drop(['income'], axis=1))
test_pred = best_model.predict(strat_test_set.drop(['income'], axis=1))

train_rmse = np.sqrt(mean_squared_error(strat_train_set['income'], train_pred))
test_rmse = np.sqrt(mean_squared_error(strat_test_set['income'], test_pred))

train_mae = mean_absolute_error(strat_train_set['income'], train_pred)
test_mae = mean_absolute_error(strat_test_set['income'], test_pred)

train_as = accuracy_score(strat_train_set['income'], train_pred)
test_as = accuracy_score(strat_test_set['income'], test_pred)

In [None]:
print(f"Training Score: {train_rmse} | Testing Score: {test_rmse}")

In [None]:
print(f"Training Score: {train_mae} | Testing Score: {test_mae}")

In [None]:
print(f"Training Score: {train_as} | Testing Score: {test_as}")