# Dataset & Linear Classifier

The [Adult Data Set][1] is a census income dataset from the UCI Machine Learning Repository. It can be used to predict whether an individual makes more than $50k per year. It is made up of census data from 1994 and has about 48 thousand examples and 14 features.
[1]: https://archive.ics.uci.edu/ml/datasets/adult

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

## Understanding the Dataset

Let's first load the data

In [2]:
data_folder = "../Data"
train_file = "/adult.data.txt"
test_file = "/adult.test.txt"
cols = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "class"]

train_df = pd.read_csv(data_folder + train_file, names=cols, header=None)
test_df  = pd.read_csv(data_folder + test_file, names=cols, skiprows=1)

In [3]:
train_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
train_df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


Now let's see how many missing values there are

In [5]:
train_df.isnull().sum(axis=0)

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
class             0
dtype: int64

## Preprocessing

In [6]:
# One Hot Encoding
categorical_cols = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
train_df = pd.get_dummies(train_df, columns=categorical_cols)
test_df = pd.get_dummies(test_df, columns=categorical_cols)

In [7]:
# convert class to 0 or 1
train_df["class"] = train_df["class"].astype('category')
train_df["class"] = train_df["class"].cat.codes
test_df["class"]  = test_df["class"].astype('category')
test_df["class"]  = test_df["class"].cat.codes

## Linear Classification
First, partition the data

In [8]:
X_train = train_df.drop("class", axis=1).as_matrix()
y_train = train_df["class"].as_matrix()
X_test = test_df.drop("class", axis=1).as_matrix()
y_test = test_df["class"].as_matrix()

We will now create the linear classifier, fit it, and then compute the scores using 10-fold cross-validation of each folds.

In [9]:
clf = linear_model.RidgeClassifier()
clf.fit(X_train, y_train) 
n_folds = 10
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
scores

array([0.83236107, 0.83937346, 0.84490172, 0.82985258, 0.84336609,
       0.83753071, 0.83753071, 0.84213759, 0.84121622, 0.84398034])

Finally, let's see the test set error

In [10]:
y_pred = cross_val_predict(clf, X_test, y_test, cv=n_folds)
accuracy_score(y_test, y_pred)

0.8420244456728703



##Gaussian Process Classification
Let's use 1000 random examples to speed things up

In [11]:
combineX_trainY_train = np.column_stack([X_train, y_train])
totalSamples = len(combineX_trainY_train)
subsetSize = 1000
subset = combineX_trainY_train[np.random.choice(combineX_trainY_train.shape[0], subsetSize, replace=False), :]

While we're at it, let's generate a test subset
We'll use 1000 test samples

In [12]:
combineX_testY_test = np.column_stack([X_test, y_test])
totalSamples_test = len(combineX_testY_test)
subsetSize_test = 1000
subset_test = combineX_testY_test[np.random.choice(combineX_testY_test.shape[0], subsetSize_test, replace=False), :]

##Now that we've picked a subset of our data, split up the data into samples and labels again and fit a gp classifier


In [13]:
y_train_subset = subset[:,-1]
X_train_subset = np.delete(subset, -1, axis = 1)

y_test_subset = subset_test[:,-1]
X_test_subset = np.delete(subset_test, -1, axis = 1)

#Time to fit a GP classfier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF, Matern

GP_RBF = GaussianProcessClassifier(kernel = 1.0 * RBF(length_scale=1.0))
GP_Matern = GaussianProcessClassifier(kernel = Matern(length_scale=2, nu=3/2))

GP_RBF.fit(X_train_subset, y_train_subset)



GP_Matern.fit(X_train_subset, y_train_subset)

RBF_Scores = cross_val_score(GP_RBF, X_train_subset, y_train_subset, cv=n_folds)
RBF_Scores

array([0.75247525, 0.75247525, 0.75      , 0.75      , 0.75      ,
       0.76      , 0.75      , 0.75      , 0.72727273, 0.73737374])

In [14]:
Matern_Scores = cross_val_score(GP_Matern, X_train_subset, y_train_subset, cv=n_folds)
Matern_Scores

array([0.76237624, 0.75247525, 0.77      , 0.76      , 0.77      ,
       0.77      , 0.76      , 0.76      , 0.75757576, 0.77777778])

In [15]:
RBF_Y_Pred = cross_val_predict(GP_RBF, X_test_subset, y_test_subset, cv = n_folds)

In [16]:
accuracy_score(y_test_subset, RBF_Y_Pred)

0.742

In [17]:
Matern_Y_Pred = cross_val_predict(GP_Matern, X_test_subset, y_test_subset, cv = n_folds)

In [18]:
accuracy_score(y_test_subset, Matern_Y_Pred)

0.756

In [19]:
##Compare Negative Log Likelihood

In [20]:
print("Log Marginal Likelihood (GP_RBF): %.3f"
      % GP_RBF.log_marginal_likelihood(GP_RBF.kernel_.theta))

Log Marginal Likelihood (GP_RBF): -693.147


In [21]:
print("Log Marginal Likelihood (GP_Matern): %.3f"
      % GP_Matern.log_marginal_likelihood(GP_Matern.kernel_.theta))

Log Marginal Likelihood (GP_Matern): -556.709
