# Dataset & Linear Classifier

The [Adult Data Set][1] is a census income dataset from the UCI Machine Learning Repository. It can be used to predict whether an individual makes more than $50k per year. It is made up of census data from 1994 and has about 48 thousand examples and 14 features.
[1]: https://archive.ics.uci.edu/ml/datasets/adult

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score

## Understanding the Dataset

Let's first load the data

In [2]:
data_folder = "../Data"
train_file = "/adult.data.txt"
test_file = "/adult.test.txt"
cols = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "class"]

train_df = pd.read_csv(data_folder + train_file, names=cols, header=None)
test_df  = pd.read_csv(data_folder + test_file, names=cols, skiprows=1)

NameError: name 'pd' is not defined

In [3]:
train_df.head()

NameError: name 'train_df' is not defined

In [4]:
train_df.describe()

NameError: name 'train_df' is not defined

Now let's see how many missing values there are

In [5]:
train_df.isnull().sum(axis=0)

NameError: name 'train_df' is not defined

## Preprocessing

In [6]:
# One Hot Encoding
categorical_cols = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
train_df = pd.get_dummies(train_df, columns=categorical_cols)
test_df = pd.get_dummies(test_df, columns=categorical_cols)

NameError: name 'pd' is not defined

In [7]:
# convert class to 0 or 1
train_df["class"] = train_df["class"].astype('category')
train_df["class"] = train_df["class"].cat.codes
test_df["class"]  = test_df["class"].astype('category')
test_df["class"]  = test_df["class"].cat.codes

NameError: name 'train_df' is not defined

## Linear Classification
First, partition the data

In [8]:
X_train = train_df.drop("class", axis=1).as_matrix()
y_train = train_df["class"].as_matrix()
X_test = test_df.drop("class", axis=1).as_matrix()
y_test = test_df["class"].as_matrix()

We will now create the linear classifier, fit it, and then compute the scores using 10-fold cross-validation of each folds.

In [9]:
clf = linear_model.RidgeClassifier()
clf.fit(X_train, y_train) 
n_folds = 10
scores = cross_val_score(clf, X_train, y_train, cv=n_folds)
scores

array([0.83236107, 0.83937346, 0.84490172, 0.82985258, 0.84336609,
       0.83753071, 0.83753071, 0.84213759, 0.84121622, 0.84398034])

Finally, let's see the test set error

In [10]:
y_pred = cross_val_predict(clf, X_test, y_test, cv=n_folds)
accuracy_score(y_test, y_pred)

0.8420244456728703



##Gaussian Process Classification
Let's use 1% of our data to speed things up (326 random samples)

In [12]:
combineX_trainY_train = np.column_stack([X_train, y_train])
totalSamples = len(combineX_trainY_train)
subsetSize = int(round(totalSamples * 0.01))
subset = combineX_trainY_train[np.random.choice(combineX_trainY_train.shape[0], subsetSize, replace=False), :]

While we're at it, let's generate a test subset
We'll use 2% of our data because that will generate 326 test samples

In [None]:
combineX_testY_test = np.column_stack([X_test, y_test])
totalSamples_test = len(combineX_testY_test)
subsetSize_test = int(round(totalSamples_test * 0.02))
subset_test = combineX_testY_test[np.random.choice(combineX_testY_test.shape[0], subsetSize_test, replace=False), :]

##Now that we've picked a subset of our data, split up the data into samples and labels again and fit a gp classifier


In [1]:
y_train_subset = subset[:,-1]
X_train_subset = np.delete(subset, -1, axis = 1)

y_test_subset = subset_test[:,-1]
X_test_subset = np.delete(subset_test, -1, axis = 1)

#Time to fit a GP classfier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF, Matern

GP_RBF = GaussianProcessClassifier(kernel = 1.0 * RBF(length_scale=1.0))
GP_Matern = GaussianProcessClassifier(kernel = Matern(length_scale=2, nu=3/2))

GP_RBF.fit(X_train_subset, y_train_subset)
GP_Matern.fit(X_train_subset, y_train_subset)

RBF_Scores = cross_val_score(GP_RBF, X_train_subset, y_train_subset, cv=n_folds)

NameError: name 'subset' is not defined

In [None]:
//Result goes here

In [None]:
Matern_Scores = cross_val_score(GP_Matern, X_train_subset, y_train_subset, cv=n_folds)

In [None]:
//Result goes here

In [None]:
RBF_Y_Pred = cross_val_predict(GP_RBF, X_test_subset, y_test_subset, cv = n_folds)

In [None]:
accuracy_score(y_test_subset, RBF_Y_Pred)

In [None]:
Matern_Y_Pred = cross_val_predict(GP_Matern, X_test_subset, y_test_subset, cv = n_folds)

In [None]:
accuracy_score(y_test_subset, Matern_Y_Pred)