# Data Pre-Processing

## Introduction
We have already prepared the cleaned dataframe from the clustering phase so this can now be imported and split into seperate dataframes per use with our predictos Major Occupation and Education_Attainment. I will also perform one hot enconding on the selected columns for use with classification.

In [65]:
import pandas as pd

In [66]:
from scipy.io import arff
df = pd.read_csv('../resources/originalCleanedDataFrame.csv')
df.head()

df.index

RangeIndex(start=0, stop=10108, step=1)

Next we want to prepare our two dataframe for Major Occupation versus Education_Attainment with the original feature selection options we used in the Clustering phase.

In [67]:
# Copy out original feature set for use with reading classification results later.
educationFeatureSubset = df[['Education_Attainment', 'Years_on_Internet', 'Web_Ordering','Age', 'Not_Purchasing_Too_complicated', 'Not_Purchasing_Prefer_people','Not_Purchasing_Privacy', 'Not_Purchasing_Security', 'Not_Purchasing_Easier_locally']].copy()
# Copy out original feature set for use with reading classification results later.
occupationFeatureSubset = df[['Major_Occupation', 'Years_on_Internet', 'Web_Ordering','Age', 'Not_Purchasing_Too_complicated', 'Not_Purchasing_Prefer_people','Not_Purchasing_Privacy', 'Not_Purchasing_Security', 'Not_Purchasing_Easier_locally']].copy()

# Assign the index so that it matches that of the original df
educationFeatureSubset.set_axis(df.index, axis='index', inplace=True)
occupationFeatureSubset.set_axis(df.index, axis='index', inplace=True)

#Replace Age Columns which are not an integer.
educationFeatureSubset = educationFeatureSubset.replace('Not_Say', 0)
occupationFeatureSubset = occupationFeatureSubset.replace('Not_Say', 0)

#EXPORT to CSV the original feature subset for use with reading classification results later in other algorithms.
educationFeatureSubset.to_csv('educationFeatureSubset.csv', encoding='utf-8')
occupationFeatureSubset.to_csv('occupationFeatureSubset.csv', encoding='utf-8')

Unnamed: 0,Major_Occupation,Years_on_Internet,Web_Ordering,Age,Not_Purchasing_Too_complicated,Not_Purchasing_Prefer_people,Not_Purchasing_Privacy,Not_Purchasing_Security,Not_Purchasing_Easier_locally
0,Professional,1-3_yr,Yes,41,0,0,0,0,0
1,Education,Under_6_mo,Yes,28,0,0,0,1,0
2,Computer,1-3_yr,Yes,25,0,0,1,0,1
3,Professional,1-3_yr,Yes,28,0,0,0,0,0
4,Education,1-3_yr,Yes,17,0,0,0,0,0


## Data Preparation for use in Classification


### Data Encoding
This section will prepare data by use of one hot encoding to convert all column values to numeric for use with classification algorithms.

In [68]:
from sklearn.preprocessing import OneHotEncoder
import random

featureStringCols = ['Years_on_Internet', 'Web_Ordering']
featureBoolCols = ['Not_Purchasing_Too_complicated', 'Not_Purchasing_Prefer_people',
                                'Not_Purchasing_Privacy', 'Not_Purchasing_Security', 'Not_Purchasing_Easier_locally']
featureIntCols = ['Age']

for col in featureIntCols:
  df[col] = pd.to_numeric(df[col], errors='coerce')

for col in featureBoolCols:
  df[col] = df[col].map({'0': 0, '1': 1})

ohe = dict()

# Choose a seed so that this code is repeatable, and select some features for the model 
random.seed(42)
originalStrCols = random.sample(featureStringCols,k=2)
print(originalStrCols)
sampledBoolCols = random.sample(featureBoolCols,k=5)
print(sampledBoolCols)
sampledIntCols = random.sample(featureIntCols,k=1)
print(sampledIntCols)

# Create an empty dataframe
featureSub = pd.DataFrame()

for col in originalStrCols:
  ohe[col] = OneHotEncoder(sparse=False)
  X = ohe[col].fit_transform(df[col].values.reshape(-1,1))
  # See https://stackoverflow.com/a/4843172
  dfOneHot = pd.DataFrame(X, columns = [col+'-'+str(int(i)) for i in range(X.shape[1])])
  featureSub = pd.concat([featureSub, dfOneHot], axis=1)

# Assign the index so that it matches that of the original df
featureSub.set_axis(df.index, axis='index', inplace=True)

# Add in the sampledBoolcols
featureSub = pd.concat([featureSub, df[sampledBoolCols]], axis=1)

# Add in the sampledIntcols
featureSub = pd.concat([featureSub, df[sampledIntCols]], axis=1)

# fill NaN values in dataset
featureSub.fillna(featureSub.mean(), inplace=True)

['Years_on_Internet', 'Web_Ordering']
['Not_Purchasing_Privacy', 'Not_Purchasing_Prefer_people', 'Not_Purchasing_Too_complicated', 'Not_Purchasing_Easier_locally', 'Not_Purchasing_Security']
['Age']


In [69]:
featureSub.head()

Unnamed: 0,Years_on_Internet-0,Years_on_Internet-1,Years_on_Internet-2,Years_on_Internet-3,Years_on_Internet-4,Web_Ordering-0,Web_Ordering-1,Web_Ordering-2,Not_Purchasing_Privacy,Not_Purchasing_Prefer_people,Not_Purchasing_Too_complicated,Not_Purchasing_Easier_locally,Not_Purchasing_Security,Age
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,41.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,,,,,,28.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,25.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,28.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,,,,,17.0


In [70]:
#EXPORT to CSV the chosen feature subset to work with in other classification algorithms
featureSub.to_csv('oheTransformedData.csv', encoding='utf-8', index=False)

We now have our data prepared both through selecting an interesting feature set, ensuring values are transformed to numeric values for use with the three classification algorithms and also prearing the new dataframe for analysis. 