<a id='preprocessing'></a>
# Pre-Processing

This step prepares the data for the clustering and classification tasks that follow.

**Note:** The notebooks were created independently of one another and then combined into one final notebook. As a result, certain import code and configurations can appear to be duplicated in the final notebook.

### Initial

In [1]:
data_FOLDER = './data'

In [2]:
import pandas as pd
import numpy as np

## Orginise imports, configs and 
import pickle

# the encoders
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer

In [3]:
df = pickle.load( open( data_FOLDER+"/data.pickle", "rb" ) )

In [4]:
df.head()

Unnamed: 0_level_0,Age,Community_Membership_Family,Community_Membership_Hobbies,Community_Membership_None,Community_Membership_Other,Community_Membership_Political,Community_Membership_Professional,Community_Membership_Religious,Community_Membership_Support,Disability_Cognitive,...,Who_Pays_for_Access_Self,Who_Pays_for_Access_Work,hasFalsified,isSingle,Concerns,isTechnical,isMSUser,Household_Income,Months_on_Internet,Adoption
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
93819,41.0,0,0,1,0,0,0,0,0,0,...,1,0,0,0,Regulation,1,0,103648.0,20.0,Early_Adopter
95708,28.0,0,0,0,0,0,0,0,1,0,...,1,0,1,1,Usability,0,1,409.0,1.0,Early_Majority
97218,25.0,1,1,0,0,0,1,0,0,0,...,1,1,1,0,Usability,1,1,109012.0,30.0,Early_Adopter
91627,28.0,0,0,0,1,0,0,0,0,0,...,1,0,1,0,Regulation,1,1,58024.0,28.0,Early_Adopter
49906,17.0,0,0,0,0,1,1,0,1,0,...,1,0,1,1,Usability,1,1,3657.0,27.0,Early_Adopter


---

### Column Types

Determine which columns contain which types of information. This is accomplished using a script that creates a series of arrays. After that, the column types can be pre-processed in the right way.

In [5]:
for j in ['hasFalsified', 'isSingle', 'isTechnical', 'isMSUser']:
    df[j] = df[j].astype('int64')


# Routinne to make lists data types

Numeric = []
Categorical = []
Boolean = []
Ordinal = []
Nominal = []
Targets = ["Education_Attainment", "Major_Occupation"]
TargetsF = []

columns = list(df)
  
for i in columns:
    type = df[i].dtype
    
    if (type == "float64"):
        Numeric.append(i)
    
    # Categorical is not a numpy array
    # https://pandas.pydata.org/pandas-docs/version/0.17.0/categorical.html
    
    elif (hasattr(df[i], 'cat')):
        
        #Categorical.append(i)
        # Build this up later
        
        if (df[i].cat.ordered):
            
            Ordinal.append(i)
        
        else:
            
            Nominal.append(i)
        
    elif (type == "int64"):
        Boolean.append(i)
        
        
        
# Take a look
print("Numeric:" + str(Numeric) +"\n")
print("Boolean:" + str(Boolean) +"\n")
print("Categorical:" + str(Categorical) +"\n")
print("Categorical Ordinal:" + str(Ordinal) +"\n")
print("Categorical Nominal:" + str(Nominal) +"\n")


Numeric:['Age', 'Household_Income', 'Months_on_Internet']

Boolean:['Community_Membership_Family', 'Community_Membership_Hobbies', 'Community_Membership_None', 'Community_Membership_Other', 'Community_Membership_Political', 'Community_Membership_Professional', 'Community_Membership_Religious', 'Community_Membership_Support', 'Disability_Cognitive', 'Disability_Hearing', 'Disability_Motor', 'Disability_Not_Impaired', 'Disability_Not_Say', 'Disability_Vision', 'Opinion_1_Pro', 'Who_Pays_for_Access_Dont_Know', 'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents', 'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self', 'Who_Pays_for_Access_Work', 'hasFalsified', 'isSingle', 'isTechnical', 'isMSUser']

Categorical:[]

Categorical Ordinal:['Education_Attainment', 'Adoption']

Categorical Nominal:['Major_Occupation', 'Concerns']



---

### Encode Categorical Features

Another crucial step in data preprocessing is munging categorical data. Categorical features must be converted to a numerical representation for models in the machine learning library.  We call this encoding.

You can choose a suitable transformation method once you know what kind of categorical data you're dealing with. For ordinal data, there will be an OrdinalEncoder, and for nominal data, there will be a OneHotEncoder in sklearn.

---

### Ordinal Categorical Features

OrdinalEncoder is used to encode the ordinal categorical features.

In [6]:
for i in Ordinal:

    # get a list categories (comes in an object so need to extract the list)
    cats = df[i].cat.categories

    ordered_cats = list()

    for cat in cats:
        ordered_cats.append(cat)

    # instiate an encoder that will maintain the order while encoding
    encoder = OrdinalEncoder(categories=[ordered_cats])

    #encode with Ordinal encoder (fit then transform)
    encoder.fit(df[[i]])
    df[i] = encoder.transform(df[[i]])
    
    if (i not in Targets):
        Categorical.append(i)
    elif (i in Targets):
        TargetsF.append(i)

---

### Nominal Categorical Features

One Hot Encoder is used to encode the Nominal Categorical features.

In [7]:
# (Butler and Murphy, 2021)

for col in Nominal:
    
    # Since the one-hot encoder only accepts numerical 
    # categorical values, any string type value should 
    # be label encoded before being one-hot encoded.
    
        if (df[col].cat.categories.dtype == 'object'):
            # Deal with the string categories

            le = LabelEncoder()

            df[col] = le.fit_transform(df[col])
            
        # Create an empty dataframe
        dfSub = pd.DataFrame()
        
        ohe = dict()
        
        # instiate a new hot encode
        ohe[col] = OneHotEncoder(sparse=False)

        # hot encode to array
        Iamhotencoded = ohe[col].fit_transform(df[col].values.reshape(-1,1))
        
        dfOneHot = pd.DataFrame(Iamhotencoded, columns = [col+'-'+str(int(i)) for i in range(Iamhotencoded.shape[1])])
        
        dfSub = pd.concat([dfSub, dfOneHot], axis=1)
        
        if col not in Targets:
            Categorical = Categorical + list(dfSub.columns)
        elif (col in Targets):
            TargetsF = TargetsF + list(dfSub.columns)
        
        # Assign the index so that it matches that of the original df
        dfSub.set_axis(df.index, axis='index', inplace=True)
        
        # concat the array with main df on key values
        df = pd.concat([df, dfSub], axis =1)
        
        # delete the orginal column
        del df[col]

---

### Boolean

For the time being, keep the boolean as an int64 as it will be easier to use with the classifiers. Later, if necessary, introduce modifications.

---

## Rearrange the Dataframe

Rearrange the columns in the dataframe such that the targets are in columns 1 and 2 which will help make them easier to find.

In [8]:
cols = TargetsF + Numeric + Boolean + Categorical
df = df[cols]

In [9]:
columns = list(df)
  
for i in columns:
    if (i not in Numeric):
        type = df[i].dtype
    
        if (type == 'float64'):
            df[i] = df[i].astype('int64')

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10108 entries, 93819 to 92223
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Education_Attainment               10108 non-null  int64  
 1   Major_Occupation-0                 10108 non-null  int64  
 2   Major_Occupation-1                 10108 non-null  int64  
 3   Major_Occupation-2                 10108 non-null  int64  
 4   Major_Occupation-3                 10108 non-null  int64  
 5   Major_Occupation-4                 10108 non-null  int64  
 6   Age                                10108 non-null  float64
 7   Household_Income                   10108 non-null  float64
 8   Months_on_Internet                 10108 non-null  float64
 9   Community_Membership_Family        10108 non-null  int64  
 10  Community_Membership_Hobbies       10108 non-null  int64  
 11  Community_Membership_None          10108 non-null  int6

In [11]:
df.head()

Unnamed: 0_level_0,Education_Attainment,Major_Occupation-0,Major_Occupation-1,Major_Occupation-2,Major_Occupation-3,Major_Occupation-4,Age,Household_Income,Months_on_Internet,Community_Membership_Family,...,Who_Pays_for_Access_Self,Who_Pays_for_Access_Work,hasFalsified,isSingle,isTechnical,isMSUser,Adoption,Concerns-0,Concerns-1,Concerns-2
who,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
93819,0,0,0,0,0,1,41.0,103648.0,20.0,0,...,1,0,0,0,1,0,1,0,1,0
95708,2,0,1,0,0,0,28.0,409.0,1.0,0,...,1,0,1,1,0,1,2,0,0,1
97218,1,1,0,0,0,0,25.0,109012.0,30.0,1,...,1,1,1,0,1,1,1,0,0,1
91627,1,0,0,0,0,1,28.0,58024.0,28.0,0,...,1,0,1,0,1,1,1,0,1,0
49906,3,0,1,0,0,0,17.0,3657.0,27.0,0,...,1,0,1,1,1,1,1,0,0,1


---

### Save the Data Frame to Binary File

To avoid overwriting the *data.pickle* file, save the file under a different name called *pre-processed.pickle*.

In [12]:
# Save the panda dataframe to a pickle file

df.to_pickle(data_FOLDER+'/pre-processed.pickle')

# note that this will overwrite any existing file
# in the data directory called 'pre-processed.pickle'

# https://ianlondon.github.io/blog/pickling-basics/