# Step 4 : Pre-processing for ML : Stage 1

## 2. Data Transformation : Encoding Categorical Variables
In this section, I used two methods to re-code categorical variables : 
- Ordinal Encoding
- Dummy Encoding

After encoding, I combined the data together

Install:
pip install category_encoders

### Import Data

In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Read the csv file
fjdf = pd.read_csv("Findjob_no_outliers.csv")

In [3]:
fjdf.columns

Index(['YEARSERV', 'RANK', 'MARRACTIV', 'PARACTIV', 'COMBAT', 'INJURED',
       'MILTOCIV', 'CIVADJ', 'VABENEFITS', 'CIVADJPROBa', 'CIVADJPROBb',
       'CIVADJISSc', 'CIVADJISSe', 'LOOKJOB', 'SCHOOL', 'MILJOBSKILLS',
       'FINDJOB', 'MILHELPJOB', 'TRAUMA1', 'F_AGECAT', 'F_SEX', 'F_EDUCCAT2',
       'F_HISP', 'F_RACETHN', 'F_MARITAL', 'Branch', 'p_income'],
      dtype='object')

In [4]:
len(fjdf.columns)

27

In [5]:
len(fjdf)

241

In [6]:
fjdf.shape

(241, 27)

### Check each variables

In [7]:
# Ordinal
fjdf["YEARSERV"].unique()

array(['>20 years', '5-9 years', '3-4 years', '< 2 years', '10-19 years'],
      dtype=object)

In [8]:
# Ordinal
fjdf["RANK"].unique()

array(['E1-E6', 'WO1-WO5', 'E7-E9', 'O1-O10'], dtype=object)

In [9]:
# Binary
fjdf['MARRACTIV'].unique()

array(['Married', 'Not Married'], dtype=object)

In [10]:
# Binary
fjdf['PARACTIV'].unique()

array(['Yes', 'No'], dtype=object)

In [11]:
# Binary
fjdf['COMBAT'].unique()

array(['Yes', 'No'], dtype=object)

In [12]:
fjdf['INJURED'].unique()

array(['Not injured', 'Injured out of combat', 'Injured in combat'],
      dtype=object)

In [13]:
# Ordinal
fjdf['MILTOCIV'].unique()

array(['Somewhat well', 'Not well at all', 'Not too well', 'Very well'],
      dtype=object)

In [14]:
# Ordinal
fjdf['CIVADJ'].unique()

array(['Very Easy', 'Somewhat Easy', 'Somewhat difficult',
       'Very difficult'], dtype=object)

In [15]:
# Binary
fjdf['VABENEFITS'].unique()

array(['Yes', 'No'], dtype=object)

In [16]:
# Binary
fjdf['CIVADJPROBa'].unique()

array(['Yes', 'No'], dtype=object)

In [17]:
# Binary
fjdf['CIVADJPROBb'].unique()

array(['No', 'Yes'], dtype=object)

In [18]:
# Ordinal
fjdf['CIVADJISSc'].unique()

array(['Frequently', 'Sometimes', 'Seldom', 'Never'], dtype=object)

In [19]:
# Ordinal
fjdf['CIVADJISSe'].unique()

array(['Never', 'Seldom', 'Sometimes', 'Frequently'], dtype=object)

In [20]:
# Binary
fjdf['LOOKJOB'].unique()

array(['Yes(right away)', 'Yes(Not right away)'], dtype=object)

In [21]:
# Binary
fjdf['SCHOOL'].unique()

array(['No', 'Yes'], dtype=object)

In [22]:
# Ordinal
fjdf['MILJOBSKILLS'].unique()

array(['fairly useful', 'Very useful', 'not too useful',
       'not useful at all'], dtype=object)

In [23]:
# Binary ordinal
fjdf['FINDJOB'].unique()

array(['<6 months', '>6 months'], dtype=object)

In [24]:
# Ordinal
fjdf['MILHELPJOB'].unique()

array(['Helped a lot', 'Helped a bit', 'hurt a bit', 'Neither',
       'hurt a lot'], dtype=object)

In [25]:
# Binary
fjdf['TRAUMA1'].unique()

array(['No', 'Yes'], dtype=object)

In [26]:
# Ordinal
fjdf['F_AGECAT'].unique()

array(['65+', '50-64', '30-49', '18-29'], dtype=object)

In [27]:
# Binary
fjdf['F_SEX'].unique()

array(['Male', 'Female'], dtype=object)

In [28]:
# Ordinal
fjdf['F_EDUCCAT2'].unique()

array(['High school degree', 'MS/PHD degree', 'Associate degree',
       'Some college(no degree)', 'Bachelors degree'], dtype=object)

In [29]:
# Binary
fjdf['F_HISP'].unique()

array(['No', 'Yes'], dtype=object)

In [30]:
fjdf['F_RACETHN'].unique()

array(['White', 'Black', 'Asian', 'Mixed race'], dtype=object)

In [31]:
fjdf['F_MARITAL'].unique()

array(['Married/Live with a partner', 'Divorced/Seperated/Widowed',
       'Never Married'], dtype=object)

In [32]:
fjdf['Branch'].unique()

array(['Muti-Branch', 'Army', 'Navy', 'Air_Force', 'Marines',
       'Coast_Guard'], dtype=object)

### Encoding Categorical Data - Ordinal Encoding

1. Steps: 
- Import category_encoders, OrdinalEncoder, and LabelEncoder
- Create a list of all columns belong to this category
- Make a “list” that specify the hierarchy of the column, then pass in all the lists into “categories” in OrdinalEncoder 
- Fit_transform the columns
- Change float to integers for the result

2. 11 columns in this category are :

'FINDJOB','YEARSERV','RANK','MILTOCIV','CIVADJ','CIVADJISSc','CIVADJISSe','MILJOBSKILLS','MILHELPJOB',
'F_AGECAT','F_EDUCCAT2'

3. Code Example :

Create 2 lists from 2 variables :
- Varaible SA : satisfaction = [Low,medium,High]
- Variable SI : size = [S,M,L]

encoder = OrdinalEncoder(categories = [satisfaction,size])
result = encoder.fit_transform(df[['SA','SI']])
result = result.astype(int)

In [33]:
# Import
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder

In [34]:
# Create a list of all columns belong to ordinal encoding category
or_index = ['FINDJOB','YEARSERV','RANK','MILTOCIV','CIVADJ','CIVADJISSc','CIVADJISSe'
            ,'MILJOBSKILLS','MILHELPJOB','F_AGECAT','F_EDUCCAT2']

#### FINDJOB : How long it takes you to find a job after you left (Label Y)
Label : '<6 months','>6 months'

In [35]:
fjdf['FINDJOB'].unique()

array(['<6 months', '>6 months'], dtype=object)

In [36]:
# Create a list of labels in sequence
findjob = ['>6 months','<6 months']

#### YEARSERV : How many years serverd
Label: '< 2 years','3-4 years','5-9 years','10-19 years','>20 years'

In [37]:
fjdf['YEARSERV'].unique()

array(['>20 years', '5-9 years', '3-4 years', '< 2 years', '10-19 years'],
      dtype=object)

In [38]:
# Create a list of labels in sequence
year = ['< 2 years','3-4 years','5-9 years','10-19 years','>20 years']

#### RANK : Rank in military
Label: 'E1-E6','E7-E9','WO1-WO5','O1-O10'

In [39]:
fjdf['RANK'].unique()

array(['E1-E6', 'WO1-WO5', 'E7-E9', 'O1-O10'], dtype=object)

In [40]:
# Create a list of labels in sequence
rank = ['E1-E6','E7-E9','WO1-WO5','O1-O10']

#### MILTOCIV : How well did the military prepare you for the transition to CIVILIAN life (Higher the better)
Label :'Not well at all','Not too well','Somewhat well','Very well'

In [41]:
fjdf['MILTOCIV'].unique()

array(['Somewhat well', 'Not well at all', 'Not too well', 'Very well'],
      dtype=object)

In [42]:
# Create a list of labels in sequence
miltociv = ['Not well at all','Not too well','Somewhat well','Very well']

#### CIVADJ : How easy to adjust to civilian life
Label: 'Very difficult','Somewhat difficult','Somewhat Easy','Very Easy'

In [43]:
fjdf['CIVADJ'].unique()

array(['Very Easy', 'Somewhat Easy', 'Somewhat difficult',
       'Very difficult'], dtype=object)

In [44]:
# Create a list of labels in sequence
civadj = ['Very difficult','Somewhat difficult','Somewhat Easy','Very Easy']

#### CIVADJISSc : Few years after you left, how optimistic you feel about your future (Higher the better)
Label : 'Never','Seldom','Sometimes','Frequently'

In [45]:
fjdf['CIVADJISSc'].unique()

array(['Frequently', 'Sometimes', 'Seldom', 'Never'], dtype=object)

In [46]:
# Create a list of labels in sequence
civadjissc = ['Never','Seldom','Sometimes','Frequently']

#### CIVADJISSe : Few years after you left, How often you have difficulty dealing with the lack of structure in civilian life 
Label : 'Never','Seldom','Sometimes','Frequently'
   

In [47]:
fjdf['CIVADJISSe'].unique()

array(['Never', 'Seldom', 'Sometimes', 'Frequently'], dtype=object)

In [48]:
# Create a list of labels in sequence
civadjisse = ['Never','Seldom','Sometimes','Frequently']

#### MILJOBSKILLS : How useful is military skills for civilian jobs
Label : 'not useful at all','not too useful','fairly useful','Very useful'

In [49]:
fjdf['MILJOBSKILLS'].unique()

array(['fairly useful', 'Very useful', 'not too useful',
       'not useful at all'], dtype=object)

In [50]:
# Create a list of labels in sequence
miljobskills = ['not useful at all','not too useful','fairly useful','Very useful']

#### MILHELPJOB : Does serving in military helf or hurt to get a civilian job (the higher the better)
Label : 'hurt a lot','hurt a bit','Neither','Helped a bit','Helped a lot'

In [51]:
fjdf['MILHELPJOB'].unique()

array(['Helped a lot', 'Helped a bit', 'hurt a bit', 'Neither',
       'hurt a lot'], dtype=object)

In [52]:
# Create a list of labels in sequence
milhelpjob = ['hurt a lot','hurt a bit','Neither','Helped a bit','Helped a lot']

#### F_AGECAT
Label : '18-29','30-49','50-64','65+'

In [53]:
fjdf['F_AGECAT'].unique()

array(['65+', '50-64', '30-49', '18-29'], dtype=object)

In [54]:
# Create a list of labels in sequence
f_agecat = ['18-29','30-49','50-64','65+']

#### F_EDUCCAT2
Label : 'High school degree','Some college(no degree)','Associate degree','Bachelors degree','MS/PHD degree'

In [55]:
fjdf['F_EDUCCAT2'].unique()

array(['High school degree', 'MS/PHD degree', 'Associate degree',
       'Some college(no degree)', 'Bachelors degree'], dtype=object)

In [56]:
# Create a list of labels in sequence
f_educcat2 = ['High school degree','Some college(no degree)',
              'Associate degree','Bachelors degree','MS/PHD degree']

#### Pass in all lists in OrdinalEncoder

In [57]:
# Pass in all the lists we created above
or_enc = OrdinalEncoder(categories = [findjob,year,rank,miltociv,civadj,
                                      civadjissc,civadjisse,miljobskills,
                                      milhelpjob,f_agecat,f_educcat2
                                     ])

In [58]:
# Fit transform and fix the data type
or_result = or_enc.fit_transform(fjdf[['FINDJOB','YEARSERV','RANK','MILTOCIV','CIVADJ'
                                       ,'CIVADJISSc','CIVADJISSe','MILJOBSKILLS'
                                       ,'MILHELPJOB','F_AGECAT','F_EDUCCAT2']])
# Change float array to all integers
or_result = or_result.astype(int)

In [59]:
# Remember to set index
# otherwise, there will be error during concatenate (index won't match)
ordf = pd.DataFrame(or_result,columns=or_index,index=fjdf.index)
ordf

Unnamed: 0,FINDJOB,YEARSERV,RANK,MILTOCIV,CIVADJ,CIVADJISSc,CIVADJISSe,MILJOBSKILLS,MILHELPJOB,F_AGECAT,F_EDUCCAT2
0,1,4,0,2,3,3,0,2,4,3,0
1,0,4,2,2,2,3,0,3,3,2,4
2,1,4,1,2,2,2,1,3,4,2,2
3,1,4,3,2,1,2,2,2,4,2,4
4,1,4,0,2,2,2,0,2,4,2,1
...,...,...,...,...,...,...,...,...,...,...,...
236,0,2,1,1,2,2,1,1,2,1,3
237,1,1,0,3,2,3,3,0,2,1,0
238,1,4,0,2,1,3,0,1,3,2,1
239,1,4,3,2,1,3,2,1,2,2,3


In [60]:
# Check whether codes above match with strings
fjdf[['FINDJOB','YEARSERV','RANK','MILTOCIV','CIVADJ','CIVADJISSc','CIVADJISSe','MILJOBSKILLS','MILHELPJOB',
'F_AGECAT','F_EDUCCAT2']].head()

Unnamed: 0,FINDJOB,YEARSERV,RANK,MILTOCIV,CIVADJ,CIVADJISSc,CIVADJISSe,MILJOBSKILLS,MILHELPJOB,F_AGECAT,F_EDUCCAT2
0,<6 months,>20 years,E1-E6,Somewhat well,Very Easy,Frequently,Never,fairly useful,Helped a lot,65+,High school degree
1,>6 months,>20 years,WO1-WO5,Somewhat well,Somewhat Easy,Frequently,Never,Very useful,Helped a bit,50-64,MS/PHD degree
2,<6 months,>20 years,E7-E9,Somewhat well,Somewhat Easy,Sometimes,Seldom,Very useful,Helped a lot,50-64,Associate degree
3,<6 months,>20 years,O1-O10,Somewhat well,Somewhat difficult,Sometimes,Sometimes,fairly useful,Helped a lot,50-64,MS/PHD degree
4,<6 months,>20 years,E1-E6,Somewhat well,Somewhat Easy,Sometimes,Never,fairly useful,Helped a lot,50-64,Some college(no degree)


### Encoding Categorical Data - Dummy Encoding
1. Steps: 
- Create a list of all variables we want to convert
- Pass in this list to .get_dummies
- Remember to set "drop_first" to "True" so we won't have duplicate columns

2. 11 columns in this category are :
'MARRACTIV','PARACTIV','COMBAT','VABENEFITS','CIVADJPROBa','CIVADJPROBb','LOOKJOB','SCHOOL','FINDJOB','TRAUMA1','F_SEX','F_HISP'

In [61]:
dummy_list = ['MARRACTIV','PARACTIV','COMBAT','VABENEFITS','CIVADJPROBa', 'CIVADJPROBb'
              ,'LOOKJOB','SCHOOL','TRAUMA1','F_SEX','F_HISP',]

In [62]:
len(dummy_list)

11

In [63]:
# Drop_first = True : eliminate duplicate columns
dummies_done = pd.get_dummies(fjdf[['MARRACTIV','PARACTIV','COMBAT','VABENEFITS'
                                    ,'CIVADJPROBa','CIVADJPROBb','LOOKJOB','SCHOOL',
                                    'TRAUMA1','F_SEX','F_HISP']],drop_first= True)
dummies_done

Unnamed: 0,MARRACTIV_Not Married,PARACTIV_Yes,COMBAT_Yes,VABENEFITS_Yes,CIVADJPROBa_Yes,CIVADJPROBb_Yes,LOOKJOB_Yes(right away),SCHOOL_Yes,TRAUMA1_Yes,F_SEX_Male,F_HISP_Yes
0,0,1,1,1,1,0,1,0,0,1,0
1,0,1,0,1,0,0,0,0,1,1,0
2,0,1,0,1,0,0,1,0,0,1,0
3,0,0,1,1,0,0,1,0,1,1,0
4,0,1,1,0,0,0,0,1,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
236,0,0,0,1,1,1,0,1,1,1,0
237,1,0,1,1,0,0,1,0,1,0,0
238,1,0,1,1,1,0,1,0,1,1,0
239,0,1,1,1,0,0,1,0,1,1,0


### Encoding Categorical Data - Label Encoding
1. 4 columns in this category are :
'INJURED','F_RACETHN','F_MARITAL','Branch'

In [64]:
# Create an instance of LabelEncoder
label_encoder = preprocessing.LabelEncoder()

In [65]:
# Copy data from these columns
ldf = fjdf[['INJURED','F_RACETHN','F_MARITAL','Branch']].copy()

# Fit transform
ldf['INJURED'] = label_encoder.fit_transform(fjdf['INJURED'])
# Print the labels in sequence
print(f"INJURED : {label_encoder.classes_}\n")

ldf['F_RACETHN'] = label_encoder.fit_transform(fjdf['F_RACETHN'])
print(f"F_RACETHN : {label_encoder.classes_}\n")

ldf['F_MARITAL'] = label_encoder.fit_transform(fjdf['F_MARITAL'])
print(f"F_MARITAL : {label_encoder.classes_}\n")

ldf['Branch'] = label_encoder.fit_transform(fjdf['Branch'])
print(f"Branch : {label_encoder.classes_}\n")

INJURED : ['Injured in combat' 'Injured out of combat' 'Not injured']

F_RACETHN : ['Asian' 'Black' 'Mixed race' 'White']

F_MARITAL : ['Divorced/Seperated/Widowed' 'Married/Live with a partner'
 'Never Married']

Branch : ['Air_Force' 'Army' 'Coast_Guard' 'Marines' 'Muti-Branch' 'Navy']



In [66]:
label_done = ldf[['INJURED','F_RACETHN','F_MARITAL','Branch']]
label_done

Unnamed: 0,INJURED,F_RACETHN,F_MARITAL,Branch
0,2,3,1,4
1,2,3,1,1
2,2,3,1,5
3,2,3,1,0
4,1,1,1,3
...,...,...,...,...
236,2,3,0,5
237,2,3,0,5
238,2,3,1,5
239,2,3,1,3


### Combine three encoded data

In [67]:
# Combine three encoded data and p_income(the only numeric variable)
encoded_fjdf = pd.concat([ordf,dummies_done,label_done,fjdf['p_income']],axis=1)
encoded_fjdf

Unnamed: 0,FINDJOB,YEARSERV,RANK,MILTOCIV,CIVADJ,CIVADJISSc,CIVADJISSe,MILJOBSKILLS,MILHELPJOB,F_AGECAT,...,LOOKJOB_Yes(right away),SCHOOL_Yes,TRAUMA1_Yes,F_SEX_Male,F_HISP_Yes,INJURED,F_RACETHN,F_MARITAL,Branch,p_income
0,1,4,0,2,3,3,0,2,4,3,...,1,0,0,1,0,2,3,1,4,31500
1,0,4,2,2,2,3,0,3,3,2,...,0,0,1,1,0,2,3,1,1,75000
2,1,4,1,2,2,2,1,3,4,2,...,1,0,0,1,0,2,3,1,5,31500
3,1,4,3,2,1,2,2,2,4,2,...,1,0,1,1,0,2,3,1,0,62500
4,1,4,0,2,2,2,0,2,4,2,...,0,1,0,1,0,1,1,1,3,44000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,0,2,1,1,2,2,1,1,2,1,...,0,1,1,1,0,2,3,0,5,63000
237,1,1,0,3,2,3,3,0,2,1,...,1,0,1,0,0,2,3,0,5,25000
238,1,4,0,2,1,3,0,1,3,2,...,1,0,1,1,0,2,3,1,5,31500
239,1,4,3,2,1,3,2,1,2,2,...,1,0,1,1,0,2,3,1,3,62500


In [68]:
# column numbers are the same as the beginning
len(encoded_fjdf.columns)

27

### Save file

In [69]:
encoded_fjdf.to_csv("Findjob_encoded.csv",index=False)