In [1]:
import numpy as np
import pandas as pd

## Read the `Wage` File

In [2]:
wage_df = pd.read_csv("wage.csv")

In [3]:
wage_df.head(3)

Unnamed: 0,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,2006,18,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602
2,2003,45,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177


In [4]:
wage_df.dtypes #we can see that everything except for year age (log) wage are categorical

year            int64
age             int64
maritl         object
race           object
education      object
region         object
jobclass       object
health         object
health_ins     object
logwage       float64
wage          float64
dtype: object

### Check unique values in the `jobclass` column

In [5]:
wage_df.jobclass.unique() #only two -- easy to replace

array(['1. Industrial', '2. Information'], dtype=object)

In [6]:
wage_df["job_information"] =  (wage_df["jobclass"] == "2. Information").astype(int) #so that =1 means information

In [7]:
wage_df.drop(['jobclass', 'logwage', 'region', 'year'], axis=1, inplace=True) #lets delete jobclass and logwage now

### Check unique values in the `health` column

In [8]:
wage_df.health.unique() #only two -- easy to replace

array(['1. <=Good', '2. >=Very Good'], dtype=object)

In [9]:
wage_df["health"] =  (wage_df["health"] == "2. >=Very Good").astype(int) #so that =1 means very good health

### Apply the same for `health_ins`

In [16]:
wage_df.health_ins.unique() #only two -- easy to replace

array(['2. No', '1. Yes'], dtype=object)

In [17]:
wage_df["health_ins"] =  (wage_df["health_ins"] == "1. Yes").astype(int) #so that =1 means has a health insurance

### Check unique values in the `maritl` column

In [10]:
wage_df.maritl.unique() #we cannot make this 1-2-3-4-5 as this is nominal

array(['1. Never Married', '2. Married', '4. Divorced', '3. Widowed',
       '5. Separated'], dtype=object)

In [11]:
one_hot = pd.get_dummies(wage_df.maritl, prefix='marriage')

In [12]:
wage_df = wage_df.join(one_hot)

In [13]:
wage_df.drop(['maritl', 'marriage_1. Never Married'], axis=1, inplace=True)

In [14]:
wage_df.columns = [*wage_df.columns[:-4], 'marriage_yes',\
                   'marriage_widowed', 'marriage_divorced', 'marriage_separated'] #we drop one -- why?

### Decide what to do for `education`

In [20]:
wage_df.education.unique() #it looks like we can take these as ordinal categories

array(['1. < HS Grad', '4. College Grad', '3. Some College', '2. HS Grad',
       '5. Advanced Degree'], dtype=object)

In [23]:
wage_df.education = (wage_df.education.astype(str).str[0]).astype(int)

### Decide what to do for `race`

In [27]:
wage_df.race.unique() #these are nominal categories

array(['1. White', '3. Asian', '4. Other', '2. Black'], dtype=object)

In [28]:
one_hot = pd.get_dummies(wage_df.race, prefix='race')

In [30]:
wage_df = wage_df.join(one_hot)

In [32]:
wage_df.drop(['race', 'race_4. Other'], axis=1, inplace=True)

In [34]:
wage_df.columns = [*wage_df.columns[:-3], 'race_white', 'race_black', 'race_asian'] #we drop one -- why?

In [37]:
wage_df.head()

Unnamed: 0,age,education,health,health_ins,wage,job_information,marriage_yes,marriage_widowed,marriage_divorced,marriage_separated,race_white,race_black,race_asian
0,18,1,0,0,75.043154,0,0,0,0,0,1,0,0
1,24,4,1,0,70.47602,1,0,0,0,0,1,0,0
2,45,3,0,1,130.982177,0,1,0,0,0,1,0,0
3,43,4,1,1,154.685293,1,1,0,0,0,0,0,1
4,50,2,0,1,75.043154,1,0,0,1,0,1,0,0


#### We can see that all the categories are now converted to numerical values. We have changed the ordinal ones to simply positive integers, and encoded the nominal ones by using dummies. 

### Now apply a linear regression and interpret coefficients

In [40]:
from sklearn.linear_model import LinearRegression

In [42]:
reg = LinearRegression().fit(wage_df.drop("wage", axis=1), wage_df.wage)

In [57]:
df_coeff = pd.DataFrame({'predictors':wage_df.drop("wage", axis=1).columns, 'coefficients':reg.coef_})

In [58]:
df_coeff

Unnamed: 0,predictors,coefficients
0,age,0.303748
1,education,13.344081
2,health,6.60423
3,health_ins,16.996045
4,job_information,3.785646
5,marriage_yes,17.179947
6,marriage_widowed,0.972707
7,marriage_divorced,3.491899
8,marriage_separated,12.132549
9,race_white,4.734301


#### Some observations
- A married person is expected to have 17.18 units more wage compared to a person who is not married, if everything else remains the same.
- One level higher education brings in average 13.4 units more wage.
- Insured people make in average 17 units more wage.
- Widowed ones make less than the ones that were never married (again, given everything else is the same, and this result is in *expectation*).

#### Discussion on ethics: 
- If we are trying to understand which group of people make less than the others, using these coefficients are fine in general. However, if we are going to make decisions such as whether or not we should give loans to some people, it is typically not ethical to use some of the information such as race or gender.