# Required assignment 8.2 Converting binary and categorical predictors using Python
When working with machine learning in Python, converting categorical values to numerical formats is essential, as most algorithms require numerical input. Two commonly used techniques for this are **label encoding** and **one-hot encoding**.

In [12]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder


## Read the `wage` file

The data set used for this assignment is `wage.csv`. This data set has a combination of ordinal, nominal and numerical data.

In [13]:
wage_df = pd.read_csv("data/wage.csv")

In [14]:
wage_df.head(3)

Unnamed: 0,year,age,maritl,race,education,region,jobclass,health,health_ins,logwage,wage
0,2006,18,1. Never Married,1. White,1. < HS Grad,2. Middle Atlantic,1. Industrial,1. <=Good,2. No,4.318063,75.043154
1,2004,24,1. Never Married,1. White,4. College Grad,2. Middle Atlantic,2. Information,2. >=Very Good,2. No,4.255273,70.47602
2,2003,45,2. Married,1. White,3. Some College,2. Middle Atlantic,1. Industrial,1. <=Good,1. Yes,4.875061,130.982177


In [15]:
wage_df.dtypes #we can see that everything except for year age (log) wage are categorical

year            int64
age             int64
maritl         object
race           object
education      object
region         object
jobclass       object
health         object
health_ins     object
logwage       float64
wage          float64
dtype: object

### Question 1: 

Identify the unique values of each of the categorical attribute and share your answers.

1. `ans1` stores the unique values of `jobclass` attribute.
2. `ans2` stores the unique values of `health` attribute.
3. `ans3` stores the unique values of `health_ins` attribute.
4. `ans4` stores the unique values of `maritl` attribute.
5. `ans5` stores the unique values of `education` attribute.
6. `ans6` stores the unique values of `race` attribute.
7. `ans7` stores the unique values of `region` attribute.

In [16]:
###GRADED CELL
ans1 = None
ans2 = None
ans3 = None
ans4 = None
ans5 = None
ans6 = None
ans7 = None
# YOUR CODE HERE
#raise NotImplementedError()

# Unique values for each categorical attribute
ans1 = wage_df.jobclass.unique()
ans2 = wage_df.health.unique()
ans3 = wage_df.health_ins.unique()
ans4 = wage_df.maritl.unique()
ans5 = wage_df.education.unique()
ans6 = wage_df.race.unique()
ans7 = wage_df.region.unique()

print("The uniques values of jobclass is given by",ans1)
print("The uniques values of health is given by",ans2)
print("The uniques values of health_ins is given by",ans3)
print("The uniques values of maritl is given by",ans4)
print("The uniques values of education is given by",ans5)
print("The uniques values of race is given by",ans6)
print("The uniques values of region is given by",ans7)

The uniques values of jobclass is given by ['1. Industrial' '2. Information']
The uniques values of health is given by ['1. <=Good' '2. >=Very Good']
The uniques values of health_ins is given by ['2. No' '1. Yes']
The uniques values of maritl is given by ['1. Never Married' '2. Married' '4. Divorced' '3. Widowed' '5. Separated']
The uniques values of education is given by ['1. < HS Grad' '4. College Grad' '3. Some College' '2. HS Grad'
 '5. Advanced Degree']
The uniques values of race is given by ['1. White' '3. Asian' '4. Other' '2. Black']
The uniques values of region is given by ['2. Middle Atlantic']


The categorical values stored in `[categorical_cols]` are encoded using `pd.get.dummies()`. 

Reference the details of [pandas.get.dummies] here: (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html).

In [17]:
categorical_cols = ['maritl', 'race',  'region', 'jobclass']


### Question 2:

- Use one-hot encoding to encode `categorical_cols`.

- Store the encoded values to `wage_df_encoded`. 

Hint: Remember to use `drop_first=True`. 

In [18]:
###GRADED CELL
wage_df_encoded = None
# YOUR CODE HERE
#raise NotImplementedError()

wage_df_encoded = pd.get_dummies(wage_df, columns=categorical_cols, drop_first=True)

print(wage_df_encoded.head())

   year  age        education          health health_ins   logwage  \
0  2006   18     1. < HS Grad       1. <=Good      2. No  4.318063   
1  2004   24  4. College Grad  2. >=Very Good      2. No  4.255273   
2  2003   45  3. Some College       1. <=Good     1. Yes  4.875061   
3  2003   43  4. College Grad  2. >=Very Good     1. Yes  5.041393   
4  2005   50       2. HS Grad       1. <=Good     1. Yes  4.318063   

         wage  maritl_2. Married  maritl_3. Widowed  maritl_4. Divorced  \
0   75.043154              False              False               False   
1   70.476020              False              False               False   
2  130.982177               True              False               False   
3  154.685293               True              False               False   
4   75.043154              False              False                True   

   maritl_5. Separated  race_2. Black  race_3. Asian  race_4. Other  \
0                False          False          False     

In `wage_df_encoded`, the columns `education`, `health` and `health_ins` are ordinal. Therefore, they are removed before further processing.



In [19]:
wage_df_encoded_new = wage_df_encoded.drop(['education', 'health', 'health_ins'], axis=1)
print(wage_df_encoded_new.head())

   year  age   logwage        wage  maritl_2. Married  maritl_3. Widowed  \
0  2006   18  4.318063   75.043154              False              False   
1  2004   24  4.255273   70.476020              False              False   
2  2003   45  4.875061  130.982177               True              False   
3  2003   43  5.041393  154.685293               True              False   
4  2005   50  4.318063   75.043154              False              False   

   maritl_4. Divorced  maritl_5. Separated  race_2. Black  race_3. Asian  \
0               False                False          False          False   
1               False                False          False          False   
2               False                False          False          False   
3               False                False          False           True   
4                True                False          False          False   

   race_4. Other  jobclass_2. Information  
0          False                    False 

The ordinal variables such as `education`, `health` and `health_ins` can be encoded with meaningful numeric order.

In [20]:
# Mappings defined for education, health and health_ins
education_map = {
    '1. < HS Grad': 1,
    '2. HS Grad': 2,
    '3. Some College': 3,
    '4. College Grad': 4,
    '5. Advanced Degree': 5
}

health_map = {
    '1. <=Good': 1,
    '2. >=Very Good': 2,
    '3. Excellent': 3
}

health_ins_map = {
    '1. Yes': 1,
    '2. No': 0
}

# Apply mappings
wage_df['education'] = wage_df['education'].map(education_map)
wage_df['health'] = wage_df['health'].map(health_map)
wage_df['health_ins'] = wage_df['health_ins'].map(health_ins_map)

# One-hot encode remaining categorical columns
categorical_cols = ['maritl', 'race', 'region', 'jobclass']
wage_df_encoded = pd.get_dummies(wage_df, columns=categorical_cols, drop_first=True)

print(wage_df_encoded.head())


   year  age  education  health  health_ins   logwage        wage  \
0  2006   18          1       1           0  4.318063   75.043154   
1  2004   24          4       2           0  4.255273   70.476020   
2  2003   45          3       1           1  4.875061  130.982177   
3  2003   43          4       2           1  5.041393  154.685293   
4  2005   50          2       1           1  4.318063   75.043154   

   maritl_2. Married  maritl_3. Widowed  maritl_4. Divorced  \
0              False              False               False   
1              False              False               False   
2               True              False               False   
3               True              False               False   
4              False              False                True   

   maritl_5. Separated  race_2. Black  race_3. Asian  race_4. Other  \
0                False          False          False          False   
1                False          False          False          Fa

### Question 3:

Can the data set `wage_df_encoded` be used for training a linear regression model?

Provide your input to `ans3a`.

In [22]:
###GRADED CELL
ans3a = None
# YOUR CODE HERE
#raise NotImplementedError()

ans3a = "No"


Linear regression models require all features to be numeric (integers or floats). Your Boolean columns (True/False) should be converted to 1/0.

### Question 4:
Select the columns where dtype is `bool` and convert it to `int`.

Hint: Use `.select_dtypes(include = 'bool').columns` to select the columns with Boolean data types and store it to variable `bool_cols`.

Use `.astype(int)` to convert the `bool_cols` to integer data types and save it to `wage_df_encoded[bool_cols]`.

In [23]:
###GRADED CELL
bool_cols = None

# YOUR CODE HERE
#raise NotImplementedError()

bool_cols = wage_df_encoded.select_dtypes(include='bool').columns
wage_df_encoded[bool_cols] = wage_df_encoded[bool_cols].astype(int)

print(wage_df_encoded.head())

   year  age  education  health  health_ins   logwage        wage  \
0  2006   18          1       1           0  4.318063   75.043154   
1  2004   24          4       2           0  4.255273   70.476020   
2  2003   45          3       1           1  4.875061  130.982177   
3  2003   43          4       2           1  5.041393  154.685293   
4  2005   50          2       1           1  4.318063   75.043154   

   maritl_2. Married  maritl_3. Widowed  maritl_4. Divorced  \
0                  0                  0                   0   
1                  0                  0                   0   
2                  1                  0                   0   
3                  1                  0                   0   
4                  0                  0                   1   

   maritl_5. Separated  race_2. Black  race_3. Asian  race_4. Other  \
0                    0              0              0              0   
1                    0              0              0            

All categorical variables have now been converted to numerical values: ordinal features were mapped to positive integers, while nominal features were one-hot encoded using dummy variables.

Next, let's check whether any columns contain null values and view the list of column names.

In [24]:
print(wage_df_encoded.isnull().sum())

year                       0
age                        0
education                  0
health                     0
health_ins                 0
logwage                    0
wage                       0
maritl_2. Married          0
maritl_3. Widowed          0
maritl_4. Divorced         0
maritl_5. Separated        0
race_2. Black              0
race_3. Asian              0
race_4. Other              0
jobclass_2. Information    0
dtype: int64


Apply a linear regression and interpret coefficients.

In [25]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [34]:
wage_df_encoded_clean = wage_df_encoded.dropna(subset=['logwage']).copy()
wage_df_encoded_clean = wage_df_encoded.clean.drop(columns=[None], errors="ignore")
print(wage_df_encoded_clean.head())

   year  age  education  health  health_ins   logwage        wage  \
0  2006   18          1       1           0  4.318063   75.043154   
1  2004   24          4       2           0  4.255273   70.476020   
2  2003   45          3       1           1  4.875061  130.982177   
3  2003   43          4       2           1  5.041393  154.685293   
4  2005   50          2       1           1  4.318063   75.043154   

   maritl_2. Married  maritl_3. Widowed  maritl_4. Divorced  \
0                  0                  0                   0   
1                  0                  0                   0   
2                  1                  0                   0   
3                  1                  0                   0   
4                  0                  0                   1   

   maritl_5. Separated  race_2. Black  race_3. Asian  race_4. Other  \
0                    0              0              0              0   
1                    0              0              0            

### Question 5:

The data set is now ready to be used with a linear regression model.

1. Remove the columns [`wage`, `logwage`] from the data set to define features `X`.

2. Set the target variable `y` to have `logwage` only.

3. Split the data set into training and testing sets using the `train_test_split()` function with `test_size=0.2`.

In [47]:
###GRADED CELL
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

model = None

X = None
y = None
X_train, X_test, y_train, y_test = None, None, None, None

# YOUR CODE HERE
#raise NotImplementedError()

model = LinearRegression()

X = wage_df_encoded_clean.drop(columns=['wage', 'logwage'])
y = wage_df_encoded_clean['logwage']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [48]:

# The model is fitted


model.fit(X_train, y_train)

### Question 6:

- Compute the coefficients of the model and display it for the different features.

- Assign your output to `df_coeff`.

Hint: Use a data frame and display the features and coefficients by assigning `X_train.columns` to features and `model.coeff` to coefficients.

In [50]:
###GRADED CELL
df_coeff_ = None

# YOUR CODE HERE
#raise NotImplementedError()

df_coeff_ = pd.DataFrame({'Features':X_train.columns, 'coefficients':model.coef_})

print(df_coeff_)

                   Features  coefficients
0                      year      0.012593
1                       age      0.003261
2                 education      0.105917
3                    health      0.063914
4                health_ins      0.189283
5         maritl_2. Married      0.158026
6         maritl_3. Widowed      0.049417
7        maritl_4. Divorced      0.032757
8       maritl_5. Separated      0.122358
9             race_2. Black     -0.051681
10            race_3. Asian     -0.013841
11            race_4. Other     -0.055469
12  jobclass_2. Information      0.025797


This table shows the features used in a linear regression model along with their coefficients, which reflect each feature's impact on the target variable (logwage).

- Positive coefficients indicate that the feature increases the predicted logwage.

- Negative coefficients indicate a decrease in the predicted logwage.

The magnitude of each coefficient shows how much logwage changes with a one-unit increase in that feature, assuming that all other features remain constant.