In [None]:
# Import necessary libaries and data
import pandas as pd
from sklearn.linear_model import LinearRegression

# Categorical Variables

Until now, we have assumed that the predictor variables can only be quantitative. But what if a dataset contains a qualitative variable? How do we process these variables so they can be used in a model?

Let’s create a dummy dataset:

In [None]:
data = {
    'Age (X1)': [42, 24, 47, 50, 60],
    'Monthly Income (X2)': [7313, 17747, 22845, 18552, 14439],
    'Gender (X3)': ['Female', 'Female', 'Male', 'Female', 'Male'],
    'Total Spend (Y)': [4198.385084, 4134.976648, 5166.614455, 7784.447676, 3254.160485]
}
df = pd.DataFrame.from_dict(data)
df

The column `Gender` is categorical, with `male` and `female` as the possible values. To handle this, we can define a dummy variable *X<sub>3</sub>* :

![](https://latex.codecogs.com/gif.latex?X_3%20%3D%20%5Cbegin%7Bcases%7D1%5Ctext%7B%20if%20customer%20is%20female%7D%20%5C%5C%200%5Ctext%7B%20if%20customer%20is%20male%7D%20%5Cend%7Bcases%7D)

Our linear regression model becomes:

![](https://latex.codecogs.com/gif.latex?Y_e%20%3D%20%5Cbegin%7Bcases%7D%20%5Calpha%20+%20%5Cbeta_1X_1%20+%20%5Cbeta_2X_2%20+%20%5Cbeta_3%20%5Ctext%7B%20if%20customer%20is%20female%7D%20%5C%5C%20%5Calpha%20+%20%5Cbeta_1X_1%20+%20%5Cbeta_2X_2%20%5Ctext%7B%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20%5C%20if%20customer%20is%20male%7D%20%5Cend%7Bcases%7D)

In Python, we create dummy variables using the `pandas` method `get_dummies`: 

In [None]:
dummy_gender = pd.get_dummies(df['Gender (X3)'], prefix='Sex')
dummy_gender

If `Gender` is `Female`, the variable `Sex_Female` is encoded as `1` (for True) and `Sex_Male` is `0` (for False). If `Gender` is `Male`, it is encoded as the opposite. 

Now you can join this back to the original dataframe, and build a linear regression model as per usual, using `Sex_Female` and `Sex_Male` as predictors in place of `Gender`. 

In [None]:
# Append Sex_Female and Sex_Males columns and drop Gender column
df_new = df.join(dummy_gender).drop(['Gender (X3)'], 1)
df_new

In [None]:
# Create linear regression model using scikit-learn
predictors = ['Age (X1)', 'Monthly Income (X2)', 'Sex_Female', 'Sex_Male']
X = df_new[predictors]
Y = df_new['Total Spend (Y)']
lm = LinearRegression()
lm.fit(X, Y)
print(f'alpha = {lm.intercept_}')
print(f'betas = {lm.coef_}')
print(f'R2 = {lm.score(X, Y)}')

Great! Now we know how to handle missing values, outliers, and categorical variables. In our final step, we will discuss how to handle non-linear relationships! 

Return to the notebook directory in Jupyter by pressing `File` > `Open…` in the toolbar at the top, then open the notebook called `3.3 Non-linear transformations.ipynb`.