Welcome to Preceptor Practice Session 10! 

In [1]:
import YData 


#YData.download_practice_code(10)              # Without Answers. 
#YData.download_practice_code(10, True)        # With Answers. 



import pandas as pd 
import numpy as np

#YData.download_data("loan_data.csv")


In [2]:
loans = pd.read_csv("loan_data.csv")
loans = loans.drop(columns = "loan_percent_income")
display(loans.head(3), loans.shape)

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,3.0,561,No,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,2.0,504,Yes,0
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,3.0,635,No,1


(45000, 13)

## 1. Linear regression

In regression, we try to predict a quantitative variable y, from a set of features X. 

Let's explore this by predicting the person's income (in USD) from other quantitative features in the above dataset. 

- You are free to choose your own set of features; I am following closely the setup from the class.
- Remark: When working with salaries, usually one takes `log transform`. You can use `np.log()` function for this. 


In [3]:
# get the features and the labels

X_features = loans[[
                    'loan_amnt', 
                    'person_emp_exp',
                    'credit_score'
                   ]]

y_salary = loans['person_income']




- Let's use scikit-learn to generate training and test data (as we did previously for our KNN classifier). 

In [4]:
from sklearn.model_selection import train_test_split

# split data into a training and test set

X_train, X_test, y_train, y_test = train_test_split(X_features,  
                                                    y_salary, 
                                                    random_state = 0)

print(X_train.shape)
print(X_test.shape)

X_train.head(5)


(33750, 3)
(11250, 3)


Unnamed: 0,loan_amnt,person_emp_exp,credit_score
34761,15912.0,3,645
17827,1500.0,3,610
8937,8000.0,2,572
26508,14500.0,10,597
36846,6000.0,8,565


- We can now create a new linear regression model, fit it to data, and make predictions. The method names are again very similar to what we used for the KNN classifier (i.e., the `fit()` and predict()` methods). 

In [5]:
from sklearn.linear_model import LinearRegression

# create a new linear regression modedl
linear_model = LinearRegression()

In [6]:
# fit the model to our training data

linear_model.fit(X_train, y_train)

In [7]:
# make predictions of the salaries on the test data

predictions = linear_model.predict(X_test)
predictions[0:5]

array([121936.68858977,  83081.28740642,  77116.35322228,  81225.87801384,
        64403.79526245])

- We can assess the accuracy of our predictons using the root mean squared error which is defined as: 

$$RMSE = \sqrt{ \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

- Here $\hat{y}$ is the predictions made by our linear model on the test data (i.e., the predicted body weight) and y is the actual body weights for the points in our test set.


In [8]:
# test the RMSE on the test data

RMSE = np.sqrt(np.mean((y_test - predictions)**2))

RMSE   


84763.8003991134

- We can also use scikit-learn's `mean_squared_error()` to get the MSE, and we can use the `cross_val_score` to run k-fold cross-validation (again, in a very similar way to what we did for our KNN classifier). 

In [9]:
from sklearn.metrics import mean_squared_error

# Use scikit-learn's mean_squared_error() function to get the RMSE
np.sqrt(mean_squared_error(y_test, predictions))


84763.8003991134

In [10]:
# using cross-validation
from sklearn.model_selection import cross_val_score

linear_model = LinearRegression()

scores = cross_val_score(linear_model, 
                         X_features,  
                         y_salary, 
                         cv = 5, 
                         scoring='neg_mean_squared_error')

np.sqrt(np.mean(-1 * scores))

77266.7406390397

- What does RMSE measures? 
- Answer: 


### Regression model equation

In linear regression, our predicted $\hat{y}$ values are given by the equation: $\hat{y} = b_0 + b_1 x_1 + ... + + b_k x_k$.

Let's fill out this equation for prediciting salary. 

To do this, let's start by extracting the intercept ($b_0$) and slope coefficients ($b_i's$) from our scikit-learn model.


In [11]:

# fit the linear regression model to our training data
linear_model.fit(X_train, y_train)

# get the intercept and slope coefficients
sklearn_intercept = linear_model.intercept_
sklearn_coefficients = linear_model.coef_

# print out the coefficient values
(sklearn_intercept, sklearn_coefficients)  


(40883.04610068797, array([ 3.07899709e+00,  2.06997650e+03, -1.82767685e+00]))

- Given these coefficient values can you write our the regression equation for predicting person's salary? 
- Answer: 

#### Writing our own prediction function

Let's also write our own function called `get_predictions(b0_intercept, b_coefficients, X_data)` that takes the coefficient values and X values and returns predicted $\hat{y}$ values for each X value. In particular, the arguments to the function are:

1. `b0_intercept`: The linear regression intercept
2. `b_coefficients`: The linear regression slope coefficients
3. `X_data`: The X data values 

The returned value is a numpy ndarray of predictions for each X data point. 


In [12]:
# write a function to get the predictions
def get_predictions(b0_intercept, b_coefficients, X_data):
    return np.sum(X_data.to_numpy() * b_coefficients, axis = 1) + b0_intercept 


# get the predicted values on the test data
predicted_vals = get_predictions(sklearn_intercept, sklearn_coefficients, X_test)


# see the it matches the scikit-learn predictions
predicted_vals_sklearn = linear_model.predict(X_test)
predicted_vals == predicted_vals_sklearn


array([ True,  True,  True, ...,  True,  True,  True])

## Inference on regression coefficients

We can also run inference procedures on our regression model using the statsmodel package. In particular, we can run hypothesis tests and create confidence intervals for our regression coefficents. 

When running a hypothesis test, our hypotheses are:

$H_0: \beta_i = 0$  
$H_A: \beta_i \ne 0$


In [13]:
# Hypothesis test on regression coeffients - which coefficients are statistically significantly different from zero? 
# (and confidence interval)

import statsmodels.api as sm

# add a constant value of 1 to our data
X_train_with_constant = sm.add_constant(X_train) 

# fit the linear regression model using the OLS function
sm_linear_model = sm.OLS(y_train, X_train_with_constant).fit()

# get information on the regression coefficients found
print(sm_linear_model.summary())


                            OLS Regression Results                            
Dep. Variable:          person_income   R-squared:                       0.092
Model:                            OLS   Adj. R-squared:                  0.092
Method:                 Least Squares   F-statistic:                     1144.
Date:                Thu, 21 Nov 2024   Prob (F-statistic):               0.00
Time:                        16:17:21   Log-Likelihood:            -4.2624e+05
No. Observations:               33750   AIC:                         8.525e+05
Df Residuals:                   33746   BIC:                         8.525e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           4.088e+04   5118.944      7.

- Would you drop `credit_score`? Why?
- Answer: 

# 2. Linear Regression with categorical features. 

In [14]:
# Let us recall our dataset. 

loans.tail(3)

Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
44997,33.0,male,Associate,56942.0,7,RENT,2771.0,DEBTCONSOLIDATION,10.02,10.0,668,No,1
44998,29.0,male,Bachelor,33164.0,4,RENT,12000.0,EDUCATION,13.23,6.0,604,No,1
44999,24.0,male,High School,51609.0,1,RENT,6665.0,DEBTCONSOLIDATION,17.05,3.0,628,No,1


In [15]:
# Let us code Male = 0 and Female = 1. 

loans["gender"] = 1
gender_bool = loans["person_gender"] == "male"
loans.loc[gender_bool, 'gender'] = 0 
loans.head()




Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status,gender
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,3.0,561,No,1,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,2.0,504,Yes,0,1
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,3.0,635,No,1,1
3,23.0,female,Bachelor,79753.0,0,RENT,35000.0,MEDICAL,15.23,2.0,675,No,1,1
4,24.0,male,Master,66135.0,1,RENT,35000.0,MEDICAL,14.27,4.0,586,No,1,0


## Run Linear Regression with your choice of features but make sure to include `gender` as one of your features. 

In [16]:
# get the features and the labels

X_features_2 = loans[[
                    'loan_amnt', 
                    'person_emp_exp',
                    'gender'
                   ]]

y_salary_2 = loans['person_income']




In [17]:
from sklearn.model_selection import train_test_split

# split data into a training and test set

X_train, X_test, y_train, y_test = train_test_split(X_features_2,  
                                                    y_salary_2, 
                                                    random_state = 0)

print(X_train.shape)
print(X_test.shape)

X_train.head(5)


(33750, 3)
(11250, 3)


Unnamed: 0,loan_amnt,person_emp_exp,gender
34761,15912.0,3,0
17827,1500.0,3,1
8937,8000.0,2,0
26508,14500.0,10,1
36846,6000.0,8,1


In [18]:
from sklearn.linear_model import LinearRegression

# create a new linear regression modedl
linear_model = LinearRegression()

In [19]:
# fit the model to our training data

linear_model.fit(X_train, y_train)

In [20]:
# Hypothesis test on regression coeffients - which coefficients are statistically significantly different from zero? 
# (and confidence interval)

import statsmodels.api as sm

# add a constant value of 1 to our data
X_train_with_constant = sm.add_constant(X_train) 

# fit the linear regression model using the OLS function
sm_linear_model = sm.OLS(y_train, X_train_with_constant).fit()

# get information on the regression coefficients found
print(sm_linear_model.summary())


                            OLS Regression Results                            
Dep. Variable:          person_income   R-squared:                       0.092
Model:                            OLS   Adj. R-squared:                  0.092
Method:                 Least Squares   F-statistic:                     1145.
Date:                Thu, 21 Nov 2024   Prob (F-statistic):               0.00
Time:                        16:17:57   Log-Likelihood:            -4.2624e+05
No. Observations:               33750   AIC:                         8.525e+05
Df Residuals:                   33746   BIC:                         8.525e+05
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
const           4.038e+04    887.554     45.

-  Gender coefficient is $-1390.6578$. What does this mean?
-  Answer: 

- What is the p-value associated with Gender coefficient and how to interpret it?
- Answer: 

# 3. Open Ended: Build a predictive model to predict salaries using the dataset `loans` from above. 