# THIS FILE IS IN THE HANDOUTS FOLDER. COPY IT INTO YOUR CLASS NOTES

- [**Read the chapter on the website!**](https://ledatascifi.github.io/ledatascifi-2024/content/05/02_reg.html) It contains a lot of extra information we won't cover in class extensively.
- After reading that, I recommend [this webpage as a complimentary place to get additional intuition.](https://aeturrell.github.io/coding-for-economists/econmt-regression.html)

## ASAP

[Declare your team and project interests in the project sheet](https://docs.google.com/spreadsheets/d/1SMetWKgI3JdhFdBwihDgIY3BDubvihkJCWbgLcCe4fs/edit?usp=sharing)

TODO UPDATE LINK

# Today: Regression

We start our machine learning applications with regression for a few simple reasons:
- Regression is fundamental method for estimating the relationship between a variable ("y") that condition on many ("X") variables. 
- But the coefficients obtained can also be used to generate predictions. 
- _Note: The focus in this section is on RELATIONSHIP paradigm_
- Many issues that confront researchers have well understood solutions when regression is the model being used. 
- Regression coefficients are easy to interpret.
- https://twitter.com/seanjtaylor/status/1550326602105466880


  
## Objectives

1. You can fit a regression with `statsmodels` or `sklearn`
    - statsmodels: Nicer result tables, usually easier to specifying the regression model
    - sklearn: Easier to use within a prediction/ML exercise
2. You can view the results visually or numerically of your model with either method
3. The focus today is on the _mechanics_ of running regressions, viewing the output, and using the estimation's output objects.

![](https://media.giphy.com/media/yoJC2K6rCzwNY2EngA/giphy.gif)


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from statsmodels.formula.api import ols as sm_ols
import matplotlib.pyplot as plt


## Data

First, we load the data. 

**This is a new dataset, so we should do some data exploration!** Things students should try:
- describe() - any impossible values
- value_count() any categorical variables
- didn't we have a community function to start the EDA?
- correlation heat map
- look for outliers for all variables, and within pairplots
- print out and explore many sections of the data manually (in Excel or Spyder) to get familiar and check for data consistency issues


In [2]:
url = 'https://github.com/LeDataSciFi/data/raw/main/Fannie%20Mae/Fannie_Mae_Plus_Data.gzip?raw=true'
fannie_mae = pd.read_csv(url,compression='gzip') 

## Task 1

Spend 5 minutes exploring the data and jot down what you learn about the data. 

- This dataset is its about loans, from Fannie_Mea
   - Unit of observation: loan
   - Key is probably Loan_ID (it's not!)
   - 135k rows
   - 36 vars
- Vars
   - what you learn from .info(), .describe()
   - Debt to income ratio: In whole numbers (0-100). Monthly, for borrow at orig
   - Coborrowinfo missing (a lot?)
   - Which vars are categorical?



In [11]:
# Loan_Identifier is not unique
len(fannie_mae) - len(fannie_mae['Loan_Identifier'].drop_duplicates())

9675

In [6]:
slice = fannie_mae.head(300) # abcd via var inspector
fannie_mae.head(20) # just look at whats here (see less)

Unnamed: 0,Loan_Identifier,Origination_Channel,Seller_Name,Original_Interest_Rate,Original_UPB,Original_Loan_Term,Original_LTV_(OLTV),Original_Combined_LTV_(CLTV),Number_of_Borrowers,Original_Debt_to_Income_Ratio,...,Qdate,rGDP,TCMR,POILWTIUSDM,TTLCONS,DEXUSEU,BOPGSTB,GOLDAMGBD228NLBM,CSUSHPISA,MSPUS
0,973373000000.0,B,OTHER,6.875,32000.0,360.0,90.0,90.0,1.0,22.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
1,927620000000.0,B,"PNC BANK, N.A.",5.875,200000.0,360.0,80.0,80.0,2.0,26.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
2,717667000000.0,B,OTHER,6.25,122000.0,180.0,80.0,80.0,2.0,31.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
3,988951000000.0,C,AMTRUST BANK,6.0,67000.0,180.0,77.0,77.0,2.0,17.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
4,190885000000.0,R,OTHER,5.875,50000.0,180.0,41.0,41.0,2.0,10.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
5,753371000000.0,R,OTHER,6.375,160000.0,360.0,95.0,95.0,3.0,28.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
6,703811000000.0,C,"CITIMORTGAGE, INC.",5.875,176000.0,180.0,80.0,90.0,2.0,22.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
7,794714000000.0,B,FIRST TENNESSEE BANK NATIONAL ASSOCIATION,7.0,294000.0,360.0,80.0,80.0,2.0,28.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
8,578907000000.0,C,"BANK OF AMERICA, N.A.",6.375,128000.0,360.0,66.0,66.0,1.0,41.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0
9,826444000000.0,C,"BANK OF AMERICA, N.A.",6.375,115000.0,360.0,35.0,35.0,2.0,20.0,...,2007-01-01,0.9,4.722632,59.257,1138752.0,1.308021,-58478.0,665.1025,184.601,257400.0


In [8]:
fannie_mae.shape

(135038, 36)

In [7]:
fannie_mae.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135038 entries, 0 to 135037
Data columns (total 36 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   Loan_Identifier                          135038 non-null  float64
 1   Origination_Channel                      135038 non-null  object 
 2   Seller_Name                              135038 non-null  object 
 3   Original_Interest_Rate                   135038 non-null  float64
 4   Original_UPB                             135038 non-null  float64
 5   Original_Loan_Term                       135038 non-null  float64
 6   Original_LTV_(OLTV)                      135038 non-null  float64
 7   Original_Combined_LTV_(CLTV)             134007 non-null  float64
 8   Number_of_Borrowers                      135007 non-null  float64
 9   Original_Debt_to_Income_Ratio            132396 non-null  float64
 10  Borrower_Credit_Score_at_Origina

In [4]:
fannie_mae['Original_Debt_to_Income_Ratio'].describe()

count    132396.000000
mean         33.298733
std          11.508698
min           1.000000
25%          25.000000
50%          33.000000
75%          42.000000
max          64.000000
Name: Original_Debt_to_Income_Ratio, dtype: float64

## Clean the data and create variables we will use

These variables are pretty straightforward:

In [12]:
fannie_mae = (fannie_mae
                  # create variables
                  .assign(l_credscore = np.log(fannie_mae['Borrower_Credit_Score_at_Origination']),
                          l_LTV = np.log(fannie_mae['Original_LTV_(OLTV)']),
                          Origination_Date = lambda x: pd.to_datetime(x['Origination_Date']),
                          Origination_Year = lambda x: x['Origination_Date'].dt.year,
                          const = 1,
                          great = fannie_mae['Borrower_Credit_Score_at_Origination'] >= 800
                         )
              
             )

Credit rating is a number between 0 and 850. But in some analysis, it might make sense to have categories of credit ratings (e.g. bad to good). I borrowed [these cutoffs from experian.](https://www.experian.com/blogs/ask-experian/infographic-what-are-the-different-scoring-ranges/)

In [13]:
# create a categorical bin var with "pd.cut()"

fannie_mae['creditbins']= pd.cut(fannie_mae['Borrower_Credit_Score_at_Origination'],
                                 [0,579,669,739,799,850],
                                 labels=['Very Poor','Fair','Good','Very Good','Exceptional'])

Here is the variable that created. I notice that 669 (right on the threshold of a bin) goes into the "Fair" bin instead of "Good".

In [14]:
fannie_mae.loc[:5,['Borrower_Credit_Score_at_Origination','creditbins']]

Unnamed: 0,Borrower_Credit_Score_at_Origination,creditbins
0,669.0,Fair
1,693.0,Good
2,741.0,Very Good
3,804.0,Exceptional
4,658.0,Fair
5,665.0,Fair


In [16]:
# pd.cut took credit , var number between 0 and 850,
# and changed it to bins. I labeled the bins explicitly

fannie_mae['creditbins'].value_counts(dropna=False) / len(fannie_mae)

creditbins
Very Good      0.472867
Good           0.292799
Exceptional    0.117663
Fair           0.107822
Very Poor      0.004725
NaN            0.004125
Name: count, dtype: float64

## Exercises with statsmodels

- **For all problems: y is the interest rate of the loan**
- I recommend the _statsmodels formula_ method on the website

Psuedocode for using statsmodels to run a regression:
```python
model = sm_ols(<formula>, data=<dataframe>)
result=model.fit()

# to print regression output: result.summary()
# get predicted values (yhat): result.predict
# get regression residuals (uhat): result.resid
```

### Q1: Starter regressions

A. Regress y on the credit score (student demo): $y=\beta_0 + \beta_1*\text{Credit Score}$
- _I'll show 2 ways: the psuedo code and the one-liner_

B. Regress y on the **natural log** of the credit score: $y=\beta_0 + \beta_1*log(\text{Credit Score})$
- _I'll show two ways to do this_

C. Regress y on the **natural log** of the loan-to-value

D. Regress y on the natural log of the loan-to-value and the natural log of the credit score: $y=\beta_0 + \beta_1*log(\text{LTV}) + \beta_2*log(\text{Credit Score})$

In [18]:
# 1a: formula is a string : 'y ~ x1'   means y = a+b*x1 + error

model = sm_ols('Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination', data=fannie_mae)
result=model.fit()
result.summary()

# the one line version - runs and prints output, nothing saved 
# will use this today
sm_ols('Original_Interest_Rate ~ Borrower_Credit_Score_at_Origination', 
       data=fannie_mae).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.126
Model:,OLS,Adj. R-squared:,0.126
Method:,Least Squares,F-statistic:,19380.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,0.0
Time:,12:53:05,Log-Likelihood:,-215750.0
No. Observations:,134481,AIC:,431500.0
Df Residuals:,134479,BIC:,431500.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,11.5819,0.046,253.270,0.000,11.492,11.671
Borrower_Credit_Score_at_Origination,-0.0086,6.14e-05,-139.198,0.000,-0.009,-0.008

0,1,2,3
Omnibus:,2660.479,Durbin-Watson:,0.397
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2660.737
Skew:,0.321,Prob(JB):,0.0
Kurtosis:,2.75,Cond. No.,10400.0


In [20]:
#1b

# option 1: create the variable and use it
sm_ols('Original_Interest_Rate ~ l_credscore', 
       data=fannie_mae).fit().summary()

# option 2: formula string can do some math! (py and numpy)
sm_ols('Original_Interest_Rate ~ np.log(Borrower_Credit_Score_at_Origination)', 
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.124
Model:,OLS,Adj. R-squared:,0.124
Method:,Least Squares,F-statistic:,19060.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,0.0
Time:,12:56:30,Log-Likelihood:,-215890.0
No. Observations:,134481,AIC:,431800.0
Df Residuals:,134479,BIC:,431800.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,45.3715,0.291,156.057,0.000,44.802,45.941
np.log(Borrower_Credit_Score_at_Origination),-6.0750,0.044,-138.067,0.000,-6.161,-5.989

0,1,2,3
Omnibus:,2741.277,Durbin-Watson:,0.394
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2737.156
Skew:,0.325,Prob(JB):,0.0
Kurtosis:,2.744,Cond. No.,598.0


In [23]:
#1c

# option 1: create the variable and use it
sm_ols('Original_Interest_Rate ~ l_LTV', 
       data=fannie_mae).fit().summary()

# # option 2: formula string can do some math! (py and numpy)
# if the variable name is "bad" - spaces, weird characters
# Q("<varname>") 
sm_ols('Original_Interest_Rate ~ np.log(Q("Original_LTV_(OLTV)"))', 
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,1010.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,8.409999999999999e-221
Time:,12:59:08,Log-Likelihood:,-225480.0
No. Observations:,135038,AIC:,451000.0
Df Residuals:,135036,BIC:,451000.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.7603,0.047,80.622,0.000,3.669,3.852
"np.log(Q(""Original_LTV_(OLTV)""))",0.3513,0.011,31.779,0.000,0.330,0.373

0,1,2,3
Omnibus:,4889.29,Durbin-Watson:,0.214
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3115.913
Skew:,0.245,Prob(JB):,0.0
Kurtosis:,2.439,Cond. No.,59.4


In [None]:
# D. Regress y on the natural log of the loan-to-value and the natural log of the credit score:
# $y=\beta_0 + \beta_1*log(\text{LTV}) + \beta_2*log(\text{Credit Score})$

In [25]:

sm_ols('Original_Interest_Rate ~ l_LTV +  l_credscore', 
       data=fannie_mae).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.126
Model:,OLS,Adj. R-squared:,0.126
Method:,Least Squares,F-statistic:,9656.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,0.0
Time:,13:01:19,Log-Likelihood:,-215780.0
No. Observations:,134481,AIC:,431600.0
Df Residuals:,134478,BIC:,431600.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,44.1324,0.302,145.949,0.000,43.540,44.725
l_LTV,0.1546,0.010,14.765,0.000,0.134,0.175
l_credscore,-5.9859,0.044,-134.888,0.000,-6.073,-5.899

0,1,2,3
Omnibus:,2793.369,Durbin-Watson:,0.386
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2743.99
Skew:,0.321,Prob(JB):,0.0
Kurtosis:,2.72,Cond. No.,735.0


### Q2: Best practices: Look at the outputs every time

Let's talk about the outputs you see and should look at EVERY time you run a regression:
- Number of obs
- R2 
- AR2
- Coef 
- Std error, t value, p value ("P>|t|")
- Std error options:
    - `.fit(cov_type="HC2")`
    - `.fit(cov_type="cluster", cov_kwds={"groups": df["industry"]})`

### Q3: Regressions with transformations

We are talking about "linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can't be some kind of non-linear transform of the original features $x_i$." The most common transformations are logging variables, interaction terms, and polynomial terms."

We already did log transformations above. 

An interaction term simply means one regressor is two variables multiplied:
- $y=\beta_0 + \beta_1 x_1 + \beta_2 x_1 x_2$
- $y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 $

Polynomial terms might look like:
- $y=\beta_0 + \beta_1 x_1 + \beta_2 x_1^2$

A. Regress y on the credit score and the credit score squared. 

B. Regress y on the natural log of the loan-to-value, the natural log of the credit score, and the interaction of LTV and credit score. 



In [34]:
#3a 

# 3 equiv ways

sm_ols('Original_Interest_Rate ~ l_credscore +  np.power(l_credscore,2)', 
       data=fannie_mae).fit().summary()

sm_ols('Original_Interest_Rate ~ l_credscore +  pow(l_credscore,2)', 
       data=fannie_mae).fit().summary()

# cfreate x2 manually
sm_ols('Original_Interest_Rate ~ l_credscore +  l_cred2', 
       data=fannie_mae.assign(l_cred2 = fannie_mae['l_credscore']* fannie_mae['l_credscore'])
      ).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.127
Model:,OLS,Adj. R-squared:,0.127
Method:,Least Squares,F-statistic:,9824.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,0.0
Time:,13:11:18,Log-Likelihood:,-215630.0
No. Observations:,134481,AIC:,431300.0
Df Residuals:,134478,BIC:,431300.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-372.2506,18.442,-20.185,0.000,-408.396,-336.106
l_credscore,121.0978,5.615,21.566,0.000,110.092,132.103
l_cred2,-9.6800,0.427,-22.649,0.000,-10.518,-8.842

0,1,2,3
Omnibus:,2476.597,Durbin-Watson:,0.398
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2473.711
Skew:,0.309,Prob(JB):,0.0
Kurtosis:,2.758,Cond. No.,260000.0


In [38]:
# 3b
sm_ols('Original_Interest_Rate ~ l_credscore +  l_LTV + l_credscore*l_LTV', 
       data=fannie_mae).fit().summary()

sm_ols('Original_Interest_Rate ~ l_credscore +  l_LTV + l_credscore:l_LTV', 
       data=fannie_mae).fit().summary()

sm_ols('Original_Interest_Rate ~ l_credscore*l_LTV', 
       data=fannie_mae).fit().summary()

0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.127
Model:,OLS,Adj. R-squared:,0.127
Method:,Least Squares,F-statistic:,6521.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,0.0
Time:,13:13:14,Log-Likelihood:,-215670.0
No. Observations:,134481,AIC:,431300.0
Df Residuals:,134477,BIC:,431400.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-16.8119,4.111,-4.090,0.000,-24.869,-8.755
l_credscore,3.2155,0.621,5.182,0.000,1.999,4.432
l_LTV,14.6120,0.973,15.024,0.000,12.706,16.518
l_credscore:l_LTV,-2.1830,0.147,-14.866,0.000,-2.471,-1.895

0,1,2,3
Omnibus:,2756.628,Durbin-Watson:,0.389
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2719.875
Skew:,0.321,Prob(JB):,0.0
Kurtosis:,2.727,Cond. No.,37700.0


### Q4: Dummy and categorical variables

A. Regress y on the dummy variable for a great credit score.

B. Regress y on the categorical variable we created for credit bins.

C. (Advanced, optional, after class exercise): High dimensional fixed effects. This basically means "a categorical variable with LOTS of values". [See this discussion.](https://aeturrell.github.io/coding-for-economists/econmt-regression.html#high-dimensional-fixed-effects-aka-absorbing-regression)

In [39]:
fannie_mae['great'].value_counts()

great
False    119149
True      15889
Name: count, dtype: int64

In [40]:
sm_ols('Original_Interest_Rate ~ great', 
       data=fannie_mae).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.05
Model:,OLS,Adj. R-squared:,0.05
Method:,Least Squares,F-statistic:,7048.0
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,0.0
Time:,13:13:55,Log-Likelihood:,-222550.0
No. Observations:,135038,AIC:,445100.0
Df Residuals:,135036,BIC:,445100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.3433,0.004,1466.712,0.000,5.336,5.350
great[T.True],-0.8916,0.011,-83.951,0.000,-0.912,-0.871

0,1,2,3
Omnibus:,2948.608,Durbin-Watson:,0.305
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2309.413
Skew:,0.239,Prob(JB):,0.0
Kurtosis:,2.572,Cond. No.,3.15


In [41]:
fannie_mae['creditbins'].value_counts()

creditbins
Very Good      63855
Good           39539
Exceptional    15889
Fair           14560
Very Poor        638
Name: count, dtype: int64

In [44]:
# it works but not recommended
sm_ols('Original_Interest_Rate ~ creditbins', 
       data=fannie_mae).fit().summary()

# nuercal vars are treated as numerical UNLESS you tell it they are cats

sm_ols('Original_Interest_Rate ~ C(Number_of_units)', 
       data=fannie_mae).fit().summary()


0,1,2,3
Dep. Variable:,Original_Interest_Rate,R-squared:,0.003
Model:,OLS,Adj. R-squared:,0.003
Method:,Least Squares,F-statistic:,145.4
Date:,"Wed, 20 Mar 2024",Prob (F-statistic):,4.2299999999999997e-94
Time:,13:17:13,Log-Likelihood:,-225770.0
No. Observations:,135038,AIC:,451500.0
Df Residuals:,135034,BIC:,451600.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,5.2267,0.004,1472.492,0.000,5.220,5.234
C(Number_of_units)[T.2.0],0.4922,0.026,18.947,0.000,0.441,0.543
C(Number_of_units)[T.3.0],0.3016,0.059,5.106,0.000,0.186,0.417
C(Number_of_units)[T.4.0],0.4656,0.063,7.432,0.000,0.343,0.588

0,1,2,3
Omnibus:,4314.209,Durbin-Watson:,0.227
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3025.793
Skew:,0.26,Prob(JB):,0.0
Kurtosis:,2.483,Cond. No.,17.9


### Q5: Summarize what you've learned so far



### Q6: Plot the regression

_If time is tight: I'll do it._

Plot 1:
- Plot a scatterplot: Plot as X the credit score variable. As Y, use our y.
- On top of that, lineplots:
    - Rerun Q1a's reg and plot the yhat values. 
    - Let's talk about this.
    - Rerun Q1b's reg and plot the yhat values.
    - Compare to the prior line.
    
Plot 2:
- Plot a scatterplot: Plot as X the credit score variable. As Y, use our y.
- On top of that, lineplots:
    - Rerun Q4b's reg and plot the yhat values, hued by credit bin
  
Plot 3:
- Plot a scatterplot: Plot as X the credit score variable. As Y, use our y.
- On top of that, lineplots:
    - Rerun Q4b's reg BUT WITH credit score as a variable and plot the yhat values, hued by credit bin  
    
_Note: statsmodels has some useful plotting functions. My favs are influence_plot (can be slow) and plot_partregress_grid._

## Regression with SKLEARN

I don't like running regressions in `sklearn` usually. The main reason to do so is if you're doing a typical ML task that sklearn excels in (meaning: "pipelines", which is a term you'll understand later in the course) or if you know you're going to be using other sklearn models anyways (in which case, you'll already be doing the set up for sklearn).

But I want to run at least one regression in SKLEARN for you so you can see how the mechanics are similar, and how they differ. We will cover sklearn more in future classes.

Psuedocode for a reg in sklearn is similar. The differences:
1. A little more work setting up the data
1. `.fit()` gets the data passed to it 
1. The `results` object is different than statsmodels'

```python

# 1. import the "class" of model form sklearn

from sklearn.linear_model import LinearRegression

# 2. arrange the data - more work than statsmodels

# Issue: sklearn doesn't work with missing values, so drop any obs with missing values
# replace vars_in_your_reg with a list of variables you want to use, including y
subset = df[vars_in_your_reg].dropna()

# explicitly set up the y variable and the X variables you want
y = subset['y'] # whatever the y variable is
X = subset[['X1','X2']] # list the X vars

# 3. set up the model ("instantiate the model")
# every class of models has "hyperparamaters" that control how you want the model to work
# below, fit_intercept=True is a "hyperparameter" for OLS models 
# hyperparameters are the things inside the parenthesis of the model class when you declare it

model = LinearRegression(fit_intercept=True)
result=model.fit(X,y) # in sklearn, you put X and Y inside fit!!!

# the result object is different in sklearn
# results.intercept_ (the constant in the model)
# results.coef_ (the other X vars)

```


## Q7: STUDENT DEMO - regressions **using sklearn**

A. Regress the interest rate on the natural log of the loan-to-value using the sklearn method.

B. Regress the interest rate on the natural log of the loan-to-value using the sklearn method.