# Static



By now you're familiar with the typical use of linear regression: you're given a dataframe with a bunch of features and a set of observations for each feature. You then take one feature (your y) and try to predict it based on other features (your x's). 

This generally works to a certain extent. This morning, though, we're going to set up a rather strange problem for linear regression. We will see what happens when y and the x's are totally unrelated.

Recall that regression follows the formula 
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \epsilon
$$ 
When we "fit" an OLS model we are giving the model all of the x's and the y and asking it to find the best betas.

This time, however, we're going to set all the betas to zero 
$$
\beta_0=\beta_1=\dots=0
$$
and then fit an OLS model.

Before we do this, stop and think for a second:

- What do you expect the model to do? What will the betas/r-squared/p-values that it finds look like?
- What do you think the model *should* do? Is that different from what you think it *will* do?

![](https://upload.wikimedia.org/wikipedia/commons/5/5a/No_Signal_23.JPG)

In [71]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from collections import defaultdict

# Part 1

Generate simulation data. We want to have 200 points (observations) for the y feature and for 20 x features. In other words make sure `y.shape == (200,1)` and `x.shape == (200,20)`. The x's should be randomly generated independent of each other. And the y should be randomly generated independent of the x's. 

Use statsmodels to fit an OLS model to your data. Are the results as you expected? Do you have any betas with a $p<0.05$? If not, re-run the model until you do.

In [51]:
target = np.random.randn(200, 1)

In [113]:
data = []
for i in range(200):
    noise = np.random.randint(500, size=(20))
    data.append(noise)

In [114]:
df = pd.DataFrame(data)

In [115]:
df_y = pd.DataFrame(target)
df_y.rename(columns={0:"target"}, inplace=True)

In [116]:
df = pd.concat([df, df_y], axis=1)

In [117]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,11,12,13,14,15,16,17,18,19,target
0,475,153,71,298,110,141,19,352,415,274,...,301,328,235,2,207,118,32,124,22,-2.126509
1,182,271,485,491,228,40,235,350,64,56,...,206,419,97,227,11,492,87,328,111,0.309988
2,101,167,253,386,1,169,41,471,89,242,...,361,192,134,329,404,95,131,77,324,0.010627
3,393,417,446,276,57,279,49,438,8,306,...,348,88,88,213,52,356,310,143,97,-1.476309
4,336,147,120,93,63,172,229,302,138,311,...,6,260,50,58,292,241,128,3,418,-0.599998
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,353,428,201,354,333,208,215,381,24,53,...,344,295,325,413,270,229,239,80,157,0.943571
196,9,326,184,15,323,102,32,476,374,415,...,18,286,413,315,34,302,70,42,71,-0.904040
197,133,436,468,444,297,416,414,391,100,451,...,299,231,128,63,197,330,444,9,68,-1.680813
198,272,153,375,337,165,249,395,301,484,0,...,422,372,353,249,19,423,65,370,451,1.288991


In [118]:
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1:]

In [119]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [120]:
std = StandardScaler()
std.fit(X_train.values)
X_train_scaled = std.transform(X_train.values)

In [121]:
model = sm.OLS(y_train, X_train_scaled)

In [126]:
fit = model.fit()
fit.summary()

0,1,2,3
Dep. Variable:,target,R-squared (uncentered):,0.13
Model:,OLS,Adj. R-squared (uncentered):,0.005
Method:,Least Squares,F-statistic:,1.043
Date:,"Fri, 31 Jul 2020",Prob (F-statistic):,0.417
Time:,09:44:39,Log-Likelihood:,-228.66
No. Observations:,160,AIC:,497.3
Df Residuals:,140,BIC:,558.8
Df Model:,20,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.0903,0.091,0.989,0.324,-0.090,0.271
x2,-0.1597,0.092,-1.733,0.085,-0.342,0.022
x3,0.0269,0.090,0.299,0.766,-0.151,0.205
x4,0.0407,0.094,0.435,0.665,-0.144,0.226
x5,0.2135,0.092,2.313,0.022,0.031,0.396
x6,-0.0394,0.092,-0.427,0.670,-0.222,0.143
x7,0.1078,0.089,1.216,0.226,-0.067,0.283
x8,0.0492,0.093,0.527,0.599,-0.135,0.234
x9,0.0909,0.091,0.994,0.322,-0.090,0.272

0,1,2,3
Omnibus:,0.747,Durbin-Watson:,1.889
Prob(Omnibus):,0.688,Jarque-Bera (JB):,0.823
Skew:,0.042,Prob(JB):,0.663
Kurtosis:,2.659,Cond. No.,1.8


# Part 2

Now, automate the process! Run the above analysis but vary the number of x's from 1 to 200. Log the r2 and r2-adj for each case and plot them

In [134]:
r2 = []
for i in range(400): 
    noise = np.random.randint(200, size=(i))
    target = np.random.randint(200, size=(i))
    df = pd.DataFrame(data)
    df_y = pd.DataFrame(target)
    df_y.rename(columns={0:"target"}, inplace=True)
    df = pd.concat([df, df_y], axis=1)
    X = df.iloc[:, 0:-1]
    y = df.iloc[:, -1:]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    std = StandardScaler()
    std.fit(X_train.values)
    X_train_scaled = std.transform(X_train.values)
    model = sm.OLS(y_train, X_train_scaled)
    fit = model.fit()
    r_squared = fit.rsquared
    r2.append(r_squared)
    
r2

MissingDataError: exog contains inf or nans