# Static



By now you're familiar with the typical use of linear regression: you're given a dataframe with a bunch of features and a set of observations for each feature. You then take one feature (your y) and try to predict it based on other features (your x's). 

This generally works to a certain extent. This morning, though, we're going to set up a rather strange problem for linear regression. We will see what happens when y and the x's are totally unrelated.

Recall that regression follows the formula 
$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \epsilon
$$ 
When we "fit" an OLS model we are giving the model all of the x's and the y and asking it to find the best betas.

This time, however, we're going to set all the betas to zero 
$$
\beta_0=\beta_1=\dots=0
$$
and then fit an OLS model.

Before we do this, stop and think for a second:

- What do you expect the model to do? What will the betas/r-squared/p-values that it finds look like?
> R2 would stay zero. Variability in the target, but the model is doing nothing, so it's unable to account for any of that behavior.
- What do you think the model *should* do? Is that different from what you think it *will* do?
> in practice, get better results than expected due to the nature of randomness (bound to see patterns in data)

![](https://upload.wikimedia.org/wikipedia/commons/5/5a/No_Signal_23.JPG)

# Part 1

Generate simulation data. We want to have 200 points (observations) for the y feature and for 20 x features. In other words make sure `y.shape == (200,1)` and `x.shape == (200,20)`. The x's should be randomly generated independent of each other. And the y should be randomly generated independent of the x's. 

Use statsmodels to fit an OLS model to your data. Are the results as you expected? Do you have any betas with a $p<0.05$? If not, re-run the model until you do.

In [2]:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import patsy #patsy takes dataframe, turns into matrices, if using sklearn, then it's in a format that
#sklearn understands

In [None]:
num_rows = 200
y = np.random.randint(0,200,200)
x = np.random.randit(0,200,[200,20])
x_name_list = []
for item in range(1,21):
    x_name_list.append('X' + str(item))
x_df = pd.DataFrame(x, columns = x_name_list)
x_df.insert(loc=0,column = 'Y', value = y)
df = x_df

In [None]:
def p_hacking(num_rows):
   y=np.random.randint(0,200,200)
   x=np.random.randint(0,200,[200,20])
   x_name_list=list()
   for item in range(1,21):
       x_name_list.append(‘X’ + str(item))
   x_df=pd.DataFrame(x,columns = x_name_list)

   x_df.insert(loc=0, column=‘Y’, value=y)
   df=x_df.iloc[:num_rows,:]
   df
   # Create your feature matrix (X) and target vector (y)
   y, x = patsy.dmatrices(‘Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 +X9 +X10+ X11+ X12+ X13+ X14+ X15+ X16+ X17 +X18+ X19+ X20’, data=df, return_type=“dataframe”)

   # Create your model
   model = sm.OLS(y, x)

   # Fit your model to your training set
   fit = model.fit()

   # # Print summary statistics of the model’s performance
   fit.summary()
   #np.log(fit.rsquared)
   np.log(np.abs(fit.rsquared_adj))
   return
Message Input


Message Jeremy Chow

In [None]:
output = []
for i in range(1,200):
    y = np.random.randn(200,1)
    x = np.random.randn(200,i)
    model = sm.OLS(y,x)
    fit = model.fit()
    output.append([fit.rsquared,fit.rsquared_adj])

In [None]:
plt.plot(output)

In [None]:
# Create your feature matrix (X) and target vector (y)
y, X = patsy.dmatrices('Y ~ X1 + X2 + X3 + X4 + X5 + X6', data=df, return_type="dataframe")

# Create your model
model = sm.OLS(y, X)

# Fit your model to your training set
fit = model.fit()

# Print summary statistics of the model's performance
fit.summary()

# Part 2

Now, automate the process! Run the above analysis but vary the number of x's from 1 to 200. Log the r2 and r2-adj for each case and plot them