# HW 07 (Nov 14th) - Avina Chan

**4) R-squared vs. P-values**

**R-Squared**: With an R-squared value of 0.176, the model explains 17.6% of the variation in the dependent variable. This leaves 82.4% unexplained, as the model does not account for it.

**P-Values**: In this model, many coefficients are larger than 10 and have very small p-values, indicating they are statistically significant.

While these two statements may seem contradictory, they actually address different aspects of the model.

- **R-Squared**: The value of 0.176 tells us that the model as a whole only explains 17.6% of the variability in the dependent variable. So, while the model may be statistically significant, it doesn’t capture much of the outcome’s variability, suggesting that a large amount of the variability in the dependent variable remains unexplained.

- **P-Values**: The small p-values and large coefficients show that specific predictors have statistically significant relationships with the outcome. This means they likely influence the outcome variable in a meaningful way, even though the overall model fit (R-squared) is low.

The difference between the two statements is that R-squared measures how well the entire model explains the outcome, while p-values and coefficients indicate how much specific predictors matter in influencing the outcome, regardless of the model’s overall performance.

In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
# fail https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
pokeaman = pd.read_csv(url) 
pokeaman

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [3]:
import statsmodels.formula.api as smf

model1_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation) + Q("Sp. Def"):C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") * C(Generation)', data=pokeaman)

model2_fit = model2_spec.fit()
model2_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.164
Method:,Least Squares,F-statistic:,15.27
Date:,"Fri, 15 Nov 2024",Prob (F-statistic):,3.5e-27
Time:,03:03:28,Log-Likelihood:,-3649.4
No. Observations:,800,AIC:,7323.0
Df Residuals:,788,BIC:,7379.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,26.8971,5.246,5.127,0.000,16.599,37.195
C(Generation)[T.2],20.0449,7.821,2.563,0.011,4.692,35.398
C(Generation)[T.3],21.3662,6.998,3.053,0.002,7.629,35.103
C(Generation)[T.4],31.9575,8.235,3.881,0.000,15.793,48.122
C(Generation)[T.5],9.4926,7.883,1.204,0.229,-5.982,24.968
C(Generation)[T.6],22.2693,8.709,2.557,0.011,5.173,39.366
"Q(""Sp. Def"")",0.5634,0.071,7.906,0.000,0.423,0.703
"Q(""Sp. Def""):C(Generation)[T.2]",-0.2350,0.101,-2.316,0.021,-0.434,-0.036
"Q(""Sp. Def""):C(Generation)[T.3]",-0.3067,0.093,-3.300,0.001,-0.489,-0.124

0,1,2,3
Omnibus:,337.229,Durbin-Watson:,1.505
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2871.522
Skew:,1.684,Prob(JB):,0.0
Kurtosis:,11.649,Cond. No.,1400.0


**7) Development of models** <br>
**Development of model5_linear_form from model3_fit and model4_fit**:  
Model 3 focuses only on Attack and Defense to predict HP; however, this doesn't fully capture the complexity of the data. The hypothesis testing shows that these variables have a significant impact, but the in-sample R-squared value (0.1477) is relatively low, suggesting that the model doesn't capture much of the variance in HP. This indicates that the model is likely underfitting the data by not accounting for enough complexity in the relationship. Model 4 attempts to improve this by adding more predictors and interaction terms to capture more relationships between these variables and HP. As seen, the in-sample R-squared (0.4671) has improved compared to Model 3. The increase in the in-sample R-squared suggests that the model has gained more predictive power. However, the out-of-sample R-squared (0.0025) dropped dramatically compared to Model 3, suggesting severe overfitting. The model performs very well on the training data but fails to generalize to the test data. In Model 5, some of the interactions from Model 4 were removed to avoid overfitting, but it still includes more predictors than Model 3. The in-sample R-squared (0.3920) and out-of-sample R-squared (0.3002) values show that the model is not as overfitted, and the out-of-sample performance is significantly better than Model 4's. This demonstrates improved generalizability. Model 5 represents a more balanced approach.

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split

fifty_fifty_split_size = int(pokeaman.shape[0]*0.5)

# Replace "NaN" (in the "Type 2" column with "None")
pokeaman.fillna('None', inplace=True)

np.random.seed(130)
pokeaman_train,pokeaman_test = \
  train_test_split(pokeaman, train_size=fifty_fifty_split_size)
pokeaman_train
model_spec3 = smf.ols(formula='HP ~ Attack + Defense', 
                      data=pokeaman_train)
model3_fit = model_spec3.fit()
model3_fit.summary()
yhat_model3 = model3_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model3_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model3)[0,1]**2)

'In sample' R-squared:     0.14771558304519894
'Out of sample' R-squared: 0.21208501873920738


In [5]:
model4_linear_form = 'HP ~ Attack * Defense * Speed * Legendary'
model4_linear_form += ' * Q("Sp. Def") * Q("Sp. Atk")'
# DO NOT try adding '* C(Generation) * C(Q("Type 1")) * C(Q("Type 2"))'
# That's 6*18*19 = 6*18*19 possible interaction combinations...
# ...a huge number that will blow up your computer

model4_spec = smf.ols(formula=model4_linear_form, data=pokeaman_train)
model4_fit = model4_spec.fit()
model4_fit.summary()
yhat_model4 = model4_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model4_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model4)[0,1]**2)

'In sample' R-squared:     0.46709442115833855
'Out of sample' R-squared: 0.002485342598992873


In [6]:
# Here's something a little more reasonable...
model5_linear_form = 'HP ~ Attack + Defense + Speed + Legendary'
model5_linear_form += ' + Q("Sp. Def") + Q("Sp. Atk")'
model5_linear_form += ' + C(Generation) + C(Q("Type 1")) + C(Q("Type 2"))'

model5_spec = smf.ols(formula=model5_linear_form, data=pokeaman_train)
model5_fit = model5_spec.fit()
model5_fit.summary()
yhat_model5 = model5_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model5_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model5)[0,1]**2)

'In sample' R-squared:     0.3920134083531893
'Out of sample' R-squared: 0.30015614488652215


**Development of model6_linear_form from model5_linear_form**: Model 6 builds on Model 5 by focusing on a more targeted selection of predictors (Attack, Speed, Special Defense, and Special Attack), which were key variables identified. Additionally, Model 6 introduces significant indicator variables that were highlighted in the previous models, such as Type 1 indicators and Generations. The in-sample R-squared for Model 6 (0.3326) shows a slight improvement compared to Model 5, and the out-of-sample R-squared for Model 6 (0.2957) is also better than that of Model 5.

In [7]:
# Here's something a little more reasonable...
model6_linear_form = 'HP ~ Attack + Speed + Q("Sp. Def") + Q("Sp. Atk")'
# And here we'll add the significant indicators from the previous model
# https://chatgpt.com/share/81ab88df-4f07-49f9-a44a-de0cfd89c67c
model6_linear_form += ' + I(Q("Type 1")=="Normal")'
model6_linear_form += ' + I(Q("Type 1")=="Water")'
model6_linear_form += ' + I(Generation==2)'
model6_linear_form += ' + I(Generation==5)'

model6_spec = smf.ols(formula=model6_linear_form, data=pokeaman_train)
model6_fit = model6_spec.fit()
model6_fit.summary()
yhat_model6 = model6_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model6_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2)

'In sample' R-squared:     0.3326310334310908
'Out of sample' R-squared: 0.29572460427079933


**Development of model7_linear_form from model6_linear_form**: Model 7 builds on Model 6 by adding interaction terms between Attack, Speed, Special Defense, and Special Attack. These interactions allow the model to account for more complex relationships, which can improve its predictive performance. The inclusion of these interactions increases the model's complexity. Additionally, the model is adjusted by centering and scaling the continuous variables, which reduces the condition number and improves the model's ability to handle multicollinearity. This adjustment led to improvements in both the in-sample and out-of-sample performance compared to the previous model.

In [8]:
model7_linear_form = 'HP ~ Attack * Speed * Q("Sp. Def") * Q("Sp. Atk")'
model7_linear_form += ' + I(Q("Type 1")=="Normal")'
model7_linear_form += ' + I(Q("Type 1")=="Water")'
model7_linear_form += ' + I(Generation==2)'
model7_linear_form += ' + I(Generation==5)'

model7_spec = smf.ols(formula=model7_linear_form, data=pokeaman_train)
model7_fit = model7_spec.fit()
model7_fit.summary()
yhat_model7 = model7_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model7_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2)

'In sample' R-squared:     0.37818209127432456
'Out of sample' R-squared: 0.35055389205977444


**9) Explaination of Illustration**

In [11]:
model7_gen1_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model7_gen1_predict_future_fit = model7_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model7_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")

'In sample' R-squared:     0.37818209127432456 (original)
'Out of sample' R-squared: 0.35055389205977444 (original)
'In sample' R-squared:     0.5726118179916575 (gen1_predict_future)
'Out of sample' R-squared: 0.11151363354803218 (gen1_predict_future)


In [13]:
model7_gen1to5_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model7_gen1to5_predict_future_fit = model7_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model7_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.37818209127432456 (original)
'Out of sample' R-squared: 0.35055389205977444 (original)
'In sample' R-squared:     0.3904756578094535 (gen1to5_predict_future)
'Out of sample' R-squared: 0.23394915464343125 (gen1to5_predict_future)


In [14]:
model6_gen1_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model6_gen1_predict_future_fit = model6_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model6_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")

'In sample' R-squared:     0.3326310334310908 (original)
'Out of sample' R-squared: 0.29572460427079933 (original)
'In sample' R-squared:     0.4433880517727282 (gen1_predict_future)
'Out of sample' R-squared: 0.1932858534276128 (gen1_predict_future)


In [15]:
model6_gen1to5_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model6_gen1to5_predict_future_fit = model6_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model6_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.3326310334310908 (original)
'Out of sample' R-squared: 0.29572460427079933 (original)
'In sample' R-squared:     0.33517279824114776 (gen1to5_predict_future)
'Out of sample' R-squared: 0.26262690178799936 (gen1to5_predict_future)



The code above fits models (model7 and model6) on subsets of the Pokeaman data. The models are trained and evaluated using Generation 1 data, then tested on data from all other generations. R-squared values (both in-sample and out-of-sample) are reported. Training and testing splits were performed by considering different generation groups of data to simulate how the models would perform as new data becomes available. <br>
<br>
The two models being compared are model7, which is more complex, and model6, which is simpler. The more complex interactions in model7 may lead to better performance on the training data but also pose a risk of overfitting due to poor generalization. On the other hand, model6 is simpler, and its coefficients are supported by stronger evidence according to p-values. However, it may not perform as well as model7 on test data. <br>
<br>
Overall, while simpler models like model6, despite having a lower in-sample R-squared, may be preferred due to better generalization and ease of interpretation, model7, with its better and more accurate predictions, still carries the risk of overfitting. In real-world scenarios, simpler models may often be more beneficial.<br>

CHAT GPT: https://chatgpt.com/share/6736bd60-746c-8001-8f41-27115857ad75
Here’s a summary of the key points from our conversation:

- **Code Explanation:** You asked for an explanation of Python code that involves fitting linear regression models using `statsmodels` and calculating R-squared values for both in-sample and out-of-sample data. The code is evaluating how well the model trained on Generation 1 data generalizes to data from other generations.
  
  1. **Model Fitting:** A linear model is fit using only Generation 1 data.
  2. **In-sample R-squared:** The R-squared value for the model's fit on its training data (in-sample).
  3. **Out-of-sample R-squared (original model):** Evaluates the performance of the original model on a separate test set.
  4. **In-sample and Out-of-sample R-squared (Generation 1 model):** The model trained on Generation 1 data is evaluated both on its training set and on other generations' data.

- **Purpose of R-squared:**
  - **In-sample R-squared** shows how well the model explains the variance in the data it was trained on.
  - **Out-of-sample R-squared** shows how well the model generalizes to unseen data.

- **General Insight:** The code helps assess whether the model overfits to the training data and how effectively it can make predictions for new, unseen data.

If you need further clarification or have additional questions, feel free to ask!