# 1

#### 1. Difference Between Simple and Multiple Linear Regression

- **Simple Linear Regression:** Models the relationship between a single predictor variable `X` and an outcome variable `Y`.
  - **Linear Form**: `Y = beta_0 + beta_1 * X + epsilon`
  - **Explanation**: This equation shows `Y` as a function of a single predictor, `X`, with `beta_0` as the intercept and `beta_1` as the slope.

- **Multiple Linear Regression:** Models the relationship between multiple predictors `X_1, X_2, ..., X_n` and `Y`.
  - **Linear Form**: `Y = beta_0 + beta_1 * X_1 + beta_2 * X_2 + ... + beta_n * X_n + epsilon`
  - **Explanation**: Multiple Linear Regression includes multiple predictors, allowing for a more complex and nuanced relationship. This improves predictions by accounting for additional factors influencing `Y`.

- **Benefit of Multiple Linear Regression**: It captures the influence of multiple predictors on the outcome, providing a more comprehensive understanding and potentially better predictive accuracy.

---

#### 2. Continuous vs. Indicator Variables in Simple Linear Regression

- **Continuous Variable**: A variable like age or income that can take on any value within a range.
  - **Linear Form**: `Y = beta_0 + beta_1 * X + epsilon`
  - **Explanation**: Here, `X` is continuous, and `beta_1` quantifies the effect of each unit increase in `X` on `Y`.

- **Indicator Variable**: A binary variable (0 or 1) indicating the presence of a specific condition (e.g., gender: male/female).
  - **Linear Form**: `Y = beta_0 + beta_1 * 1(X) + epsilon`
  - **Explanation**: The indicator variable, `1(X)`, shifts `Y` by `beta_1` when the condition holds (e.g., male vs. female).

---

#### 3. Effect of Adding an Indicator Variable Alongside a Continuous Variable in Multiple Linear Regression

- When adding an indicator variable `1(X_2)` alongside a continuous variable `X_1`, Multiple Linear Regression can model different baseline levels of `Y` based on the indicator while adjusting for the continuous predictor.
  - **Linear Form**: `Y = beta_0 + beta_1 * X_1 + beta_2 * 1(X_2) + epsilon`
  - **Explanation**: This model accounts for both the continuous effect of `X_1` and the shift caused by `X_2`. For example, `X_1` could be age and `X_2` could represent gender; this model shows age’s effect on `Y`, adjusted for gender.

---

#### 4. Effect of Adding an Interaction Between a Continuous and an Indicator Variable in Multiple Linear Regression

- An interaction term between a continuous `X_1` and an indicator `1(X_2)` allows the slope of `X_1`’s effect on `Y` to differ based on the indicator variable.
  - **Linear Form**: `Y = beta_0 + beta_1 * X_1 + beta_2 * 1(X_2) + beta_3 * (X_1 * 1(X_2)) + epsilon`
  - **Explanation**: Here, `beta_3` captures the interaction effect. If `X_1` represents age and `X_2` gender, the model allows age to impact `Y` differently based on gender.

---

#### 5. Multiple Linear Regression Model Based Solely on Indicator Variables Derived from a Non-Binary Categorical Variable

- When using only indicator variables from a non-binary categorical variable (e.g., race with multiple categories), Multiple Linear Regression uses binary encoding to represent each category.
  - **Linear Form**: `Y = beta_0 + beta_1 * 1(X = A) + beta_2 * 1(X = B) + ... + beta_{n-1} * 1(X = N) + epsilon`
  - **Explanation**: Each indicator represents a category (e.g., `1(X = A)` if race = A, otherwise 0). One category (often the “reference” category) is omitted to prevent multicollinearity, creating a set of binary indicators that each describe the presence of one specific category.


NotebookLM:

* ### Indicator Variables and Model Behavior in Multiple Linear Regression

**Introducing a single indicator variable alongside a continuous variable in a multiple linear regression model changes the model's behavior.** Instead of representing a single straight line (as in simple linear regression), the model now represents two parallel lines. Each line represents the relationship between the continuous predictor and the outcome for one of the groups defined by the indicator variable.

Consider these linear forms:

*   **Simple Linear Regression**:

    $$Y_i = \beta_0 + \beta_1 x_i + \epsilon_i$$

    where:
    * $Y_i$ represents the outcome variable for observation $i$.
    * $\beta_0$ is the intercept.
    * $\beta_1$ is the slope coefficient for the continuous predictor, $x_i$.
    * $\epsilon_i$ is the error term.

*   **Multiple Linear Regression** with an **indicator variable**:

    $$Y_i = \beta_0 + \beta_1 x_i + \beta_2 1_{[x_i = \text{"B"}]}(x_i) + \epsilon_i$$

    where:
    * $1_{[x_i = \text{"B"}]}(x_i)$ is the indicator variable. This variable takes a value of 1 if observation $i$ belongs to group "B" and 0 otherwise.
    * $\beta_2$ is the coefficient for the indicator variable, quantifying the difference in the average outcome between the two groups when the continuous predictor is held constant.

**The indicator variable effectively introduces a constant "offset" for the group it represents, resulting in two parallel lines with the same slope but different intercepts.**

**Adding an interaction term further modifies the model's behavior.**  The interaction term, denoted as $x_i \times 1_{[x_i = \text{"B"}]}(x_i)$, allows the slope of the relationship between the continuous predictor and the outcome to differ between the two groups. This means the lines representing the relationship are no longer parallel but may converge, diverge, or even intersect.

Here is the linear form with the interaction term:

$$Y_i = \beta_0 + \beta_1 x_i + \beta_2 1_{[x_i = \text{"B"}]}(x_i) + \beta_3 x_i \times 1_{[x_i = \text{"B"}]}(x_i)  + \epsilon_i$$

* $\beta_3$ captures the difference in slopes between the two groups.

**When a multiple linear regression model is based solely on indicator variables derived from a non-binary categorical variable, the model represents a series of parallel horizontal lines.** Each line represents a category of the categorical variable.  

For instance, for a categorical variable with $K$ levels, the model will have $K-1$ indicator variables, and the linear form will be:

$$Y_i = \beta_0 + \beta_1 1_{[x_i = \text{"B"}]}(x_i) + \beta_2 1_{[x_i = \text{"C"}]}(x_i) + ... + \beta_{K-1} 1_{[x_i = \text{"K"}]}(x_i) + \epsilon_i$$

Each indicator variable acts as a binary encoding for the categorical variable, taking the value 1 if the observation belongs to the corresponding category and 0 otherwise. **This model allows for comparisons between the different groups represented by the categorical variable while holding all other factors constant.**

**Key takeaways**:

*   **Indicator variables in multiple linear regression models allow for comparisons between different groups or categories.**
*   **The interaction between a continuous and an indicator variable allows the relationship between the continuous predictor and the outcome to vary across groups.**
*   **A multiple linear regression model based solely on indicator variables represents a series of parallel horizontal lines, each corresponding to a category of the categorical variable.**

# 2

#### Outcome and Predictor Variables

* **Outcome Variable:** The desired outcome the company wants to predict is **sales**. The goal is to understand how advertising spending affects sales.

* **Predictor Variables:** The factors influencing sales that the company can manipulate are:
    * **TV Advertising Spending:** The amount invested in TV ads.
    * **Online Advertising Spending:** The amount invested in online ads.

#### Meaningful Interactions
The effectiveness of one advertising medium might depend on the spending on the other, creating a synergistic or antagonistic effect. For example:

* **Synergy:** High online advertising might reinforce the message of a TV ad, leading to greater sales than expected from simply adding the effects of each medium.
* **Antagonism:** Excessive use of both mediums might saturate the market, resulting in diminishing returns.

#### Linear Forms and Predictions

* **Yᵢ**: Sales for observation **i**.
* **xᵢ**: TV ad spending for observation **i**.
* **zᵢ**: Online ad spending for observation **i**.
* **εᵢ**: Error term representing random variability.
* **β₀**: Intercept, representing baseline sales.
* **β₁**: Coefficient for TV ad spending.
* **β₂**: Coefficient for online ad spending.
* **β₃**: Coefficient for the interaction term.

the linear forms for predicting sales are:

* **Without the interaction:**

    Yᵢ = β₀ + β₁ * xᵢ + β₂ * zᵢ + εᵢ

* **With the interaction:**

    Yᵢ = β₀ + β₁ * xᵢ + β₂ * zᵢ + β₃ * (xᵢ * zᵢ) + εᵢ

#### Making Predictions:

1. **Estimate Coefficients:** Use regression analysis to determine the values of **β₀**, **β₁**, **β₂**, and **β₃** from historical data.

2. **Plug in Values:** Substitute the values for TV and online ad spending (**xᵢ** and **zᵢ**) into the appropriate formula.

3. **Calculate Predicted Sales:** Perform the arithmetic in the equation to obtain the predicted sales (**Yᵢ**).

#### Differences in Model Predictions

The primary distinction between the models lies in how they handle the relationship between the predictor variables:

* **Model without Interaction:** It assumes an independent, **additive** relationship between TV and online advertising. The impact of each variable on sales is constant, regardless of the other variable's value. Predictions reflect the sum of each predictor's individual effect.

* **Model with Interaction:** It accounts for a potential **synergistic or antagonistic relationship** between the advertising mediums. The effect of one medium on sales varies depending on the spending on the other medium. Predictions adjust for the combined effect of the two variables working together.

#### Binary Predictor Variables
* **xᵢ = 1** if TV ad spending is "high", and **xᵢ = 0** if it's "low".
* **zᵢ = 1** if online ad spending is "high", and **zᵢ = 0** if it's "low".

* **Without the interaction:**
    Yᵢ = β₀ + β₁ * xᵢ + β₂ * zᵢ + εᵢ

* **With the interaction:**
    Yᵢ = β₀ + β₁ * xᵢ + β₂ * zᵢ + β₃ * (xᵢ * zᵢ) + εᵢ

* **β₀** represents baseline sales when both TV and online ad spending are "low".
* **β₁** represents the effect on sales of increasing TV spending from "low" to "high", keeping online spending constant.
* **β₂** represents the effect on sales of increasing online spending from "low" to "high", keeping TV spending constant.
* **β₃** captures the interaction, measuring the additional effect on sales when both TV and online spending are "high" compared to the sum of their individual effects.

### Making Predictions with Binary Variables:

The process for making predictions remains the same, but you now plug in 0 or 1 for **xᵢ** and **zᵢ** based on whether the corresponding advertising spending is "low" or "high". The differences in predictions between models with and without the interaction term are similar to the continuous case. The model with the interaction adjusts the prediction based on the combined effect of both advertising categories, while the model without interaction assumes a purely additive relationship.

It's important to note that the choice between using continuous or binary predictor variables depends on the specific context of your data and research question. Continuous variables provide more granular information, while binary variables offer a simplified representation. The decision should be based on what best represents the underlying relationship between advertising spending and sales in your scenario.

___ 
NotebookLM:
## Key Takeaways from Linear Regression Discussion

The main point of the discussion is to illustrate **how linear regression works** and how to **interpret the results**. The response emphasizes the importance of understanding the relationship between **predictor variables** (what influences the outcome) and the **outcome variable** (what you're trying to predict). 

### Simple vs. Multiple Linear Regression

*   **Simple Linear Regression** involves one predictor variable, while **Multiple Linear Regression** utilizes two or more. The benefit of using multiple predictors is that it allows you to create more complex and accurate models.

### Interaction Terms: Synergy and Antagonism

The concept of **interactions** is crucial. It describes how the combined effect of two predictor variables can be more (synergy) or less (antagonism) than simply adding their individual effects.

### Model Interpretation and Validation

The response highlights the importance of:

*   **Understanding coefficients**: Each coefficient in the linear model represents the change in the outcome variable for a one-unit increase in the corresponding predictor variable, while holding other predictors constant.
*   **Distinguishing between theoretical and fitted models**: The **theoretical model** is the assumed relationship between variables, while the **fitted model** is based on the observed data. 
*   **Assessing model performance**: R-squared is a key metric to evaluate how well the model fits the data. A higher R-squared indicates a better fit.
*   **Validating model generalizability**: It is crucial to check if the model can accurately predict outcomes for new data (out-of-sample performance) rather than just fitting the data it was trained on. 

###  Linear Forms and Equations

The response provides the **linear forms** (equations) for different scenarios, showcasing how to represent the relationship between variables mathematically.

### Understanding the Assumptions of Linear Regression

The discussion emphasizes the importance of understanding the **assumptions** underlying linear regression, such as linearity, normality of errors, and independence of errors. Violating these assumptions can affect the accuracy of your results. 

### Continuous vs. Binary Predictor Variables

It is explained that the choice between continuous and binary predictor variables depends on the context and the research question. Continuous variables offer more detailed information, while binary variables simplify the representation. 

### Application to the Course Project

The response connects these concepts to the **Canadian Social Connection Survey**, encouraging you to think about how linear regression could be used to analyze the data. This includes choosing relevant outcome and predictor variables, considering potential interactions, and being aware of the assumptions of the model. 


# 3

In [1]:
# Here's an example of how you can do this
import pandas as pd
import statsmodels.formula.api as smf

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
pokeaman = pd.read_csv(url).fillna('None')

pokeaman['str8fyre'] = (pokeaman['Type 1']=='Fire').astype(int)
linear_model_specification_formula = \
'str8fyre ~ Attack*Legendary + Defense*I(Q("Type 2")=="None") + C(Generation)'
log_reg_fit = smf.logit(linear_model_specification_formula, data=pokeaman).fit()
log_reg_fit.summary()

Optimization terminated successfully.
         Current function value: 0.228109
         Iterations 8


0,1,2,3
Dep. Variable:,str8fyre,No. Observations:,800.0
Model:,Logit,Df Residuals:,788.0
Method:,MLE,Df Model:,11.0
Date:,"Tue, 12 Nov 2024",Pseudo R-squ.:,0.05156
Time:,04:21:08,Log-Likelihood:,-182.49
converged:,True,LL-Null:,-192.41
Covariance Type:,nonrobust,LLR p-value:,0.04757

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.2644,0.714,-4.572,0.000,-4.664,-1.865
Legendary[T.True],4.3478,2.179,1.996,0.046,0.078,8.618
"I(Q(""Type 2"") == ""None"")[T.True]",1.5432,0.853,1.810,0.070,-0.128,3.215
C(Generation)[T.2],-0.0574,0.468,-0.123,0.902,-0.975,0.861
C(Generation)[T.3],-0.6480,0.466,-1.390,0.164,-1.561,0.265
C(Generation)[T.4],-0.8255,0.545,-1.516,0.130,-1.893,0.242
C(Generation)[T.5],-0.5375,0.449,-1.198,0.231,-1.417,0.342
C(Generation)[T.6],0.3213,0.477,0.673,0.501,-0.614,1.257
Attack,0.0172,0.006,3.086,0.002,0.006,0.028


# 4

In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
# fail https://github.com/KeithGalli/pandas/blob/master/pokemon_data.csv
pokeaman = pd.read_csv(url) 
pokeaman

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,80,160,60,170,130,80,6,True


In [3]:
import statsmodels.formula.api as smf

model1_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") + C(Generation) + Q("Sp. Def"):C(Generation)', data=pokeaman)
model2_spec = smf.ols(formula='HP ~ Q("Sp. Def") * C(Generation)', data=pokeaman)

model2_fit = model2_spec.fit()
model2_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.176
Model:,OLS,Adj. R-squared:,0.164
Method:,Least Squares,F-statistic:,15.27
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,3.5e-27
Time:,04:52:04,Log-Likelihood:,-3649.4
No. Observations:,800,AIC:,7323.0
Df Residuals:,788,BIC:,7379.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,26.8971,5.246,5.127,0.000,16.599,37.195
C(Generation)[T.2],20.0449,7.821,2.563,0.011,4.692,35.398
C(Generation)[T.3],21.3662,6.998,3.053,0.002,7.629,35.103
C(Generation)[T.4],31.9575,8.235,3.881,0.000,15.793,48.122
C(Generation)[T.5],9.4926,7.883,1.204,0.229,-5.982,24.968
C(Generation)[T.6],22.2693,8.709,2.557,0.011,5.173,39.366
"Q(""Sp. Def"")",0.5634,0.071,7.906,0.000,0.423,0.703
"Q(""Sp. Def""):C(Generation)[T.2]",-0.2350,0.101,-2.316,0.021,-0.434,-0.036
"Q(""Sp. Def""):C(Generation)[T.3]",-0.3067,0.093,-3.300,0.001,-0.489,-0.124

0,1,2,3
Omnibus:,337.229,Durbin-Watson:,1.505
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2871.522
Skew:,1.684,Prob(JB):,0.0
Kurtosis:,11.649,Cond. No.,1400.0


## Resolving the Apparent Contradiction Between R-squared and Hypothesis Testing

The apparent contradiction between the low \( R^2 \) value (17.6%) and the strong evidence against the null hypothesis (as indicated by low p-values for many coefficients) can be resolved by recognizing that these metrics assess different aspects of the model.

### Interpreting \( R^2 \) and Explanatory Power
The \( R^2 \) value of 17.6% tells us that only about 17.6% of the variability in the outcome variable \( y \) (in this case, HP) is explained by the model. This metric speaks to the overall explanatory power of the model. When \( R^2 \) is low, it suggests that the model does not capture a large proportion of the variability in the response variable. This can happen when:

- **Outcome Variability**: The outcome \( y \) is highly variable or complex, possibly influenced by many factors outside of those included in the model.
- **Relevant but Unincluded Predictors**: Important predictor variables are omitted from the model.

A low \( R^2 \) can be common in models with a high degree of noise or where the predictors only capture some aspects of the response variable’s behavior.

### Interpreting Coefficients and Hypothesis Testing
Despite the low \( R^2 \), strong evidence against the null hypothesis for certain coefficients (indicated by low p-values) suggests that there are statistically significant relationships between the outcome variable and certain predictors, even if these predictors only explain a small portion of the overall variation in \( y \). In this model:

- **Significance of Coefficients**: Low p-values indicate that each predictor, considered independently in the context of the model, has an effect on \( y \) that is unlikely to be due to chance alone. Thus, each predictor has a meaningful association with the response variable, though it might not explain a large amount of its variability.
- **Magnitude of Coefficients**: Large coefficient values (greater than 10) for certain predictors suggest that changes in these predictors (e.g., specific generations or interactions with "Sp. Def") have a sizable impact on \( y \). The interpretation remains that these predictors influence \( y \), even if their combined effect does not capture much of the variability in \( y \).

### Why \( R^2 \) and p-values Are Not Contradictory
- **Different Aspects of Model Fit**: \( R^2 \) measures the model’s explanatory power, whereas p-values reflect the statistical significance of individual predictors. A predictor can have a significant effect on \( y \) (small p-value) without the model necessarily explaining a large portion of the variation in \( y \).
- **Addressing Different Hypotheses**: \( R^2 \) focuses on how well the model as a whole accounts for the outcome's variation, whereas p-values assess whether each predictor's effect on the outcome differs significantly from zero (null hypothesis).

### Example Context with Model
In your model specification (`HP ~ Q("Sp. Def") * C(Generation)`), the interaction term between Sp. Def and Generation allows each generation to have its own slope for Sp. Def. This interaction structure might yield significant coefficients for some generations, indicating that Sp. Def affects HP differently across generations. However, because the predictors collectively explain only 17.6% of the total variance in HP, we know there are other unexplained factors influencing HP beyond Sp. Def and Generation.

### Conclusion
Low \( R^2 \) and significant coefficients (small p-values) are not contradictory but instead reveal different insights:
- The model's predictive/explanatory power is limited (low \( R^2 \)).
- Certain predictors, nonetheless, have a statistically significant effect on the outcome (low p-values).

These aspects together suggest that while some predictors have meaningful associations with the outcome, a substantial amount of variation in the response variable remains unexplained by this model.

chat: https://chatgpt.com/share/6732e417-5550-800b-8b68-3f9675df9484

summary: A low R2 (17.6%) indicates that the model explains only a small portion of the variation in the outcome, while low p-values for many coefficients suggest significant relationships between the predictors and the outcome. This isn’t contradictory: R^2 reflects overall explanatory power, while p-values indicate the statistical significance of individual predictors. So, while certain predictors meaningfully impact the outcome, much of its variability remains unexplained.


# 5

In [9]:
import numpy as np
from sklearn.model_selection import train_test_split

fifty_fifty_split_size = int(pokeaman.shape[0]*0.5)

# Replace "NaN" (in the "Type 2" column with "None")
pokeaman.fillna('None', inplace=True)

np.random.seed(130)
pokeaman_train,pokeaman_test = \
  train_test_split(pokeaman, train_size=fifty_fifty_split_size)
pokeaman_train

Unnamed: 0,#,Name,Type 1,Type 2,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
370,338,Solrock,Rock,Psychic,70,95,85,55,65,70,3,False
6,6,Charizard,Fire,Flying,78,84,78,109,85,100,1,False
242,224,Octillery,Water,,75,105,75,105,75,45,2,False
661,600,Klang,Steel,,60,80,95,70,85,50,5,False
288,265,Wurmple,Bug,,45,45,35,20,30,20,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...
522,471,Glaceon,Ice,,65,60,110,130,95,65,4,False
243,225,Delibird,Ice,Flying,45,55,45,65,45,75,2,False
797,720,HoopaHoopa Confined,Psychic,Ghost,80,110,60,150,130,70,6,True
117,109,Koffing,Poison,,40,65,95,60,45,35,1,False


1: 
- Purpose: Prepares the data by filling missing values in the "Type 2" column with 'None', sets a random seed for reproducibility, and splits the data into a 50-50 train-test split.
- Key Points:
1. Ensures that missing values in the "Type 2" column won’t interfere with modeling.
2. Uses a random seed to make results consistent across runs.
The split ensures half of the dataset is for training and half for testing.

In [5]:
model_spec3 = smf.ols(formula='HP ~ Attack + Defense', 
                      data=pokeaman_train)
model3_fit = model_spec3.fit()
model3_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,1.66e-14
Time:,05:16:30,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,42.5882,3.580,11.897,0.000,35.551,49.626
Attack,0.2472,0.041,6.051,0.000,0.167,0.327
Defense,0.1001,0.045,2.201,0.028,0.011,0.190

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,343.0


2:
- Purpose: Specifies a simple linear regression model to predict HP based on Attack and Defense.
- Key Points:
1. HP is the dependent variable, with Attack and Defense as predictors.
2. .fit() estimates the regression coefficients, and .summary() provides the model’s performance metrics, including R2, p-values for predictors, and coefficients.
3. The simplicity of the model makes it a good starting point to understand the relationship between HP, Attack, and Defense.

In [10]:
yhat_model3 = model3_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model3_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model3)[0,1]**2)

'In sample' R-squared:     0.14771558304519894
'Out of sample' R-squared: 0.21208501873920738


3: 
- Purpose: Evaluates the model’s performance on both training (in-sample) and testing (out-of-sample) data using R2.
- Key Points:
1. model3_fit.rsquared shows the R2 on training data (in-sample).
2. Squaring the correlation between actual (y) and predicted values (yhat_model3) on the test data provides out-of-sample R2, indicating generalizability.
3. A significant difference between in-sample and out-of-sample R2 suggests potential overfitting.

In [7]:
model4_linear_form = 'HP ~ Attack * Defense * Speed * Legendary'
model4_linear_form += ' * Q("Sp. Def") * Q("Sp. Atk")'
# DO NOT try adding '* C(Generation) * C(Q("Type 1")) * C(Q("Type 2"))'
# That's 6*18*19 = 6*18*19 possible interaction combinations...
# ...a huge number that will blow up your computer

model4_spec = smf.ols(formula=model4_linear_form, data=pokeaman_train)
model4_fit = model4_spec.fit()
model4_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.467
Model:,OLS,Adj. R-squared:,0.369
Method:,Least Squares,F-statistic:,4.764
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,4.230000000000001e-21
Time:,05:16:48,Log-Likelihood:,-1738.6
No. Observations:,400,AIC:,3603.0
Df Residuals:,337,BIC:,3855.0
Df Model:,62,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,521.5715,130.273,4.004,0.000,265.322,777.821
Legendary[T.True],-6.1179,2.846,-2.150,0.032,-11.716,-0.520
Attack,-8.1938,2.329,-3.518,0.000,-12.775,-3.612
Attack:Legendary[T.True],-1224.9610,545.105,-2.247,0.025,-2297.199,-152.723
Defense,-6.1989,2.174,-2.851,0.005,-10.475,-1.923
Defense:Legendary[T.True],-102.4030,96.565,-1.060,0.290,-292.350,87.544
Attack:Defense,0.0985,0.033,2.982,0.003,0.034,0.164
Attack:Defense:Legendary[T.True],14.6361,6.267,2.336,0.020,2.310,26.963
Speed,-7.2261,2.178,-3.318,0.001,-11.511,-2.942

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.664
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.2e+16


4. 
- Purpose: Specifies and fits a complex linear regression model with multiple interaction terms among variables, aiming to capture nuanced relationships affecting HP.
- Key Points:
1. '*' in the formula generates all combinations of interactions, including main effects and up to the highest-level interactions.
This model excludes interactions with categorical variables like Generation and Type to prevent an unmanageable number of terms.
2. The complexity makes it powerful for capturing interactions but risks overfitting and can be computationally intensive.

In [8]:
yhat_model4 = model4_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model4_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model4)[0,1]**2)

'In sample' R-squared:     0.46709442115833855
'Out of sample' R-squared: 0.002485342598992873


5.
- Purpose: This code evaluates the complex model’s ability to generalize to the test data and compares its in-sample and out-of-sample performance. This helps identify potential overfitting or if the model generalizes well.

- Key Points:
1. Prediction: The model (model4_fit) is used to predict HP on the test set (pokeaman_test).
2. In-Sample R2 : The in-sample R2 is calculated from the training data, as done previously.
3. Out-of-Sample R2: The out-of-sample R2 is computed using the correlation between the predicted (yhat_model4) and actual (y) values on the test data.

chat: https://chatgpt.com/share/6732e5dd-e914-800b-916a-f399c8810b94

summary: 
Here’s a short summary of our chat:

Data Preprocessing & Splitting: Split the pokeaman dataset into training and test sets (50-50), handling missing values by replacing them with 'None'.

Simple Linear Regression Model: A model was built to predict HP using Attack and Defense, followed by displaying a summary with model details like coefficients and R-squared.

Model Evaluation: Calculated both "in-sample" and "out-of-sample" R-squared to assess model performance, checking for overfitting.

Complex Model with Interactions: A more complex regression model was specified, involving interactions between multiple variables like Attack, Defense, Speed, and others, and then fit to the training data.

Evaluation of Complex Model: Predicted HP for the test data and computed R-squared for both the training and test sets, comparing performance to check for overfitting or generalization.

Each part involved building, fitting, and evaluating predictive models using linear regression techniques, with an emphasis on checking how well the models generalize to unseen data.

# 6

In [11]:
# "Cond. No." WAS 343.0 WITHOUT to centering and scaling
model3_fit.summary() 

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,1.66e-14
Time:,05:39:38,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,42.5882,3.580,11.897,0.000,35.551,49.626
Attack,0.2472,0.041,6.051,0.000,0.167,0.327
Defense,0.1001,0.045,2.201,0.028,0.011,0.190

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,343.0


In [12]:
from patsy import center, scale

model3_linear_form_center_scale = \
  'HP ~ scale(center(Attack)) + scale(center(Defense))' 
model_spec3_center_scale = smf.ols(formula=model3_linear_form_center_scale,
                                   data=pokeaman_train)
model3_center_scale_fit = model_spec3_center_scale.fit()
model3_center_scale_fit.summary()
# "Cond. No." is NOW 1.66 due to centering and scaling

0,1,2,3
Dep. Variable:,HP,R-squared:,0.148
Model:,OLS,Adj. R-squared:,0.143
Method:,Least Squares,F-statistic:,34.4
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,1.66e-14
Time:,05:39:58,Log-Likelihood:,-1832.6
No. Observations:,400,AIC:,3671.0
Df Residuals:,397,BIC:,3683.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,69.3025,1.186,58.439,0.000,66.971,71.634
scale(center(Attack)),8.1099,1.340,6.051,0.000,5.475,10.745
scale(center(Defense)),2.9496,1.340,2.201,0.028,0.315,5.585

0,1,2,3
Omnibus:,284.299,Durbin-Watson:,2.006
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5870.841
Skew:,2.72,Prob(JB):,0.0
Kurtosis:,20.963,Cond. No.,1.66


In [13]:
model4_linear_form_CS = 'HP ~ scale(center(Attack)) * scale(center(Defense))'
model4_linear_form_CS += ' * scale(center(Speed)) * Legendary' 
model4_linear_form_CS += ' * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))'
# Legendary is an indicator, so we don't center and scale that

model4_CS_spec = smf.ols(formula=model4_linear_form_CS, data=pokeaman_train)
model4_CS_fit = model4_CS_spec.fit()
model4_CS_fit.summary().tables[-1]  # Cond. No. is 2,250,000,000,000,000

# The condition number is still bad even after centering and scaling

0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.663
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.54e+16


In [14]:
# Just as the condition number was very bad to start with
model4_fit.summary().tables[-1]  # Cond. No. is 12,000,000,000,000,000


0,1,2,3
Omnibus:,214.307,Durbin-Watson:,1.992
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2354.664
Skew:,2.026,Prob(JB):,0.0
Kurtosis:,14.174,Cond. No.,1.2e+16


The **linear form** in `model4_linear_form` defines the relationships between predictors (input variables) and the outcome (dependent variable) in a regression model. It specifies how each predictor or combination of predictors relates to the outcome.

#### Step 1: Creating Predictor Variables in the "Design Matrix" (`model4_spec.exog`)
The **design matrix** (`model4_spec.exog`) consists of the predictor variables (independent variables) used to predict the outcome variable (`model4_spec.endog`). Each column corresponds to a predictor variable, and the rows represent individual observations or data points.

In `model4_linear_form`, the formula involves scaled and centered predictors. The formula includes terms like:
HP ~ scale(center(Attack)) * scale(center(Defense)) * scale(center(Speed)) * Legendary * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))


This formula applies transformations to the original data, such as centering and scaling continuous variables (e.g., Attack, Defense). These transformations standardize the variables by ensuring they have:

- **Centered Variables**: Subtract the mean from each predictor variable.
- **Scaled Variables**: Divide each centered variable by its standard deviation, resulting in variables with unit variance.

For interaction terms like `scale(center(Attack)) * scale(center(Defense))`, the design matrix includes new predictor variables that represent the interaction between `Attack` and `Defense`, increasing the number of predictors and creating more complex relationships.

#### Step 2: Multicollinearity in the Design Matrix (`model4_spec.exog`)
**Multicollinearity** occurs when predictor variables in the design matrix are highly correlated. This can be assessed by checking the correlation between the columns (predictors) using `np.corrcoef(model4_spec.exog)`.

When some predictors are highly correlated, it means they provide redundant information to the model. For example, if `Attack`, `Defense`, and `Speed` are highly correlated, it becomes difficult for the model to distinguish their individual effects on the outcome. This results in **linearly dependent predictors**, which complicates the estimation of their coefficients.

#### Step 3: How Multicollinearity Affects Generalization (Out-of-Sample Predictions)
When multicollinearity is present, the model experiences instability in estimating the coefficients. Since predictors are highly correlated, small changes in the data can lead to large fluctuations in the estimated coefficients. This causes the model to be highly sensitive to the specific training data, leading to **overfitting**.

**Overfitting** occurs when the model captures the noise or idiosyncrasies of the training data, rather than the true underlying patterns. As a result, the model may perform well on the training data but poorly on new, unseen data (i.e., it fails to generalize).

In the case of `model4`, a large **condition number** (e.g., 12 trillion after centering and scaling) indicates severe multicollinearity, suggesting that the model coefficients are unstable. This instability leads to poor generalization and a lack of out-of-sample prediction accuracy. Even if the model fits the training data well, it cannot reliably predict new data because the relationships between the predictors are too complex and unreliable.

### Conclusion
To summarize:

- The **linear form** in `model4_linear_form` creates predictor variables through transformations (centering and scaling) and interactions, which are represented as columns in the design matrix (`model4_spec.exog`).
- **Multicollinearity** arises when these predictors are highly correlated, leading to redundancy and instability in the model's coefficient estimates.
- This multicollinearity, reflected in the large condition number, contributes to **overfitting**, where the model captures noise in the training data that does not generalize well to unseen data.

Thus, the presence of multicollinearity in the design matrix makes the model less reliable for out-of-sample predictions, as it becomes overly tuned to the specific patterns of the training data rather than generalizable, underlying relationships.

chat: https://chatgpt.com/share/6732f03a-d010-800b-b8be-b747afb664d9

The model4_linear_form specifies predictor variables, including transformations (centering, scaling) and interactions, which become columns in the design matrix (model4_spec.exog). Multicollinearity occurs when these predictors are highly correlated, leading to redundancy and instability in estimating the model coefficients. This instability causes overfitting, where the model fits the training data well but fails to generalize to new data. A high condition number in the design matrix indicates severe multicollinearity, contributing to poor out-of-sample generalization.


# 7

In [15]:
# Here's something a little more reasonable...
model5_linear_form = 'HP ~ Attack + Defense + Speed + Legendary'
model5_linear_form += ' + Q("Sp. Def") + Q("Sp. Atk")'
model5_linear_form += ' + C(Generation) + C(Q("Type 1")) + C(Q("Type 2"))'

model5_spec = smf.ols(formula=model5_linear_form, data=pokeaman_train)
model5_fit = model5_spec.fit()
model5_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.392
Model:,OLS,Adj. R-squared:,0.313
Method:,Least Squares,F-statistic:,4.948
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,9.48e-19
Time:,06:07:13,Log-Likelihood:,-1765.0
No. Observations:,400,AIC:,3624.0
Df Residuals:,353,BIC:,3812.0
Df Model:,46,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,10.1046,14.957,0.676,0.500,-19.312,39.521
Legendary[T.True],-3.2717,4.943,-0.662,0.508,-12.992,6.449
C(Generation)[T.2],9.2938,4.015,2.315,0.021,1.398,17.189
C(Generation)[T.3],2.3150,3.915,0.591,0.555,-5.385,10.015
C(Generation)[T.4],4.8353,4.149,1.165,0.245,-3.325,12.995
C(Generation)[T.5],11.4838,3.960,2.900,0.004,3.696,19.272
C(Generation)[T.6],4.9206,4.746,1.037,0.300,-4.413,14.254
"C(Q(""Type 1""))[T.Dark]",-1.4155,6.936,-0.204,0.838,-15.057,12.226
"C(Q(""Type 1""))[T.Dragon]",0.8509,6.900,0.123,0.902,-12.720,14.422

0,1,2,3
Omnibus:,286.476,Durbin-Watson:,1.917
Prob(Omnibus):,0.0,Jarque-Bera (JB):,5187.327
Skew:,2.807,Prob(JB):,0.0
Kurtosis:,19.725,Cond. No.,9210.0


In [16]:
yhat_model5 = model5_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model5_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model5)[0,1]**2)

'In sample' R-squared:     0.3920134083531893
'Out of sample' R-squared: 0.30015614488652215


In [17]:
# Here's something a little more reasonable...
model6_linear_form = 'HP ~ Attack + Speed + Q("Sp. Def") + Q("Sp. Atk")'
# And here we'll add the significant indicators from the previous model
# https://chatgpt.com/share/81ab88df-4f07-49f9-a44a-de0cfd89c67c
model6_linear_form += ' + I(Q("Type 1")=="Normal")'
model6_linear_form += ' + I(Q("Type 1")=="Water")'
model6_linear_form += ' + I(Generation==2)'
model6_linear_form += ' + I(Generation==5)'

model6_spec = smf.ols(formula=model6_linear_form, data=pokeaman_train)
model6_fit = model6_spec.fit()
model6_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.333
Model:,OLS,Adj. R-squared:,0.319
Method:,Least Squares,F-statistic:,24.36
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,2.25e-30
Time:,06:07:59,Log-Likelihood:,-1783.6
No. Observations:,400,AIC:,3585.0
Df Residuals:,391,BIC:,3621.0
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,22.8587,3.876,5.897,0.000,15.238,30.479
"I(Q(""Type 1"") == ""Normal"")[T.True]",17.5594,3.339,5.258,0.000,10.994,24.125
"I(Q(""Type 1"") == ""Water"")[T.True]",9.0301,3.172,2.847,0.005,2.794,15.266
I(Generation == 2)[T.True],6.5293,2.949,2.214,0.027,0.732,12.327
I(Generation == 5)[T.True],8.4406,2.711,3.114,0.002,3.112,13.770
Attack,0.2454,0.037,6.639,0.000,0.173,0.318
Speed,-0.1370,0.045,-3.028,0.003,-0.226,-0.048
"Q(""Sp. Def"")",0.3002,0.045,6.662,0.000,0.212,0.389
"Q(""Sp. Atk"")",0.1192,0.042,2.828,0.005,0.036,0.202

0,1,2,3
Omnibus:,271.29,Durbin-Watson:,1.999
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4238.692
Skew:,2.651,Prob(JB):,0.0
Kurtosis:,18.04,Cond. No.,618.0


In [18]:
yhat_model6 = model6_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model6_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2)

'In sample' R-squared:     0.3326310334310908
'Out of sample' R-squared: 0.29572460427079933


In [19]:
# And here's a slight change that seems to perhaps improve prediction...
model7_linear_form = 'HP ~ Attack * Speed * Q("Sp. Def") * Q("Sp. Atk")'
model7_linear_form += ' + I(Q("Type 1")=="Normal")'
model7_linear_form += ' + I(Q("Type 1")=="Water")'
model7_linear_form += ' + I(Generation==2)'
model7_linear_form += ' + I(Generation==5)'

model7_spec = smf.ols(formula=model7_linear_form, data=pokeaman_train)
model7_fit = model7_spec.fit()
model7_fit.summary()

0,1,2,3
Dep. Variable:,HP,R-squared:,0.378
Model:,OLS,Adj. R-squared:,0.347
Method:,Least Squares,F-statistic:,12.16
Date:,"Tue, 12 Nov 2024",Prob (F-statistic):,4.2000000000000004e-29
Time:,06:08:09,Log-Likelihood:,-1769.5
No. Observations:,400,AIC:,3579.0
Df Residuals:,380,BIC:,3659.0
Df Model:,19,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,95.1698,34.781,2.736,0.007,26.783,163.556
"I(Q(""Type 1"") == ""Normal"")[T.True]",18.3653,3.373,5.445,0.000,11.733,24.997
"I(Q(""Type 1"") == ""Water"")[T.True]",9.2913,3.140,2.959,0.003,3.117,15.466
I(Generation == 2)[T.True],7.0711,2.950,2.397,0.017,1.271,12.871
I(Generation == 5)[T.True],7.8557,2.687,2.923,0.004,2.572,13.140
Attack,-0.6975,0.458,-1.523,0.129,-1.598,0.203
Speed,-1.8147,0.554,-3.274,0.001,-2.905,-0.725
Attack:Speed,0.0189,0.007,2.882,0.004,0.006,0.032
"Q(""Sp. Def"")",-0.5532,0.546,-1.013,0.312,-1.627,0.521

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,2340000000.0


In [20]:
yhat_model7 = model7_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model7_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2)

'In sample' R-squared:     0.37818209127432456
'Out of sample' R-squared: 0.35055389205977444


In [21]:
# And here's a slight change that seems to perhas improve prediction...
model7_linear_form_CS = 'HP ~ scale(center(Attack)) * scale(center(Speed))'
model7_linear_form_CS += ' * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))'
# We DO NOT center and scale indicator variables
model7_linear_form_CS += ' + I(Q("Type 1")=="Normal")'
model7_linear_form_CS += ' + I(Q("Type 1")=="Water")'
model7_linear_form_CS += ' + I(Generation==2)'
model7_linear_form_CS += ' + I(Generation==5)'

model7_CS_spec = smf.ols(formula=model7_linear_form_CS, data=pokeaman_train)
model7_CS_fit = model7_CS_spec.fit()
model7_CS_fit.summary().tables[-1] 
# "Cond. No." is NOW 15.4 due to centering and scaling

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,15.4


In [22]:
# "Cond. No." WAS 2,340,000,000 WITHOUT to centering and scaling
model7_fit.summary().tables[-1]

0,1,2,3
Omnibus:,252.3,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,3474.611
Skew:,2.438,Prob(JB):,0.0
Kurtosis:,16.59,Cond. No.,2340000000.0


#### 1. `model5_linear_form`
- **Changes**:
  - Basic predictors: `Attack`, `Defense`, `Speed`, `Legendary`, `Sp. Def`, `Sp. Atk`, `Generation`, `Type 1`, and `Type 2`.
  
- **Rationale**:
  - This initial model includes a wide set of features to capture the relationship between Pokémon attributes and HP. However, the inclusion of many predictors, especially categorical ones, may lead to multicollinearity and overfitting. The model may not generalize well due to these issues.

---

#### 2. `model6_linear_form`
- **Changes**:
  - Removed predictors: `Defense` and `Legendary`.
  - Added indicator variables for `Type 1` (e.g., "Normal", "Water") and `Generation` (e.g., `Generation == 2`).
  
- **Rationale**:
  - Feature selection is employed to focus on more significant predictors, simplifying the model. By removing less relevant features such as `Defense` and `Legendary`, the model reduces the risk of overfitting and improves generalizability. The inclusion of domain-specific categorical variables helps capture important associations that influence HP.

---

#### 3. `model7_linear_form`
- **Changes**:
  - Added interaction terms between `Attack`, `Speed`, `Sp. Def`, and `Sp. Atk`.
  - Kept `Type 1` and `Generation` indicators.
  
- **Rationale**:
  - Interaction terms enable the model to capture complex relationships between predictors. For instance, the effect of `Attack` on HP might depend on `Speed` or `Sp. Def`. This improves the model's ability to explain variance in HP and enhances predictive accuracy by modeling these interactions.

---

#### 4. `model7_linear_form_CS` (Centered and Scaled)
- **Changes**:
  - Centered and scaled continuous variables: `Attack`, `Speed`, `Sp. Def`, and `Sp. Atk`.
  - Kept `Type 1` and `Generation` unscaled (as categorical variables).
  
- **Rationale**:
  - Centering and scaling normalize continuous variables, which reduces multicollinearity, especially when interaction terms are present. This step improves model stability by ensuring that no variable disproportionately influences the model due to differing scales. The categorical variables are left unscaled because they represent categories rather than continuous measures. A reduced condition number indicates better model reliability.

---

#### Improvements

- **`model5` to `model6`**:
  - Feature selection and simplification, leading to improved accuracy and generalizability.

- **`model6` to `model7`**:
  - Addition of interaction terms to capture complex dependencies between predictors, enhancing predictive power.

- **`model7` to `model7_CS`**:
  - Centering and scaling applied to reduce multicollinearity and improve model stability and generalizability.

chat: https://chatgpt.com/share/6732f292-32a4-800b-823d-5f8363ec1c65

Key Points:
Model Refinement:

The models are built step by step, starting with a basic linear form (model5_linear_form) and progressively adding more features and interactions, like Attack, Speed, and Pokémon types.
In model6_linear_form, significant indicator variables (e.g., for specific Pokémon types and generations) are added based on previous analyses.
model7_linear_form introduces interaction terms between multiple predictors to capture more complex relationships, which improves the model's predictive power.
Centering and Scaling:

In model7_linear_form_CS, centering and scaling are applied to continuous variables to normalize them and reduce multicollinearity. This is shown by the drop in the condition number from 2,340,000,000 to 15.4, indicating better model stability.
Multicollinearity:

The models are developed with a focus on managing multicollinearity, as seen through the condition number diagnostic. The significant reduction in the condition number after centering and scaling suggests that these transformations improve model reliability and reduce multicollinearity concerns.
Overall Development:

The progression from model5 to model7 reflects the goal of improving the model's accuracy, complexity, and generalizability by refining the variable set and addressing potential issues like multicollinearity. The final model (model7_linear_form_CS) is the most robust, offering improved predictive power and stability.
In essence, the models evolve to better capture the relationships between predictors and the target variable (HP), with each step aiming for better predictive accuracy and less multicollinearity, ultimately leading to a more generalized and reliable model.


# 8

In [32]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split

# Assume "songs" is your dataset
# Define model formula for linear regression
linear_form = 'danceability ~ energy * loudness + energy * mode'

# Set the number of repetitions and initialize arrays to store R-squared values
reps = 100
in_sample_Rsquared = np.zeros(reps)
out_of_sample_Rsquared = np.zeros(reps)

# Generate a sample dataset with random values for testing purposes
np.random.seed(0)  # For reproducible results
songs = pd.DataFrame({
    'danceability': np.random.rand(62),  # 62 rows as implied by 50-50 split with 31 in training
    'energy': np.random.rand(62),
    'loudness': np.random.rand(62),
    'mode': np.random.randint(0, 2, 62)  # Binary variable, as mode often is
})

# Loop through different random train-test splits
for i in range(reps):
    # Split the data randomly into training and testing sets (50-50 split with 31 samples in training)
    songs_training_data, songs_testing_data = train_test_split(songs, train_size=31)
    
    # Fit the model on the training data
    final_model_fit = smf.ols(formula=linear_form, data=songs_training_data).fit()
    
    # Record the in-sample R-squared value
    in_sample_Rsquared[i] = final_model_fit.rsquared
    
    # Calculate the out-of-sample R-squared on the test data
    predictions = final_model_fit.predict(songs_testing_data)
    out_of_sample_Rsquared[i] = np.corrcoef(songs_testing_data['danceability'], predictions)[0, 1]**2

# Create a DataFrame to store the results for visualization
df = pd.DataFrame({
    "In Sample Performance (Rsquared)": in_sample_Rsquared,
    "Out of Sample Performance (Rsquared)": out_of_sample_Rsquared
})

# Generate a scatter plot comparing in-sample and out-of-sample R-squared values
fig = px.scatter(df, x="In Sample Performance (Rsquared)", y="Out of Sample Performance (Rsquared)",
                 title="In-Sample vs Out-of-Sample Performance (R-squared)")
# Add a y=x reference line to compare in-sample and out-of-sample values directly
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], name="y=x", line_shape='linear'))

# Show the plot
fig.show()


### Purpose:
- Investigate how **in-sample** (training) and **out-of-sample** (test) model performance vary with random data splits using R-squared.

### Key Components:
1. **Repeated Data Splitting**: 
   - Randomly splits data into training and test sets for each loop iteration.
2. **Model Fitting**: 
   - Fits a linear regression model to the training set.
3. **R-squared Calculation**:
   - In-sample R-squared (on training data).
   - Out-of-sample R-squared (on test data).
4. **Visualization**: 
   - Scatter plot of in-sample vs out-of-sample R-squared with a y=x line.

### Interpretation:
1. **Model Variability**: 
   - Shows how performance can change with different data splits.
2. **Overfitting**: 
   - High in-sample but low out-of-sample R-squared = overfitting.
3. **Out-of-Sample > In-Sample**:
   - Occasional cases may indicate sample variance, not underfitting.
4. **Generalization**:
   - Highlights the importance of not relying on a single data split and considering cross-validation.

### Conclusion:
- The code explores model performance variability, helping to understand overfitting, sample variance, and the importance of robust evaluation methods like cross-validation.

chat: https://chatgpt.com/share/6732fa3a-5860-8012-9c4c-570ba7e51c72

In this chat, we discussed code that demonstrates how to analyze and visualize the generalization of a linear regression model on different train-test splits of a dataset.

Goal of the Code: The code repeatedly splits the data, trains the model, and compares "in-sample" (training) and "out-of-sample" (testing) R-squared values across different splits. This helps illustrate model performance variability and detect overfitting or underfitting patterns.

Explanation of the Code: We went through the code, line-by-line, covering:

How data is split and stored.
How the model is fit and performance metrics are calculated.
How results are visualized using a scatter plot.
Error Resolution: You encountered an error (NameError: name 'songs' is not defined), which we resolved by either creating a sample dataset or loading a real dataset if available.

Overall, this exercise helps to understand model stability and performance variability across multiple data splits.

# 9

In [33]:
model7_gen1_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model7_gen1_predict_future_fit = model7_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model7_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")

'In sample' R-squared:     0.37818209127432456 (original)
'Out of sample' R-squared: 0.35055389205977444 (original)
'In sample' R-squared:     0.5726118179916575 (gen1_predict_future)
'Out of sample' R-squared: 0.11151363354803218 (gen1_predict_future)


In [34]:
model7_gen1to5_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model7_gen1to5_predict_future_fit = model7_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model7_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.37818209127432456 (original)
'Out of sample' R-squared: 0.35055389205977444 (original)
'In sample' R-squared:     0.3904756578094535 (gen1to5_predict_future)
'Out of sample' R-squared: 0.23394915464343125 (gen1to5_predict_future)


In [35]:
model6_gen1_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model6_gen1_predict_future_fit = model6_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model6_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")

'In sample' R-squared:     0.3326310334310908 (original)
'Out of sample' R-squared: 0.29572460427079933 (original)
'In sample' R-squared:     0.4433880517727282 (gen1_predict_future)
'Out of sample' R-squared: 0.1932858534276128 (gen1_predict_future)


In [36]:
model6_gen1to5_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model6_gen1to5_predict_future_fit = model6_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model6_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.3326310334310908 (original)
'Out of sample' R-squared: 0.29572460427079933 (original)
'In sample' R-squared:     0.33517279824114776 (gen1to5_predict_future)
'Out of sample' R-squared: 0.26262690178799936 (gen1to5_predict_future)


### Complexity vs. Simplicity:
- **model7_fit** is more complex due to additional interaction terms (combinations of multiple variables).
- These interaction terms might capture detailed relationships but can lead to **overfitting**.
- **model6_fit** is simpler and may generalize better to new data.

### Coefficient Significance:
- Examine each model's `.summary()` output to check for significant coefficients (variables and interactions).
- **model7_fit** has weaker statistical significance for its coefficients, making its associations less reliable.
- **model6_fit** has more significant coefficients, indicating stronger relationships with the data.

### In-Sample and Out-of-Sample Performance:
- The R-squared values for both models are compared for **in-sample** (training data) and **out-of-sample** (new data) performance.
- **model6_fit** shows more consistent R-squared values across both datasets, suggesting it generalizes better.
- **model7_fit** may perform well on training data but shows weaker generalization to new data, indicating potential overfitting.

### Sequential Data Analysis:
- The models are tested on data separated by generations, simulating predicting future outcomes based on past data.
- **model6_fit** performs better across generations, indicating greater robustness.
- **model7_fit** is more prone to overfitting, limiting its use on new, unseen data.

### Final Takeaway:
- **Simpler models** (like **model6_fit**) often offer **better interpretability** and **generalizability**, even when performance is similar to more complex models.
- Favoring interpretability over slight performance gains can lead to more reliable and understandable models.

chat: https://chatgpt.com/share/6732fbd8-a550-8012-ac73-cf853cdd5694

This chat explains the code used to evaluate two regression models, model6_fit and model7_fit, on their performance with different data splits. The main steps involve:

Training Models: Both models are trained on data from specific generations.
In-Sample and Out-of-Sample Evaluation: The R-squared values for each model are calculated for both the training (in-sample) and testing (out-of-sample) data.
Comparison of Models: The simpler model6_fit and the more complex model7_fit are compared in terms of generalizability, with model6_fit generally showing more stable and reliable performance.