# Question 1.
1. **Difference between Simple and Multiple Linear Regression:**
   - Simple Linear Regression uses one independent variable to predict a dependent variable, modeling the relationship as a straight line. Multiple Linear Regression, however, includes two or more independent variables, allowing for more complex relationships and providing a better fit to data that may be influenced by multiple factors.


2. **Continuous vs. Indicator Variable in Simple Linear Regression:**
   - A continuous variable represents numerical values with an infinite range, allowing for a linear trend. An indicator (or binary) variable, however, has two possible values (e.g., 0 or 1), effectively dividing data into two categories and shifting the regression line up or down based on category presence.


3. **Behavior Change with an Indicator Variable in Multiple Linear Regression:**
   - When an indicator variable is added alongside a continuous variable, the model can account for a “shift” in the line based on the category of the indicator variable. This means that different categories can have different baseline values, creating two parallel lines, each representing a category’s relationship with the continuous variable.


4. **Effect of Adding an Interaction Between Continuous and Indicator Variable:**
   - Adding an interaction term between a continuous and an indicator variable enables the model to fit different slopes for each category, allowing one category's influence to differ in intensity from the other. This non-parallel line behavior offers a more flexible fit when categories affect the continuous variable's impact differently.


5. **Behavior of a Multiple Linear Regression with Indicator Variables from a Non-Binary Categorical Variable:**
   - When using a non-binary categorical variable, the model creates a set of binary variables (often called “dummy” variables) to represent each category except one (the reference category). This setup allows the model to capture distinct shifts in the dependent variable for each category, making it adaptable for multi-category data by adjusting baselines based on the category.

# Question 2

In this scenario, we have a company that sells sports equipment and wants to understand the impact of its advertising on **sales**. Here’s a breakdown of the **outcome and predictor variables**, the role of **interactions**, and how we could set up different models to make predictions.

---

### Outcome and Predictor Variables

- **Outcome Variable**: The outcome, or the variable we’re predicting, is **sales** or some measure of the effectiveness of the ads (e.g., customer engagement or revenue).
  
- **Predictor Variables**: The predictors are the **TV advertising budget** and **online advertising budget**. Both are initially treated as **continuous variables** representing the amount spent.

- **Interaction**: Since the effectiveness of each ad type might depend on the budget of the other, there’s potential for an **interaction effect**. This means that the impact of one advertising budget on sales might change based on the other advertising budget, so the total effect isn’t simply additive.

---

### Linear Forms with and without Interaction

1. **Without Interaction**:
   - $\text{Sales} = \beta_0 + \beta_1 \times \text{TV\_Budget} + \beta_2 \times \text{Online\_Budget}$
   - This formula assumes that TV and online budgets affect sales **independently**. Each predictor has a fixed contribution to sales regardless of the other.

2. **With Interaction**:
   - $\text{Sales} = \beta_0 + \beta_1 \times \text{TV\_Budget} + \beta_2 \times \text{Online\_Budget} + \beta_3 \times (\text{TV\_Budget} \times \text{Online\_Budget})$
   - Here, the interaction term $\beta_3 \times (\text{TV\_Budget} \times \text{Online\_Budget})$ accounts for the combined influence of both ad budgets. This term allows the effect of one ad budget to change based on the level of the other, creating a more flexible model that may capture synergistic or diminishing returns in advertising.

---

### Making Predictions and Interpreting Differences

1. **Without Interaction**:
   - To predict sales, we plug in the values for TV and online budgets into the formula. Each dollar spent on TV and online advertising increases sales by a fixed amount, given by $\beta_1$ and $\beta_2$ respectively.
   - **Difference in Predictions**: The model without interaction gives **independent effects**, meaning the budget for one type of ad has the same impact on sales no matter how much is spent on the other type.

2. **With Interaction**:
   - We use the same approach but include the interaction term. If both TV and online budgets are high, the interaction term may increase or decrease the effect on sales depending on whether they complement or detract from each other.
   - **Difference in Predictions**: This model allows for **combined effects**, where each dollar spent on TV ads may have a larger or smaller effect on sales depending on the online ad budget (and vice versa). This could be useful if the effectiveness of one ad type enhances or limits the other.

---

### Binary (High/Low) Advertising Budgets

If we categorize the TV and online budgets as **binary variables** representing “High” (1) and “Low” (0), the formulas would be adjusted as follows:

1. **Without Interaction (Binary Predictors)**:
   - $\text{Sales} = \beta_0 + \beta_1 \times \text{TV\_High} + \beta_2 \times \text{Online\_High}$
   - Here, $\text{TV\_High}$ and $\text{Online\_High}$ are indicator variables (1 if high, 0 if low). This model estimates the **independent impact** of high spending on each ad type without an interaction effect.

2. **With Interaction (Binary Predictors)**:
   - $\text{Sales} = \beta_0 + \beta_1 \times \text{TV\_High} + \beta_2 \times \text{Online\_High} + \beta_3 \times (\text{TV\_High} \times \text{Online\_High})$
   - This version adds an interaction term, allowing the **joint high spending** on both ad types to impact sales differently than if only one or neither type had high spending.

---

### Making Predictions with Binary Models

- **Without Interaction**: We can predict sales based on whether TV or online spending is high or low. This model doesn’t consider the joint effect, so it treats each ad type’s budget level as an independent contributor to sales.
- **With Interaction**: By including the interaction term, this model adjusts sales predictions based on whether both ad budgets are high, both are low, or one is high and the other is low. It allows for a combined effect where high-high spending might have a unique impact on sales, differing from when only one budget is high. 

In summary, adding interaction terms in both continuous and binary cases allows for more nuanced predictions that reflect possible combined effects between the two types of ad spending.




# Question 3.

In [13]:
# Here's an example of how you can do this
import pandas as pd
import statsmodels.formula.api as smf

url = "https://raw.githubusercontent.com/KeithGalli/pandas/master/pokemon_data.csv"
pokeaman = pd.read_csv(url).fillna('None')

pokeaman['str8fyre'] = (pokeaman['Type 1']=='Fire').astype(int)
linear_model_specification_formula = \
'str8fyre ~ Attack*Legendary + Defense*I(Q("Type 2")=="None") + C(Generation)'
log_reg_fit = smf.logit(linear_model_specification_formula, data=pokeaman).fit()
log_reg_fit.summary()

Optimization terminated successfully.
         Current function value: 0.228109
         Iterations 8


0,1,2,3
Dep. Variable:,str8fyre,No. Observations:,800.0
Model:,Logit,Df Residuals:,788.0
Method:,MLE,Df Model:,11.0
Date:,"Fri, 15 Nov 2024",Pseudo R-squ.:,0.05156
Time:,03:03:25,Log-Likelihood:,-182.49
converged:,True,LL-Null:,-192.41
Covariance Type:,nonrobust,LLR p-value:,0.04757

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Intercept,-3.2644,0.714,-4.572,0.000,-4.664,-1.865
Legendary[T.True],4.3478,2.179,1.996,0.046,0.078,8.618
"I(Q(""Type 2"") == ""None"")[T.True]",1.5432,0.853,1.810,0.070,-0.128,3.215
C(Generation)[T.2],-0.0574,0.468,-0.123,0.902,-0.975,0.861
C(Generation)[T.3],-0.6480,0.466,-1.390,0.164,-1.561,0.265
C(Generation)[T.4],-0.8255,0.545,-1.516,0.130,-1.893,0.242
C(Generation)[T.5],-0.5375,0.449,-1.198,0.231,-1.417,0.342
C(Generation)[T.6],0.3213,0.477,0.673,0.501,-0.614,1.257
Attack,0.0172,0.006,3.086,0.002,0.006,0.028


In [14]:
import pandas as pd
import statsmodels.formula.api as smf

# Load the dataset
url = "C:\Users\Enzo\Downloads\CSCS_data_anon.csv"
data = pd.read_csv(url)

# Handle any missing data if necessary, for example by filling with means or dropping rows.
data = data.fillna(data.mean())  # Or use other imputation strategies as needed

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape (4249915974.py, line 5)

In [15]:
# Define the regression formula
formula = '''
    WELLNESS_life_satisfaction ~ DEMO_age * CONNECTION_activities_text_or_messaged_p3m 
    + DEMO_gender + CONNECTION_activities_video_chat_p3m + CONNECTION_activities_visited_friends_p3m
    + COVID_vaccinated + WELLNESS_malach_pines_burnout_measure_tired
'''
# Example formula only

In [16]:
# Fit the model
model = smf.ols(formula, data=data).fit()

# Display the model summary
print(model.summary())

NameError: name 'data' is not defined

In [17]:
# Update the formula to reflect binary categories
formula_binary = '''
    WELLNESS_life_satisfaction ~ C(DEMO_age_category) * C(CONNECTION_activities_text_or_messaged_high_low) 
    + DEMO_gender + CONNECTION_activities_video_chat_p3m + CONNECTION_activities_visited_friends_p3m
'''

# Question 4.

This apparent contradiction arises because of the distinction between *statistical significance* of the individual coefficients and the *overall fit* or explanatory power of the model.

1. **Low R-squared (Overall Fit)**:
   - The statement "the model only explains 17.6% of the variability in the data" refers to the R-squared value, a measure of how well the model captures the variability in the response variable. An R-squared of 17.6% indicates that the model only explains a small portion of the total variability in the data. This implies that either the model is missing important predictors or that the outcome is inherently noisy or influenced by factors not included in the model.

2. **Significance of Individual Coefficients**:
   - The fact that "many of the coefficients are larger than 10 while having strong or very strong evidence against the null hypothesis of 'no effect'" indicates that these predictors are statistically significant. Statistical significance tests assess whether each predictor has an effect on the outcome, typically using p-values. When p-values are small, we reject the null hypothesis and conclude that the predictors are likely to have non-zero effects.
   - Large coefficients with low p-values suggest that these predictors have a statistically detectable effect on the outcome, even though they may only explain a small amount of the total variation.

3. **Interpretation**:
   - In this case, it’s possible that the model includes predictors that do have significant, measurable effects on the response variable, but collectively, they only explain a small fraction of its total variation. This can happen if the data is complex, with multiple unmeasured variables that contribute to the variability in the outcome. Alternatively, the relationship between the predictors and the outcome could be weak or nonlinear in ways that the model does not capture well.

In summary, the low R-squared reflects the model's limited ability to explain overall variability, while the significant coefficients indicate that some predictors still have meaningful, measurable effects on the outcome—even if they don't account for a large portion of its total variation.

# ChatBot Summary:

Here is a summary of our exchanges:

---

### 1. **Explaining Simple vs. Multiple Linear Regression:**
   - **Simple Linear Regression** involves predicting an outcome using one predictor, while **Multiple Linear Regression** involves multiple predictors. The benefit of the latter is that it accounts for more factors and can explain the outcome better by incorporating more information.
   - **Continuous vs. Indicator Variables in Regression**: Continuous variables represent numeric values, while indicator (or binary) variables represent categories. Including indicator variables in regression models allows you to capture the effect of categorical data.
   - **Interaction Terms**: Adding interaction terms in a regression model allows for the effect of one predictor to change based on the level of another predictor, improving the model’s ability to represent complex relationships between predictors and the outcome.
   - **Impact of Categorical Predictors**: When predictor variables are binary (e.g., "high" or "low" advertising budgets), the model coefficients will represent the change in the outcome relative to the baseline (low category), adjusting for any interactions between the variables.

### 2. **Explaining Variables and Interactions in a Scenario:**
   - For a company analyzing the effectiveness of TV and online advertising, **sales** is the outcome variable, and the predictors are the TV and online advertising budgets.
   - Interaction terms are crucial because the effectiveness of each advertising type may depend on the other. A model with interaction terms would capture the combined effects of advertising budgets, whereas a model without interaction assumes their effects are independent.
   - **Predictions**: A model with interaction will provide predictions where the effect of TV advertising on sales may change depending on the level of online advertising (and vice versa).
   - **Binary Variables**: When budgets are categorized as "high" or "low," the model changes to use indicator (binary) variables, and predictions will depend on whether each type of ad is high or low, along with potential interactions between these indicators.

### 3. **Using `smf` to Fit Multiple Linear Regression Models:**
   - A regression formula is defined using the dataset from the **Canadian Social Connection Survey** to predict **life satisfaction** using various predictors such as **age**, **gender**, and **social connection activities**.
   - The model is fitted using `smf.ols()` to examine how predictors (like age and social activities) influence life satisfaction. Interaction terms are included to explore more complex relationships, such as the interaction between age and messaging frequency.
   - The model summary is then displayed to evaluate the significance of the coefficients and their effects on life satisfaction.

### 4. **Explaining the Contradiction Between Model Fit and Coefficient Significance:**
   - The apparent contradiction arises because the **R-squared value** (17.6%) indicates that the model explains a small portion of the variability in the outcome, while **significant coefficients** with large magnitudes suggest that individual predictors have substantial and statistically significant effects.
   - This situation occurs when the predictors are related to the outcome in specific, consistent ways, but many other unobserved factors contribute to unexplained variability. The significant coefficients show real effects, but the model overall doesn’t explain much of the total variance in the outcome.

---

This summary captures the key points of our discussions and can be submitted as part of your homework assignment. Let me know if you'd like any adjustments or additional details!

# Question 5.

This code appears to explore linear regression modeling on a dataset called `pokeaman`, which seems to include attributes for Pokémon, such as "HP," "Attack," "Defense," "Speed," etc. 

The cells' purpose and results:

### Cell 1
```python
import numpy as np 
from sklearn.model_selection import train_test_split

fifty_fifty_split_size = int(pokeaman.shape[0]*0.5)

# Replace "NaN" (in the "Type 2" column with "None")
pokeaman.fillna('None', inplace=True)

np.random.seed(130)
pokeaman_train, pokeaman_test = train_test_split(pokeaman, train_size=fifty_fifty_split_size)
pokeaman_train
```
**Explanation:**
- Imports necessary libraries and defines a train-test split for the dataset `pokeaman`.
- Fills any missing values in the "Type 2" column with the string "None".
- Uses a random seed for reproducibility and splits the data into a 50-50 train-test set (`pokeaman_train` and `pokeaman_test`).

This step prepares the data for modeling and ensures that the model sees a balanced sample.

### Cell 2
```python
model_spec3 = smf.ols(formula='HP ~ Attack + Defense', data=pokeaman_train)
model3_fit = model_spec3.fit()
model3_fit.summary()
```
**Explanation:**
- Builds a simple linear regression model (`model_spec3`) with "HP" as the dependent variable and "Attack" and "Defense" as predictors.
- Fits the model to the training data (`pokeaman_train`) and prints the summary, which includes statistics like coefficients, p-values, and R-squared.

The `summary()` output provides insights into how well "Attack" and "Defense" explain "HP." The R-squared value will indicate the proportion of variance in "HP" that these predictors account for.

### Cell 3
```python
yhat_model3 = model3_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model3_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y, yhat_model3)[0,1]**2)
```
**Explanation:**
- Predicts `HP` values for the test set using `model3_fit` and calculates two R-squared values:
  - **In-sample R-squared**: The R-squared value from training, which shows the model's performance on the training data.
  - **Out-of-sample R-squared**: The squared correlation between actual and predicted `HP` values on the test set, reflecting model performance on unseen data.

This step shows how well `model3_fit` generalizes, with a comparison between in-sample and out-of-sample R-squared values.

### Cell 4
```python
model4_linear_form = 'HP ~ Attack * Defense * Speed * Legendary'
model4_linear_form += ' * Q("Sp. Def") * Q("Sp. Atk")'
# Avoid complex interactions that may slow down computation or crash

model4_spec = smf.ols(formula=model4_linear_form, data=pokeaman_train)
model4_fit = model4_spec.fit()
model4_fit.summary()
```
**Explanation:**
- Defines a more complex linear regression model (`model4_spec`) with interactions between multiple features: "Attack," "Defense," "Speed," "Legendary" status, and "Sp. Def" and "Sp. Atk".
- This model potentially captures more intricate relationships between these variables and "HP".

The summary output will show if this more complex model offers significant improvements over `model3_fit`.

### Cell 5
```python
yhat_model4 = model4_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model4_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y, yhat_model4)[0,1]**2)
```
**Explanation:**
- Similar to Cell 3, this code evaluates the predictive power of the more complex model (`model4_fit`) by computing both in-sample and out-of-sample R-squared values.
- This comparison helps determine if the added complexity improved predictive accuracy on the test set.

**Summary:**
These five cells illustrate the progression from a simple to a complex regression model, with both in-sample and out-of-sample R-squared values showing how well each model generalizes. This setup helps understand the trade-offs between model simplicity and predictive performance, highlighting the importance of balancing model complexity to avoid overfitting while maximizing generalization.

In [18]:
import numpy as np
from sklearn.model_selection import train_test_split

fifty_fifty_split_size = int(pokeaman.shape[0]*0.5)

# Replace "NaN" (in the "Type 2" column with "None")
pokeaman.fillna('None', inplace=True)

np.random.seed(130)
pokeaman_train,pokeaman_test = \
  train_test_split(pokeaman, train_size=fifty_fifty_split_size)
pokeaman_train
model_spec3 = smf.ols(formula='HP ~ Attack + Defense', 
                      data=pokeaman_train)
model3_fit = model_spec3.fit()
model3_fit.summary()
yhat_model3 = model3_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model3_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model3)[0,1]**2)
model4_linear_form = 'HP ~ Attack * Defense * Speed * Legendary'
model4_linear_form += ' * Q("Sp. Def") * Q("Sp. Atk")'
# DO NOT try adding '* C(Generation) * C(Q("Type 1")) * C(Q("Type 2"))'
# That's 6*18*19 = 6*18*19 possible interaction combinations...
# ...a huge number that will blow up your computer

model4_spec = smf.ols(formula=model4_linear_form, data=pokeaman_train)
model4_fit = model4_spec.fit()
model4_fit.summary()
yhat_model4 = model4_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model4_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model4)[0,1]**2)

'In sample' R-squared:     0.14771558304519894
'Out of sample' R-squared: 0.21208501873920738
'In sample' R-squared:     0.46709442115833855
'Out of sample' R-squared: 0.002485342598992873


# Question 6.

### Explanation of Model4 and Its Design Matrix

In **Cell 4**, the `model4_linear_form` is specified as a formula that defines how the outcome variable (HP) is predicted by a set of predictors. The interaction terms, such as `Attack * Defense * Speed * Legendary`, create new predictor variables by combining these features in various ways. For example:
- `Attack * Defense` produces an interaction term between Attack and Defense.
- `Attack * Defense * Speed` adds a three-way interaction between Attack, Defense, and Speed.
- The terms like `Q("Sp. Def")` and `Q("Sp. Atk")` refer to specific handling of categorical variables or potentially transformed columns.

These interactions create new columns in the **design matrix** (`model4_spec.exog`), where each column represents a predictor or interaction. The design matrix is essentially a collection of all the independent variables used to predict the outcome. Each row corresponds to a data point, and each column is a specific variable (or interaction) that influences the predicted outcome. The shape of `model4_spec.exog` tells us how many predictors (features) and observations the model has.

### Multicollinearity in the Design Matrix

**Multicollinearity** occurs when two or more predictors in the design matrix are highly correlated. This can happen when:
- Interaction terms involve predictors that are already correlated with each other (e.g., Attack and Defense might already be related, and their interaction term might introduce redundancy).
- Some predictors are strongly related to each other, leading to difficulty in estimating their individual effects on the outcome.

In the case of `model4_spec.exog`, the correlation matrix (`np.corrcoef(model4_spec.exog)`) would show high correlations between certain predictors, especially those derived from interactions. This **multicollinearity** makes it difficult for the regression model to reliably separate the unique contribution of each predictor to the outcome.

As a result, the model struggles to generalize well to new (out-of-sample) data because the coefficients might become **unstable**. High multicollinearity makes the regression coefficients very sensitive to small changes in the data, leading to large variations in the model's predictions for new data points.

### Impact on Model Performance and Condition Number

The **condition number** is a diagnostic that reflects the sensitivity of the model's coefficients to multicollinearity. A high condition number indicates that the design matrix is ill-conditioned, meaning that the matrix is close to being singular (non-invertible), which is a sign of severe multicollinearity. In the case of `model4`, the **condition number** before centering and scaling was `343.0`, and even after centering and scaling, it remained extremely high (`2,250,000,000,000,000`), suggesting that multicollinearity persists.

This high condition number contributes to the **poor out-of-sample generalization**. The model may perform well on the training data but fail to predict new data effectively because the unstable coefficients result in predictions that do not generalize well.

# Question 7.

The development of the models from `model3_fit` to `model7_fit` illustrates a gradual progression of adding complexity, refining predictors, and exploring interactions and transformations to improve prediction accuracy. Here's a breakdown of the rationale and principles behind each extension:

### From `model3_fit` to `model5_linear_form`
- **Rationale**: `model3_fit` was a simple model with just "Attack" and "Defense" as predictors for "HP." `model5_linear_form` extends this by adding more predictors like "Speed," "Legendary," and categorical variables such as "Type 1," "Type 2," and "Generation."
- **Principle**: The goal is to incorporate additional features to better capture the variability in "HP" by using domain knowledge of Pokémon attributes (e.g., speed, type, generation). The inclusion of categorical variables using `C()` enables the model to account for the different levels within these categories, thus improving the model’s ability to explain the variance in "HP."

### From `model5_linear_form` to `model6_linear_form`
- **Rationale**: `model6_linear_form` simplifies `model5_linear_form` by removing some predictors that might not contribute significantly to the model (e.g., "Defense"). It retains the key predictors like "Attack," "Speed," and special attributes like "Sp. Def" and "Sp. Atk." Additionally, significant indicator variables from `model5_fit` (such as specific "Type 1" values and "Generation" values) are added to account for the effects of these categories on "HP."
- **Principle**: The rationale here is to focus the model on the most significant predictors identified in the previous model, while keeping indicators that highlight important categorical distinctions (e.g., specific Pokémon types and generations). This simplification helps avoid overfitting and improves interpretability while maintaining the model’s predictive power.

### From `model6_linear_form` to `model7_linear_form`
- **Rationale**: `model7_linear_form` further extends `model6_linear_form` by incorporating interaction terms (e.g., `Attack * Speed * Q("Sp. Def") * Q("Sp. Atk")`). The interactions are designed to capture more complex relationships between the predictors, recognizing that the effect of one predictor (e.g., "Attack") might depend on the values of others (e.g., "Speed" and "Sp. Def").
- **Principle**: The inclusion of interaction terms acknowledges that certain features may not independently influence "HP" but rather work together in a more complex way. By adding these interactions, the model can potentially uncover hidden relationships that improve its predictive performance, especially for more nuanced or non-linear effects.

### From `model7_linear_form` to `model7_linear_form_CS`
- **Rationale**: `model7_linear_form_CS` is an extension where the predictors are **centered and scaled** using `scale(center())`. This transformation normalizes the variables, making them have zero mean and unit variance, which helps mitigate issues like multicollinearity and improves numerical stability. Centering and scaling are especially useful when dealing with interaction terms, as they prevent one predictor from disproportionately affecting the model due to differences in scale.
- **Principle**: The rationale behind centering and scaling is to improve the model's stability and to make sure that the coefficients are comparable in magnitude, especially when interaction terms are involved. This is a common technique to address multicollinearity and ensure that the model is numerically well-conditioned (evidenced by the significant reduction in the condition number).

In [21]:
# Here's something a little more reasonable...
model5_linear_form = 'HP ~ Attack + Defense + Speed + Legendary'
model5_linear_form += ' + Q("Sp. Def") + Q("Sp. Atk")'
model5_linear_form += ' + C(Generation) + C(Q("Type 1")) + C(Q("Type 2"))'

model5_spec = smf.ols(formula=model5_linear_form, data=pokeaman_train)
model5_fit = model5_spec.fit()
model5_fit.summary()
yhat_model5 = model5_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model5_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model5)[0,1]**2)
# Here's something a little more reasonable...
model6_linear_form = 'HP ~ Attack + Speed + Q("Sp. Def") + Q("Sp. Atk")'
# And here we'll add the significant indicators from the previous model
# https://chatgpt.com/share/81ab88df-4f07-49f9-a44a-de0cfd89c67c
model6_linear_form += ' + I(Q("Type 1")=="Normal")'
model6_linear_form += ' + I(Q("Type 1")=="Water")'
model6_linear_form += ' + I(Generation==2)'
model6_linear_form += ' + I(Generation==5)'

model6_spec = smf.ols(formula=model6_linear_form, data=pokeaman_train)
model6_fit = model6_spec.fit()
model6_fit.summary()
yhat_model6 = model6_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model6_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2)
# And here's a slight change that seems to perhaps improve prediction...
model7_linear_form = 'HP ~ Attack * Speed * Q("Sp. Def") * Q("Sp. Atk")'
model7_linear_form += ' + I(Q("Type 1")=="Normal")'
model7_linear_form += ' + I(Q("Type 1")=="Water")'
model7_linear_form += ' + I(Generation==2)'
model7_linear_form += ' + I(Generation==5)'

model7_spec = smf.ols(formula=model7_linear_form, data=pokeaman_train)
model7_fit = model7_spec.fit()
model7_fit.summary()
yhat_model7 = model7_fit.predict(pokeaman_test)
y = pokeaman_test.HP
print("'In sample' R-squared:    ", model7_fit.rsquared)
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2)
# And here's a slight change that seems to perhas improve prediction...
model7_linear_form_CS = 'HP ~ scale(center(Attack)) * scale(center(Speed))'
model7_linear_form_CS += ' * scale(center(Q("Sp. Def"))) * scale(center(Q("Sp. Atk")))'
# We DO NOT center and scale indicator variables
model7_linear_form_CS += ' + I(Q("Type 1")=="Normal")'
model7_linear_form_CS += ' + I(Q("Type 1")=="Water")'
model7_linear_form_CS += ' + I(Generation==2)'
model7_linear_form_CS += ' + I(Generation==5)'

model7_CS_spec = smf.ols(formula=model7_linear_form_CS, data=pokeaman_train)
model7_CS_fit = model7_CS_spec.fit()
model7_CS_fit.summary().tables[-1] 
# "Cond. No." is NOW 15.4 due to centering and scaling
# "Cond. No." WAS 2,340,000,000 WITHOUT to centering and scaling
model7_fit.summary().tables[-1]

'In sample' R-squared:     0.39050543408631777
'Out of sample' R-squared: 0.2928211708652593
'In sample' R-squared:     0.3278619119714917
'Out of sample' R-squared: 0.3201730457445605
'In sample' R-squared:     0.39896705495122187
'Out of sample' R-squared: 0.29455630423884327


0,1,2,3
Omnibus:,268.963,Durbin-Watson:,2.18
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4069.702
Skew:,2.632,Prob(JB):,0.0
Kurtosis:,17.713,Cond. No.,1440000000.0


# Question 8.

In [22]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from sklearn.model_selection import train_test_split
import statsmodels.formula.api as smf

# Define the linear form specification for model3_fit
linear_form = 'HP ~ Attack + Defense'

# Number of repetitions for cross-validation
reps = 100

# Initialize arrays to collect R-squared values
in_sample_Rsquared = np.array([0.0]*reps)
out_of_sample_Rsquared = np.array([0.0]*reps)

# Loop for multiple iterations to assess performance
for i in range(reps):
    # Perform the 50-50 train-test split without using np.random.seed(130) each time
    pokeaman_train, pokeaman_test = train_test_split(pokeaman, train_size=0.5)
    
    # Fit the model to the training data
    final_model_fit = smf.ols(formula=linear_form, data=pokeaman_train).fit()
    
    # Collect in-sample R-squared (model fit on the training data)
    in_sample_Rsquared[i] = final_model_fit.rsquared
    
    # Collect out-of-sample R-squared (model prediction on the test data)
    y_test = pokeaman_test.HP
    yhat_test = final_model_fit.predict(pokeaman_test)
    out_of_sample_Rsquared[i] = np.corrcoef(y_test, yhat_test)[0, 1] ** 2

# Store results in a DataFrame
df = pd.DataFrame({"In Sample Performance (Rsquared)": in_sample_Rsquared,
                   "Out of Sample Performance (Rsquared)": out_of_sample_Rsquared})

# Visualize the results using a scatter plot
fig = px.scatter(df, x="In Sample Performance (Rsquared)", 
                     y="Out of Sample Performance (Rsquared)",
                     title="In-sample vs Out-of-sample R-squared for Multiple Model Runs")
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], name="y=x", line_shape='linear'))
fig.show()


### Explanation of Results:

- **In-Sample R-squared**: This metric measures how well the model fits the training data. Higher values indicate better model fit on the training data.
  
- **Out-of-Sample R-squared**: This metric measures how well the model generalizes to new data. It is calculated using the test data, and higher values suggest better generalization.

- **Scatter Plot**: The scatter plot illustrates the variability of in-sample and out-of-sample R-squared values across different iterations. A perfect model would produce points along the diagonal line (`y=x`), where the in-sample and out-of-sample R-squared values are identical. However, in practice, we typically observe a range of values, with some models overfitting (high in-sample R-squared and low out-of-sample R-squared) and others performing well on both in-sample and out-of-sample data.

### Purpose of the Demonstration:

This demonstration serves to visualize and understand the **variability in model performance** when applying the same model specification (e.g., `model3_fit`) across multiple train-test splits. It shows how the model might behave with different subsets of data and emphasizes the importance of **out-of-sample performance** in assessing how well the model will generalize to unseen data. The plot highlights the need to avoid overfitting, where the model may fit the training data well but fail to perform on new data.

# Question 9.

In [23]:
model7_gen1_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model7_gen1_predict_future_fit = model7_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model7_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")
model7_gen1to5_predict_future = smf.ols(formula=model7_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model7_gen1to5_predict_future_fit = model7_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model7_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model7)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model7_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model7_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")
model6_gen1_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation==1])
model6_gen1_predict_future_fit = model6_gen1_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1_predict_future_fit.rsquared, "(gen1_predict_future)")
y = pokeaman[pokeaman.Generation!=1].HP
yhat = model6_gen1_predict_future_fit.predict(pokeaman[pokeaman.Generation!=1])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1_predict_future)")
model6_gen1to5_predict_future = smf.ols(formula=model6_linear_form,
                                   data=pokeaman[pokeaman.Generation!=6])
model6_gen1to5_predict_future_fit = model6_gen1to5_predict_future.fit()
print("'In sample' R-squared:    ", model6_fit.rsquared, "(original)")
y = pokeaman_test.HP
print("'Out of sample' R-squared:", np.corrcoef(y,yhat_model6)[0,1]**2, "(original)")
print("'In sample' R-squared:    ", model6_gen1to5_predict_future_fit.rsquared, "(gen1to5_predict_future)")
y = pokeaman[pokeaman.Generation==6].HP
yhat = model6_gen1to5_predict_future_fit.predict(pokeaman[pokeaman.Generation==6])
print("'Out of sample' R-squared:", np.corrcoef(y,yhat)[0,1]**2, "(gen1to5_predict_future)")

'In sample' R-squared:     0.39896705495122187 (original)
'Out of sample' R-squared: 0.008741299280319925 (original)
'In sample' R-squared:     0.5726118179916575 (gen1_predict_future)
'Out of sample' R-squared: 0.11151363354803218 (gen1_predict_future)
'In sample' R-squared:     0.39896705495122187 (original)
'Out of sample' R-squared: 0.008741299280319925 (original)
'In sample' R-squared:     0.3904756578094535 (gen1to5_predict_future)
'Out of sample' R-squared: 0.23394915464343125 (gen1to5_predict_future)
'In sample' R-squared:     0.3278619119714917 (original)
'Out of sample' R-squared: 0.012706187160367099 (original)
'In sample' R-squared:     0.4433880517727282 (gen1_predict_future)
'Out of sample' R-squared: 0.1932858534276128 (gen1_predict_future)
'In sample' R-squared:     0.3278619119714917 (original)
'Out of sample' R-squared: 0.012706187160367099 (original)
'In sample' R-squared:     0.33517279824114776 (gen1to5_predict_future)
'Out of sample' R-squared: 0.26262690178799936

The illustration discusses the trade-offs between two models, **model6_fit** and **model7_fit**, in terms of their complexity, performance, and generalizability. The key points in the explanation emphasize how the complexity of **model7_fit** may introduce overfitting issues, even though it performs better on test data ("out-of-sample" performance) compared to **model6_fit**. Here's a breakdown:

1. **Complexity and Parsimony**: 
   - **model7_fit** is more complex than **model6_fit** because it includes more interactions, such as the four-way interaction term `Attack:Speed:Q("Sp. Def"):Q("Sp. Atk")`. While this may improve the model's performance on training data, it can also lead to overfitting. Overfitting happens when a model captures noise or idiosyncrasies in the training data that do not generalize well to new, unseen data.
   - **model6_fit**, being simpler, has a clearer and more interpretable structure. It may not perform as well in the short term, but it is more likely to generalize better because it avoids capturing these spurious patterns in the data.

2. **Interpretability vs. Performance**:
   - The complexity of **model7_fit** makes it harder to interpret. For instance, understanding a four-way interaction term in conjunction with other lower-order interactions can be quite challenging. In contrast, **model6_fit**'s simpler structure is more interpretable and easier to understand.
   - This trade-off between model performance and interpretability is crucial, especially when the performance difference between the two models is not dramatic. In some cases, a simpler, more interpretable model is preferred, even if its "out-of-sample" performance is slightly worse.

3. **Generalizability**:
   - The concern with **model7_fit**'s complexity is its potential for overfitting, which reduces its ability to generalize to new data. The model might detect patterns in the training set that are not present in the real world or new datasets.
   - A simpler model, like **model6_fit**, might generalize better to future data, as shown in the sequential data predictions using "Generations." This better generalizability is one reason why **model6_fit** could be preferred despite its slightly lower performance on test data.

4. **Real-World Application**:
   - The code illustrates how data might arrive sequentially (e.g., in "Generations"), and models should be tested on their ability to predict future data based on the current available data. **model7_fit**'s complexity can cause issues in these real-world scenarios, where data isn't idealized or static. The predictive performance of **model7_fit** declines when applied to future "Generations," highlighting its overfitting tendencies.

In conclusion, while **model7_fit** might perform better on current test data, its complexity and lack of interpretability, as well as its tendency to overfit, make **model6_fit** a safer and more robust choice in practice. This reinforces the idea that simplicity in modeling often leads to more consistent generalization, better interpretability, and fewer overfitting risks, which is crucial in real-world applications.

# ChatBot Summary:

Here is a summary of our exchanges:

---

1. **Model Complexity and Interpretability**: We discussed the trade-off between two models, **model7_fit** and **model6_fit**, focusing on how the complexity of **model7_fit** leads to overfitting and challenges in interpretability. While **model7_fit** performs better in out-of-sample tests, **model6_fit** is simpler and more interpretable, with better generalizability.

2. **Generalizability Concerns**: The increased complexity of **model7_fit** reduces its ability to generalize to new data, as demonstrated by sequential "Generations" data. In contrast, the simpler **model6_fit** shows more consistent performance and generalizability.

3. **Real-World Application**: The discussion emphasized the importance of model simplicity and interpretability in real-world applications where data arrives sequentially, highlighting the risk of overfitting with more complex models.

This summary captures the main points of our discussion on model selection and generalizability.