<div style="background-color:#f8f9fa;
    padding:20px;
    border:2px solid #360084;
    border-radius:8px;
    margin:10px 0;">
    <div style="background-color:#fafdf0;
        line-height:2;
        text-align:center;
        border:2px solid #360084;
        padding:15px;
        border-radius:5px;">
        <div style="color:#008B8B;
            font-size:24pt;
            font-weight:700;">
            Simple Linear Regression - Practice Notebook
        </div>
    </div> 
    <div style="margin-top:15px; padding:10px;">
        <strong>Course:</strong> Econometrics<br>
        <strong>Topic:</strong> Simple Linear Regression<br>
        <strong>Author:</strong> Dr. Saad Laouadi<br>
        <strong>Date:</strong> June 2025<br>
        <strong>Difficulty:</strong> Beginner<br>
        <strong>Dataset:</strong> Advertising (TV vs Sales)<br>
        <strong>Learning Objectives:</strong>
        <ul style="margin:5px 0; padding-left:20px;">
            <li>Understand simple linear regression concepts</li>
            <li>Fit OLS models using statsmodels</li>
            <li>Interpret regression results</li>
            <li>Analyze residuals and model diagnostics</li>
        </ul>
    </div>
</div>

---

## Instructions

Complete the following exercises step by step. Read each instruction carefully and fill in the missing code where indicated. Use the advertising dataset to explore the relationship between TV advertising spending and sales.

### Exercise 1: Environment Setup

Set up your working environment by importing the necessary libraries.

In [None]:
# Import required libraries for data analysis and visualization
import os
from pathlib import Path

import pandas as pd
import numpy as np
import scipy
import skimpy

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.api import OLS

import patsy
import plotnine as pl
import matplotlib.pyplot as plt
import seaborn as sns

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

# Set plotting style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_style('whitegrid')

# Print environment setup confirmation
print("*"*52)
print("Environment Setup".center(52))
print("*"*52)

# Load watermark extension for version information
%reload_ext watermark 
%watermark -iv -ud -v -a "Your Name"
print("*"*52)

### Exercise 1: Environment Setup

Set up your working environment by importing the necessary libraries.

In [None]:
# Import required libraries for data analysis and visualization
# Hint: You'll need os, pathlib, pandas, numpy, scipy, skimpy
import _____ 
from pathlib import _____
import _____ as pd
import _____ as np
import _____
import _____

# Import statsmodels components
import _____ as sm
import _____._____.api as smf
from statsmodels.api import _____

# Import visualization libraries
import _____.pyplot as plt
import _____ as sns

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)

# Set plotting styles
plt.style.use('_____')
sns.set_style('_____')

# Print environment setup confirmation
print("*"*52)
print(_____.center(52))
print("*"*52)

# Load watermark extension and display version information
%reload_ext _____ 
%watermark -iv -ud -v -a "_____"  # Replace with your name
print("*"*52)

### Exercise 2: Data Loading and Exploration

Load the advertising dataset and explore its structure.

In [None]:
# Load the advertising dataset
# 1. Update the path to your dataset location
data_path = Path("../../datasets/misc/advertising.csv")
advert = pd.read_csv(data_path)

# Display basic information about the dataset
# Use the appropriate method to show data types and non-null counts
advert.info()

In [None]:
# Load the advertising dataset
# Note: Update the path to your dataset location
data_path = Path("../../datasets/misc/advertising.csv")
advert = pd.read_csv(data_path)

# Display basic information about the dataset
advert._____()

### Exercise 3: Data Visualization - Pairplot

Create pairwise plots to visualize relationships between variables.

In [None]:
# Create pairplot showing TV, Radio, Newspaper vs Sales
# TODO: Complete the pairplot function with appropriate parameters
sns.pairplot(data=advert,
             x_vars=[_____, _____, _____],  # Fill in the x variables
             y_vars=_____,                   # Fill in the y variable
             aspect=1.1, 
             height=4.5)
plt.show()

In [None]:
# Create pairplot showing TV, Radio, Newspaper vs Sales
sns.pairplot(data=advert,
             x_vars=[_____, _____, _____],  # Fill in the x variables
             y_vars=_____,                   # Fill in the y variable
             aspect=1.1, 
             height=4.5)
plt.show()

### Exercise 4: Correlation Analysis

Examine correlations between variables.

In [None]:
# Calculate and display correlation matrix for the dataset
correlation_matrix = advert._____()
print(correlation_matrix)

# Create correlation heatmap with annotations
sns.heatmap(_____, annot=_____)
plt.show()

### Exercise 5: Variable Selection

Select the independent and dependent variables for regression.

In [None]:
# Set X and Y variables for simple linear regression
#     Select TV as independent variable (X) and Sales as dependent variable (Y)

X = advert[_____]
Y = advert[_____]

# Check the shape of variables
print(f"The X shape: {X.shape}")
print(f"The Y shape: {Y.shape}")

### Exercise 6: Model Preparation (Method 1)

Prepare data for OLS regression using the matrix approach.

In [None]:
# Add constant term to X for intercept
X_with_const = sm._____(X)

# Check the new shape
print(f"X with constant shape: {X_with_const.shape}")

# Create OLS model with Y and X_with_const
model = OLS(_____, _____)

# Fit the model and store results
result = model._____()

# Display coefficient table only
print(result.summary().tables[1])

In [None]:
### Exercise 7: Model Fitting (Method 2 - Formula API)

Use the formula interface for easier model specification.

In [None]:
# Create model using formula interface
# Hint: Use OLS.from_formula with "Sales ~ TV" formula
model_formula = OLS.from_formula(_____, _____)

# Fit the model
result_formula = model_formula._____()

# Display complete summary
print(result_formula.summary())

### Exercise 8: Exploring Results Structure

Understand the structure of regression results.

In [None]:
# Explore summary components
# Hint: Print all non-private attributes of result summary
components = [comp for comp in dir(result_formula.summary()) if not comp.startswith('_')]
print("Summary components:", components)

# Check number of tables in summary
print(f"Number of tables: {len(result_formula.summary().tables)}")

# Print each table separately
for i in range(len(result_formula.summary().tables)):
    print(f"\nTable {i+1}:")
    print(result_formula.summary().tables[i])

# Print extra notes
print("\nExtra text:")
print(result_formula.summary().extra_txt)

### Exercise 9: Parameter Extraction

Extract and work with model coefficients.

In [None]:
# Extract model parameters from the fitted model
params = result_formula._____
print("Model parameters:")
print(params)
print(f"Parameter type: {type(params)}")

# Extract individual coefficients: The intercept and slope coefficients
intercept = params.loc[_____]
slope = params.loc[_____]

print(f"Intercept: {intercept}")
print(f"Slope: {slope}")

### Exercise 10: Model Visualization

Create visualizations of the fitted model.

In [None]:
# Plot 1: Scatter plot with fitted line (manual)
plt.figure(figsize=(10, 6))
plt.scatter(advert[_____], advert[_____])
plt.plot(advert[_____], _____ + _____ * advert[_____], 'g', linewidth=2)
plt.xlabel('TV Advertising')
plt.ylabel('Sales')
plt.title('Sales vs TV Advertising with Fitted Line')
plt.show()

# Plot 2: Create regression plot using seaborn
sns.regplot(data=advert,
           x=_____,
           y=_____,
           ci=None)
plt.title('Sales vs TV Advertising (Seaborn)')
plt.show()

### Exercise 11: Predictions and Residuals

Calculate predicted values and residuals.

In [None]:
# Generate predictions using the fitted model
y_hat = result_formula._____(X_with_const)
print("First 10 predictions:")
print(y_hat.head(10))

# Create dataframe with actual Y, predicted Y, and residuals 
comparison_data = pd.DataFrame({
    "Y_actual": _____,
    "Y_predicted": _____
})

# Calculate residuals (errors) as actual - predicted
comparison_data['Residuals'] = comparison_data[_____] - comparison_data[_____]

print("Comparison data:")
print(comparison_data.head(10))

### Exercise 12: Residual Analysis

Analyze residuals to check model assumptions.

In [None]:
# Plot 1: Distribution of residuals
#  Create histogram of residuals with KDE
sns.displot(comparison_data[_____],
           kde=True,
           aspect=1.3)
plt.title("Distribution of Residuals")
plt.show()

# Plot 2: Residuals vs Fitted values
#  Create scatter plot of TV vs residuals
plt.figure(figsize=(10, 6))
plt.scatter(advert[_____], comparison_data[_____])
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('TV Advertising')
plt.ylabel('Residuals')
plt.title("Residuals vs TV Advertising")
plt.show()

### Exercise 13: Model Interpretation

Answer the following questions based on your results:

1. **What is the estimated relationship between TV advertising and sales?**
   - Write your interpretation here: _____

2. **What does the R-squared value tell us?**
   - Write your interpretation here: _____

3. **Are the coefficients statistically significant?**
   - Write your analysis here: _____

4. **What do the residual plots suggest about model assumptions?**
   - Write your observations here: _____

### Exercise 14: Additional Calculations

Perform additional model diagnostics.

In [None]:
# Calculate key statistics manually
# =================================
# Calculate R-squared, RMSE, and other metrics
actual = comparison_data['Y_actual']
predicted = comparison_data['Y_predicted']
residuals = comparison_data['Residuals']

# R-squared calculation
ss_res = np.sum(residuals ** 2)
ss_tot = np.sum((actual - np.mean(actual)) ** 2)
r_squared_manual = 1 - (ss_res / ss_tot)

# RMSE calculation
rmse = np.sqrt(np.mean(residuals ** 2))

# Mean Absolute Error
mae = np.mean(np.abs(residuals))

print(f"Manual R-squared: {r_squared_manual:.4f}")
print(f"Model R-squared: {result_formula.rsquared:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")

---

## Reflection Questions

After completing all exercises, reflect on the following:

1. How strong is the relationship between TV advertising and sales?
2. What assumptions of linear regression can you verify from your analysis?
3. What would you recommend to improve this model?
4. How would you explain the business implications of your findings?

---

## Submission Instructions

1. Complete all TODO items in the code cells
2. Fill in your interpretations for Exercise 13
3. Answer the reflection questions
4. Ensure all code runs without errors
5. Save your notebook with your name in the filename

**Good luck with your analysis!**