In [2]:
import pandas as pd
import numpy as np

import statsmodels.api as sm

In a command-line environment:

pwd (Print Working Directory): This command displays the full path of the current directory you are in. It’s useful for knowing exactly where you are in the directory structure.

ls (List): This command lists the files and folders in the current directory. It shows you the contents of the directory you’re in, helping you see what’s available to work with.

### Simple Linear Regression:

1. **Overview**: 
   - In this section, we will build a simple linear regression model in Python.
   - The goal is to understand how the model works, evaluate its quality, and use it for predictions.

2. **Simple Linear Regression**:
   - Simple linear regression uses one feature to predict the target value.
   - The term "simple" refers to the use of just one feature. When multiple features are used, it's called multiple linear regression.

3. **Model Basics**:
   - Simple linear regression fits a line through data points on a scatter plot.
   - Example: Predicting **price** (y-axis) based on **carat weight** (x-axis).
   - After fitting the regression model, you get a line that best fits the data, though it may not be a perfect fit.

4. **Linear Regression Equation**:
   - The equation used in simple linear regression helps describe the relationship between the feature and the target variable.
   - We will visualize how the model finds the "line of best fit."

5. **Model Evaluation**:
   - After building the regression model, we will evaluate its accuracy using statistical tests and metrics.

6. **Visualizing the Model**:
   - The model’s performance is often visualized with a scatter plot and the regression line.
   - The line may not pass through all data points, but it should follow the overall trend.

7. **Training and Testing**:
   - When we build a model, we need to split the data into **training** and **testing** datasets.
   - The training dataset is used to train the model, while the test dataset is used to evaluate its performance.

8. **Next Steps**:
   - We will continue improving the model and learn how to interpret the summary statistics to assess its effectiveness.



### Simple Linear Regression Model:

1. **Linear Relationship**:
   - The linear regression model describes the relationship between the feature (X) and the target (Y).
   - **Simple Linear Regression** uses just one feature to make predictions.

2. **Model Equation**:
   - The equation for simple linear regression is:  
     $$ \hat{y} = \beta_0 + \beta_1 X $$

### Explanation of Each Component:
- **$\hat{y}$**: This is the predicted value of the dependent variable \( y \) based on the model.
- **$\beta_0$**: This is the **intercept**. It represents the value of \( y \) when \( X = 0 \). In other words, it’s where the line crosses the y-axis.
- **$\beta_1$**: This is the **slope** or **coefficient** for \( X \). It shows how much \( y \) is expected to change for a one-unit increase in \( X \).
- **$X$**: This is the independent variable (or predictor) whose effect on \( y \) we want to model.

---

### Error in Real Life

In reality, there is always some error, meaning the model may not perfectly predict the target value.  
The equation becomes:

$$
Y = \hat{Y} + \epsilon
$$

where $\epsilon$ (epsilon) is the error term that accounts for the difference between the actual value ($Y$) and the predicted value ($\hat{Y}$).

### Error Term:
- $\epsilon$ (error) represents the difference between the actual and predicted values.
- The goal is to minimize this error so that the predicted values ($\hat{Y}$) are as close as possible to the actual values ($Y$).

### Minimizing Error:
In real life, it’s rare for all points to fall perfectly on the line.  
The aim is to make the error (or residual) as small as possible, ensuring that the predicted values are close to the actual values.

### Summary:
- The top equation shows the ideal prediction using the model.
- The equation with the error term shows what happens in reality, with some variance from the target values due to error.
- The goal is to minimize the difference between the predicted and actual values.


### Least Squared Error and Linear Regression

1. **Error and Prediction**:
   - **Error** is the difference between the actual value of the target (\(Y\)) and the predicted value **(\(\hat{Y}\))** from the model.

2. **Least Squared Error (LSE)**:
   - **Least squared error** is the method used to fit a linear regression model by finding the line that best represents the relationship between the feature (\(X\)) and target (\(Y\)).
   - This method minimizes the sum of the squared differences (errors) between the actual and predicted values.

3. **Why Squared Error?**:
   - Squaring the errors makes them positive and prevents large errors in one direction from canceling out small errors in the other direction.
   - It also simplifies the math, making it easier to solve.

4. **Impact of Outliers**:
   - A drawback of squared errors is that **outliers** (extreme values) can greatly influence the line, potentially distorting the fit.

5. **Ordinary Least Squares (OLS)**:
   - **OLS** is another name for the traditional linear regression method we're using here. It's common in the industry.
   - Other methods exist, but they're less commonly used in general practice.

6. **Visualizing Least Squared Error**:
   - Imagine a scatter plot where the target (\(Y\)) values are plotted against the feature (\(X\)).
   - A very basic model might predict the average of \(Y\) (e.g., a constant value), which gives a simple line. However, the error (the distance from actual \(Y\) to the predicted \(\hat{Y}\)) is large.
   - By drawing a better line based on knowledge of \(X\), we can reduce the error, but it still won’t be optimal.
   - Using **linear regression** helps find the best line, minimizing the total squared error.

7. **How Linear Regression Fits the Model**:
   - Linear regression calculates the best line by adjusting the slope and intercept to minimize the sum of squared errors.
   - For example, if the line is:
   $$
   \hat{Y} = 12 + 0.4X
   $$
   This represents the best line with the least error, where the sum of squared errors is the smallest.

8. **In Summary**:
   - Linear regression works by minimizing the total squared error, which gives us the optimal slope and intercept for the line that best fits the data.


### Regression Libraries in Python:

1. **Different Regression Implementations**:
   - Python offers various ways to implement regression, including coding it from scratch using basic Python and **NumPy**. However, for this course, we’ll focus on two popular libraries: **Statsmodels** and **scikit-learn**.

2. **Statsmodels**:
   - **Statsmodels** is great for **inference** (understanding the relationship between variables). It’s especially useful for those familiar with tools like SAS, R, or Excel for regression.
   - Statsmodels provides detailed outputs, such as coefficient estimates and statistical tests, which help analyze the regression.
   - **Drawback**: It’s harder to use in production and doesn’t always include the latest algorithms for regression.

3. **Scikit-learn**:
   - **Scikit-learn** is ideal for **prediction**. It’s a popular library for machine learning in Python and offers a variety of regression algorithms.
   - It allows easy comparison of different regression models and is well-suited for production deployment.
   - **Drawback**: It doesn’t provide as detailed statistical outputs as Statsmodels. It focuses more on prediction and model performance.

4. **Comparison Between Statsmodels and Scikit-learn**:
   - Both libraries use the same mathematical principles and return the same regression equation.
   - Statsmodels gives a lot of information upfront (e.g., statistical tests, coefficient estimates).
   - Scikit-learn is simpler and focuses on prediction, but you need to explicitly ask for outputs like coefficients.

5. **Regression Model Fitting**:
   - We start by importing data processing libraries like **Pandas** and **NumPy**.
   - Then, import **Statsmodels** (as `SM`) and **scikit-learn's** linear regression model.
   - Data is split into features (X) and target (Y). In **Statsmodels**, we call the **OLS** function (Ordinary Least Squares), fit the model, and then use the `summary()` method to get detailed results.
   - In **scikit-learn**, we use the `LinearRegression()` function to fit the model. It doesn’t have a `summary()` method, but you can access the coefficients directly.

6. **Key Differences**:
   - **Statsmodels** asks for the target (Y) first when fitting the model, whereas **scikit-learn** expects the features (X) first.
   - Statsmodels automatically provides a summary of the regression model, while in scikit-learn, you must request the outputs explicitly.

7. **Summary**:
   - We’ll start with Statsmodels to explore detailed model outputs, then gradually transition to **scikit-learn** for its flexibility and suitability in prediction and production scenarios.

In [8]:
pwd

'C:\\Users\\ymona\\PythonScripts\\Python Data Science Regression & Forecasting\\Demo Notebooks'

In [11]:
ls

 Volume in drive C is Windows
 Volume Serial Number is 4A1D-0E57

 Directory of C:\Users\ymona\PythonScripts\Python Data Science Regression & Forecasting\Demo Notebooks

11/10/2024  09:59 PM    <DIR>          .
11/10/2024  06:11 PM    <DIR>          ..
11/10/2024  06:11 PM    <DIR>          .ipynb_checkpoints
11/10/2024  06:11 PM           806,443 01_EDA_Demos.ipynb
11/10/2024  09:59 PM           111,507 02_Regression_Demos.ipynb
11/10/2024  06:11 PM           274,967 02_Simple_Regression_Case_Study.ipynb
11/10/2024  06:11 PM           147,298 03_multiple_regression_demos.ipynb
11/10/2024  06:11 PM           714,819 04_Assumptions_Demos.ipynb
11/10/2024  06:11 PM            15,630 05_Validating_Testing_Demos.ipynb
11/10/2024  06:11 PM           105,491 06_feature_engineering_demos.ipynb
11/10/2024  06:11 PM            49,314 07_regularized_regression_demos.ipynb
11/10/2024  06:11 PM           486,064 08_time_series_demos.ipynb
               9 File(s)      2,711,533 bytes
             

In [6]:
diamonds = pd.read_csv("../Data/Diamonds Prices2022.csv")

diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


### Fitting a Regression in sklearn

In [26]:
from sklearn.linear_model import LinearRegression

X = diamonds[["carat"]]
y = diamonds["price"]

lr = LinearRegression().fit(X, y)

print(f"Intercept: {lr.intercept_}")
print(f"Coefficients: {lr.coef_}")


Intercept: -2256.3950475375823
Coefficients: [7756.43615951]


### Fitting a Regression in Statsmodels

In [11]:
X = sm.add_constant(diamonds["carat"])
y = diamonds["price"]

model = sm.OLS(y, X).fit()

model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.849
Model:,OLS,Adj. R-squared:,0.849
Method:,Least Squares,F-statistic:,304100.0
Date:,"Tue, 08 Aug 2023",Prob (F-statistic):,0.0
Time:,11:03:41,Log-Likelihood:,-472760.0
No. Observations:,53943,AIC:,945500.0
Df Residuals:,53941,BIC:,945500.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-2256.3950,13.055,-172.840,0.000,-2281.983,-2230.807
carat,7756.4362,14.066,551.423,0.000,7728.866,7784.006

0,1,2,3
Omnibus:,14027.005,Durbin-Watson:,0.986
Prob(Omnibus):,0.0,Jarque-Bera (JB):,153060.389
Skew:,0.939,Prob(JB):,0.0
Kurtosis:,11.036,Cond. No.,3.65


###  Building Regression Models with Statsmodels

1. **Basic Setup**:
   - To build a regression model in **Statsmodels**, we need to import the library and set up our data.
   - **Steps**:
     - Import `statsmodels.api` as `SM`.
     - Create a data frame `X` for the features (predictor variables) and add a constant column.
     - Create a data frame or series `Y` for the target variable.
     - Use `SM.OLS()` to set up the model, passing `Y` for the target and `X` for the features, then call the `fit()` method to build the model.
     - Store the model in a variable (e.g., `model`) and use `model.summary()` to view the model output.

2. **Adding a Constant**:
   - **Why Add a Constant?** Statsmodels assumes that the regression line runs through the origin (0,0) by default. Adding a constant allows the model to calculate a y-intercept that’s not necessarily zero.
   - The constant is simply a column in the features data frame where every value is 1. This helps the model compute the intercept term (y-intercept).
   - Without adding a constant, the model forces the line to go through the origin, which might not be appropriate in most cases.

3. **How to Add a Constant**:
   - You can manually add a constant column, but it's easier and more reliable to use the built-in `add_constant()` function in Statsmodels.

4. **Interpreting the Model Summary**:
   - The summary output can be overwhelming, but it’s divided into three main sections:
     1. **Model Summary Statistics** (e.g., R-squared, F-statistic)
     2. **Variable (Coefficient) Summary Statistics** (e.g., coefficient estimates, p-values)
     3. **Residual (Error) Statistics** (e.g., standard error, residual sum of squares)
   - It's common not to look at every detail of the summary but to focus on key sections relevant to model performance.

5. **Common Issues**:
   - If you don’t add a constant, the R-squared value will be labeled as “Uncentered,” indicating the model does not have an intercept.
   - Once you add a constant, the model will provide a proper intercept term.

6. **Code Example**:
   - Import necessary libraries: `pandas`, `numpy`, and `statsmodels.api`.
   - Prepare the data by selecting the relevant features (e.g., `carat` as `X` and `price` as `Y`).
   - Fit the model with `SM.OLS(Y, X).fit()`.
   - If you don't add a constant, the model will display "Uncentered" in the summary.
   - Add a constant using `SM.add_constant(X)` to correct the model and include the intercept.

7. **Storing the Model**:
   - It's helpful to store the fitted model in a variable (e.g., `model = SM.OLS(Y, X).fit()`) so you can access different parts of the model output later.

8. **Next Steps**:
   - After building the model, we will dive deeper into interpreting the outputs and understanding key metrics like coefficients and statistical tests.

## Interpreting Linear Regression Coefficients

### Linear Regression Overview:
Linear regression is widely used due to its simplicity and ease of interpretation. It's a valuable tool for understanding relationships between variables. The key to interpreting the model lies in the **coefficients** (often referred to as the "Coef" column) in the variable summary statistics.

### Model Setup:
In this example, we are predicting the price of a diamond based on its carat weight. The regression equation can be written as:

$$
y = \text{intercept} + (\text{slope} \times \text{carat weight})
$$

Where:
- **Intercept (constant)** = -2256 (the price when the carat weight is zero)
- **Slope (carat coefficient)** = 7756 (the price increase for each additional carat)

### Interpreting Coefficients:
- **Intercept**: The intercept term suggests that a diamond with zero carat weight would cost **-2256**. While this is a theoretical value, it may not be meaningful in a real-world context (since diamonds can't have zero carat weight).
  
- **Slope (Carat Coefficient)**: The slope of **7756** indicates that for each additional carat, the price of the diamond increases by **7756**. This is a meaningful relationship that directly connects carat weight with price.

### Caution with Causality:
While the model suggests a relationship between carat weight and price, we must be cautious about claiming causality. This model reflects **correlation**, not causality. There could be other unaccounted factors (such as diamond cut, color, or clarity) influencing the price. Including these factors in the model could change the coefficients and provide a more complete picture.

### Practical Considerations:
- The **intercept** term may be less meaningful in practical terms, especially if it represents a scenario that doesn't occur in reality (like a diamond with zero carat weight).
- However, the **slope (carat coefficient)** is the more important coefficient, as it describes the actual relationship between carat weight and price.

### Next Steps:
Once you've built the model, you can use it to make predictions on new data, such as predicting the price of a diamond based on its carat weight. The model will allow you to estimate how much the price increases as the carat weight increases, giving you valuable insights for pricing diamonds.


### Making Predictions

In [18]:
model.predict([1, 2])

array([13256.47727148])

In [13]:
new_diamonds = pd.DataFrame({"carat": [0, .1, .3, .5, 1, 2, 3, 5]})

In [14]:
model.predict(sm.add_constant(new_diamonds))

0    -2256.395048
1    -1480.751432
2       70.535800
3     1621.823032
4     5500.041112
5    13256.477271
6    21012.913431
7    36525.785750
dtype: float64

### Notes on Using the Predict Method for Model Predictions

1. **Purpose of the Predict Method**:
   - The `predict` method is used to make predictions with a fitted model, whether for individual data points or entire datasets.
   - It's especially useful when you want to make predictions for multiple data points (data frames).

2. **Predicting a Single Data Point**:
   - For a single prediction, you can pass the data directly to the `predict` method.
   - Example:
     - Model coefficients: intercept = -2256, carat coefficient = 7756.
     - To predict the price of a 1.5-carat diamond:
       - Input: `1` (constant), `1.5` (carat weight).
       - Equation: `-2256 + (7756 × 1.5) = 9378.26`
       - The predicted price for a 1.5-carat diamond is $9,378.26.

3. **Predicting for Multiple Data Points**:
   - You can create a data frame with multiple values (e.g., carat weights) and pass it to the `predict` method to get predictions for all of them.
   - Example: A data frame with carat values like 0.5, 1, 1.5, 2, 2.5 will generate predictions for each.

4. **Handling Constant Term**:
   - When predicting, you need to manually add the constant (intercept) term by including a column with ones in your data.
   - Without the constant, the model will not work correctly and will raise an error.

5. **Example of Predicting with a Data Frame**:
   - You can predict for multiple carat values (e.g., 0.5, 1, 1.5, etc.) using a data frame.
   - After adding the constant column, passing this data frame into the `predict` method will return the predicted prices.

6. **Handling Small Values**:
   - If the carat values are very small (e.g., 0.1, 0.3 carats), the model may predict negative prices because of the intercept term.
   - This is an indication that the model may not perform well for very small diamonds but should improve for larger ones with more data.

7. **Model Accuracy – R Squared**:
   - After making predictions, it's important to measure how well the model fits the data using accuracy metrics like R squared, which will be discussed next.

## R-Squared: The Coefficient of Determination

Many of you have likely already heard the term **R-squared**, and that's because it's one of the most important metrics in regression. R-squared (also known as the **coefficient of determination**) is a measure that tells us how well our model is predicting the target variable compared to simply using its mean.

### What Does R-Squared Tell Us?

- R-squared measures how much better the model is at predicting the target compared to using just the mean. 
- If you remember from the line of best fit, we started with the mean and adjusted the slope of our line to reduce the sum of squares.

The value of **R-squared** will always be between **0 and 1**:
- **R-squared = 0**: The model does no better than predicting the mean. It's equivalent to using the mean as the prediction for all observations.
- **R-squared = 1**: The model explains **100%** of the variance in the data—this would be a perfect fit.

### How R-Squared Relates to Simple Regression

In the case of **simple (single-variable) regression**, **R-squared** is equal to the square of the **correlation coefficient** between the independent variable (X) and the dependent variable (Y). 

In other words:
$$
R^2 = \left( \text{correlation coefficient} \right)^2
$$
For example, if we calculate the correlation between **carat weight (X)** and **price (Y)** and square it, we get an **R-squared** of 0.849. This means our model explains 84.9% of the variation in price that is not explained by the mean price.

### R-Squared Formula and Deeper Understanding

The formula for **R-squared** is:

$$
R^2 = 1 - \frac{SSE}{SST}
$$

Where:
- **SSE** is the **Sum of Squared Errors**, which represents the squared distance between the observed values and the values predicted by our model. It reflects the variance of the data that is **not explained** by the model.
- **SST** is the **Sum of Squared Total**, which represents the total variance between the observed values and the mean of the target variable.

#### Breaking It Down:

- **SSE**: This measures the error in the model. The smaller the SSE, the better the model fits the data.
- **SST**: This measures the total variance in the target variable **Y**. If you just predicted the mean of **Y** for all data points, you would have this as the total variance.

### What Happens at R-Squared = 0?

- If **R-squared = 0**, it means our model is doing no better than using the mean value for every prediction. In this case, we could simply use the mean to make predictions and the model would perform just as well.

### Relative Importance of R-Squared

- **R-squared** should always be interpreted relative to the **data** and the **problem** at hand:
  - In fields like **sports analytics**, an **R-squared of 0.05** might be considered excellent.
  - In more precise fields like **physics** or **engineering**, where you're testing theoretical models, an **R-squared of 0.95** might indicate that the model or theory needs further refinement.

Even a small improvement over the mean can have significant practical value depending on the field and the specific use case.

### Conclusion

R-squared is a key metric that tells us how well our model fits the data. However, it’s important to consider the context and use it along with other evaluation methods to get a full picture of model performance.

Now, let's shift gears and look at how **hypothesis testing** fits into the bigger picture of model evaluation.


### Hypothesis Testing in Regression

1. **What is Hypothesis Testing?**:
   - Hypothesis tests are used to help make decisions or draw conclusions based on data.
   - For example, testing if one group performed better than another in a test.

2. **Purpose in Regression**:
   - While we don’t need to be experts in hypothesis testing to build regression models, we need to understand the tests included in the outputs.
   - Regression models include hypothesis tests like the **F test** to determine whether the model is significantly better than using the mean of the target variable.

3. **F Test**:
   - The F test checks if the model adds value or if the results are just random noise.
   - It’s about determining whether the model is useful or not. If the F test shows a significant result, we can conclude that the model is better than just predicting the mean.

4. **Steps in a Hypothesis Test**:
   - **State the null and alternative hypothesis**:
     - **Null hypothesis** (H0): The model is not useful (i.e., no improvement over the mean).
     - **Alternative hypothesis** (Ha): The model is useful (i.e., it’s better than the mean).
   - **Set a significance level (alpha)**: This determines the threshold for making decisions. A lower alpha means it’s harder to reject the null hypothesis, while a higher alpha makes it easier.
   - **Calculate the test statistic and p-value**: These help assess the strength of the evidence against the null hypothesis.
   - **Draw a conclusion**:
     - If **p-value ≤ alpha**, reject the null hypothesis, meaning the model is likely useful.
     - If **p-value > alpha**, don’t reject the null hypothesis, meaning the model might not be useful and needs further adjustments (e.g., more data or features).

5. **F Test Hypotheses**:
   - **Null hypothesis (H0)**: F = 0 (the model is no better than the mean).
   - **Alternative hypothesis (Ha)**: F ≠ 0 (the model is better than the mean).
   - The goal is to reject the null hypothesis and show that the model is adding value.

6. **Significance Level (Alpha)**:
   - Alpha represents the probability threshold for error.
   - The industry standard is often **alpha = 0.05**, meaning there's a 5% chance of making an error.
   - Some industries, like pharmaceuticals, use stricter thresholds (e.g., **alpha = 0.01 or 0.001**) to avoid errors that could have severe consequences.

7. **Conclusion**:
   - The hypothesis test, especially the F test, helps evaluate whether a regression model is meaningful or if it’s just producing random results.
   - The significance level helps control the risk of making incorrect decisions about the model’s effectiveness.

## F-statistic and P-value: Key Concepts in Model Evaluation

The **F-statistic** is an important measure in regression analysis that helps us understand the **overall significance** of the model. Specifically, it represents the ratio of the **variance explained by the model** to the **variance not explained** by the model. In simple terms, it tells us how well our model is fitting the data compared to a model that predicts only the mean.

### Relationship Between R-squared and F-statistic

The F-statistic is closely related to **R-squared**, as both give us an idea of how well the model explains the variability in the target variable (Y). However, while **R-squared** looks at how much of the variance is explained by the model, the **F-statistic** evaluates the **overall significance** of the regression model, considering all variables together.

### What Does the F-statistic Tell Us?

- The **F-statistic** tells us if the variability explained by the model is significantly larger than the variability left unexplained.
- A **high F-statistic** indicates that the model explains a significant portion of the variability in the target variable, making it a good model overall.
- A **low F-statistic** means the model does not explain much more variability than a model that would predict just the mean of the target variable.

### P-value for the F-statistic

The **P-value** associated with the F-statistic tells us the **probability** that our model is predicting poorly or is not better than a simple mean model.

- If the **P-value** is **low** (typically less than **0.05**), it suggests that the model is statistically significant, and we can confidently say that the predictors (independent variables) in the model are related to the target variable.
- If the **P-value** is **high** (greater than **0.05**), it suggests that the model does not provide enough evidence to reject the null hypothesis, meaning the predictors are not statistically significant.

### Interpreting the F-statistic and P-value in Practice

Let's take a look at the **model summary**:

- **F-statistic**: 30,000, which is a **very high value**. This suggests that the model explains a lot of the variance in the target variable.
- **P-value for F-statistic**: Near zero, which indicates a very **low probability** that the model is predicting poorly.

#### Significance of the P-value

- Since the **P-value** is near **zero** (far less than the significance threshold, typically **0.05**), we can reject the **null hypothesis**.
- This means that the model is statistically significant and that the predictors (such as **carrot weight**) are good predictors of the target variable (**diamond price**).

### Conclusion: Hypothesis Testing

- If the **P-value** is **less than alpha** (0.05), we **reject the null hypothesis**, meaning the model is significantly better than using just the mean.
- If the **P-value** is **greater than alpha**, we **fail to reject** the null hypothesis, meaning the model may not provide a significant improvement over the mean.

In our case, with a **P-value near zero**, we can confidently reject the null hypothesis and conclude that **carrot weight is a good predictor** of the price of a diamond.

### Summary:
- **F-statistic**: Tells us the overall significance of the model by comparing explained vs. unexplained variance.
- **P-value**: Helps us determine if the model is significantly better than predicting just the mean. If the P-value is small (less than 0.05), we reject the null hypothesis and conclude that the model is meaningful.


### Residual Plot

In [55]:
import seaborn as sns

sns.scatterplot(x=model.predict(), y=model.resid)

NameError: name 'model' is not defined

## Evaluating Model Fit Using Residuals

In addition to metrics like $R^2$ and hypothesis tests like the **F-test**, another very useful way to evaluate how well a model fits is through **visualization**.

### Visualizing Residuals:

Residual plots show how well a model performs across the range of predictions. Ideally, we want to see that these residual plots:
- Are **centered around zero**,
- Are roughly **normally distributed**,
- Do not have any **clear pattern**.

In these plots:
- The **x-axis** represents the predicted values of the target variable.
- The **y-axis** represents the **residuals** or errors of the predictions.

If the residual is equal to zero, that means we perfectly predicted the value, and points that fall along the zero line fall exactly on the regression line.

### Understanding Residuals:

For example, if the predicted value is 30 and the residual is 10, that indicates that:
- We **under-predicted** by 10 units, meaning the actual value is 10 units higher than the predicted value.

### Relation to Sum of Squared Error:

This ties directly into the **sum of squared errors (SSE)**, which is the value the model is trying to minimize when fitting the regression line.

By visually inspecting residuals, if we see any clear patterns, this might indicate areas where we could improve the model and further reduce the error.

### Residual Plot Example:

In **Statsmodels**, the `model.resid` method returns a series with the residuals, which are calculated as:

$$
\text{Residual} = Y - \hat{Y}
$$

Where:
- $Y$ is the actual value,
- $\hat{Y}$ is the predicted value.

For example:
- Actual price = \$666,
- Predicted price = \$846,
- Residual = \$666 - \$846 = -180.

This negative residual indicates **over-prediction**, meaning the model has predicted too high.

### Interpreting the Residual Plot:

When plotting all the residuals, you might observe a strange shape. Some of this is caused by the fact that diamond prices in the dataset are capped at roughly \$20,000.

- **Positive residuals** indicate **under-prediction** (the actual value is greater than the predicted value).
- **Negative residuals** indicate **over-prediction** (the predicted value is higher than the actual value).

As we observe the plot:
- For smaller diamonds, the model performs well and the residuals are small.
- As the price increases (for larger diamonds), the model begins to **over-predict**, especially for diamonds near the upper end of the price range. This suggests that the model isn't capturing the characteristics of these larger diamonds well.

### Next Steps:

We'll use the residual plot as part of our **model diagnostics** later in the course. Even for simple regressions, residual plots can quickly help us visualize:
- Where the model performs well (residuals close to zero),
- Where the model doesn’t fit so well, which can inform decision-making.

In the next section, we'll go through a **case study** that combines all of these steps to see how they work together.



In the context of linear regression, residuals are the differences between the actual values $Y$ and the predicted values $\hat{Y}$ for each data point. Residuals are calculated as:

$$
\text{Residual} = Y - \hat{Y}
$$

- **Positive residuals** ($Y - \hat{Y} > 0$) indicate under-prediction because the actual value $Y$ is larger than the predicted value $\hat{Y}$. In other words, the model has predicted too low for that particular observation.

- **Negative residuals** ($Y - \hat{Y} < 0$) indicate over-prediction because the actual value $Y$ is smaller than the predicted value $\hat{Y}$. This means the model has predicted too high for that particular observation.

### Intuition:
- **Under-prediction** (positive residual) occurs when the model's prediction is too small compared to the true value, and thus the residual is positive.
- **Over-prediction** (negative residual) occurs when the model's prediction is too large compared to the true value, and thus the residual is negative.

### In summary:
- **Positive residual**: The model underestimates the target value.
- **Negative residual**: The model overestimates the target value.
