#**Loading the Dataset**


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ModuleNotFoundError: No module named 'pandas'

***%matplotlib inline***

This is a Jupyter Notebook magic command (not standard Python).
 It tells Jupyter to display any Matplotlib plots directly inside the notebook, right below the code cell.
 Without this, plots might open in a separate window (depending on your environment)

In [None]:
# from sklearn.datasets import load_boston
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

***In scikit-learn***, datasets like Iris or Boston are often returned as a sklearn.utils.Bunch object.

In [None]:
# Wrap the dataset into a sklearn.utils.Bunch object
# so it behaves like sklearn's built-in datasets (e.g., load_iris).
# - data: feature matrix (506 samples × 13 features)
# - target: median house value in $1000s
# - feature_names: names of the 13 features
# - DESCR: short description of the dataset
from sklearn.utils import Bunch
boston = Bunch(
    data = data,
    target = target,
    feature_names = [
        "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM",
        "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT"
    ],
    DESCR="Boston Housing dataset"
)


In [None]:
print(boston.data)
pd.DataFrame(boston.data)

#**Preparing the Dataset**







In [None]:
dataset = pd.DataFrame(boston.data,columns=boston.feature_names)
dataset["PRICE"] = boston.target
dataset.head()

In [None]:
dataset.info()

In [None]:
## Summarizing the stats of the data
dataset.describe()

In [None]:
## Check the messing value
dataset.isnull().sum() #  calcule the sum of each  colomns


##**Analyzing The Correlated Features**

**why correlation is very important in any Linear Regression Problem?**

This goes to the heart of linear regression.

Correlation is important in linear regression because:

🔹 1. Linear regression assumes a linear relationship

Linear regression tries to fit a straight line between the **independent variable(s) (X)** and **the dependent variable (y)**.

If there’s no correlation (or very weak), a straight line won’t describe the data well → predictions will be poor.

If there’s a strong correlation, linear regression can capture that relationship effectively.

🔹 2. Correlation helps identify useful predictors

Variables highly correlated with the target are usually better predictors.

If correlation between X and y is close to 0, including that variable may not improve the model.

🔹 3. Correlation reveals multicollinearity

When two independent variables are highly correlated with each other, it causes multicollinearity.

In regression, this makes coefficient estimates unstable and hard to interpret (the model can’t tell which variable really explains the change in y).

Example: in the Boston dataset, TAX and RAD are highly correlated. Including both can confuse the model.

🔹 4. Helps with feature selection & interpretation

Correlation analysis is often the first step before regression:

Which features matter most for predicting y?

Are some features redundant because they are strongly correlated with others?

🔹 5. Relation to R² (coefficient of determination)

In simple linear regression (1 feature), the square of the Pearson correlation coefficient (r²) is exactly the R² value.

This means correlation directly tells you how much of the variance in y is explained by x.

**Correlation ranges from -1 to 1:**

Close to 1 → strong positive correlation

Close to -1 → strong negative correlation

Close to 0 → weak or no correlation

**In short:**

Correlation tells you whether linear regression is appropriate.

Strong correlation → good predictor.

No correlation → regression won’t work well.

High correlation between predictors → beware of multicollinearity.

In [None]:
### Exploratory Data Analysis
## Correlation
dataset.corr() # corr_matrix
dataset.corr()['PRICE'].sort_values(ascending=False)



***seaborn (aliased as sns)***



>is a Python library for statistical data visualization.

It makes plots easier to create and prettier than raw Matplotlib.

Common uses:

Correlation heatmaps (sns.heatmap)

Pairplots (sns.pairplot)

Boxplots, violin plots, regression plots, etc.

In [None]:
# visualize the Correlation
import seaborn as sns
sns.pairplot(dataset) #automatically creates a grid of scatter plots for all pairs of variables in your DataFrame.


In [None]:
sns.regplot(x='RM',y='PRICE',data = dataset) #regplot is used for visualizing the relationship between two specific variables, along with a linear regression fit line.

In [None]:
plt.scatter(dataset['RM'],dataset['PRICE'])
plt.xlabel('RM')
plt.ylabel('PRICE')

In [None]:
## independent and dependent features
X = dataset.iloc[:,:-1] #dataframe.iloc[rows, columns]
y = dataset.iloc[:,-1]

In [None]:
## train test split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

***StandardScaler (sklearn.preprocessing)***

Purpose: Standardize features so each has mean = 0 and standard deviation = 1.

Why: Helps machine learning models perform better, especially those sensitive to feature scale (linear regression with regularization, SVM, KNN, etc.).

Steps:

Import: from sklearn.preprocessing import StandardScaler

Create scaler object: scaler = StandardScaler()

Fit & transform data: X_scaled = scaler.fit_transform(X)

fit → computes mean & std of each feature

transform → standardizes the features
**Xscaled ​= (X−mean) / std**


Effect: Each column of X_scaled now has mean 0 and standard deviation 1.


##StandardScaler: Detailed Explanation

Purpose:

Standardize features so each has mean = 0 and standard deviation = 1.

Helps ML algorithms (like linear regression, SVM, KNN, regularized models) perform better and avoid bias due to feature scale differences.

1️⃣ **Fit on training data**

scaler.fit(X_train)

fit() calculates and stores statistics from training data:

 → mean of feature j
 → standard deviation of feature j

These are stored in scaler attributes:

**scaler.mean_** → mean of each feature

**scaler.scale_** → standard deviation of each feature

**Key:** Fit is done only on training data to avoid leaking information from the test set.

2️⃣ **Transform training and test data**

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
	​

Uses the stored mean and std from training data to scale any new data (train or test).

**Why not fit on test data?**

Fitting on test data would compute a different mean and std, shifting the scale and introducing data leakage.

Always use training statistics to ensure consistency and correct model evaluation.

3️⃣ **Workflow Summary**

Split data → X_train, X_test

Initialize scaler → scaler = StandardScaler()

Fit scaler on training data → scaler.fit(X_train)

Transform training data → X_train_scaled = scaler.transform(X_train)

Transform test data → X_test_scaled = scaler.transform(X_test)

After scaling:

Each feature has mean 0 and std 1 (training set)

Test set is scaled consistently with training data

4️⃣ **Benefits**

Ensures features are on the same scale → faster convergence for gradient-based algorithms.

Makes coefficients in linear regression more interpretable.

Essential for regularized models (Ridge, Lasso).

In [None]:
## Standardize the dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
X_train = scaler.fit_transform(X_train)

In [None]:
X_test = scaler.transform(X_test)

#**Model Training**

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
regression = LinearRegression()
regression.fit(X_train,y_train)

In [None]:
#Coefficients and the intercept
print(regression.coef_)
print(regression.intercept_)


In [None]:
#on whith paramters the model has been trained
regression.get_params()

In [None]:
### Prediction with Test Data
reg_pred =  regression.predict(X_test)
reg_pred


In [None]:
# plot a scarter plot for the prdiction
plt.scatter(y_test,reg_pred)
plt.xlabel('y_test')
plt.ylabel('pred')

In [None]:
#residuls
residuls = y_test - reg_pred
residuls

***Why plot residuals?***

In linear regression, residuals should ideally follow a normal distribution centered around 0.

Plotting residuals helps check assumptions of linear regression:

Linearity: No systematic patterns

Homoscedasticity: Equal variance across predictions

Normality: Residuals approximately bell-shaped

In [None]:
#plot this residuals
sns.displot(residuls,kind='kde')

####***residuals VS predictions scatter plot :***

 which is another key diagnostic tool in regression. Let’s break it down:

plt.scatter(reg_pred, residuls)


**Why we plot this**

This plot helps check assumptions of linear regression:

Linearity: Residuals should be randomly scattered around 0 (no pattern).

Homoscedasticity: Spread of residuals should be roughly the same across all predicted values.

Outliers: Points far from 0 are potential outliers.

✅ Ideal pattern: cloud of points, evenly spread, centered at 0.

❌ Bad signs: clear patterns (like curves, funnel shapes), meaning model assumptions are violated

In [None]:
## Scatter plot with respect to prdiction and residuals
plt.scatter(reg_pred,residuls)
plt.xlabel('pred')
plt.ylabel('res')

##**Error Metrics**

To check if it’s good or not, we compare the predictions (reg_pred) with the real values (y_test).
For that, we use error metrics.

**Metrics we are using**

1- MAE (Mean Absolute Error)


> Formula:

MAE=1/n * ∑∣yi−y^i∣

> It’s the average absolute error. Easy to understand: “On average, the model is off by this much.”

2- MSE (Mean Squared Error)
>Formula:

MSE= 1/n * ∑((yi​−y^​i​)^2) avec 1<=i<=n

> Similar to MAE, but it squares the errors. This punishes big mistakes more.

3- RMSE (Root Mean Squared Error)
>Formula:

𝑅𝑀𝑆𝐸=sqrt(𝑀𝑆𝐸)



>It’s just the square root of MSE, which brings the error back to the same units as your target variable.



In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_test,reg_pred))
print(mean_squared_error(y_test,reg_pred))
print(np.sqrt(mean_squared_error(y_test,reg_pred)))

###**R square and adjusted R square**
####****R square**
>Formula : R2 = 1 - SSR / SST

R^2 : coefficient of determination

SSR : sum of squares of residuals

SST : total sum of squares

####****Adjust R2**
>Formula : Adjust R2 = [(1-R2)*(n-1)/(n-k-1)]

n : nb of observations

k : the nb of predictor variables

In [None]:
from sklearn.metrics import r2_score
score = r2_score(y_test,reg_pred)
score

In [None]:
n = len(y_test)
k = X_test.shape[1]
[(1-score)*(n-1)/(n-k-1)]

#**New Data Prediction**

In [None]:
boston.data[0].reshape(1,-1).shape
# reshape(rows,colomuns)
# 1 means we want 1 row.
# -1 tells NumPy: “Automatically calculate the number of columns based on the original size.”
#boston.data[2].shape #(rows,columns)

In [None]:
scaler.transform(boston.data[0].reshape(1,-1)) # transform the new data
regression.predict(scaler.transform(boston.data[0].reshape(1,-1)))

#**Pickling The Model file For Deployment**

>***What is Pickling?***

Pickling means saving (serializing) a Python object — such as your trained ML model — to a binary file on disk so you can reuse it later without retraining.

The saved file is called a pickle file and usually has the extension .pkl or .sav.

>***Why is it important?***

Because after you train a model:

You don’t want to retrain it every time you want to use it.

You might want to deploy it in a web app, desktop app, or API.

Pickling allows you to save the trained model once, and later load it instantly to make predictions.

>***Library used: pickle***

Python’s built-in module pickle is used for this.

🔹 Example
import pickle

> Save (Pickle) the model

pickle.dump(regressor, open('reg_model.pkl', 'wb'))

Explanation:

pickle.dump(obj, file) saves an object to a file.

- 'wb' means write binary mode.
- 'reg_model.pkl' is the filename.

>Load (Unpickle) the model later
- Load the model from disk
loaded_model = pickle.load(open('reg_model.pkl', 'rb'))

- Use it for prediction

predictions = loaded_model.predict(X_test)
Explanation:
- 'rb' = read binary mode.


loaded_model is now your trained model, ready to use.


pickle is the Python library that performs these two actions

>***Serialization*** = converting a Python object into a binary format (bytes)
so it can be saved or transmitted.

>***Deserialization*** = converting that binary data back into the original object.

In [None]:
import pickle

In [None]:
pickle.dump(regression,open('regmodel.pkl','wb'))

In [None]:
pickle_model = pickle.load(open('regmodel.pkl','rb'))

In [None]:
## prediction
pickle_model.predict(scaler.transform(boston.data[0].reshape(1,-1)))
#pickle_model.predict(scaler.transform(boston.data[0].reshape(1,-1))) - boston.target[0]
# When it’s “not ok”
# If you see:
# Residuals much larger than your RMSE (e.g., 20 or 30)
# A consistent positive/negative pattern (model always too high or too low)
# then your model might not be well-fitted (maybe missing a key feature, or needs nonlinear terms).