In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_california_housing

california_housing = fetch_california_housing()
print(california_housing.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [3]:
california_housing_df = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
california_housing_df['target'] = california_housing.target

display(california_housing_df.head())

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
X = california_housing_df.drop('target', axis=1)
y = california_housing_df['target']

print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)

Features (X) shape: (20640, 8)
Target (y) shape: (20640,)


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)
print("Training target shape:", y_train.shape)
print("Testing target shape:", y_test.shape)

Training features shape: (16512, 8)
Testing features shape: (4128, 8)
Training target shape: (16512,)
Testing target shape: (4128,)


# Task
Standardize the California housing dataset and save the scaled data to a file using pickle.

## Standardize the data

### Subtask:
Use `StandardScaler` from `sklearn.preprocessing` to standardize the training and testing feature sets (`X_train` and `X_test`).


**Reasoning**:
Standardize the training and testing feature sets using StandardScaler.



In [6]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Scaled training features shape:", X_train_scaled.shape)
print("Scaled testing features shape:", X_test_scaled.shape)

Scaled training features shape: (16512, 8)
Scaled testing features shape: (4128, 8)


## Summary:

### Data Analysis Key Findings

*   The training and testing feature sets were successfully standardized using `StandardScaler`.
*   The shapes of the scaled training and testing feature sets are (16512, 8) and (4128, 8) respectively.
*   The scaled training and testing feature sets were successfully saved to `X_train_scaled.pkl` and `X_test_scaled.pkl` files using the `pickle` library.

### Insights or Next Steps

*   The scaled data is ready for use in machine learning model training and evaluation.
*   The saved pickle files can be easily loaded later for further analysis or model deployment.


# Task
Train a linear regression model on the California housing dataset, evaluate its performance, and summarize the results.

## Import linear regression

### Subtask:
Import the `LinearRegression` model from `sklearn.linear_model`.


**Reasoning**:
Import the LinearRegression model from sklearn.linear_model.



In [8]:
from sklearn.linear_model import LinearRegression

## Initialize and train model

### Subtask:
Initialize a `LinearRegression` model and train it on the scaled training data (`X_train_scaled` and `y_train`).


**Reasoning**:
Initialize and train the Linear Regression model.



In [9]:
model = LinearRegression()
model.fit(X_train_scaled, y_train)

## Evaluate model

### Subtask:
Evaluate the trained model's performance on the scaled testing data (`X_test_scaled` and `y_test`) using relevant metrics like Mean Squared Error or R-squared.


**Reasoning**:
Evaluate the trained model's performance on the scaled testing data using Mean Squared Error and R-squared.



In [10]:
from sklearn.metrics import mean_squared_error, r2_score

y_pred = model.predict(X_test_scaled)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2) Score:", r2)

Mean Squared Error (MSE): 0.5558915986952442
R-squared (R2) Score: 0.575787706032451


## Summary:

### Data Analysis Key Findings

*   The Mean Squared Error (MSE) of the linear regression model on the scaled testing data is approximately 0.556.
*   The R-squared (R2) score of the model on the scaled testing data is approximately 0.576.

### Insights or Next Steps

*   The R-squared score of 0.576 indicates that the linear regression model explains about 57.6% of the variance in the housing prices, suggesting a moderate fit to the data.
*   Further analysis could involve exploring other regression models or techniques to potentially improve performance, such as polynomial regression or ensemble methods.


In [7]:
import pickle

with open('X_train_scaled.pkl', 'wb') as f:
    pickle.dump(X_train_scaled, f)

with open('X_test_scaled.pkl', 'wb') as f:
    pickle.dump(X_test_scaled, f)

print("Scaled data saved to X_train_scaled.pkl and X_test_scaled.pkl")

Scaled data saved to X_train_scaled.pkl and X_test_scaled.pkl


In [11]:
import pandas as pd


new_data = {
    'MedInc': [5.0, 3.5, 6.8],
    'HouseAge': [15.0, 30.0, 10.0],
    'AveRooms': [5.5, 4.2, 7.1],
    'AveBedrms': [1.0, 0.9, 1.1],
    'Population': [1000.0, 1500.0, 800.0],
    'AveOccup': [2.8, 3.1, 2.5],
    'Latitude': [34.0, 36.0, 33.5],
    'Longitude': [-118.0, -120.0, -117.5]
}

new_data_df = pd.DataFrame(new_data)

display(new_data_df)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,5.0,15.0,5.5,1.0,1000.0,2.8,34.0,-118.0
1,3.5,30.0,4.2,0.9,1500.0,3.1,36.0,-120.0
2,6.8,10.0,7.1,1.1,800.0,2.5,33.5,-117.5


Deployment of the Model file

In [17]:
# pickling of the model file means saving the trained Linear regression object to a file in a way that preserves its state, so that you can load it back into memory later for deployment or further use i,e prevents retraining the model all over again
import pickle
pickle.dump(model,open('regmodel.pkl','wb')) # saves the trained model to a file called regmodel.pkl in binary write mode
pickled_model=pickle.load(open('regmodel.pkl','rb')) # loads the saved model from the same pkl file in binary read mode which desrializes the byte stream back into a Python object which is assigned to pickled_model
## Prediction
pickled_model.predict(scaler.transform(california_housing.data[0].reshape(1,-1)))
'''
california_housing.data[0].reshape(1,-1) takes the first data point from the original california_housing dataset's data array and reshapes it into a 2D array (required by scaler.transform).
scaler.transform(...) scales this single data point using the same scaler object that was fitted on the training data.
pickled_model.predict(...) then makes a prediction on this scaled single data point.
In essence, this code demonstrates how to save a trained machine learning model using pickle, load it back, and then use the loaded model to make a prediction on a single data point after scaling it.
'''



"\ncalifornia_housing.data[0].reshape(1,-1) takes the first data point from the original california_housing dataset's data array and reshapes it into a 2D array (required by scaler.transform).\nscaler.transform(...) scales this single data point using the same scaler object that was fitted on the training data.\npickled_model.predict(...) then makes a prediction on this scaled single data point.\nIn essence, this code demonstrates how to save a trained machine learning model using pickle, load it back, and then use the loaded model to make a prediction on a single data point after scaling it.\n"