# Project to understand Linear regression, Classification  and Regression in Python

We will take sklearn's housing dataset available. This dataset contains information about housing values in suburban areas, such as the median value of owner-occupied homes, along with various other attributes like crime rate, proximity to highways, and socioeconomic factors. This dataset is commonly used for regression tasks to predict housing prices based on these features.

**Data source:** [lib.stat.cmu.edu/datasets/boston](lib.stat.cmu.edu/datasets/boston)


 The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic  prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
 ...', Wiley, 1980.   N.B. Various transformations are used in the table on pages 244-261 of the latter.

 Variables in order:
- **CRIM:**    per capita crime rate by town
- **ZN:**       proportion of residential land zoned for lots over 25,000 sq.ft.
- **INDUS:**    proportion of non-retail business acres per town
- **CHAS:**     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- **NOX:**      nitric oxides concentration (parts per 10 million)
- **RM:**       average number of rooms per dwelling
- **AGE:**      proportion of owner-occupied units built prior to 1940
- **DIS:**      weighted distances to five Boston employment centres
- **RAD:**      index of accessibility to radial highways
- **TAX:**      full-value property-tax rate per $10,000
- **PTRATIO:**  pupil-teacher ratio by town
- **B:**        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- **LSTAT:**    % lower status of the population
- **MEDV:**     Median value of owner-occupied homes in $1000's


In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import requests

%matplotlib inline

In [2]:
# URL of the dataset
url = "http://lib.stat.cmu.edu/datasets/boston"

# Send a GET request to fetch the data
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Decode the content of the response
    content = response.content.decode("utf-8")
    
    # Save the content to a CSV file
    with open("boston_dataset.csv", "w") as f:
        f.write(content)

    print("Data saved successfully as boston_dataset.csv")

else:
    print("Failed to fetch data from the URL")


Data saved successfully as boston_dataset.csv


In [3]:
import pandas as pd

# Define column names
column_names = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", 
    "RM", "AGE", "DIS", "RAD", "TAX", 
    "PTRATIO", "B", "LSTAT", "MEDV"
]

# Read the lines from the file and drop the top 22 rows
with open("boston_dataset.csv", "r") as file:
    lines = file.readlines()[22:]

# Merge every two lines into one and split into columns
data = []
for i in range(0, len(lines), 2):
    row_data = lines[i].strip().split()[:13]  # Take the first 13 columns from the first line
    row_data += lines[i+1].strip().split()   # Take all columns from the second line
    data.append(row_data)

# Create DataFrame
df = pd.DataFrame(data, columns=column_names)

# Display the DataFrame
df.head()


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [4]:
type(df)

pandas.core.frame.DataFrame

### Dividing the dataset into independent and dependent variables

In [5]:
# Set "MEDV" as the target column
target_column = "MEDV"
X = df.drop(columns=[target_column]) # Independent variable
y = df[target_column] # dependent variable

# Rename the target column to "Price"
y = y.rename("Price")

In [6]:
y

0      24.00
1      21.60
2      34.70
3      33.40
4      36.20
       ...  
501    22.40
502    20.60
503    23.90
504    22.00
505    11.90
Name: Price, Length: 506, dtype: object

In [7]:
X

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98
1,0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14
2,0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03
3,0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94
4,0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.00,11.930,0,0.5730,6.5930,69.10,2.4786,1,273.0,21.00,391.99,9.67
502,0.04527,0.00,11.930,0,0.5730,6.1200,76.70,2.2875,1,273.0,21.00,396.90,9.08
503,0.06076,0.00,11.930,0,0.5730,6.9760,91.00,2.1675,1,273.0,21.00,396.90,5.64
504,0.10959,0.00,11.930,0,0.5730,6.7940,89.30,2.3889,1,273.0,21.00,393.45,6.48


## Linear regression

In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

lin_rag = LinearRegression() 
mse = cross_val_score(lin_rag, X,y,scoring='neg_mean_squared_error', cv=5)
print(mse)

[-12.46030057 -26.04862111 -33.07413798 -80.76237112 -33.31360656]


In [9]:
mean_mse = np.mean(mse)
print(mean_mse)

-37.13180746769923


In [10]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [11]:
fit = lin_rag.fit(X_train, y_train)

In [12]:
# 3. Make predictions using the trained model
predictions = lin_rag.predict(X_test)
predictions

array([28.99672362, 36.02556534, 14.81694405, 25.03197915, 18.76987992,
       23.25442929, 17.66253818, 14.34119   , 23.01320703, 20.63245597,
       24.90850512, 18.63883645, -6.08842184, 21.75834668, 19.23922576,
       26.19319733, 20.64773313,  5.79472718, 40.50033966, 17.61289074,
       27.24909479, 30.06625441, 11.34179277, 24.16077616, 17.86058499,
       15.83609765, 22.78148106, 14.57704449, 22.43626052, 19.19631835,
       22.43383455, 25.21979081, 25.93909562, 17.70162434, 16.76911711,
       16.95125411, 31.23340153, 20.13246729, 23.76579011, 24.6322925 ,
       13.94204955, 32.25576301, 42.67251161, 17.32745046, 27.27618614,
       16.99310991, 14.07009109, 25.90341861, 20.29485982, 29.95339638,
       21.28860173, 34.34451856, 16.04739105, 26.22562412, 39.53939798,
       22.57950697, 18.84531367, 32.72531661, 25.0673037 , 12.88628956,
       22.68221908, 30.48287757, 31.52626806, 15.90148607, 20.22094826,
       16.71089812, 20.52384893, 25.96356264, 30.61607978, 11.59

### [Ridge Lasso regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)

`class sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, copy_X=True, max_iter=None, tol=0.0001, solver='auto', positive=False, random_state=None)`

- alpha{float, ndarray of shape (n_targets,)}, default=1.0 : Constant that multiplies the L2 term, controlling regularization strength. alpha must be a non-negative float i.e. in [0, inf).

In [13]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge =Ridge()
params = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20]}
ridge_regressor = GridSearchCV(ridge, params, scoring='neg_mean_squared_error', cv=5)
ridge_regressor.fit(X,y)

In [15]:
print(ridge_regressor.best_params_)
print(ridge_regressor.best_score_)

{'alpha': 20}
-32.38025025182519


Now MSE is reduced from -37 to -32

- From Linear regression = -37 
- From Ridge regression = -32

### Lasso regression

In [16]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

lasso =Lasso()
params = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20]}
lasso_regressor = GridSearchCV(lasso, params, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X,y)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [17]:
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)

{'alpha': 1}
-35.53158022069486


Now putting more alpha values

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV

lasso =Lasso()
params = {'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,]}
lasso_regressor = GridSearchCV(lasso, params, scoring='neg_mean_squared_error', cv=5)
lasso_regressor.fit(X,y)