## Choosing the right estimator for my problem

Some things to note:
- Classification problem - predicting a category
- Regression problem - predicting a number


Some sklearn machine learning map: https://scikit-learn.org/stable/machine_learning_map.html


In [2]:
import pandas as pd
import numpy as np

Number of Instances:
20640

Number of Attributes:
8 numeric, predictive attributes and the target

Attribute Information:
MedInc median income in block group

HouseAge median house age in block group

AveRooms average number of rooms per household

AveBedrms average number of bedrooms per household

Population block group population

AveOccup average number of household members

Latitude block group latitude

Longitude block group longitude

Missing Attribute Values:
None

In [3]:
from sklearn.datasets import fetch_california_housing

housing_df = fetch_california_housing()

housing_df

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [8]:
df = pd.DataFrame(housing_df["data"], columns=housing_df["feature_names"])
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [14]:
df["MedHouseVal"]=housing_df["target"]
df.head()


# Target is the median house value for California districts, expressed in hundreds of thousand of dollars ($100,000)

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [35]:
len(df)

20640

In [None]:
df.info()

In [17]:
# setup random seed

np.random.seed(42)

In [33]:
# create the data

X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

#split the data into train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [57]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso

models = {
    "Ridge":Ridge(),
    "Lasso":Lasso(),
    
}

In [63]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f"Model: {name}")
    print(f" R² Score: {r2:.4f}")
    print(f" MAE: {mae:.4f}")
    print(f" MSE: {mse:.4f}\n")



Model: Ridge
 R² Score: 0.5759
 MAE: 0.5332
 MSE: 0.5558

Model: Lasso
 R² Score: 0.2842
 MAE: 0.7616
 MSE: 0.9380



In [45]:
model.score(X_test, y_test)

'''
Coefficient of Determinaion: 
more commonly known as R-squared. assesses how strong the linear relationship is between two variables
'''

0.5758549611440131