# Introduction to Scikit-Learn (Sklearn)

This notebook demonstrates osme of the most usefull fucntions of the beautiful Scikit-Learn library.
What we are goint to cover:

0. An end to end sklearn workflow
1. getting the data ready
2. choose the right estimator/algorithm for our problems
3. fit the model/algorithm and use it to make predictions on our data
4. evaluating a model
5. improve a model
6. save and load a trained model
7. putting it all together

<img src="Precision,Recall,F1-score,Support.png"/>

<img src="Accuracy.png"/>

<img src="Precision,Recall,F1-score,Support,Accuracy Meaing.png"/>

## 0. An End to End Scikit-Learn workflow

In [1]:
# 1. Get the data ready
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [2]:
# Create X (features matrix)
X=heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]

In [3]:
# Choose the right model and hyperparameters
import sklearn
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# we will keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [4]:
X.shape, y.shape

((303, 13), (303,))

<img src="cross_validation_grid_search_workflow.png" width="400"/>

In [5]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

# split the data into train and test sets
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.2)
X_train.shape,X_test.shape,y_train.shape,y_test.shape, type(X_train),type(X_test),type(y_train),type(y_test)

((242, 13),
 (61, 13),
 (242,),
 (61,),
 pandas.core.frame.DataFrame,
 pandas.core.frame.DataFrame,
 pandas.core.series.Series,
 pandas.core.series.Series)

In [6]:
## fit the data to mode for train

clf.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [7]:
# Make a prediction
y_prads = clf.predict(X_test)
y_prads

array([1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1])

In [8]:
y_test

40     1
279    0
87     1
274    0
37     1
      ..
129    1
236    0
18     1
73     1
3      1
Name: target, Length: 61, dtype: int64

In [9]:
# 4. Evaluate the model (Final ealuation) on training data and test data (Score -> mean error)
# accuracy %
clf.score(X_train,y_train),clf.score(X_test,y_test)

(1.0, 0.8032786885245902)

In [10]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(y_true=y_test,y_pred=y_prads))

              precision    recall  f1-score   support

           0       0.80      0.74      0.77        27
           1       0.81      0.85      0.83        34

    accuracy                           0.80        61
   macro avg       0.80      0.80      0.80        61
weighted avg       0.80      0.80      0.80        61



In [11]:
confusion_matrix(y_true=y_test,y_pred=y_prads)

array([[20,  7],
       [ 5, 29]])

In [12]:
accuracy_score(y_test,y_prads)

0.8032786885245902

<img src="n_estimators.png" />

In [13]:
# 5. Improve a model
# Try different amount of n_estimators

np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train,y_train)
    print(f'Model accuracy on test set: {clf.score(X_test,y_test)*100:.2f} %')
    print(" ")

Trying model with 10 estimators...
Model accuracy on test set: 81.97 %
 
Trying model with 20 estimators...
Model accuracy on test set: 83.61 %
 
Trying model with 30 estimators...
Model accuracy on test set: 78.69 %
 
Trying model with 40 estimators...
Model accuracy on test set: 78.69 %
 
Trying model with 50 estimators...
Model accuracy on test set: 81.97 %
 
Trying model with 60 estimators...
Model accuracy on test set: 81.97 %
 
Trying model with 70 estimators...
Model accuracy on test set: 80.33 %
 
Trying model with 80 estimators...
Model accuracy on test set: 77.05 %
 
Trying model with 90 estimators...
Model accuracy on test set: 78.69 %
 


In [14]:
# 6. Save a moddel and load it

import pickle

pickle.dump(clf,open("random_forest_model_1.pkl","wb"))

In [15]:
with open("random_forest_model_1.pkl","rb") as file:
    loaded_model = pickle.load(file)

In [16]:
loaded_model.score(X_test,y_test)

0.7868852459016393

## 1. Getting our data ready to be used with machine learning 

Three main things we have to do:

    1. Split the data into features and labels (usually `X` & `Y`)
    2. Filling (also called imputing) or disregarding missing values
    3. Converting non-numerical values to numerical values (also called  feature encoding)


In [17]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [18]:
X = heart_disease.drop("target", axis =1)
y = heart_disease["target"]

X.shape, y.shape

((303, 13), (303,))

In [19]:
# split the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [20]:
X_train.shape,X_test.shape, y_train.shape,y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [21]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier().fit(X_train,y_train)
clf

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [22]:
clf.score(X_test,y_test)

0.8852459016393442

# Clean up----> Transform -----> Numerical
### 1.1 Make sure it's All numerical (Car-Sales....................................................)

In [23]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [24]:
len(car_sales)

1000

In [25]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [26]:
# Split into X/y

X = car_sales.drop("Price",axis=1)
y = car_sales["Price"]
X.shape, y.shape

((1000, 4), (1000,))

In [27]:
#split into raining and test
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((800, 4), (200, 4), (800,), (200,))

In [28]:
# Restart the model
# Turn the catagories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
catagorial_feature = ["Make","Colour","Doors"]
One_hot = OneHotEncoder()
transformer= ColumnTransformer([("One_hot",One_hot,catagorial_feature)],
                               remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X


array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]], shape=(1000, 13))

In [29]:
X

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3
...,...,...,...,...
995,Toyota,Black,35820,4
996,Nissan,White,155144,3
997,Nissan,Blue,66604,4
998,Honda,White,215883,4


In [30]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [31]:
dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]],dtype=int)
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1


In [32]:
# Let's refit the model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
X_train,X_test,y_train,y_test=train_test_split(transformed_X,y,test_size=0.2)

model.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [33]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [34]:
y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [35]:
model.score(X_test,y_test)

0.3235867221569877

# What happen If Missing values in data

1. Fill them with some value (also know as imputation)
2. Remove the samples with missing data altogether

In [36]:
car_sales= pd.read_csv("car-sales-extended-missing-data.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [37]:
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

# Option 1: Fill missing data with Pandas

In [38]:
# Fill the "Make" column
car_sales["Make"]=car_sales["Make"].fillna("missing")
car_sales["Colour"]=car_sales["Colour"].fillna("missing")
car_sales["Odometer (KM)"]=car_sales["Odometer (KM)"].fillna(car_sales["Odometer (KM)"].mean())
car_sales["Doors"]=car_sales["Doors"].fillna(4)

car_sales.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [39]:
car_sales.dropna(inplace=True)

In [40]:
car_sales.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [41]:
car_sales.shape

(950, 5)

In [42]:
X = car_sales.drop("Price", axis=1)
y =car_sales["Price"]

In [43]:
# Lets try and convert our data to numbers

from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
model = RandomForestRegressor()
catagorial_feature = ["Make", "Colour", "Doors"]
transformer = ColumnTransformer([("One_hot", OneHotEncoder(),catagorial_feature)],remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales)
transformed_X


array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]], shape=(950, 16))

In [44]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(transformed_X,y,test_size=0.2)
X_train.shape,X_test.shape, y_train.shape, y_test.shape 

((760, 16), (190, 16), (760,), (190,))

In [45]:
model.fit(X_train,y_train)

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [46]:
model.score(X_test,y_test)

0.9999491016126744

# Option 2: Fill missing values with Scikit-Learn

In [47]:
car_sales= pd.read_csv("car-sales-extended-missing-data.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [48]:
car_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [49]:
car_sales.dropna(subset="Price",inplace=True)

In [50]:
car_sales.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [51]:
# split in X & y
X = car_sales.drop("Price",axis =1)
y = car_sales["Price"]
X.shape, y.shape

((950, 4), (950,))

In [52]:
# Take care of the missing data with Scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill catagorical value with "missing" and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant",fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define column
cat_feature=["Make","Colour"]
door_feature=["Doors"]
num_feature=["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer= ColumnTransformer([
    ("cat_feature",cat_imputer,cat_feature),
    ("door_feature",door_imputer,door_feature),
    ("num_feature",num_imputer,num_feature)
    
])
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], shape=(950, 4), dtype=object)

In [53]:
car_sales_filled = pd.DataFrame(filled_X,columns=["Make","Colour","Doors","Price"])
car_sales_filled.isna().sum()

Make      0
Colour    0
Doors     0
Price     0
dtype: int64

In [54]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
model = RandomForestRegressor()
catagorial_feature = ["Make", "Colour", "Doors"]
transformer = ColumnTransformer([("One_hot", OneHotEncoder(),catagorial_feature)],remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X


<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 3800 stored elements and shape (950, 15)>

In [55]:
# Now we hae got our data as numbers and filled (no missing value)
# let's fit a model

np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.21990196728583944

## 2. Choosing the right estimator/algorithm for our problem

Scikit-Learn uses estimator as another term for machine learing model or algorithm

* Classification - predicting whether a sample is a one thing or another
* Regression - predicting  a number

### 2.2 Picking a machine learning model for a regression problem

In [56]:
boston_df = pd.read_csv("boston.csv")
boston_df

Unnamed: 0,TOWN,TRACT,LON,LAT,MEDV,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO
0,Nahant,2011,-70.9550,42.2550,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3
1,Swampscott,2021,-70.9500,42.2875,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8
2,Swampscott,2022,-70.9360,42.2830,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8
3,Marblehead,2031,-70.9280,42.2930,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7
4,Marblehead,2032,-70.9220,42.2980,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,Winthrop,1801,-70.9860,42.2312,22.4,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0
502,Winthrop,1802,-70.9910,42.2275,20.6,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0
503,Winthrop,1803,-70.9948,42.2260,23.9,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0
504,Winthrop,1804,-70.9875,42.2240,22.0,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0


In [57]:
from sklearn.linear_model import Ridge
np.random.seed(42)
boston_df.rename(columns = {"PTRATIO":"target"},inplace=True)
X=boston_df.drop("target",axis=1)
y=boston_df["target"]

In [58]:
X.shape,y.shape

((506, 15), (506,))

In [59]:
boston_df.dtypes

TOWN       object
TRACT       int64
LON       float64
LAT       float64
MEDV      float64
CRIM      float64
ZN        float64
INDUS     float64
CHAS        int64
NOX       float64
RM        float64
AGE       float64
DIS       float64
RAD         int64
TAX         int64
target    float64
dtype: object

In [60]:
from sklearn.linear_model import Ridge
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
catagorial_feature = ["TOWN"]
transformer = ColumnTransformer([("One_hot", OneHotEncoder(),catagorial_feature)],remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X
np.random.seed(42)
X_train,X_test,y_train,y_test = train_test_split(transformed_X,y,test_size=0.2)
model=Ridge()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.7063616126957761

How do we improve the score

What if Ridge is not working

https://scikit-learn.org/stable/machine_learning_map.html

In [61]:
# Improving the model 
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
catagorial_feature = ["TOWN"]
transformer = ColumnTransformer([("One_hot", OneHotEncoder(),catagorial_feature)],remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X
np.random.seed(42)
X_train,X_test,y_train,y_test = train_test_split(transformed_X,y,test_size=0.2)
model=RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.9158148625639175