## Introduction to Scikit-Learn (SKLearn)
This Notebook demonstrate some of the useful function of the Scikit Learn Liberary 

What to Cover 
1. An end-to-end Scikit-Learn workflow 
2. Getting the data ready
3. Choose the right estimator/algorithm for your problem
4. Fit the model/estimator to the data and use it to make predictions on our data
5. Evaluating a model
6. Improve a model
7. Save and load a trained model
8. Putting all together 

Note: This notebook is a simplified version and doesn't include all the details of the full workflow. For a more comprehensive understanding, refer to the official Scikit-Learn documentation.




#### Relevant Packages 

In [3]:
# import relevant Packages 
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 


#### Getting the Data Ready 

In [29]:
#import the data into pandas DataFrame 

heart_diesase = pd.read_csv("heart-disease.csv")
X = heart_diesase.drop("target", axis = 1)
y = heart_diesase["target"]


In [30]:
# Split the data into a training and test-sets 
from sklearn.model_selection import train_test_split 
X_test, X_train, y_test, y_train = train_test_split(X,y, test_size= 0.2, random_state=42)

#### Make sure all the data is numerical 

In [31]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [6]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [32]:
# Split the data into X,y 
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

#Split the data into a Training set and a Test Sets
X_train , X_test, y_train, y_test = train_test_split(X,y, test_size=0.2 )

In [33]:
# Buid a Machine Learning Model 
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)


#Error Message 
#--------ValueError: could not convert string to float: 'Honda'--------#

ValueError: could not convert string to float: 'Toyota'

#### Turn the Categories into Numbers (Convert Non Numerical Data into Numerical Data)

In [34]:
# Using Sklearn Modules 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer 

categorical_features = ["Make","Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_not",
                                 one_hot,
                                 categorical_features)],
                                 remainder = "passthrough" )
transformed_X = transformer.fit_transform(X)
X_Trans = pd.DataFrame(transformed_X)

In [35]:
# Try and fit the model 
# Set up a new train and test sets
np.random.seed(42)
X_train, X_test, y_train,y_test = train_test_split(X_Trans, y, test_size= 0.2)
model.fit(X_train , y_train)
model.score(X_test, y_test)



0.3235867221569877

#### What if there are missing values 
1. Fill them with some values (also known as imputation)
2. Revome the sampeles with missing values altogether 

In [36]:
# Import the carsales missing Data 
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")


In [37]:
# How mnay missing values do we have 
car_sales_missing.isna() .sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [38]:
len(car_sales_missing)

1000

In [14]:
X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

# Convret Non numerical values to numeric 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("One-hot",
                                one_hot,
                                categorical_features)],
                                remainder= "passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 5000 stored elements and shape (1000, 17)>

In [11]:
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


#### Option 1: Fill missing data with pandas

In [None]:
# Fill the "MAke" Column 
car_sales_missing["Make"].fillna("missing", inplace = True)

# Fill the "Colour" column 
car_sales_missing["Colour"].fillna("missing", inplace = True)

#fill the "Odometer(KM)" Column 
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace = True)

# Fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace=True)


In [None]:
#Checck the dataframe again fior the number of missing data 
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

since "Price" is the value we are predicting we cannot fill the missing values there, 
we have to remove those rows for accuracy

In [16]:
# Remove rowws with missing price value 
car_sales_missing.dropna(inplace = True)
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [17]:
X  = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [18]:
# Convret Non numerical values to numeric 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("One-hot",
                                one_hot,
                                categorical_features)],
                                remainder= "passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]], shape=(950, 16))

#### Option 1: Fill missing data with pandas

In [62]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [64]:
#----Drop the row with no labels----#
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [None]:
#----Split into X and y----#
X = car_sales_missing.drop("Price", axis= 1)
y = car_sales_missing["Price"]


(950,)

#### Option 2: Fill missing value with Scikit-Learn
    strategy
    The strategy parameter controls how missing values are filled.
    For SimpleImputer, possible options are:
    'mean': Replace missing values using the mean of the column (numeric only).
    'median': Use the median of the column (numeric only).
    'most_frequent': Use the most frequent value in the column (works for categorical or numerical).
    'constant': Replace with a constant value that you specify using fill_value.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer #allows us to define some kind of transformer and then apply it to a colunm 

#Fill categorical value with "missing" and Numerical value with mean 
cat_imputer = SimpleImputer(strategy= "constant", fill_value="missing")
door_imputer = SimpleImputer(strategy = "constant", fill_value= 4)
num_imputer = SimpleImputer(strategy= "mean")

# Define colunmas 
cat_features = ["Make", "Colour"]
door_feature = ["Doors"]
num_feature = ["Odometer (KM)"]

#Create the imputer to replace the categorical values with "missing"
imputer = ColumnTransformer([("cat_imputer", cat_imputer, cat_features),
                             ("door_imputer", door_imputer, door_feature),
                             ("num_imputer", num_imputer, num_feature)])

# Transform the data 
filled_X = imputer.fit_transform(X)

car_sales_filled = pd.DataFrame(filled_X, columns= ["Make","Colour","Doors", "Odometer (KM)"])
# We have a perfect ML data now, with no missing Value 

(950, 4)

In [68]:
# Convret Non numerical values to numeric 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("One-hot",
                                one_hot,
                                categorical_features)],
                                remainder= "passthrough")
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X.shape
# Our Data is ready with all categories converted into numeric 

(950, 15)

In [None]:
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X,y,test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.21990196728583944

## 2.0 Choosing the Right estmator or Algorithm for the problem 
#### Before you choose a model, figure out what typoe of problem you are dealing with.
* Classifcation: Predicting whether a sample is "A" or "B"
* Regression: Predicting a number 


#### 2.1 Picking a ML Algorithm for a Regression Problem 
* Note: .score returns R^2 which is the coefficient of determination
* A perfect score is 1.0 
* We can get a score of negative, which means the model is arbitrarily worse
##### Step 1 : Check the scikit learn machine learning map ...https://scikit-learn.org/stable/machine_learning_map.html

In [85]:
simulated_pavement = pd.read_csv("simulated_pavement.csv")
simulated_pavement.head()

Unnamed: 0,Age,AADT,Climate_Index,Asphalt_Thickness,Air_Voids,Binder_Grade,IRI
0,7,3195,3.047813,197.348431,5.920157,64,2.032052
1,20,26571,1.646559,281.219758,3.190865,76,3.329051
2,15,15922,5.340894,186.878873,5.264149,76,3.37856
3,11,5758,4.8483,170.015682,3.634586,70,2.205523
4,8,22502,6.92436,229.020672,3.480659,76,2.643641


In [86]:
# How many samples 
len(simulated_pavement)

100

In [90]:
# Try the rigression model 
from sklearn.linear_model import Ridge

# Setup a random seed 
np.random.seed(42)

# Create the data 
X = simulated_pavement.drop("IRI", axis=1)
y = simulated_pavement["IRI"]

# Split the data into Training set and Test set 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size= 0.2)

# Instantiate the Ridge Model 
model = Ridge()
model.fit(X_train,y_train)

# Check the score of the Ridge Model on test data
model.score(X_test, y_test)


0.5760743265613899

### How do we improve the score 
* What if RidgeRegression wasn't working 
* Let's refer back to the map...https://scikit-learn.org/stable/machine_learning_map.html


In [91]:
# Chech the RandomForestRegressor 
from sklearn.ensemble import RandomForestRegressor

# Set uo the random seed 
np.random.seed(42)

# Create the Data
X = simulated_pavement.drop("IRI", axis = 1)
y = simulated_pavement["IRI"]

# Split the Data into a Training set and a Test Set 
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size=0.2)

# Instantiate the RandomForestModel 
rf_model =RandomForestRegressor()
rf_model.fit(X_train, y_train)

# Check the Score 0f the RandomForestRegressor 
rf_model.score(X_test, y_test)

0.4309822736775186

### Choosing an Estimator for a Classification Problem 

In [8]:
# Get the Data Ready 
import pandas as pd 
heart_disease = pd.read_csv("heart-disease.csv")
heart_diesase.head()

NameError: name 'heart_diesase' is not defined

In [4]:
# Get the clasifier from sklearn 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Get the data ready 
X = heart_diesase.drop("target", axis =1)
y = heart_diesase["target"]

# Split the data into a Training and Test set 
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2)

# Instantiate the RFC model 
clf = RandomForestClassifier()
clf.fit(X_train,y_train)

# Get the score 
clf.score(X_test,y_test)


NameError: name 'heart_diesase' is not defined

In [10]:

result = 0 
for i in range(5):
    result += 2*i

print(result)


20
