## Choosing the right estimator/algorithm for our problem

Scikit Learn uses estimator as another term for machine learning model or algorithm

* Classification - predicting whether a sample is one thing or another
* Regresson - predicting a value

### Picking a machine learning model for a regresion problem

In [3]:
from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, mean_absolute_error
import pandas as pd
import numpy as np 

boston = load_boston()

In [2]:
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])

boston_df["target"] = pd.Series(boston["target"])

boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [3]:
len(boston_df)

506

In [4]:
np.random.seed(42)

X = boston_df.drop("target", axis=1)

y = boston_df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Ridge()

model.fit(X_train, y_train)

model.score(X_test, y_test)

0.6662221670168519

In [5]:
np.random.seed(42)

X = boston_df.drop("target", axis = 1)

y = boston_df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor()

model.fit(X_train, y_train)

model.score(X_test, y_test)

0.8654448653350507

### Choosing an estimator for a classification problem


In [6]:
heart_disease = pd.read_csv(r"C:\Users\cos_9\PycharmProjects\machine_learning_and_data_science_bootcamp\resources\heart-disease.csv")

heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
np.random.seed(42)

X = heart_disease.drop("target", axis=1)

y = heart_disease.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearSVC()

model.fit(X_train, y_train)

model.score(X_test, y_test)



0.8688524590163934

In [8]:
model = RandomForestClassifier()

model.fit(X_train, y_train)

model.score(X_test, y_test)

0.8524590163934426

### Making predictions!!!

In [9]:
y_preds = model.predict(X_test)

#### Compare predictions to actual data

In [10]:
np.mean(y_preds == np.array(y_test))

0.8524590163934426

In [11]:
accuracy_score(y_test, y_preds)

0.8524590163934426

##### Making predictions with predict_proba()

In [15]:
# predict_proba returns the probability of a classification label

model.predict_proba(X_test[:5])

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.44, 0.56],
       [0.84, 0.16],
       [0.18, 0.82]])

In [16]:
model.predict(X_test[:5])

array([0, 1, 1, 0, 1], dtype=int64)

In [17]:
heart_disease.target.value_counts()

1    165
0    138
Name: target, dtype: int64

predict can also be used for regression models

In [19]:
np.random.seed(42)


X = boston_df.drop("target", axis=1)
y = boston_df.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2 )

model = RandomForestRegressor()

model.fit(X_train, y_train)

model.score(X_test, y_test)

0.8654448653350507

In [23]:
model.predict(X_train[:3])

array([12.691, 20.124, 19.866])

In [22]:
model.predict([[14, 0, 20, 0, 0.7, 5, 95, 2, 23, 555, 19, 330, 23]])

array([10.622])

In [26]:
y_preds = model.predict(X_test)

mean_absolute_error(y_test, y_preds)

2.136382352941176

In [27]:
model.predict_prob(X_test[:10])

AttributeError: 'RandomForestRegressor' object has no attribute 'predict_prob'

Some thing to note:
* Sklearn refers to machine learning models, alogrithms as estimators
* Classification problem - predicting a category ( heart disease or not)
    * Some times you'll see 'clf' (short for classifier) used as a classification estimator
* Regression problem - predicting a number (selling price of a car)

If you are working on a machine learning problem and looking to use sklearn and not sure what model you should use, follow the map: [here](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

In [1]:
# Using the california housing dataset
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

In [2]:
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [6]:
df = pd.DataFrame(housing["data"], columns=housing["feature_names"])

In [8]:
df["MedHouseVal"] = housing["target"]

In [9]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [12]:
np.random.seed(42)

X = df.drop("MedHouseVal", axis=1)
y = df.loc[:, "MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [14]:
from sklearn import linear_model

reg = linear_model.Ridge(alpha=0.5)
reg.fit(X_train, y_train)

print(f"Coeficient of determination: {reg.score(X_test, y_test)}")

Coeficient of determination: 0.5758213996714421


In [15]:
# trying the same with Lasso

reg = linear_model.Lasso(alpha=0.1)

reg.fit(X_train, y_train)

print(f"Coeficient of determination: {reg.score(X_test, y_test)}")

Coeficient of determination: 0.5318167610318159


In [17]:
# tryig with linear SVG
from sklearn import svm
reg = svm.SVR()

reg.fit(X_train, y_train)

print(f"Coeficient of determination: {reg.score(X_test, y_test)}")

Coeficient of determination: -0.01648536010717372


How about we try an ensemble model (combination of smaller models to try and make better predictions than just a single model)

Using random forest regressions and decision trees

In [21]:
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed

np.random.seed(42)

X = df.drop("MedHouseVal", axis=1)
y = df.loc[:, "MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = RandomForestRegressor(n_estimators=200)

clf.fit(X_train, y_train)


print(f"Coeficient of determination: {clf.score(X_test, y_test)}")

Coeficient of determination: 0.8068836508775645


`predict()` can also be used for regression models

In [23]:
y_preds = clf.predict(X_test)

In [25]:
y_preds[:10]

array([0.49123   , 0.72785   , 4.93415665, 2.561385  , 2.29763   ,
       1.63672505, 2.271205  , 1.66016   , 2.486845  , 4.8543831 ])

In [26]:
from sklearn.metrics import mean_absolute_error

mean_absolute_error(y_test, y_preds)

0.32583425476017464

## Evaluating a machine learning model

There are three methods for evaluating a machine learning model:
1. Estimator's `score()` method
2. The `scoring` parameter
3. Problem-specific metric functions

You can read more about these [here](https://scikit-learn.org/stable/modules/model_evaluation.html)

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.