<font size="+5">#03 | Model Selection. Decision Tree vs Support Vector Machines vs Logistic Regression</font>

- Subscribe to my [Blog ↗](https://blog.pythonassembly.com/)
- Let's keep in touch on [LinkedIn ↗](www.linkedin.com/in/jsulopz) 😄

# Discipline to Search Solutions in Google

> Apply the following steps when **looking for solutions in Google**:
>
> 1. **Necesity**: How to load an Excel in Python?
> 2. **Search in Google**: by keywords
>   - `load excel python`
>   - ~~how to load excel in python~~
> 3. **Solution**: What's the `function()` that loads an Excel in Python?
>   - A Function to Programming is what the Atom to Phisics.
>   - Every time you want to do something in programming
>   - **You will need a `function()`** to make it
>   - Theferore, you must **detect parenthesis `()`**
>   - Out of all the words that you see in a website
>   - Because they indicate the presence of a `function()`.

# Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html) executing the lines of code below:
> - The goal of this dataset is
> - To predict `internet_usage` of **people** (rows)
> - Based on their **socio-demographical characteristics** (columns)

In [63]:
import pandas as pd

url = 'https://raw.githubusercontent.com/py-thocrates/data/main/internet_usage_spain.csv'

df = pd.read_csv(url).sample(1000)
df.head()

Unnamed: 0,internet_usage,sex,age,education
134,0,Male,50,No studies
1479,1,Male,26,PhD
1593,1,Female,40,Higher Level
2101,1,Female,22,PhD
1725,0,Female,65,High School


# Build & Compare Models

## `DecisionTreeClassifier()` Model in Python

In [64]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/7VeUPuFGJHk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [65]:
clf.fit(X, y)

NameError: name 'clf' is not defined

In [66]:
from sklearn.tree import DecisionTreeClassifier

In [67]:
model = DecisionTreeClassifier()

In [68]:
model.fit()

TypeError: fit() missing 2 required positional arguments: 'X' and 'y'

In [69]:
y = df.internet_usage
X = df.drop('internet_usage', axis=1)

In [70]:
X = df.drop(columns='internet_usage')

In [71]:
import pandas as pd

In [72]:
pd.get_dummies(X)

Unnamed: 0,age,sex_Female,sex_Male,education_Elementary,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
134,50,0,1,0,0,0,1,0,0
1479,26,0,1,0,0,0,0,1,0
1593,40,1,0,0,0,1,0,0,0
2101,22,1,0,0,0,0,0,1,0
1725,65,1,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...
555,52,1,0,1,0,0,0,0,0
312,34,1,0,1,0,0,0,0,0
1782,51,1,0,0,0,0,1,0,0
1032,21,1,0,0,1,0,0,0,0


In [73]:
X = pd.get_dummies(data=X, drop_first=True)

In [74]:
model.fit(X, y)

DecisionTreeClassifier()

In [75]:
model.score(X, y)

0.875

## `SVC()` Model in Python

In [76]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [77]:
from sklearn.svm import SVC

In [78]:
model = SVC()

In [79]:
model.fit(X, y)

SVC()

In [80]:
model.score(X,y)

0.786

## `LogisticRegression()` Model in Python

In [81]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIYKR4sgzI8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [82]:
from sklearn.linear_model import LogisticRegression

In [83]:
model = LogisticRegression(max_iter=1000)

In [84]:
model.fit(X, y)

LogisticRegression(max_iter=1000)

In [85]:
model.score(X, y)

0.834

# Function to Automate Lines of Code

> - We repeated all the time the same code:

```python
model.fit()
model.score()
```

> - Why not turning the lines into a `function()`
> - To automate the process?
> - In a way that you would just need

```python
calculate_accuracy(model=dt)

calculate_accuracy(model=svm)

calculate_accuracy(model = lr)
```

> - To calculate the `accuracy`

## Make a Procedure Sample for `DecisionTreeClassifier()`

In [86]:
model = DecisionTreeClassifier()
model.fit(X, y)
model.score(X, y)

0.875

## Code Thinking

> 1. Think of the functions `result`
> 2. Store that `object` to a variable
> 3. `return` the `result` at the end
> 4. **Indent the body** of the function to the right
> 5. `def`ine the `function():`
> 6. Think of what's gonna change when you execute the function with `different models`
> 7. Locate the **`variable` that you will change**
> 8. Turn it into the `parameter` of the `function()`

## Automate the Procedure into a `function()`

In [87]:
def calcular_precision(model):
    model.fit(X, y)
    result = model.score(X, y)

    return result

## `DecisionTreeClassifier()` Accuracy

In [88]:
dt = DecisionTreeClassifier()
calcular_precision(model = dt)

0.875

## `SVC()` Accuracy

In [89]:
sv = SVC()
calcular_precision(model = sv)

0.786

## `LogisticRegression()` Accuracy

In [90]:
lr = LogisticRegression(max_iter = 1000)
calcular_precision(model = lr)

0.834

# Which is the Best Model?

> Which model has the **highest accuracy**?

## University Access Exams Analogy

> Let's **imagine**:
>
> 1. You have a `math exam` on Saturday
> 2. Today is Monday
> 3. You want to **calculate if you need to study more** for the math exam
> 4. How do you calibrate your `math level`?
> 5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
> 6. You may study the 100 questions with 100 solutions `fit(questions, solutions)`
> 7. Then, you may do a `mock exam` with the 100 questions `predict(questions)`
> 8. And compare `your_solutions` with the `real_solutions`
> 9. You've got **90/100 correct answers** `accuracy` in the mock exam
> 10. You think you are **prepared for the maths exam**
> 11. And when you do **the real exam on Saturday, the mark is 40/100**
> 12. Why? How could have we prevented this?
> 13. **Solution**: separate the 100 questions in
> - `70 train` to study & `30 test` for the mock exam.

# `train_test_split()` the Data

In [91]:
from sklearn.model_selection import train_test_split

In [95]:
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.30, random_state=42)

In [96]:
X

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
134,50,1,0,0,1,0,0
1479,26,1,0,0,0,1,0
1593,40,0,0,1,0,0,0
2101,22,0,0,0,0,1,0
1725,65,0,1,0,0,0,0
...,...,...,...,...,...,...,...
555,52,0,0,0,0,0,0
312,34,0,0,0,0,0,0
1782,51,0,0,0,1,0,0
1032,21,0,1,0,0,0,0


In [97]:
X_train

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
1874,63,1,0,0,0,0,0
1437,54,0,0,0,0,0,0
495,30,0,1,0,0,0,0
564,41,0,0,0,0,0,1
1219,72,0,0,0,1,0,0
...,...,...,...,...,...,...,...
806,46,1,1,0,0,0,0
2033,44,1,0,0,0,1,0
734,44,0,0,0,0,0,0
1332,27,0,0,0,0,1,0


In [98]:
y_train

1874    0
1437    1
495     1
564     1
1219    0
       ..
806     1
2033    1
734     0
1332    1
1288    1
Name: internet_usage, Length: 700, dtype: int64

In [99]:
X_test

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
1408,55,0,0,0,0,0,0
947,18,0,0,0,0,0,0
680,28,1,0,0,0,0,0
1449,65,0,0,0,0,0,0
1761,29,1,1,0,0,0,0
...,...,...,...,...,...,...,...
530,34,0,0,1,0,0,0
463,82,0,0,0,0,0,0
1858,25,0,0,0,0,0,0
730,37,1,0,0,0,0,1


In [100]:
y_test

1408    1
947     1
680     1
1449    0
1761    1
       ..
530     1
463     0
1858    1
730     1
1208    1
Name: internet_usage, Length: 300, dtype: int64

> 1. **`fit()` the model with `Train Data`**
>
> - `model.fit(70%questions, 70%solutions)`

In [101]:
model = DecisionTreeClassifier()

In [102]:
model.fit(X_train, y_train)

DecisionTreeClassifier()

> 2. **`.predict()` answers with `Test Data` (mock exam)**
>
> - `your_solutions = model.predict(30%questions)`

In [104]:
your_solutions = model.predict(X_test)

> **3. Compare `your_solutions` with `correct answers` from mock exam**
>
> - `your_solutions == real_solutions`?

In [105]:
your_solutions == y_test

1408    False
947      True
680     False
1449     True
1761     True
        ...  
530      True
463      True
1858     True
730     False
1208    False
Name: internet_usage, Length: 300, dtype: bool

In [106]:
(your_solutions == y_test).sum()

225

In [107]:
(your_solutions == y_test).sum()/300

0.75

In [108]:
(your_solutions == y_test).mean()

0.75

In [109]:
model.score(X_test, y_test)

0.75

# Optimize All Models & Compare Again

## Make a Procedure Sample for `DecisionTreeClassifier()`

In [111]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier()

In [113]:
y_pred = model.predict(X_test)

In [116]:
y_pred

array([0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0])

In [118]:
comp = y_pred == y_test

In [119]:
comp

1408    False
947      True
680     False
1449     True
1761     True
        ...  
530      True
463      True
1858     True
730     False
1208    False
Name: internet_usage, Length: 300, dtype: bool

In [123]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
comp = y_pred == y_test
comp.mean()

0.75

## Automate the Procedure into a `function()`

> 1. Think of the functions `result`
> 2. Store that `object` to a variable
> 3. `return` the `result` at the end
> 4. **Indent the body** of the function to the right
> 5. `def`ine the `function():`
> 6. Think of what's gonna change when you execute the function with `different models`
> 7. Locate the **`variable` that you will change**
> 8. Turn it into the `parameter` of the `function()`

In [126]:
comp

1408    False
947      True
680     False
1449     True
1761     True
        ...  
530      True
463      True
1858     True
730     False
1208    False
Name: internet_usage, Length: 300, dtype: bool

In [144]:
def calcular_precision(model):
    
    model = DecisionTreeClassifier()
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    comp = y_pred == y_test
    result = comp.mean()

    return result 

## `DecisionTreeClassifier()` Accuracy

In [145]:
calcular_precision(dt)

0.7533333333333333

## `SVC()` Accuracy

In [149]:
sv

SVC()

In [150]:
calcular_precision(model = sv)

0.7466666666666667

## `LogisticRegression()` Accuracy

In [147]:
calcular_precision(lr)

0.7466666666666667

In [151]:
def calcular_precision(model):

    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    comp = y_pred == y_test
    result = comp.mean()

    return result 

## `DecisionTreeClassifier()` Accuracy

In [152]:
calcular_precision(dt)

0.75

## `SVC()` Accuracy

In [153]:
calcular_precision(model = sv)

0.8033333333333333

## `LogisticRegression()` Accuracy

In [154]:
calcular_precision(lr)

0.8466666666666667

# Which is the Best Model with `train_test_split()`?

> Which model has the **highest accuracy**?

# Reflect

> - Banks deploy models to predict the **probability for a customer to pay the loan**
> - If the Bank used the `DecisionTreeClassifier()` instead of the `LogisticRegression()`
> - What would have happened?
> - Is `train_test_split()` always required to compare models?