<font size="+5">#07. Model Selection. Decision Tree vs Support Vector Machines vs Logistic Regression</font>

- Book + Private Lessons [Here ↗](https://sotastica.com/reservar)
- Subscribe to my [Blog ↗](https://blog.pythonassembly.com/)
- Let's keep in touch on [LinkedIn ↗](www.linkedin.com/in/jsulopz) 😄

# Load the Data

Load the dataset from [CIS](https://www.cis.es/cis/opencms/ES/index.html) executing the lines of code below:
> - The goal of this dataset is
> - To predict `internet_usage` of **people** (rows)
> - Based on their **socio-demographical characteristics** (columns)

In [1]:
import pandas as pd

url = 'https://raw.githubusercontent.com/py-thocrates/data/main/internet_usage_spain.csv'

df = pd.read_csv(url)
df.head()

Unnamed: 0,internet_usage,sex,age,education
0,0,Female,66,Elementary
1,1,Male,72,Elementary
2,1,Male,48,University
3,0,Male,59,PhD
4,1,Female,44,PhD


# Build & Compare Models

## `DecisionTreeClassifier()` Model in Python

In [2]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/7VeUPuFGJHk" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [3]:
fit()

NameError: name 'fit' is not defined

In [4]:
model.fit()

NameError: name 'model' is not defined

`model = ?`

In [5]:
from sklearn.tree import DecisionTreeClassifier

In [10]:
import pandas as pd

In [13]:
df = pd.get_dummies(df, drop_first=True)

In [16]:
df

Unnamed: 0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
0,0,66,0,0,0,0,0,0
1,1,72,1,0,0,0,0,0
2,1,48,1,0,0,0,0,1
3,0,59,1,0,0,0,1,0
4,1,44,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
2450,1,43,1,0,0,0,0,0
2451,1,18,0,1,0,0,0,0
2452,0,54,0,0,0,0,0,0
2453,1,31,1,1,0,0,0,0


In [17]:
X = df.drop(columns='internet_usage')
y = df.internet_usage

In [18]:
model = DecisionTreeClassifier()

In [19]:
model.fit(X,y)

DecisionTreeClassifier()

In [20]:
model.score(X,y)

0.859877800407332

## `SVC()` Model in Python

In [1]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/efR1C6CvhmE" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [21]:
from sklearn.svm import SVC

In [22]:
model = SVC()

In [23]:
model.fit(X,y)

SVC()

In [24]:
model.score(X,y)

0.7837067209775967

## `LogisticRegression()` Model in Python

In [4]:
%%HTML
<iframe width="560" height="315" src="https://www.youtube.com/embed/yIYKR4sgzI8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

> - Build the model `model.fit()`
> - And see how good it is `model.score()`

In [26]:
from sklearn.linear_model import LogisticRegression

In [30]:
model = LogisticRegression(max_iter=1000)

In [31]:
model.fit(X,y)

LogisticRegression(max_iter=1000)

In [32]:
model.score(X,y)

0.8334012219959267

# Function to Automate Lines of Code

> - We repeated all the time the same code:

```python
model.fit()
model.score()
```

> - Why not turning the lines into a `function()`
> - To automate the process?
> - In a way that you would just need

```python
calculate_accuracy(model=dt)

calculate_accuracy(model=svm)

calculate_accuracy(model = lr)
```

> - To calculate the `accuracy`

## Make a Procedure Sample for `DecisionTreeClassifier()`

In [33]:
def calculate_accuracy(model):

    model.fit(X,y)

    precision = model.score(X,y)

    return precision

## Code Thinking

> 1. Think of the functions `result`
> 2. Store that `object` to a variable
> 3. `return` the `result` at the end
> 4. **Indent the body** of the function to the right
> 5. `def`ine the `function():`
> 6. Think of what's gonna change when you execute the function with `different models`
> 7. Locate the **`variable` that you will change**
> 8. Turn it into the `parameter` of the `function()`

## Automate the Procedure into a `function()`

In [34]:
def calculate_accuracy(model):

    model.fit(X,y)

    precision = model.score(X,y)

    return precision

## `DecisionTreeClassifier()` Accuracy

In [35]:
dt = DecisionTreeClassifier()
calculate_accuracy(dt)

0.859877800407332

## `SVC()` Accuracy

In [36]:
sv = SVC()
calculate_accuracy(sv)

0.7837067209775967

## `LogisticRegression()` Accuracy

In [38]:
lr = LogisticRegression()
calculate_accuracy(lr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8325865580448065

# Which is the Best Model?

> Which model has the **highest accuracy**?

## University Access Exams Analogy

> Let's **imagine**:
>
> 1. You have a `math exam` on Saturday
> 2. Today is Monday
> 3. You want to **calculate if you need to study more** for the math exam
> 4. How do you calibrate your `math level`?
> 5. Well, you've got **100 questions `X` with 100 solutions `y`** from past years exams
> 6. You may study the 100 questions with 100 solutions `fit(questions, solutions)`
> 7. Then, you may do a `mock exam` with the 100 questions `predict(questions)`
> 8. And compare `your_solutions` with the `real_solutions`
> 9. You've got **90/100 correct answers** `accuracy` in the mock exam
> 10. You think you are **prepared for the maths exam**
> 11. And when you do **the real exam on Saturday, the mark is 40/100**
> 12. Why? How could have we prevented this?
> 13. **Solution**: separate the 100 questions in
> - `70 train` to study & `30 test` for the mock exam.

# `train_test_split()` the Data

In [39]:
from sklearn.model_selection import train_test_split

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [48]:
df

Unnamed: 0,internet_usage,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
0,0,66,0,0,0,0,0,0
1,1,72,1,0,0,0,0,0
2,1,48,1,0,0,0,0,1
3,0,59,1,0,0,0,1,0
4,1,44,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...
2450,1,43,1,0,0,0,0,0
2451,1,18,0,1,0,0,0,0
2452,0,54,0,0,0,0,0,0
2453,1,31,1,1,0,0,0,0


In [49]:
X_train

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
198,54,1,0,0,0,0,0
561,50,1,0,0,0,1,0
685,26,1,0,0,0,0,1
1321,62,1,0,0,0,0,1
590,86,0,0,0,1,0,0
...,...,...,...,...,...,...,...
1638,37,0,1,0,0,0,0
1095,35,0,0,1,0,0,0
1130,58,0,0,0,0,0,0
1294,52,0,0,0,0,0,0


In [50]:
X_test

Unnamed: 0,age,sex_Male,education_High School,education_Higher Level,education_No studies,education_PhD,education_University
1598,52,0,0,0,0,0,0
620,48,1,0,1,0,0,0
1266,53,0,0,0,0,1,0
649,43,1,0,0,0,0,1
1908,43,1,0,1,0,0,0
...,...,...,...,...,...,...,...
377,28,1,0,0,0,1,0
535,35,0,0,1,0,0,0
1535,23,1,1,0,0,0,0
902,38,1,0,0,0,0,1


> 1. **`fit()` the model with `Train Data`**
>
> - `model.fit(70%questions, 70%solutions)`

In [51]:
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000)

> 2. **`.predict()` answers with `Test Data` (mock exam)**
>
> - `your_solutions = model.predict(30%questions)`

In [56]:
your_solutions = model.predict(X_test)

> **3. Compare `your_solutions` with `correct answers` from mock exam**
>
> - `your_solutions == real_solutions`?

In [57]:
y_test == your_solutions

1598    True
620     True
1266    True
649     True
1908    True
        ... 
377     True
535     True
1535    True
902     True
1967    True
Name: internet_usage, Length: 737, dtype: bool

In [58]:
(y_test == your_solutions).mean()

0.8548168249660787

# Optimize All Models & Compare Again

## Make a Procedure Sample for `DecisionTreeClassifier()`

In [64]:
def cacular_precision(model):

    model.fit(X_train, y_train)

    your_solutions = model.predict(X_test)

    precision = (y_test == your_solutions).mean()

    return precision

## Automate the Procedure into a `function()`

In [66]:
def calcular_precision(model):

    model.fit(X_train, y_train)

    your_solutions = model.predict(X_test)

    precision = (y_test == your_solutions).mean()

    return precision

## `DecisionTreeClassifier()` Accuracy

In [67]:
calcular_precision(dt)

0.8032564450474898

## `SVC()` Accuracy

In [68]:
calcular_precision(sv)

0.7788331071913162

## `LogisticRegression()` Accuracy

In [69]:
calcular_precision(lr)

0.8548168249660787

# Which is the Best Model with `train_test_split()`?

> Which model has the **highest accuracy**?

# Reflect

> - Banks deploy models to predict the **probability for a customer to pay the loan**
> - If the Bank used the `DecisionTreeClassifier()` instead of the `LogisticRegression()`
> - What would have happened?
> - Is `train_test_split()` always required to compare models?