# Contents <a id='back'></a>
1. [Introduction](#introduccion)
2. [Objective](#objetivo)
3. [Data and Libraries](#datos-y-librerias)
    1. [Libraries](#librerias)
    2. [Data](#datos)
    3. [Understanding the Data](#understanding-the-data)
    4. [Data Preparation](#preparacion-de-los-datos)
4. [Model: Decision Tree](#decision-tree)
5. [Model: Random Forest](#random-forest)
6. [Model: Logistic Regression](#regresion-logistica)
7. [Testing](#testeo)
8. [Conclusion](#conclusion)

<a id="introduccion"></a>
## Introduction

In this project, we will create a model that can analyze customer behavior and recommend one of Megaline's new plans:

- Smart
- Ultra

To achieve this, we will analyze the data using 3 types of models:

- Decision Tree
- Random Forest
- Logistic Regression

For each of these models, we will explore and calibrate their hyperparameters to seek the highest possible accuracy.

<a id="objetivo"></a>
## Objetive

Develop a model that can analyze customer behavior and recommend one of Megaline's new plans:

- Smart
- Ultra

<a id="datos-y-librerias"></a>
## Data and libreries

<a id="librerias"></a>
### Libraries

In [2]:
# We load the libraries that will be useful for our analysis.
import pandas as pd
from sklearn import set_config
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression 
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.preprocessing import StandardScaler 

<a id="datos"></a>
### Data

In [4]:
# We load the data from each table separately.
try:
    df = pd.read_csv('C:\\Users\\Alejandro\\Downloads\\users_behavior.csv')
except FileNotFoundError:
    print("The file was not found in the specified location.")

[Back to Contents](#back)

<a id="understanding-the-data"></a>
### Understanding the Data

Let's quickly inspect the data to understand what we're training our model with.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


Let's take advantage of the fact that this DataFrame consists exclusively of numbers and describe it statistically.

In [5]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


[Back to Contents](#back)

<a id="preparacion-de-los-datos"></a>
### Data preparation

Although DataFrames often have missing data, this is not the case this time. However, that does not mean we should not prepare the data before working with it. To train, validate, and test our model, we need to take 2 steps beforehand. These are:

- Split the data to have different data groups; in this case, we will leave 20% for validation and another 20% for testing, with the rest for training.
- Create targets and features for our model, considering training, validation, and testing. In this case, we consider the final column `is_ultra` as the target.

In [6]:
# First, we separate the test group
df_rest, df_test = train_test_split(df, test_size=0.2, random_state=54321)

In [7]:
print(df_rest.shape)
print(df_test.shape)

(2571, 5)
(643, 5)


In [8]:
# Then we split into training and validation
df_train, df_valid = train_test_split(df_rest, test_size=0.25, random_state=54321)

In [9]:
# We create targets and features for our model
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

In [10]:
# Let's check the final sizes of our DataFrames
print('Training:')
print(features_train.shape)
print(target_train.shape)
print()
print('Validationn:')
print(features_valid.shape)
print(target_valid.shape)
print()
print('Testing:')
print(features_test.shape)
print(target_test.shape)

Entrenamiento:
(1928, 4)
(1928,)

Validación:
(643, 4)
(643,)

Testeo:
(643, 4)
(643,)


As we will use the same data across each model, it is more convenient to do this in advance.

[Back to Contents](#back)

<a id="decision-tree"></a>
## Model: Decision Tree

We will start with the decision tree, adjusting its depth and number of leaves. We will change several variables, taking advantage of the model's high processing speed. This allows us to experiment a bit more with the details.

In [11]:
best_tree = 0
best_depth = 0
best_leaf = 0
score = 0
for depth in range(1, 20): # select the hyperparameter range
    for leaf in range(1,20): 
        tree = DecisionTreeClassifier(random_state=54321, max_depth=depth, min_samples_leaf=leaf) # configure the number of trees
        tree.fit(features_train,target_train) # train the model on the training set
        score = tree.score(features_valid,target_valid) # calculate the accuracy score on the validation set
        if score > best_tree:
            best_tree = score # save the best accuracy score on the validation set
            best_depth = depth # save the number of estimators corresponding to the best accuracy score
            best_leaf = leaf

print("Accuracy of the best model on the validation set (depth = {}): {}".format(best_depth, best_tree))

final_tree = DecisionTreeClassifier(random_state=54321, max_depth=best_depth, min_samples_leaf=best_leaf) # change `n_estimators` to obtain the best model
final_tree.fit(features_train, target_train)

Exactitud del mejor modelo en el conjunto de validación (depth = 7): 0.8429237947122862


DecisionTreeClassifier(max_depth=7, min_samples_leaf=13, random_state=54321)

[Back to Contents](#back)

It turns out that our best tree has a depth of 7 and 13 leaves on its branches. This model gives us an accuracy of 84.3%.

<a id="random-forest"></a>
## Model: Random Forest

Now that we have a good tree model, let's create a forest and give it space for these trees to grow freely. Although this may take more time, it can improve the accuracy of our model. In this case, I will take advantage of my machine's power.

In [12]:
best_forest = 0
best_est = 0
score=0
best_leaf = 0
best_depth = 0
for est in range(1, 20): # select the hyperparameter range
    for depth in range(1, 20): 
        for leaf in range(1,20):
            forest = RandomForestClassifier(random_state=54321, n_estimators=est, max_depth=depth, min_samples_leaf=leaf) # configure the number of trees
            forest.fit(features_train,target_train) # train the model on the training set
            score = forest.score(features_valid,target_valid) # calculate the accuracy score on the validation set
            if score > best_forest:
                best_forest = score # save the best accuracy score on the validation set
                best_est = est # save the number of estimators corresponding to the best accuracy score
                best_leaf = leaf
                best_depth = depth

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_forest))

final_forest = RandomForestClassifier(random_state=54321, n_estimators=best_est, max_depth=best_depth, min_samples_leaf=best_leaf) # change `n_estimators` to obtain the best model

final_forest.fit(features_train, target_train)

Exactitud del mejor modelo en el conjunto de validación (n_estimators = 3): 0.8506998444790047


RandomForestClassifier(max_depth=11, min_samples_leaf=3, n_estimators=3,
                       random_state=54321)

Although the accuracy only increased slightly, we are close to 85.1%. Our final model has 3 trees, 11 depth, and 3 leaves. The significant depth might indicate overfitting; we'll see how this model performs in the testing phase.

It could be argued that the processing time required is not worth the small gain achieved, but once found, this model does perform better.

[Back to Contents](#back)

<a id="regresion-logistica"></a>
## Model: Logistic Regression

To not leave it behind, we will also test with logistic regression, although we do not expect to obtain a better model than the previous ones.

In [13]:
reg = LogisticRegression(random_state=54321, solver='liblinear')
reg.fit(features_train,target_train)
reg.score(features_valid,target_valid)

0.776049766718507

This model has little room for calibration, and even though various solvers were tested, its accuracy barely reaches 77.6%. We will continue with our forest model since it has the highest accuracy.

[Back to Contents](#back)

<a id="testeo"></a>
## Testing

Considering that we have two models with very similar accuracies, we will conduct tests for each one. To do this, we will train the model we are testing with `df_rest`, which includes the original training and validation data. Once we have the trained model, we will compare it with the data in `df_test`. We expect to achieve a high accuracy percentage, above 84% if everything goes well.

First, let's create the features and target for `df_rest`.

In [14]:
features_rest = df_rest.drop(['is_ultra'], axis=1)
target_rest = df_rest['is_ultra']

In [15]:
final_forest= RandomForestClassifier(max_depth=11, min_samples_leaf=3, n_estimators=3, random_state=54321)
final_forest.fit(features_rest,target_rest)

score = final_forest.score(features_test, target_test)
print(score)

0.7807153965785381


The accuracy of this model has dropped by 7 percentage points. This is likely due to the overfitting of this model. We will test the decision tree model before taking further action.

In [16]:
final_tree = DecisionTreeClassifier(max_depth=7, min_samples_leaf=13, random_state=54321)
final_tree.fit(features_rest, target_rest)

score = final_tree.score(features_test, target_test)
print(score)

0.7729393468118196


The accuracy of the tree model has dropped again, while the forest model maintains its slight advantage.

There should be a method to reduce these accuracy losses. For now, despite attempts to alter some other elements, it has not been possible to achieve a model with higher accuracy than these.

[Back to Contents](#back)

<a id="conclusion"></a>
## Conclusion

After calibrating and comparing our 3 models, we are left with 2 potential options:

1. A tree with 7 branches and 13 leaves, achieving 77.3% accuracy in testing. A significant advantage is its high speed and accuracy. Although Option 2 has higher accuracy, it can be time-consuming. Code:

   - `final_tree = DecisionTreeClassifier(max_depth=7, min_samples_leaf=13, random_state=54321)`

2. A forest of 3 trees, with 11 branches and 3 leaves, achieving 78.1% accuracy in testing. This model may take a bit more time, but considering large data scales, the difference in accuracy with Option 1 starts to become significant. Code:

   - `final_forest = RandomForestClassifier(max_depth=11, min_samples_leaf=3, n_estimators=3, random_state=54321)`


[Back to Contents](#back)