# üíªDevelopment of a model

#### Development of a model that chooses the correct mobile phone plan, analyzing customer behavior.

In [1]:
#Import library
import pandas as pd

In [2]:
#Load dataset into a DataFrame
users = pd.read_csv('/datasets/users_behavior.csv')

## üîéEDA (Exploratory Data Analysis)

In [3]:
users.info()
users.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


### **Missing Values**
At first glance, it looks that there are no missing values.

In [4]:
users.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

*We can confirm there are no missing values*

However, we must ensure that there are no records with no actual activity.

If a user has 0 calls, 0 minutes, 0 messages, and 0 MB; this likely indicates a registration error, an inactive account, or incorrectly loaded data. 
And these records can corrupt the model.

In [5]:
#Filter records where all consumption is 0
zero_usage = users[
    (users['calls'] == 0) &
    (users['minutes'] == 0) &
    (users['messages'] == 0) &
    (users['mb_used'] == 0)]
print(zero_usage)

Empty DataFrame
Columns: [calls, minutes, messages, mb_used, is_ultra]
Index: []


*We can confirm that there are no inactivity records*

### **Value Types**
The calls and messages numbers are currently in float type. While this doesn't affect the model's development, as long as they are integer values disguised as floats, they will be converted to integers to accurately reflect their discrete nature.

In [6]:
users['calls'] = users['calls'].astype(int)
users['messages'] = users['messages'].astype(int)

### **Duplicated Values**

In [7]:
users.duplicated().sum()

0

*There are no duplicate values in the dataset. Therefore, the 3,214 records are clean and ready for analysis.*

### **Plan Balance**

In [8]:
users['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [9]:
users['is_ultra'].value_counts(normalize=True)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

This indicates that the dataset is moderately imbalanced, but not severely so.

If we build a dummy model that always predicts "Smart", it would achieve an accuracy of 0.6935. 

Since the target performance is ‚â• 0.75, the model‚Äôs accuracy must be improved to meet the required threshold.

In [10]:
users.groupby('is_ultra').mean()

Unnamed: 0_level_0,calls,minutes,messages,mb_used
is_ultra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,58.463437,405.942952,33.384029,16208.466949
1,73.392893,511.224569,49.363452,19468.823228


*Observations:*
- All consumption figures are higher on the Ultra plan. This is consistent with what would be expected: Ultra is the "premium" plan.
- Some variables are more discriminating than others.

## ‚öñÔ∏èSegmentation

In [11]:
#Import library
from sklearn.model_selection import train_test_split

#Define features and target
features = users.drop(columns=['is_ultra'])
target = users['is_ultra']

#Split Train/Other (60%/40%)
features_train, features_other, target_train, target_other = train_test_split(features, target, test_size=0.4, stratify=target, random_state=54321) 
#Stratify is used to ensure that when the data is divided, the proportion of plans in the source dataset is maintained. 
#This is crucial due to unbalanced classification, even when the imbalance is not extreme.

#Split Other ‚Üí Validation/Test (50%/50%) - From the 40% ‚Üí 20% each one
features_valid, features_test, target_valid, target_test = train_test_split(features_other, target_other, test_size=0.5, stratify=target_other, random_state=54321)

Given that the distribution of plans in the dataset is 69% Smart and 31% Ultra, the proportion of classes in each set is maintained through stratification.

Since a test set doesn¬¥t yet exist, the dataset is divided into three parts with a **3:1:1** ratio:

**60% for training, 20% for validation, and 20% for testing.**

This strategy allows for a more accurate estimation of the model's actual performance, as it provides sufficient data for both validation and final evaluation.

## üìäModel Evaluation

### üéØDummyClassifier (baseline)

In [12]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

dummy = DummyClassifier(strategy='most_frequent')  #Predicts the majority class

#Train the model in the training set
dummy.fit(features_train, target_train)

#Accuracy in validation
accuracy_dummy = dummy.score(features_valid, target_valid)
print("Accuracy Dummy (baseline):", accuracy_dummy)

Accuracy Dummy (baseline): 0.6936236391912908


*This gives the minimum accuracy that the model must exceed (baseline)*

### üìàLogistic Regression (Classification Algorithm)

In [13]:
from sklearn.linear_model import LogisticRegression

#Initialize the logistic regression constructor with the parameters random_state=54321 and solver='liblinear'
logreg = LogisticRegression(random_state=54321, solver='liblinear')

#Train the model in the training set
logreg.fit(features_train, target_train)

#Accuracy in validation
accuracy_logreg = logreg.score(features_valid, target_valid)
print("Accuracy Logistic Regression:", accuracy_logreg)

Accuracy Logistic Regression: 0.7169517884914464


*Accuracy improved; there is no overfitting, but it is still below the 75% accuracy threshold.*

### üå≥Decission Tree

In [14]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=54321)

#Train the model in the training set
dtc.fit(features_train, target_train)

# Accuracy in validation
accuracy_dtc = dtc.score(features_valid, target_valid)
print("Accuracy Decision Tree:", accuracy_dtc)


Accuracy Decision Tree: 0.7293934681181959


*The accuracy of the Decision Tree is slightly better; however, it remains below the 75% accuracy threshold.*

### üå≥üå≥üå≥Random Forest

In [15]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=54321)

#Train the model in the training set
rfc.fit(features_train, target_train)

# Accuracy in validation
accuracy_rfc = rfc.score(features_valid, target_valid)
print("Accuracy Random Forest:", accuracy_rfc)

Accuracy Random Forest: 0.8009331259720062


*Random Forest usually improves accuracy because it averages many trees and reduces overfitting.*

‚úÖ**This model has the best accuracy, exceeding the 75% accuracy threshold.**

### üÜöComparison of models

In [16]:
print(f"DummyClassifier (baseline): {accuracy_dummy:.3f}")
print(f"Logistic Regression: {accuracy_logreg:.3f}")
print(f"Decision Tree: {accuracy_dtc:.3f}")
print(f"Random Forest: {accuracy_rfc:.3f}")

DummyClassifier (baseline): 0.694
Logistic Regression: 0.717
Decision Tree: 0.729
Random Forest: 0.801


## üî¢Hyperparameter fitting for üå≥Decision Tree and üå≥üå≥üå≥Random Forest models

The hyperparameters will be adjusted to improve both models and obtain the optimal model version for a final comparison, using the test dataset.

### Decision Tree Classifier

In [50]:
best_dtc_score = 0
best_max_depth = None

for depth in range(1, 11):
    #Configure the max_depth
    dtc = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    #Train the model in the training set
    dtc.fit(features_train, target_train)
    #Calculate the accuracy score
    score = dtc.score(features_valid, target_valid)
    #Print score
    print(f"max_depth = {depth} : {score:3f}")

    #Save best score
    if score > best_dtc_score:
        best_dtc_score = score
        best_max_depth = depth

print("\nThe accuracy of the best Decision Tree model in the validation set has a max_depth of {}: {:.3f}".format(best_max_depth, best_dtc_score))

max_depth = 1 : 0.749611
max_depth = 2 : 0.785381
max_depth = 3 : 0.790047
max_depth = 4 : 0.790047
max_depth = 5 : 0.777605
max_depth = 6 : 0.783826
max_depth = 7 : 0.791602
max_depth = 8 : 0.790047
max_depth = 9 : 0.785381
max_depth = 10 : 0.772939

The accuracy of the best Decision Tree model in the validation set has a max_depth of 7: 0.792


***Observations:***

In the Decision Tree Classifier model, the default max_depth hyperparameter is *None*, which achieved an accuracy score of 0.729.

The most accurate Decision Tree Classifier model is the one with max_depth = 7, as it achieves the highest accuracy on the validation set while avoiding overfitting.

### Random Forest Classifier

In [49]:
#Random Forest Classifier
best_rfc_score = 0
best_n_est = None

for est in range(10, 101, 10):
    rfc = RandomForestClassifier(random_state=54321, n_estimators=est)
    rfc.fit(features_train, target_train)
    score = rfc.score(features_valid, target_valid)
    
    print(f"n_estimators = {est} : {score:.3f}")
    
    if score > best_rfc_score:
        best_rfc_score = score
        best_n_est = est

print(f"\nThe accuracy of the best Random Forest model in the validation set has {best_n_est} n_estimators: {best_rfc_score:.3f}")

n_estimators = 10 : 0.788
n_estimators = 20 : 0.787
n_estimators = 30 : 0.795
n_estimators = 40 : 0.796
n_estimators = 50 : 0.801
n_estimators = 60 : 0.801
n_estimators = 70 : 0.802
n_estimators = 80 : 0.806
n_estimators = 90 : 0.804
n_estimators = 100 : 0.801

The accuracy of the best Random Forest model in the validation set has 80 n_estimators: 0.806


***Observations:***

In the Random Forest Classifier model, the default n_estimators hyperparameter is 100, which achieved an accuracy score of 0.801.

The most accurate Decision Tree Classifier model is the one with 80 n_estimators, as it achieves the highest accuracy on the test set.

### Conclusion


Based on hyperparameter fitting on the validation set, the Random Forest model with n_estimators = 80 is selected as the best-performing model.

Decision Tree provides a simpler and interpretable alternative, while Logistic Regression offers insights into the relative effect of features but with slightly lower accuracy.

All models exceed the baseline established by DummyClassifier (0.694), confirming that they effectively learn from user behavior to predict the optimal mobile plan.

## üèÅTesting (Test dataset)

The performance of the selected model (**Random Forest Classifier with 80 estimators**) will be evaluated using the test dataset.

It is important to note that the training dataset together with the validation dataset will be used to train the model, while the test dataset will be reserved solely for evaluation.

An analogy to illustrate this strategy is the exam process:
- Training: Studying with textbooks
- Validation: Taking practice exams to determine the best study strategy
- Test: Sitting the actual final exam

In [55]:
#Combine train + validation
features_final = pd.concat([features_train, features_valid])
target_final = pd.concat([target_train, target_valid])

#Train the best model
best_model = RandomForestClassifier(random_state=54321, n_estimators=best_n_est)
best_model.fit(features_final, target_final)

#Evaluate in TEST
predictions_test = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, predictions_test)

print(f"Final accuracy on test dataset: {test_accuracy:.3f}")

Final accuracy on test dataset: 0.784


The model exceeds the required threshold of 0.75.

The testing accuracy indicates good generalizability and no signs of overfitting.

Overall, the Random Forest remains the most effective model for predicting each user's plan based on their behavior.