### Author: Allan Jeeboo 
### Preferred Name: Vyncent S. A. van der Wolvenhuizen 
### Affiliation: Student at TripleTen 
### Email: vanderwolvenhuizen.vyncent@proton.me
### Date Started: 2025-03-12 
### Last Updated: 2025-03-17 16:17

# 1.0 Introduction
This project places us in the role of developing a model that analyzes subscribers' behavior for the telecommunications company Megaline. This model will recommend one of their newer plans: Smart or Ultra. 

We have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis sprint). For this classification task, we need to develop a model that will pick the right plan. Since we’ve already performed the data preprocessing step, we can move straight to creating the model. 

We'll create a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. We'll check the accuracy using the test dataset.

### 1.1 Module & Data Import

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('users_behavior.csv')

display(df.head()) 
print(f'rows, columns: {df.shape}')

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


rows, columns: (3214, 5)


### 1.2 Data Description
This dataset contains monthly behavior information regarding users.
- сalls: Number of calls.
- minutes: Total call duration in minutes.
- messages: Number of text messages.
- mb_used: Internet traffic used in MB.
- is_ultra: Plan for the current month (Ultra - 1, Smart - 0).

# 2.0 Model Creation

In this section we'll be testing three types of models:  
- Decision Tree Classifier 
- Random Forest Classifier 
- Logistic Regression

We need to establish what our features and target are. Because we want to see who is already an ultra member, is_ultra will be our target. "is_ultra" is not numerical (it has binary representation, but it's classifying users into two groups.), so we'll use classification models. Everything except for "is_ultra" will be our features.

### 2.1 Decision Tree Classifier

First we need to establish the features and targets. Afterwards, we'll split the data set to obtain a test set; from the remainder of the initial set, we'll split that once more to obtain a validation set. I've chosen a test size and validation size of 0.2, so our data split has a ratio of 3:1:1. Throughout this project, we'll pass the arbitrary argument of 12345 to our random_state parameter.

In [2]:
features = df.drop('is_ultra', axis= 1) 
target = df['is_ultra']
train_valid, test = train_test_split(df, test_size= 0.2, random_state= 12345)
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size= 0.2, random_state= 12345)

print(f'feature rows, columns:', features.shape)
print(f'target rows:', target.shape)
print(f'test rows, columns:', test.shape)

feature rows, columns: (3214, 4)
target rows: (3214,)
test rows, columns: (643, 5)


Now let's tune the model and see which depth yields the best accuracy.

In [3]:
best_dtc_accuracy = 0 
best_dtc_depth = 0 
best_dtc_model = None

for depth in range (1, 10): 
    decision_tree_model = DecisionTreeClassifier(random_state= 12345, max_depth= depth) 
    decision_tree_model.fit(features_train, target_train)
    predictions_valid = decision_tree_model.predict(features_valid) 
    print(f'Validation accuracy for depth {depth}: {accuracy_score(target_valid, predictions_valid)}')
    if accuracy_score(target_valid, predictions_valid) > best_dtc_accuracy: 
        best_dtc_accuracy = accuracy_score(target_valid, predictions_valid)
        best_dtc_depth = depth
        best_dtc_model = decision_tree_model

test_features = test.drop('is_ultra', axis = 1)
test_target = test['is_ultra']
train_pedictions = decision_tree_model.predict(features_train)
test_predictions = decision_tree_model.predict(test_features)

print(f'Accuracy on the train set: {accuracy_score(target_train, train_pedictions)}')
print(f'Best Decision Tree Classifier accuracy: {best_dtc_accuracy} with depth {best_dtc_depth}')
print(f'Accuracy on the test set: {accuracy_score(test_target, test_predictions)}')

Validation accuracy for depth 1: 0.7480559875583204
Validation accuracy for depth 2: 0.7807153965785381
Validation accuracy for depth 3: 0.7838258164852255
Validation accuracy for depth 4: 0.7791601866251944
Validation accuracy for depth 5: 0.7853810264385692
Validation accuracy for depth 6: 0.7822706065318819
Validation accuracy for depth 7: 0.7791601866251944
Validation accuracy for depth 8: 0.7729393468118196
Validation accuracy for depth 9: 0.7791601866251944
Accuracy on the train set: 0.8638661999222093
Best Decision Tree Classifier accuracy: 0.7853810264385692 with depth 5
Accuracy on the test set: 0.7791601866251944


Our sets have an accuracy, rounded to three decimal places, of:  
- Training: 86.888%
- Validation: 78.538%
- Test: 78.849%

Usually the test and validation sets have about the same value, and seeing as the difference here is a mere 0.311%, that's a good sign.
In addition to that, a max depth of 5 appears to yield the best validation accuracy.

### 2.2 Random Forest Classifier

Random Forest Classifiers have the highest degree of accuracy due to the fact that it generates a specified quantity of independent trees, then votes on the best model. The downside though is that due to the fact that it's generating multiple trees, it has a low speed.

In [4]:
best_score = 0 
best_est = 0 
for est in range(1, 100): 
    random_forest_model = RandomForestClassifier(random_state= 12345, n_estimators= est) 
    random_forest_model.fit(features_train, target_train)
    score_valid = random_forest_model.score(features_valid, target_valid)
    if score_valid > best_score: 
        best_score = score_valid
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

final_model = RandomForestClassifier(random_state=12345, n_estimators=best_est) # change n_estimators to get best model
final_model.fit(features_train, target_train)
train_predictions = final_model.predict(features_train)
test_predictions = final_model.predict(test_features)
print(f'Accuracy on the train set: {accuracy_score(target_train, train_predictions)}')
print(f'Accuracy on the test set: {accuracy_score(test_target, test_predictions)}')

Accuracy of the best model on the validation set (n_estimators = 20): 0.7884914463452566
Accuracy on the train set: 0.9906651108518086
Accuracy on the test set: 0.7884914463452566


Our sets have an accuracy, rounded to three decimal places, of:  
- Training: 99.067%
- Validation: 78.849%
- Test: 79.849%

In addition to that, the best estimate in range 1 to 100 appears to be n_estimators= 20. 

The validation set for the Random Forest typically tends to yields a higher accuracy than that of the Decision Tree, and it just barely does so in this test.

### 2.3 Logistic Regression

In [5]:
logistic_model = LogisticRegression(random_state= 12345, solver= 'liblinear')
logistic_model.fit(features_train, target_train)  # train model on training set
score_train = logistic_model.score(features_train, target_train) # calculate accuracy score on training set  
score_valid = logistic_model.score(features_valid, target_valid)
     # calculate accuracy score on validation set  

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.7016725009723843
Accuracy of the logistic regression model on the validation set: 0.702954898911353


Our sets have an accuracy, rounded to three decimal places, of:  
- Training: 70.167%
- Validation: 70.295%

The logistic regression model has the lowest accuracy of all tested model types.

# 3.0 Conclusion

Through this project, we have tested three models and they've yielded the following validation accuracies:
 - Decision Tree Classifier: 78.538%. A max depth of 7 yielded the best results.
 - Random Forest Classifier: 78.849%. n_estimators= 20 yielded the best results. 
 - Logistic Regression: 70.295% 

Validation accuracy reflects how often the model reaches the correct answer. Seeing as there are two possible answers to whether or not Megaline users are ultra users or not, our sanity check threshold is 50%; i.e., through random chance the result would be right half of the time. All of our models produced results above that, but for the 75% threshold mentioned at the beginning of this project, the Logistic Regression model failed to exceed that. That being said, the Random Forest Classifier appears to be the best model to use in regards to recommending plans to Megaline customers. One downside that needs to be reiterated is that because this type of model is processing multiple trees, it is the slowest. 