### Author: Allan Jeeboo 
### Preferred Name: Vyncent S. A. van der Wolvenhuizen 
### Affiliation: Student at TripleTen 
### Email: vanderwolvenhuizen.vyncent@proton.me
### Date Started: 2025-03-12 
### Last Updated: 2025-03-15 16:44

# 1.0 Introduction
This project places us in the role of developing a model that analyzes subscribers' behavior for the telecommunications company Megaline. This model will recommend one of their newer plans: Smart or Ultra. 

We have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis sprint). For this classification task, we need to develop a model that will pick the right plan. Since we’ve already performed the data preprocessing step, we can move straight to creating the model. 

We'll create a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. We'll check the accuracy using the test dataset.

### 1.1 Module & Data Import

In [38]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('users_behavior.csv')

display(df.head()) 
print(f'rows, columns: {df.shape}')

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


rows, columns: (3214, 5)


### 1.2 Data Description
This dataset contains monthly behavior information regarding users.
- сalls: Number of calls.
- minutes: Total call duration in minutes.
- messages: Number of text messages.
- mb_used: Internet traffic used in MB.
- is_ultra: Plan for the current month (Ultra - 1, Smart - 0).

# 2.0 Model Creation

### 2.1 Decision Tree Classifier

In [41]:
features = df.drop('is_ultra', axis= 1) 
target = df['is_ultra']
train_valid, test = train_test_split(df, test_size= 0.2, random_state= 12345)
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size= 0.25, random_state= 12345)

In [56]:
for depth in range (1, 10): 
    decision_tree_model = DecisionTreeClassifier(random_state= 12345, max_depth= depth) 
    decision_tree_model.fit(features_train, target_train)
    predictions_valid = decision_tree_model.predict(features_valid) 
    print(f'Accuracy for depth {depth}: {accuracy_score(target_valid, predictions_valid)}')
    print(f'decision_tree_model.score {depth}: {decision_tree_model.score(features_valid, target_valid)}')

Accuracy for depth 1: 0.75
decision_tree_model.score 1: 0.75
Accuracy for depth 2: 0.7835820895522388
decision_tree_model.score 2: 0.7835820895522388
Accuracy for depth 3: 0.7885572139303483
decision_tree_model.score 3: 0.7885572139303483
Accuracy for depth 4: 0.7810945273631841
decision_tree_model.score 4: 0.7810945273631841
Accuracy for depth 5: 0.7810945273631841
decision_tree_model.score 5: 0.7810945273631841
Accuracy for depth 6: 0.7661691542288557
decision_tree_model.score 6: 0.7661691542288557
Accuracy for depth 7: 0.7898009950248757
decision_tree_model.score 7: 0.7898009950248757
Accuracy for depth 8: 0.7885572139303483
decision_tree_model.score 8: 0.7885572139303483
Accuracy for depth 9: 0.7885572139303483
decision_tree_model.score 9: 0.7885572139303483


In [None]:
test_features = test.drop('is_ultra', axis = 1)
test_target = test['is_ultra']
train_pedictions = decision_tree_model.predict(features_train)
test_predictions = decision_tree_model.predict(test_features)

print(f'Accuracy on the train set: {accuracy_score(target_train, train_pedictions)}')
print(f'Accuracy on the test set: {accuracy_score(test_target, test_predictions)}')
#add validation

Accuracy on the train set: 0.8688796680497926
Accuracy on the test set: 0.7884914463452566


A max depth of 7 seems to yield the highest accuracy.

### 2.2 Random Forest Classifier

In [52]:
best_score = 0 
best_est = 0 
for est in range(1, 31): 
    random_forest_model = RandomForestClassifier(random_state= 12345, n_estimators= est) 
    random_forest_model.fit(features_train, target_train)
    score_valid = random_forest_model.score(features_valid, target_valid)
    if score_valid > best_score: 
        best_score = score_valid
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

final_model = RandomForestClassifier(random_state=54321, n_estimators=best_est) # change n_estimators to get best model
final_model.fit(features_train, target_train)

Accuracy of the best model on the validation set (n_estimators = 28): 0.7922885572139303


In [53]:
test_features = test.drop('is_ultra', axis = 1)
test_target = test['is_ultra']
train_pedictions = decision_tree_model.predict(features_train)
test_predictions = decision_tree_model.predict(test_features)

print(f'Accuracy on the train set: {accuracy_score(target_train, train_pedictions)}')
print(f'Accuracy on the test set: {accuracy_score(test_target, test_predictions)}')

Accuracy on the train set: 0.8688796680497926
Accuracy on the test set: 0.7884914463452566


### 2.3 Logistic Regression

In [48]:
logistic_model = LogisticRegression(random_state= 12345, solver= 'liblinear')
logistic_model.fit(features_train, target_train)  # train model on training set
score_train = logistic_model.score(features_train, target_train) # calculate accuracy score on training set  
score_valid = logistic_model.score(features_valid, target_valid)
     # calculate accuracy score on validation set  

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.74149377593361
Accuracy of the logistic regression model on the validation set: 0.753731343283582
