# Developing a Prepaid Package Classification Model for Megaline Customers

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#1-Project-Description" data-toc-modified-id="1-Project-Description-1">1 Project Description</a></span><ul class="toc-item"><li><span><a href="#1.1-Objective" data-toc-modified-id="1.1-Objective-1.1">1.1 Objective</a></span></li><li><span><a href="#1.2-Stages-of-Project-Completion" data-toc-modified-id="1.2-Stages-of-Project-Completion-1.2">1.2 Stages of Project Completion</a></span></li></ul></li><li><span><a href="#2-Data-Description" data-toc-modified-id="2-Data-Description-2">2 Data Description</a></span></li><li><span><a href="#3-Creating-a-Classification-Model" data-toc-modified-id="3-Creating-a-Classification-Model-3">3 Creating a Classification Model</a></span><ul class="toc-item"><li><span><a href="#3.1-Import-All-Libraries" data-toc-modified-id="3.1-Import-All-Libraries-3.1">3.1 Import All Libraries</a></span></li><li><span><a href="#3.2-Open-and-Read-Data" data-toc-modified-id="3.2-Open-and-Read-Data-3.2">3.2 Open and Read Data</a></span></li><li><span><a href="#3.3-Split-Dataset" data-toc-modified-id="3.3-Split-Dataset-3.3">3.3 Split Dataset</a></span></li><li><span><a href="#3.4-Create-Models" data-toc-modified-id="3.4-Create-Models-3.4">3.4 Create Models</a></span><ul class="toc-item"><li><span><a href="#3.4.1-Decision-Tree" data-toc-modified-id="3.4.1-Decision-Tree-3.4.1">3.4.1 Decision Tree</a></span></li><li><span><a href="#3.4.2-Random-Forest" data-toc-modified-id="3.4.2-Random-Forest-3.4.2">3.4.2 Random Forest</a></span></li><li><span><a href="#3.4.3-Logistic-Regression" data-toc-modified-id="3.4.3-Logistic-Regression-3.4.3">3.4.3 Logistic Regression</a></span></li></ul></li><li><span><a href="#3.5-Testing-the-Model" data-toc-modified-id="3.5-Testing-the-Model-3.5">3.5 Testing the Model</a></span></li><li><span><a href="#3.6-Sanity-Check" data-toc-modified-id="3.6-Sanity-Check-3.6">3.6 Sanity Check</a></span></li></ul></li><li><span><a href="#4-Conclusion" data-toc-modified-id="4-Conclusion-4">4 Conclusion</a></span></li></ul></div>

## 1 Project Description

Megaline, a mobile operator, is unsatisfied with the fact that many of its customers are still using outdated packages. The company aims to develop a model that can analyze consumer behavior and recommend either the Smart or Ultra packages as a solution.

### 1.1 Objective

The objectives of this project include:
- Analyzing the behavior patterns of Megaline customers who have switched to the latest package.
- Developing machine learning models to study the behavior of these users.
- Utilizing the models to provide suitable package recommendations for customers who have not yet adopted the latest package.

### 1.2 Stages of Project Completion

Having previously conducted statistical data analysis on Megaline customer data in the Statistical Data Analysis course project, we will now proceed directly to the modeling stage, assuming that the data preprocessing step has been completed.

The objective of this classification task is to develop a model that can accurately recommend the appropriate package for Megaline customers who have not yet switched to the latest package.

Our goal is to create a model with the highest possible accuracy, with a minimum threshold of 0.75 for accuracy in this project.

## 2 Data Description

To build this model, we will utilize the preprocessed data available in the `users_behavior.csv` file.

Each observation in the dataset provides monthly behavioral information for a single user, including the following variables:

- `сalls`: the number of calls made by the user
- `minutes`: the total duration of calls in minutes
- `messages`: the number of text messages sent by the user
- `mb_used`: the amount of internet usage traffic in megabytes (MB)
- `is_ultra`: the package subscribed by the user for the current month (Ultra - 1, Smart - 0)

## 3 Creating a Classification Model

### 3.1 Import All Libraries

In [1]:
# Import All Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 

from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")

### 3.2 Open and Read Data

In [2]:
# Open the data
df = pd.read_csv('users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
# Checking dataset shape
df.shape

(3214, 5)

In [4]:
# Checking the data description
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [5]:
# Checking the general information of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


**Findings:**
- The dataset is clean and does not contain any missing values.
- There are 5 columns (features) and 3,214 rows (observations) in the dataset.
- The median and average values of the data are close to each other, indicating a relatively balanced distribution without significant outliers.
- Overall, the dataset provides sufficient information for the modeling process.

### 3.3 Split Dataset

To meet the requirements of this model, we have a single dataset that needs to be divided into three groups: the `training set`, `validation set`, and `test set`.

Following best practices, we will divide the data in a 3:1:1 ratio, allocating 60% of the data to the `training set`, 20% to the `validation set`, and the remaining 20% to the `test set`.

Instead of using the `train_test_split()` function from the `scikit-learn` library, which only splits the data into two parts (train and test), we will utilize `numpy` to split the dataset into all three sets simultaneously.

In [6]:
# Split the dataset
train, validate, test = \
              np.split(df.sample(frac=1, random_state=42), 
                       [int(.6*len(df)), int(.8*len(df))])

In [7]:
# Examine the shape of the dataset
print(train.shape)
print(validate.shape)
print(test.shape)

(1928, 5)
(643, 5)
(643, 5)


All three datasets have been prepared, and now it is time to separate the data into features and targets. Our target variable is the `is_ultra` column, while the remaining columns will serve as the features.

In [8]:
# Split the 'train' dataset
train_features = train.drop(['is_ultra'], axis=1)
train_target = train['is_ultra']

In [9]:
# Split the 'validate' dataset
validate_features = validate.drop(['is_ultra'], axis=1)
validate_target = validate['is_ultra']

In [10]:
# Split the 'test' dataset
test_features = test.drop(['is_ultra'], axis=1)
test_target = test['is_ultra']

### 3.4 Create Models

We will create three models: `decision tree`, `random forest`, and `logistic regression`.

The training process will involve using the `train_features` and `train_target` datasets. We will then validate the models using the `validate_features` and `validate_target` datasets. Through experimentation, we will explore the models both with and without hyperparameter tuning to identify the one with the highest accuracy.

Following the analysis, we will conduct a sanity check to assess the validity and reasonability of the chosen model.

#### 3.4.1 Decision Tree

The decision tree model includes the hyperparameter `max_depth`, which determines the depth or height of the decision tree. This parameter defines the number of levels the decision tree will analyze, such as two levels, three levels, four levels, and so on. We will test 10 different levels to identify the option with the highest accuracy. Additionally, we will set the hyperparameter `random_state` to 12345 for consistent results.

To assess the presence of overfitting or underfitting, we will create two models: one without hyperparameter tuning and one with hyperparameter tuning. This comparison will allow us to evaluate the impact of hyperparameter optimization on model performance.

In [11]:
# No hyperparameter tuning
dt_model = DecisionTreeClassifier()
dt_model.fit(train_features, train_target)

dt_predict_train = dt_model.predict(train_features)
dt_predict_valid = dt_model.predict(validate_features)

print("Model accuracy score using train data is:", accuracy_score(train_target, dt_predict_train) * 100)
print("Model accuracy score using validation data is:", accuracy_score(validate_target, dt_predict_valid) * 100)

Model accuracy score using train data is: 100.0
Model accuracy score using validation data is: 72.78382581648522


The accuracy scores between the two models are significantly different, suggesting the presence of overfitting. Thus, it is advisable to perform hyperparameter tuning to optimize the model's performance.

In [12]:
# With hyperparameter tunning
for depth in range(1, 11):
    dtree_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dtree_model.fit(train_features, train_target)
    
    predict_train = dtree_model.predict(train_features) 
    predict_valid = dtree_model.predict(validate_features) 
    
    acc_train = accuracy_score(train_target, predict_train) * 100
    acc_valid = accuracy_score(validate_target, predict_valid) * 100
        
    print('max_depth =', depth, ': ', end='')
    print(f'Train data accuracy is {acc_train} and validation data accuracy is {acc_valid}')

max_depth = 1 : Train data accuracy is 75.4149377593361 and validation data accuracy is 72.31726283048211
max_depth = 2 : Train data accuracy is 78.7344398340249 and validation data accuracy is 76.98289269051321
max_depth = 3 : Train data accuracy is 79.92738589211619 and validation data accuracy is 78.53810264385692
max_depth = 4 : Train data accuracy is 79.9792531120332 and validation data accuracy is 78.53810264385692
max_depth = 5 : Train data accuracy is 81.58713692946058 and validation data accuracy is 78.53810264385692
max_depth = 6 : Train data accuracy is 82.88381742738589 and validation data accuracy is 78.69362363919129
max_depth = 7 : Train data accuracy is 83.97302904564316 and validation data accuracy is 78.0715396578538
max_depth = 8 : Train data accuracy is 85.47717842323651 and validation data accuracy is 78.53810264385692
max_depth = 9 : Train data accuracy is 86.35892116182573 and validation data accuracy is 79.00466562986003
max_depth = 10 : Train data accuracy is 8

We achieved the highest accuracy score of 87.65% for the training data and 79.16% for the validation data when using a `max_depth` of 10 in the decision tree model.

#### 3.4.2 Random Forest

In the random forest model, instead of setting the parameter for the depth of a decision tree, we set the parameter for the number of trees using `n_estimators`. We will test different numbers of trees up to a maximum of 100, while keeping the `random_state` parameter the same as before.

We will create models without and with hyperparameter tuning to determine the best number of trees for this model.

In [13]:
# No hyperparameter tuning
rforest = RandomForestClassifier() 
rforest.fit(train_features, train_target) 

rf_predict_train = rforest.predict(train_features)
rf_predict_valid = rforest.predict(validate_features)

print("Model accuracy score using train data is:", accuracy_score(train_target, rf_predict_train) * 100)
print("Model accuracy score using validation data is:", accuracy_score(validate_target, rf_predict_valid) * 100)

Model accuracy score using train data is: 100.0
Model accuracy score using validation data is: 80.87091757387248


We have observed a significant accuracy gap in the random forest model, indicating overfitting. To address this issue, we will apply hyperparameter tuning. Based on previous models, we found that setting `max_depth` to 10 yielded the best accuracy. Therefore, in this model, we will focus on tuning the `n_estimators` parameter to find the optimal number of trees.

In [14]:
# With hyperparameter tunning

for est in range(10, 101, 10):
    rf_model = RandomForestClassifier(random_state=54321, max_depth=10, n_estimators=est)
    rf_model.fit(train_features, train_target) 

    rf_predict_train_tunning = rf_model.predict(train_features) 
    rf_predict_valid_tunning = rf_model.predict(validate_features) 

    acc_train = accuracy_score(train_target, rf_predict_train_tunning) * 100
    acc_valid = accuracy_score(validate_target, rf_predict_valid_tunning) * 100

    print('n_estimators =', est, ': ', end='')
    print(f'Train data accuracy is {acc_train.round(2)} and validation data accuracy is {acc_valid.round(2)}')

n_estimators = 10 : Train data accuracy is 88.95 and validation data accuracy is 80.25
n_estimators = 20 : Train data accuracy is 88.95 and validation data accuracy is 80.56
n_estimators = 30 : Train data accuracy is 89.16 and validation data accuracy is 80.72
n_estimators = 40 : Train data accuracy is 89.26 and validation data accuracy is 80.09
n_estimators = 50 : Train data accuracy is 89.21 and validation data accuracy is 79.78
n_estimators = 60 : Train data accuracy is 89.47 and validation data accuracy is 80.09
n_estimators = 70 : Train data accuracy is 89.26 and validation data accuracy is 80.25
n_estimators = 80 : Train data accuracy is 89.26 and validation data accuracy is 80.56
n_estimators = 90 : Train data accuracy is 89.21 and validation data accuracy is 80.25
n_estimators = 100 : Train data accuracy is 89.16 and validation data accuracy is 80.25


In [15]:
# Best model
best_score = 0
best_est = 0
for est in range(10, 101, 10): 
    rf_model = RandomForestClassifier(random_state=54321, max_depth=10, n_estimators=est) 
    rf_model.fit(train_features, train_target) 
    score = rf_model.score(validate_features, validate_target) 
    if score > best_score:
        best_score = score
        best_est = est

print("The best accuracy model is at n_estimators {} = {}".format(best_est, best_score))

The best accuracy model is at n_estimators 30 = 0.807153965785381


Our random forest model yielded the highest accuracy with 30 trees, achieving an accuracy rate of 89.16% for the train data and 80.72% for the validation data.

#### 3.4.3 Logistic Regression

We will proceed with the third model, which is logistic regression. Unlike the previous two models, we do not need to specify the tree depth or number of trees hyperparameters. Instead, we only need to set the `solver` parameter, and in this case, we will use `liblinear`. We will keep the `random_state` parameter as 12345.

In [16]:
# Create a logistic regression model
lr_model = LogisticRegression(random_state=54321, solver='liblinear') 
lr_model.fit(train_features, train_target) 
score_train = lr_model.score(train_features, train_target) 
score_valid = lr_model.score(validate_features, validate_target) 

print("Accuracy of logistic regression model based on training set:", score_train)
print("Accuracy of logistic regression model based on validation set:", score_valid)

Accuracy of logistic regression model based on training set: 0.7448132780082988
Accuracy of logistic regression model based on validation set: 0.7309486780715396


**Findings**

Based on the accuracy level, the random forest model achieved the highest score of 80.72% when validating the dataset. To obtain this level of accuracy, a model with 30 trees was analyzed.

Another alternative, the decision tree model, also achieved a high level of accuracy at 79.16% with just one tree, using the same depth (max_depth). In contrast, the logistic regression model had a lower accuracy of only 70.45%.

Considering these findings, we concluded that the random forest model is the best choice. Although it requires more trees compared to the decision tree model, the increase is still manageable and provides a significant improvement in accuracy.

### 3.5 Testing the Model

Now we will try to test the selected model on the test_feature and test_target datasets.

In [17]:
# Create the final model
final_model = RandomForestClassifier(random_state=54321, max_depth=10, n_estimators=30)
final_model.fit(train_features, train_target)

In [18]:
# Make predictions on the test_features dataset
test_predictions = final_model.predict(test_features)

In [19]:
# Let's check the accuracy
print('Accuracy score:', accuracy_score(test_target, test_predictions))

Accuracy score: 0.8102643856920684


Based on the test results, we achieved a higher accuracy value compared to the training results, indicating that the model is performing well.

### 3.6 Sanity Check

To conduct a sanity check and assess the reasonableness of our model, we can examine the balance of the target data and compare it with the accuracy value of our model. If the accuracy value is higher than the level of data imbalance, it suggests that the model is performing well in analyzing the features, despite the data being unbalanced.

In [20]:
# Counts the number of Smart (0) vs Ultra (1) on test_target
test_target.value_counts()

0    444
1    199
Name: is_ultra, dtype: int64

In [21]:
# Percentage of users
test_target.value_counts() / test_target.shape[0] * 100

0    69.051322
1    30.948678
Name: is_ultra, dtype: float64

Based on our analysis, we observed that the test_target dataset consists of 69.05% Smart customers, indicating an imbalance in the data distribution.

However, our model achieved an accuracy rate of 81.02%, which is approximately 12% higher than the majority class. This indicates that our model was able to effectively analyze the features and make accurate predictions despite the data imbalance.

Here are the predicted counts of Smart (0) and Ultra (1) subscribers generated by our model:

In [22]:
test_prediction_df = pd.DataFrame(test_predictions)

In [23]:
test_prediction_df.value_counts() / test_prediction_df.shape[0] * 100

0    78.693624
1    21.306376
dtype: float64

Our model predicts that 78.69% of the customers in the test dataset can be recommended to purchase the Smart plan. This percentage is indeed higher than the actual value in the test_target dataset.

Furthermore, our model also predicts that a significantly larger number of customers can be offered the Smart package compared to the Ultra package. This observation aligns with the distribution in the test_target dataset.

## 4 Conclusion

After evaluating multiple models, including the decision tree and random forest, we have determined that the random forest model outperforms the others in terms of accuracy. This chosen model demonstrates good predictive performance and passes the sanity check process. Overall, it provides reliable predictions with a high level of accuracy.

Based on our analysis and the performance of the chosen random forest model, our recommendation for Megaline would be to utilize this model to make package recommendations for their customers. The random forest model has demonstrated a high level of accuracy in predicting customer preferences between the Smart and Ultra packages. 

By leveraging this model, Megaline can effectively analyze customer behavior and recommend the appropriate package to customers who have not yet switched to the latest package. This approach will enable Megaline to optimize their offerings and cater to individual customer needs, ultimately leading to improved customer satisfaction and potentially higher subscription rates for the recommended package.