# Prediction Model for User Cellular Package Using Machine Learning <a id='intro'></a>

A mobile operator named `Megaline` was dissatisfied because many of their customers were still using the old package. The company wants to develop a model to analyze consumer behavior and recommend one of the newest Megaline packages: `Smart` or `Ultra`. `Megaline` has a dataset containing the behavior of users of the `Smart` and `Ultra` packages. Here, we will train several `machine learning models` to determine which package to give based on customer behavior (`number of calls`, `call duration`, `sms`, `number of data packets`). In this project, the threshold for the `Accuracy` level is `0.75`. 
Some of the objectives and formulation of the problem from this project analysis:
- Find out the best algorithm for `machine learning` model for `Megaline` dataset.
- What is the best `Hyperparameter` in `machine learning` models.
- Does the selected `machine learning model` meet the `sanity check`?
- Is it true that the selected `machine learning model` can test arbitrary data samples?

# Content <a id='back'></a>

* [Intro](#intro)
* [Content](#back)
* [Stage 1. Preparing Dataset](#cont_1)
     * [1.1 Loading Library](#cont_2)
     * [1.2 Load Dataset](#cont_3)
     * [1.3 Checking for Duplication](#cont_4)
     * [1.4 Changing Data Type](#cont_5)
* [Step 2. Creating a Machine Learning Model](#cont_6)
     * [2.1 Splitting Dataset](#cont_7)
     * [2.2 Train and Test Machine Learning Algorithms](#cont_8)
         * [2.2.1 Decision Tree Classification Algorithm](#cont_9)
         * [2.2.2 Random Forest Classification Algorithm](#cont_10)
         * [2.2.3 Logistic Regression Algorithm](#cont_11)
     * [2.3 Model With Best Algorithm](#cont_12)
     * [2.4 Testing Model Eligibility (Sanity Check)](#cont_13)
* [Stage 3. Machine Learning Model Application](#cont_14)
* [Stage 4. General Conclusion](#cont_15)

# Preparing Dataset <a id='cont_1'></a>

The first step that needs to be done is to prepare the dataset starting from loading the required library, loading the dataset into the project, checking sample data, checking for missing values, checking for duplicates and checking data types.

## Load Libraries <a id='cont_2'></a>

Next we will load the required libraries. Here we only need two libraries namely `pandas` to process dataset and `scikit learn` for `machine learning` modeling. Let's load the second library.

In [35]:
# load libraries
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

## Load Dataset <a id='cont_3'></a>

Let's load the `Megaline` dataset into the project using the `pandas` library.

In [36]:
# load the megaline dataset
df_megaline = pd.read_csv('users_behavior.csv')

Next we will display information and sample data from the `Megaline` dataset.

In [37]:
# check data information
print(df_megaline.info())

# check sample data
df_megaline.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Dataset contains the following fields:
- `calls` is the number of calls
- `minutes` is the duration of call in minutes
- `messages` is the number of messages
- `mb_used` is the number of data used in MB
- `is_ultra` is the package of ultra. If 1 is `ultra`, 0 is `smart`.

Based on the information above, it shows that the dataset consists of `3214 rows` and `5 columns`, has no `missing values`. You can see that the `calls` and `messages` columns have the wrong data type, we will fix this later.

## Check for Duplication <a id='cont_4'></a>

Next we will check for duplication in the dataset. If there are many duplicates of the same row, it will reduce the accuracy of the machine learning model that we will create.

In [38]:
# check for duplicates
df_megaline.duplicated().sum()

0

It can be seen that we do not have the same duplicate data in this dataset.

## Changing Data Type <a id='cont_5'></a>

Based on previous observations, we will change the data type for the `calls` and `messages` columns from `float` to `integer`.

In [39]:
# change the data type of the calls column to integer
df_megaline['calls'] = df_megaline['calls'].astype('int')

# change the message column data type to integer
df_megaline['messages'] = df_megaline['messages'].astype('int')

# check the new data type
df_megaline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int32  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int32  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int32(2), int64(1)
memory usage: 100.6 KB


The `calls` and `messages` fields have been successfully converted to `integer` data type.

# Create Machine Learning Models <a id='cont_6'></a>

Making the best `machine learning model` requires several steps including: dividing the dataset, choosing an algorithm, testing the algorithm, and `tuning hyperparamater`.

We will test several `machine learning algorithms` and choose the most effective one to use. Some of these algorithms include:
- Decision Tree Classification Algorithm
- Random Forest Classification Algorithm
- Logistic Regression Algorithm

## Splitting Dataset <a id='cont_7'></a>

Here we will define what packages are recommended as `target` based on some related `features`. These features include `calls`, `minutes`, `messages` and `mb_used`.

Since we only have one dataset, we will divide it into groups to create a `machine learning model`. The division is broken down into `60%` dataset for `training`, `20%` dataset for `validation` and `20%` for `testing`.

In [40]:
# divide the dataset into 60% training, 20% validation and 20% test

# divide the megaline dataset into 60% for training and 40% for (validation + test)
df_train, df_temp = train_test_split(df_megaline, test_size=0.6, random_state=147)

# divide the temporary dataset into 50% for validation and 50% for test
df_validation, df_test = train_test_split(df_temp, test_size=0.5, random_state=147)

In [41]:
# divide the training dataset into features and targets
features_train = df_train.drop(['is_ultra'],axis=1)
target_train = df_train['is_ultra']

# split the validation dataset into features and targets
features_valid = df_validation.drop(['is_ultra'],axis=1)
target_valid = df_validation['is_ultra']

# divide the test dataset into features and targets
features_test = df_test.drop(['is_ultra'],axis=1)
target_test = df_test['is_ultra']

# displays the shape of the training and validation
print('features_train:',features_train.shape)
print('target_train:',target_train.shape,'\n')

print('features_valid:',features_valid.shape)
print('target_valid:',target_valid.shape,'\n')

print('features_test:',features_test.shape)
print('target_test:',target_test.shape)

features_train: (1285, 4)
target_train: (1285,) 

features_valid: (964, 4)
target_valid: (964,) 

features_test: (965, 4)
target_test: (965,)


## Train and Test Machine Learning Algorithms <a id='cont_8'></a>

Based on the types of features and target in the previous dataset division, we conclude that the `machine learning model` that can be made is of the `supervised learning - classification` type. Let's train and test the aforementioned models:

### Decision Tree Classification Algorithm <a id='cont_9'></a>

Next we will test the `decision tree classification algorithm` where `hyperparameter` for the depth of the tree will be tested to get the best `hyperparameter`.

In [42]:
# decision tree algorithm experiment

# create a temporary variable
best_result = 0
best_depth = 0

# testing the depth of the decision tree model (depth -> 1 ~ 50)
for depth in range(1, 51):
     # create a decision tree model
     model = DecisionTreeClassifier(random_state=147, max_depth=depth)
     # train the model using features and target train
     model.fit(features_train, target_train)
     # calculate accuracy using features and target validation
     result = model.score(features_valid,target_valid)
     if result > best_result:
         best_model = model
         best_result = result
         best_depth = depth

# displays the output
print("Best Model Accuracy (Decision Tree)")
print(f"Accuracy: {best_result:.5f}")
print(f"Depth\t: {best_depth}")

Best Model Accuracy (Decision Tree)
Accuracy: 0.79876
Depth	: 3


From here, we can see that the best accuracy score obtained was `79.88%` at a `depth` tree of `3`.

### Random Forest Classification Algorithm <a id='cont_10'></a>

In the same way, we will test the `decision tree classification algorithm` where the `hyperparameter` for the `random forest` that we are `tuning` is the number of trees `n_estimators` and the depth of the tree `depth`.

Here we will try for a tree depth of `1 to 10`, and for the number of trees `1 to 15` we will find the best `hyperparameter`.

In [43]:
# Experimental random forest algorithm
# tree depth: depth -> 1 ~ 10
# number of trees: n_estimators -> 1 ~ 15

# create a temporary variable
best_result = 0
best_est = 0
best_depth = 0

# testing the depth of the model and the number of trees
# setting the number of trees
for est in range(1, 16):
     # setting the amount of tree depth
     for depth in range (1, 11):
         # create a random forest classifier model
         model = RandomForestClassifier(random_state=147, n_estimators=est, max_depth=depth)
         # train the model using features and target train
         model.fit(features_train, target_train)
         # calculate accuracy using features and target validation
         result = model.score(features_valid,target_valid)
         if result > best_result:
             best_result = result
             best_est = est
             best_depth = depth

# displays the output
print("Best Model Accuracy (Random Forest)")
print(f"Accuracy\t: {best_result:.5f}")
print(f"Depth\t\t: {best_depth}")
print(f"N_Estimators\t: {best_est}")

Best Model Accuracy (Random Forest)
Accuracy	: 0.81017
Depth		: 7
N_Estimators	: 6


It can be seen that we can get an accuracy score of `81.02%` using only `7` tree depth and `6` tree count.

Let's try increasing the `hyperparameter` to see if we can get a better accuracy score than this score.

We will try to increase the depth of the trees `1 to 20`, and the number of trees `1 to 70 (in increments of 5)` for which we will find the best `hyperparameter`.

In [44]:
# Experimental random forest algorithm
# tree depth: depth -> 1 ~ 20
# number of trees: n_estimators -> 1 ~ 70 {increment 5}

# create temporary
best_result = 0
best_est = 0
best_depth = 0

# testing the depth of the model and the number of trees
# setting the number of trees
for est in range(1, 71, 5):
     # setting the amount of tree depth
     for depth in range(1, 21):
         # create a random forest classifier model
         model = RandomForestClassifier(random_state=147, n_estimators=est, max_depth=depth)
         # train the model using features and target train
         model.fit(features_train, target_train)
         # calculate accuracy using features and target validation
         result = model.score(features_valid,target_valid)
         if result > best_result:
             best_result = result
             best_est = est
             best_depth = depth

# displays the output
print("Best Model Accuracy (Random Forest)")
print(f"Accuracy\t: {best_result:.5f}")
print(f"Depth\t\t: {best_depth}")
print(f"N_Estimators\t: {best_est}")

Best Model Accuracy (Random Forest)
Accuracy	: 0.81535
Depth		: 9
N_Estimators	: 51


It can be seen that we can get an accuracy score of `81.53%` using only `9` tree depth and `51` tree count.

Here we conclude that the higher the `hyperparameter` does not cause a significant increase in the accuracy of the score, in fact it is almost the same. So we take the best `hyperparameter` for the `random forest` algorithm this time at a tree depth of `7` and number of trees `6`.

### Logistic Regression Algorithm <a id='cont_11'></a>

In the same way we will test the `logistic regression algorithm` using the 'liblinear' `solver`.

In [45]:
# logistic regression algorithm experiments

# create a logistic regression model
model = LogisticRegression(random_state=147, solver='liblinear')

# train the model using features and target train
model.fit(features_train, target_train)

# calculate accuracy using features and target validation
result = model.score(features_valid,target_valid)

# displays the output
print("Best Model Accuracy (Logistic Regression)")
print(f"Accuracy: {result:.5f}")

Best Model Accuracy (Logistic Regression)
Accuracy: 0.72822


Here we get a lower score than the two previous algorithms that we tested, which is only `72.8%`.

## Model With Best Algorithm <a id='cont_12'></a>

From the previous tests, we chose the best algorithm with its `hyperparameter` which is the `Random Forest Classification Algorithm` with a `depth` tree depth of `7` and a total of `6` trees.

The model only uses `60%` of the dataset as an exercise, what if we increase it to `80%`, of course we hope that the model will be able to predict better.

Let's combine the training and validation datasets into one dataset.

In [46]:
# train the model using training and validation datasets to get more accurate results using random forests

# combine training and validation datasets
merge_df = pd.concat([df_train,df_validation],axis=0)
print('New Dataset:',merge_df.shape)

# split the final dataset into features and targets
features = merge_df.drop(['is_ultra'],axis=1)
target = merge_df['is_ultra']

# max_depth = 7
#n_estimators = 6
best_model = RandomForestClassifier(random_state=147, n_estimators=6, max_depth=7)

# train the best models
best_model. fit(features,target)

New Dataset: (2249, 5)


## Testing the Feasibility of the Model (Sanity Check) <a id='cont_13'></a>

Let's test the feasibility of the `sanity check` using the `testing dataset` that has been made before where the correct answer was not entered during the training of the best model.

In [47]:
# model feasibility test (sanity check)
accuracy = best_model.score(features_test,target_test)

# displays the output
print("Best_model accuracy:",accuracy)

Best_model accuracy: 0.8072538860103627


It can be seen that the accuracy is still around `81%` and is still above the `75%` accuracy threshold.

# Application of Machine Learning Model <a id='cont_14'></a>

Here we will create any dataset where we will know which package is suitable if we have the following features.

In [48]:
# create an arbitrary dataframe to test the selected model
data_test = pd.DataFrame({
    'calls':[70,20,50,100,90],
    'minutes':[100,85,300,250,30],
    'messages':[50,35,300,60,500],
    'mb_used':[10000,500,7000,3000,1000]
})

# display dataframes
data_test

Unnamed: 0,calls,minutes,messages,mb_used
0,70,100,50,10000
1,20,85,35,500
2,50,300,300,7000
3,100,250,60,3000
4,90,30,500,1000


Let's predict the packages for these five users.

In [49]:
# predict data_test
best_model.predict(data_test)

array([0, 0, 1, 1, 1], dtype=int64)

It can be seen that we can predict users who use the `smart` package as many as 2 users and the `ultra` package as many as 3 users.

# General Conclusion <a id='cont_15'></a>

From this project we have loaded the necessary libraries, prepared the dataset, divided the dataset, trained and tested a `machine learning model` to predict which packages are recommended to users based on user behavior. It can be concluded as follows:
- The dataset for training and testing the `machine learning` model is divided into `60%` for `training`, `20%` for `validation` and `20%` for `testing`.
- Algorithms tested include: `Decision Tree Classification`, `Random Forest Classification` and `Logistic Regression`.
- The best algorithm with its `hyperparameter` is the `Random Forest Classification Algorithm` with a `depth` of `7` trees and `6` of trees resulting in an accuracy of `81%`.
- A `Sanity Check` feasibility test was performed and the model was able to maintain its score accuracy of `81%`.