# Machine Learning for phone plan recommender system

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## Data description

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:

 - `сalls` — number of calls,
 - `minutes` — total call duration in minutes,
 - `messages` — number of text messages,
 - `mb_used` — Internet traffic used in MB,
 - `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

## Objectives

The objectives of this project is to:
- Develop a model that would analyze subscribers' behavior
- Build a phone plan recommender system to recommend the right plan based on subscribers' behavior 

<hr>

 # Table of contents

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#open_the_data">Open the data file and study the general information</a></li>
        <li><a href="#data_splitting">Split the source data</a></li>
        <li><a href="#investigate_models">Investigate different models quality</a></li>
        <li><a href="#check_quality">Check model quality</a></li>
        <li><a href="#sanity_check">Sanity check the model</a></li>
        <li><a href="#overall_conclusion">Overall conclusion</a></li>
    </ol>
</div>
<br>
<hr>

<div id="open_the_data">
    <h2>Open the data file and study the general information</h2> 
</div>

We require the following libraries: *pandas* and *numpy* for data preprocessing and manipulation, *Scikit-Learn* for building our learning algorithms

In [1]:
# import pandas and numpy for data preprocessing and manipulation
import numpy as np
import pandas as pd

# import train_test_split to split data
from sklearn.model_selection import train_test_split

# import machine learning module from the sklearn library
from sklearn.tree import DecisionTreeClassifier # import decision tree classifier
from sklearn.linear_model import LogisticRegression # import logistic regression 
from sklearn.ensemble import RandomForestClassifier # import random forest algorithm

# import metrics for evaluting model accuracy
from sklearn.metrics import accuracy_score

print('Project libraries has been successfully been imported!')

Project libraries has been successfully been imported!


In [2]:
# read the data
try:
    df = pd.read_csv('https://code.s3.yandex.net/datasets/users_behavior.csv')
except:
    df = pd.read_csv('C:/Users/hotty/Desktop/Practicum by Yandex/Projects/Introduction to Machine Learning/users_behavior.csv')
print('Data has been read correctly!')

Data has been read correctly!


In [3]:
# function to determine if columns in file have null values
def get_percent_of_na(df, num):
    count = 0
    df = df.copy()
    s = (df.isna().sum() / df.shape[0])
    for column, percent in zip(s.index, s.values):
        num_of_nulls = df[column].isna().sum()
        if num_of_nulls == 0:
            continue
        else:
            count += 1
        print('Column {} has {:.{}%} percent of Nulls, and {} of nulls'.format(column, percent, num, num_of_nulls))
    if count != 0:
        print("\033[1m" + 'There are {} columns with NA.'.format(count) + "\033[0m")
    else:
        print()
        print("\033[1m" + 'There are no columns with NA.' + "\033[0m")
        
# function to display general information about the dataset
def get_info(df):
    """
    This function uses the head(), info(), describe(), shape() and duplicated() 
    methods to display the general information about the dataset.
    """
    print("\033[1m" + '-'*100 + "\033[0m")
    print('Head:')
    print()
    display(df.head())
    print('-'*100)
    print('Info:')
    print()
    display(df.info())
    print('-'*100)
    print('Describe:')
    print()
    display(df.describe())
    print('-'*100)
    display(df.describe)
    print()
    print('Columns with nulls:')
    display(get_percent_of_na(df, 4))  # check this out
    print('-'*100)
    print('Shape:')
    print(df.shape)
    print('-'*100)
    print('Duplicated:')
    print("\033[1m" + 'We have {} duplicated rows.\n'.format(df.duplicated().sum()) + "\033[0m")
    print()

In [4]:
# study the general information about the dataset 
print('General information about the dataframe')
get_info(df)

General information about the dataframe
[1m----------------------------------------------------------------------------------------------------[0m
Head:



Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


----------------------------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

----------------------------------------------------------------------------------------------------
Describe:



Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


----------------------------------------------------------------------------------------------------


<bound method NDFrame.describe of       calls  minutes  messages   mb_used  is_ultra
0      40.0   311.90      83.0  19915.42         0
1      85.0   516.75      56.0  22696.96         0
2      77.0   467.66      86.0  21060.45         0
3     106.0   745.53      81.0   8437.39         1
4      66.0   418.74       1.0  14502.75         0
...     ...      ...       ...       ...       ...
3209  122.0   910.98      20.0  35124.90         1
3210   25.0   190.36       0.0   3275.61         0
3211   97.0   634.44      70.0  13974.06         0
3212   64.0   462.32      90.0  31239.78         0
3213   80.0   566.09       6.0  29480.52         1

[3214 rows x 5 columns]>


Columns with nulls:

[1mThere are no columns with NA.[0m


None

----------------------------------------------------------------------------------------------------
Shape:
(3214, 5)
----------------------------------------------------------------------------------------------------
Duplicated:
[1mWe have 0 duplicated rows.
[0m



**Conclusion**

Since the data have already been preprocessed, we see there are no duplicated rows, or missing values as expected. Now that our data is ready for modeling, let's start by splitting the source dataset into a training set, validation set, and test set using a ratio 3:1:1 or 60% training set, 20% validation set, and 20% testing sets.

<div id="data_splitting">
    <h2>Split the source data</h2> 
</div>

To split the data into training set, validation set, and test set we use `sklearn.model_selection.train_test_split` twice. First to split train, test and then split train again into validation and train.

In [5]:
# split data into training and testing 
df_train, df_test = train_test_split(df, test_size=0.20, random_state=12345)

# split train data into validation and train 
df_train, df_valid = train_test_split(df_train, test_size=0.25, random_state=12345) # 0.25 * 0.80 = 0.20 for validation size

In [6]:
# display the shape of the split dataset
print('The train set now contains {}'.format(df_train.shape[0]) + ' dataset representing 60% of the data') 
print('The valid set now contains {}'.format(df_valid.shape[0]) + ' dataset representing 20% of the data')
print('The test set now contains {}'.format(df_test.shape[0]) + ' dataset representing 20% of the data')

The train set now contains 1928 dataset representing 60% of the data
The valid set now contains 643 dataset representing 20% of the data
The test set now contains 643 dataset representing 20% of the data


In [7]:
# declare variables for features and target feature
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

print('-'*30)
print('Train features :', features_train.shape)
print('Train target   :',target_train.shape)
print('Valid features :',features_valid.shape)
print('Valid target   :',target_valid.shape)
print('Test features  :',features_test.shape)
print('Test target    :',target_test.shape)

------------------------------
Train features : (1928, 4)
Train target   : (1928,)
Valid features : (643, 4)
Valid target   : (643,)
Test features  : (643, 4)
Test target    : (643,)


**Conclusion**

We have been able to split the data three ways into 60% training set, 20% validation set, and 20% testing sets.

<div id="investigate_models">
    <h2>Investigate different models quality</h2> 
</div>

#### Model development

In this section, we proceed to build and investigate different model. Since this is a classification task, we would use the decision tree classifier, logistic regression, and random forestto develop the model.

#### Decision Tree Classifier

In [8]:
# create the decision tree classifier
def decision_tree_classifier(X_train, y_train, X_valid, y_valid):
    """
    This is a decision tree classifier function developed to train 
    the model, make prediction on train and validation dataset, print
    model accuracy for training and validation datasets
    """
    # create a loop for max_depth from 1 to 5 
    for depth in range(1, 6):
        model = DecisionTreeClassifier(random_state=12345, max_depth = depth) # create an instance of a class
        model.fit(X_train, y_train) # train the model
        train_predictions = model.predict(X_train) # make predictions on train set
        predictions_valid = model.predict(X_valid) # make predictions on validation set
        print('Max depth and accuracy for decision tree classifier')
        print('-'*40)
        print("\033[1m" + 'max_depth = {}'.format(depth) + "\033[0m")
        print('Training set:', accuracy_score(y_train, train_predictions))
        print('Validation set:', accuracy_score(y_valid, predictions_valid))
        print()

In [9]:
# determine accuracy for decision tree classifier
decision_tree_classifier(features_train, target_train, features_valid, target_valid)

Max depth and accuracy for decision tree classifier
----------------------------------------
[1mmax_depth = 1[0m
Training set: 0.758298755186722
Validation set: 0.7387247278382582

Max depth and accuracy for decision tree classifier
----------------------------------------
[1mmax_depth = 2[0m
Training set: 0.79201244813278
Validation set: 0.7573872472783826

Max depth and accuracy for decision tree classifier
----------------------------------------
[1mmax_depth = 3[0m
Training set: 0.8117219917012448
Validation set: 0.7651632970451011

Max depth and accuracy for decision tree classifier
----------------------------------------
[1mmax_depth = 4[0m
Training set: 0.8205394190871369
Validation set: 0.7636080870917574

Max depth and accuracy for decision tree classifier
----------------------------------------
[1mmax_depth = 5[0m
Training set: 0.8272821576763485
Validation set: 0.7589424572317263



The decision tree classifier can determines the right plan when we run a learning algorithm to train the model to make predictions. We created a loop for `max_depth` hyperparameter from 1 to 6 to see what depth gives us the optimal accuracy. We determined the accuracy of the decision tree classifier at various depth. The depth with the optimum accuracy for training and validation set is depth 4. Notice how the accuracy of the validation test keeps increasing until it gets to `max_depth` of 4. After this depth, the accuracy starts to decline. At `max_depth` of 4, we have an accuracy of 82.05% for the training set, and 76.36% for the validation set.

#### Logistic Regression Model

In [10]:
# create the logistic regression model
def logistic_regression(X_train, y_train, X_valid, y_valid):
    """
    This is a logistic regression model function developed to train
    the model, make prediction on train and validation dataset, print
    model accuracy for training and validation datasets
    """
    model = LogisticRegression(random_state=12345, solver='liblinear')
    model.fit(X_train, y_train) # train the model 
    model.score(X_train, y_train) # check the model's accuracy with score() method
    train_predictions = model.predict(X_train) # make predictions on train set
    predictions_valid = model.predict(X_valid) # make predictions on validation set
    print('Accuracy for logistic regression model')
    print('-'*40)
    print('Training set:', accuracy_score(y_train, train_predictions))
    print('Validation set:', accuracy_score(y_valid, predictions_valid))

In [11]:
# determine accuracy for logistic regression model
logistic_regression(features_train, target_train, features_valid, target_valid)

Accuracy for logistic regression model
----------------------------------------
Training set: 0.7028008298755186
Validation set: 0.6998444790046656


Although the model training is fast, the accuracy is lower. The logistic regression model gave an accuracy of 70.28% for the training set, and about 70% for the validation sets. 

#### Random Forest Classifier

In [12]:
# create the random forest classifier model
def random_forest_classifier(X_train, y_train, X_valid, y_valid):
    """
    This is a random forest classifier function developed to train
    the model, make prediction on train and validation dataset, print
    model accuracy for training and validation datasets
    """
    model = RandomForestClassifier(random_state=12345, n_estimators=5)
    model.fit(X_train, y_train) # train the model 
    model.score(X_train, y_train) # check the model's accuracy with score() method
    train_predictions = model.predict(X_train) # make predictions on train set
    predictions_valid = model.predict(X_valid) # make predictions on validation set
    print('Accuracy for random forest classifier')
    print('-'*40)
    print('Training set:', accuracy_score(y_train, train_predictions))
    print('Validation set:', accuracy_score(y_valid, predictions_valid))

In [13]:
# determine accuracy for random forest classifier
random_forest_classifier(features_train, target_train, features_valid, target_valid)

Accuracy for random forest classifier
----------------------------------------
Training set: 0.970954356846473
Validation set: 0.7620528771384136


In tuning hyperparameters for the random forest classifier, we make the `random_state` parameter pseudorandomness static. We also set the number of trees in the forest using `n_estimators=5` hyperparameter. The random forest classifier gave an accuracy of 97.1% for the training data, and 76.20% for the validation data using `n_estimator` of 5.

**Conclusion**

From the investigation of different model quality, we can see that the random forest is the most accurate model with an accuracy of 97.1% for the training data, and 76.20% for the validation data. The logistic regression model was the least accurate model with an accuracy of 70.28% for the training set, and about 70% for the validation sets. We proceed to use the random forest classifier to test prediction on the unseen test data.

<div id="check_quality">
    <h2>Check model quality</h2> 
</div>

#### Model testing

In [14]:
# Testing the random forest classifier model quality
model = RandomForestClassifier(random_state=12345, n_estimators=5)
model.fit(features_train, target_train) # train the model 
model.score(features_train, target_train) # check the model's accuracy with score() method
test_predictions = model.predict(features_test) # make predictions on test set    

print('Test set:', accuracy_score(target_test, test_predictions))

Test set: 0.7807153965785381


Using the random forest classifier, we tested the model with the test set to obtain an **accuracy score of 78%.**

<div id="sanity_check">
    <h2>Sanity check the model</h2> 
</div>

<div id="overall_conclusion">
    <h2>Overall conclusion</h2> 
</div>