## Introduction

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans. For this classification task, you need to develop a model that will pick the right plan.

Develop a model with the highest possible accuracy. Check the accuracy using the test dataset.

## Dataset description ('users_behavior.csv'):
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- `сalls` — number of calls,
- `minutes` — total call duration in minutes,
- `messages` — number of text messages,
- `mb_used` — Internet traffic used in MB,
- `is_ultra` — plan for the current month (Ultra - 1, Smart - 0).

## Project overview

The goal of this project is to evaluate three different machine learning classification models, Decision Tree, Random Forest, and Logistic Regression, and identifying the best model and parameters to predict behaviors of Megaline subscribers. The dataset from 'users_behavior.csv' will be used to train and validate the machine learning models. Finally, a test set will be used to check the quality of the models.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Read in the dataset into df
df = pd.read_csv('users_behavior.csv')

In [3]:
# Print out info on the dataset
df.info()
display(df.head())
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [4]:
# features: calls, minutes, messages, mb_used
# target: is_ultra
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

r_state = 48144495 # random state value

# Split into training, validation, and test sets
# 60% training, 20% validation, 20% test
features_train, features_split, target_train, target_split = train_test_split(features, target, test_size=0.4, random_state=r_state)
features_valid, features_test, target_valid, target_test = train_test_split(features_split, target_split, test_size=0.5, random_state=r_state)

In [5]:
# Evaluate and optimize hyperparameters for decision tree model
best_score_tree = 0
best_depth_tree = 0
for depth in range(1, 11):
    modelTree = DecisionTreeClassifier(random_state=r_state, max_depth=depth)
    modelTree.fit(features_train, target_train)
    score_tree = modelTree.score(features_valid, target_valid)
    if score_tree > best_score_tree:
        best_depth_tree = depth
        best_score_tree = score_tree

print(f'Best score = {best_score_tree} with best depth: {best_depth_tree}')

Best score = 0.7947122861586314 with best depth: 3


In [6]:
# Evaluate and optimize hyperparameters for random forest model
best_est = 0
best_score_forest = 0
best_depth_forest = 0
for est in range(10, 51, 10):
    for depth in range(1, 21):
        modelForest = RandomForestClassifier(n_estimators=est, random_state=r_state, max_depth=depth)
        modelForest.fit(features_train, target_train)
        score_forest = modelForest.score(features_valid, target_valid)
        if score_forest > best_score_forest:
            best_depth_forest = depth
            best_score_forest = score_forest
            best_est = est

print(f'Best score = {best_score_forest}, with best depth: {best_depth_forest}, with n_estimators: {best_est}')

Best score = 0.8118195956454122, with best depth: 10, with n_estimators: 40


In [7]:
# Evaluate logistic regression model with different solvers
solvers = ['lbfgs', 'liblinear']
best_solver = ''
best_score_log = 0
for solv in solvers:
    modelLog = LogisticRegression(random_state=r_state, solver=solv)
    modelLog.fit(features_train, target_train)
    score_log = modelLog.score(features_valid, target_valid)
    if score_log > best_score_log:
        best_solver = solv
        best_score_log = score_log

print(f'Best score = {best_score_log} with solver: {best_solver}')

Best score = 0.7122861586314152 with solver: lbfgs


## Results and discussion

A decision tree model with max_depth ranging from 1-10 were evaluated and the best model was determined with a max depth of 3 and an accuracy/score of 0.79.

A random forest model was evaluated with varying the n_estimator from 10-50, and a max depth ranging from 1-20 for each n_estimator. The best random forest model had a depth of 10, n_estimator of 40, and the accuracy/score of 0.81.

A logistic regression model with solvers 'lbfgs' and 'liblinear' were evaluated. The best score/accuracy was achieved at 0.71 with the solver 'lbfgs'.

Ranking of best to worst model with these parameters based on accuracy of the validation dataset is: Random Forest > Decision Tree > Logistic Regression.

In [8]:
# Dummy classifier
model_dummy = DummyClassifier(strategy='prior').fit(features_train, target_train)
predictions_dummy = model_dummy.predict(features_test)
acc_dummy = accuracy_score(target_test, predictions_dummy)

print(f'Accuracy of minimal quality model: {acc_dummy}')

Accuracy of minimal quality model: 0.6827371695178849


In [9]:
# Determined hyperparameters for decision tree model on test set
best_modelTree = DecisionTreeClassifier(random_state=r_state, max_depth=best_depth_tree)
best_modelTree.fit(features_train, target_train)
predictions_tree = best_modelTree.predict(features_test)
acc_tree = accuracy_score(target_test, predictions_tree)

print(f'Accuracy of decision tree model on test set: {acc_tree}')

Accuracy of decision tree model on test set: 0.8118195956454122


In [10]:
# Determined hyperparameters for random forest model on test set
best_modelForest = RandomForestClassifier(n_estimators=best_est, random_state=r_state, max_depth=best_depth_forest)
best_modelForest.fit(features_train, target_train)
predictions_forest = best_modelForest.predict(features_test)
acc_forest = accuracy_score(target_test, predictions_forest)

print(f'Accuracy of random forest model on test set: {acc_forest}')

Accuracy of random forest model on test set: 0.8040435458786936


In [11]:
# Determined hyperparameters for logistic regression model
best_modelLog = LogisticRegression(random_state=r_state, solver=best_solver)
best_modelLog.fit(features_train, target_train)
predictions_log = best_modelLog.predict(features_test)
acc_log = accuracy_score(target_test, predictions_log)

print(f'Accuracy of logistic regression model on test set: {acc_log}')

Accuracy of logistic regression model on test set: 0.6842923794712286


## Results and discussion on test set

Decision Tree model accuracy on test set is 0.81.

Random Forest model accuracy on test set is 0.80.

Logistic Regression model accuracy on test set is 0.68, which is equivalent to the minimal quality model, so the logistic regression model is not a reliable model to use.

After evaluating with the test set, the ranking is now decision tree > random forest > logistic regression. The decision tree accuracy actually went up on the test set, while the accuracy of the random forest model went down when evaluating the test set. The change in difference is actually fairly small and likely not that significant of a difference. From this analysis though, we were able to achieve accuracy on a test dataset of more than 75% with both decision tree and random forest models.