# Megaline

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

We need to develop a model that will pick the right plan for the customer. We need to develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75.

## Preparation 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
# Checking general info on df
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
# Checking general info on df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
# Convert unnecessary floats into ints
df['calls'] = df['calls'].astype(np.int64)
df['messages'] = df['messages'].astype(np.int64)

In [6]:
# Checking NaNs
print('Amount of NaNs:', df.isna().sum().sum())

Amount of NaNs: 0


In [7]:
# Checking duplicates
print('Total Duplicate Rows:', len(df.drop_duplicates()) - len(df))

Total Duplicate Rows: 0


In [8]:
# Creating model variables and checking their sizes for errors
target_train, target_test, features_train, features_test = train_test_split(
    df['is_ultra'], df.drop(['is_ultra'], axis=1), train_size=0.8, test_size=0.1)

print('Training set shapes:', features_train.shape, target_train.shape)
print('Testing set shapes:', features_test.shape, target_test.shape)

Training set shapes: (2571, 4) (2571,)
Testing set shapes: (322, 4) (322,)


Here we split the data into a train and test dataset. There's no need for a validation set since we'll be using GridSearchCV.

## Analysis

In [9]:
print('{:.02%} of customers use Ultra plan'.format(len(df.loc[df['is_ultra'] == 0]) / len(df)))
print('{:.02%} of customers use Smart plan'.format(len(df.loc[df['is_ultra'] == 1]) / len(df)))

69.35% of customers use Ultra plan
30.65% of customers use Smart plan


In [10]:
print('Average amount of messages sent per month {:}'.format(df['messages'].mean().round(1)))
print('Max amount of messages sent per month by a customer {:}'.format(df['messages'].max()))

Average amount of messages sent per month 38.3
Max amount of messages sent per month by a customer 224


An average of 38 messages sent per month seems odd, and could indicate an older aged customer base. However, 224 messages sent maximum seems to just be wrong. I tried to find reliable statistics to prove this and found many sites, but none of them credible enough to be worth citing. Despite that, the rate still seems suspicious and could indicate a problem with the service, or an issue with how the message count was made for this dataset.

## Creating ML Models

In [11]:
best_accuracy = []

### Decision Tree

In [12]:
# Tuning Random Forest Regressors parameters using GridSearchCV
param_grid1 = {
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [int(x) for x in np.linspace(3, 15, num = 5)],
    'criterion' : ['gini', 'entropy']
}

ccv = TimeSeriesSplit(n_splits=2)
DTC_rfc = GridSearchCV(estimator=DecisionTreeClassifier(random_state=123), param_grid=param_grid1, cv=ccv)
DTC_rfc.fit(features_train, target_train)
predictions_test = DTC_rfc.predict(features_test)
best_accuracy.append(accuracy_score(target_test, predictions_test))

CPU times: user 368 ms, sys: 3.35 ms, total: 371 ms
Wall time: 370 ms


### Random Forest

In [13]:
# Tuning Random Forest Regressors parameters using GridSearchCV
param_grid2 = {
    'n_estimators': [int(x) for x in np.linspace(start = 100, stop = 300, num = 5)],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [int(x) for x in np.linspace(3, 15, num = 5)],
}

ccv = TimeSeriesSplit(n_splits=2)
RFR_rfc = GridSearchCV(estimator=RandomForestClassifier(random_state=123), param_grid=param_grid2, cv=ccv)
RFR_rfc.fit(features_train, target_train)
predictions_test = RFR_rfc.predict(features_test)
best_accuracy.append(accuracy_score(target_test, predictions_test))

CPU times: user 45.3 s, sys: 386 ms, total: 45.7 s
Wall time: 46 s


### Logistic Regression

In [14]:
# Using Linear Regression with GridSearchCV for the cross-validation
param_grid3 = {}

ccv = TimeSeriesSplit(n_splits=2)
LR = GridSearchCV(estimator=LogisticRegression(random_state=123), param_grid=param_grid3, cv=ccv)
LR.fit(features_train, target_train)
predictions_test = LR.predict(features_test)
best_accuracy.append(accuracy_score(target_test, predictions_test))

CPU times: user 218 ms, sys: 394 ms, total: 611 ms
Wall time: 558 ms


In [15]:
best_accuracy_df = pd.DataFrame(best_accuracy, index = ['Decision Tree', 'Random Forest', 'Linear Regression'])
best_accuracy_df.columns = ['Accuracy Score']
best_accuracy_df.sort_values(by='Accuracy Score', ascending=[False])

Unnamed: 0,Accuracy Score
Random Forest,0.798137
Decision Tree,0.78882
Linear Regression,0.708075


Here we use GridSearchCV to test our classification models for the best accuracy out of the three. Random Forest Classification is our winner.

In [16]:
dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(features_train, target_train)
prediction_test = dummy_model.predict(features_test)
print('Dummy Model Accuracy:', accuracy_score(target_test, prediction_test))

Dummy Model Accuracy: 0.6583850931677019


Here we sanity check our chosen model to ensure it's accuracy is higher than the random chance of achieving a similar score. Our training set exceeds the test set.

## Conclusion

We've chosen Random Forest Classification as it has shown to be the most accurate model. We've sanity checked it and proven it's more accurate than random chance, and will be best for future Megaline users.