# Selecting right mobile plan

In this project, we will use machine learning algorithms to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

First of all, will load the data and the libraries that we will use in this project.

In [1]:
# Loading all required libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score 

In [2]:
# Loading the data files into DataFrame
df=pd.read_csv('/datasets/users_behavior.csv')

We will display general data info and a sample of the data.

In [3]:
# printing the general/summary information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# printing a sample of data
df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


The data were preprocessed before, and as can be seen, all columns have the correct type and there are no missing values. So, we can start to build our models.

Before, building the models, we will split the data into training, validation, and test sets. For the training data set, we will use 60% of the original data set. For validation and test data sets we will use 20% of the original data set for each.

In [5]:
# spliting data into training, validation and test sets
df, df_test = train_test_split(df, test_size=0.2, random_state=12345) 
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345) 

Let's print shapes for each data set to validate that the data was spent in correct proportions.

In [6]:
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

(1928, 5)
(643, 5)
(643, 5)


Data sets have correct proportions.

Next, for each of the sets, we will put features and targets into separate sets.

In [7]:
# putting features and target into separate sets
train_features = df_train.drop(['is_ultra'], axis=1)
train_target = df_train['is_ultra']
valid_features = df_valid.drop(['is_ultra'], axis=1)
valid_target = df_valid['is_ultra']
test_features = df_test.drop(['is_ultra'], axis=1)
test_target = df_test['is_ultra']

As the next step, we will investigate the quality of different models by changing hyperparameters. We will start with the Decision Tree and will train the model with different maximum depths in the range from 2 to 100. We will investigate the quality of each model on the validation data set.

In [8]:
#creating an empty dictionary for storing results of accuracy examination.
decision_tree_quality={}

#declaring variables for storing best max_depth and best accuracy

best_max_depth=0
best_accuracy=0

#training models with different maximum depths in range from 2 to 100
for i in range(2,100):
    model = DecisionTreeClassifier(random_state=12345, max_depth=i)
    model.fit(train_features, train_target) 
#calculating accuracy of the model on the validation data set    
    predictions=model.predict(valid_features)
    accuracy = accuracy_score(valid_target, predictions)
#if accuracy of the model is greater than previus best accuracy, updating best max_depth and best accuracy variables
    if accuracy>best_accuracy:
        best_max_depth=i
        best_accuracy=accuracy
#adding model accuracy to dictionary that stores results of accuracy examination
    decision_tree_quality['max_depth='+str(i)]='accuracy - '+str(accuracy)

#printing best accuracy and accuracy of all models
print(f'Best accuracy {best_accuracy} is achived with max_depth={best_max_depth}')
decision_tree_quality

Best accuracy 0.7744945567651633 is achived with max_depth=7


{'max_depth=2': 'accuracy - 0.7573872472783826',
 'max_depth=3': 'accuracy - 0.7651632970451011',
 'max_depth=4': 'accuracy - 0.7636080870917574',
 'max_depth=5': 'accuracy - 0.7589424572317263',
 'max_depth=6': 'accuracy - 0.7573872472783826',
 'max_depth=7': 'accuracy - 0.7744945567651633',
 'max_depth=8': 'accuracy - 0.7667185069984448',
 'max_depth=9': 'accuracy - 0.7620528771384136',
 'max_depth=10': 'accuracy - 0.7713841368584758',
 'max_depth=11': 'accuracy - 0.7589424572317263',
 'max_depth=12': 'accuracy - 0.7558320373250389',
 'max_depth=13': 'accuracy - 0.749611197511664',
 'max_depth=14': 'accuracy - 0.7573872472783826',
 'max_depth=15': 'accuracy - 0.7527216174183515',
 'max_depth=16': 'accuracy - 0.749611197511664',
 'max_depth=17': 'accuracy - 0.7387247278382582',
 'max_depth=18': 'accuracy - 0.7418351477449455',
 'max_depth=19': 'accuracy - 0.7356143079315708',
 'max_depth=20': 'accuracy - 0.7293934681181959',
 'max_depth=21': 'accuracy - 0.7325038880248833',
 'max_dept

As can be seen, the best accuracy of 0.7744945567651633 is achieved with max_depth=7.

Next, we will train the Random Forest will different numbers of estimators in the range from 2 to 100. We will investigate the quality of each model on the validation data set.

In [9]:
#creating an empty dictionary for storing results of accuracy examination.
random_forest_quality={}

#declaring variables for storing best max_depth and best accuracy
best_n_estimators=0
best_accuracy=0

#training models with different numbers of estimators in range from 2 to 100
for i in range(2,100):
    model = RandomForestClassifier(random_state=12345, n_estimators=i) 
    model.fit(train_features, train_target) 
#calculating accuracy of the model on the validation data set 
    predictions=model.predict(valid_features)
    accuracy = accuracy_score(valid_target, predictions)
#if accuracy of the model is greater than previus best accuracy, updating best max_depth and best accuracy variables
    if accuracy>best_accuracy:
        best_n_estimators=i
        best_accuracy=accuracy
#adding model accuracy to dictionary that stores results of accuracy examination
    random_forest_quality['n_estimators='+str(i)]='accuracy - '+str(accuracy)
#printing best accuracy and accuracy of all models
print(f'Best accuracy {best_accuracy} is achived with n_estimators={best_n_estimators}')
random_forest_quality

Best accuracy 0.7993779160186625 is achived with n_estimators=65


{'n_estimators=2': 'accuracy - 0.7573872472783826',
 'n_estimators=3': 'accuracy - 0.744945567651633',
 'n_estimators=4': 'accuracy - 0.7651632970451011',
 'n_estimators=5': 'accuracy - 0.7620528771384136',
 'n_estimators=6': 'accuracy - 0.7698289269051322',
 'n_estimators=7': 'accuracy - 0.7713841368584758',
 'n_estimators=8': 'accuracy - 0.7869362363919129',
 'n_estimators=9': 'accuracy - 0.7838258164852255',
 'n_estimators=10': 'accuracy - 0.7884914463452566',
 'n_estimators=11': 'accuracy - 0.7807153965785381',
 'n_estimators=12': 'accuracy - 0.7822706065318819',
 'n_estimators=13': 'accuracy - 0.7776049766718507',
 'n_estimators=14': 'accuracy - 0.7853810264385692',
 'n_estimators=15': 'accuracy - 0.7838258164852255',
 'n_estimators=16': 'accuracy - 0.7838258164852255',
 'n_estimators=17': 'accuracy - 0.7776049766718507',
 'n_estimators=18': 'accuracy - 0.7869362363919129',
 'n_estimators=19': 'accuracy - 0.7869362363919129',
 'n_estimators=20': 'accuracy - 0.7900466562986003',
 '

As can be seen, the best accuracy 0.7993779160186625 is achieved with n_estimators=65.

Finally, we will train the Logistic Regression will liblinear solver and will investigate the quality of the model on the validation data set.

In [10]:
#training the model
model = LogisticRegression(random_state=12345, solver='liblinear') 
model.fit(train_features, train_target) 

#calculating accuracy of the model on the validation data set 
predictions=model.predict(valid_features)
accuracy = accuracy_score(valid_target, predictions)

#printing accuracy
accuracy

0.6967340590979783

Based on the above calculations, our best model is Random Forest with 65 estimators, and the model with the least accuracy is Logistic Regression.

In the next step, we will check the quality of our best model using the test set.

In [11]:
#training the model
model = RandomForestClassifier(random_state=12345, n_estimators=65) 
model.fit(train_features, train_target) 

#calculating accuracy of the model on the validation data set 
predictions=model.predict(test_features)
accuracy = accuracy_score(test_target, predictions)

#printing accuracy
print(f'Accuracy: {accuracy}')

Accuracy: 0.7900466562986003


As can be seen, the accuracy of our best model on the test data set is just a bit lower than on the validation data set which indicates that the model is well fitted.

As a final step, we will make a sanity check of the model to ensure that the accuracy of the model is better than chance. We will assume we have a model that assigns "0" or "1" randomly with a 50/50 chance. The accuracy of this random model will be calculated as: (number of 1) ⋅ (portion of correctly guessed 1) + (number of 0) ⋅ (portion of correctly guessed 0))/number of observations. The portion of correctly guessed 1 can be calculated as a chance of a correct value to be 1 which is the "number of 1/number of observations". The portion of correctly guessed o is calculated similarly.

In [12]:
#getting the number of observation
number_of_observation=len(df_test)

#getting the number of 0s and 1s
number_of_ultra=len(df_test[df_test['is_ultra']==1])
number_of_smart=len(df_test[df_test['is_ultra']==0])

#calculating accuracy of the random model
print((number_of_ultra*(number_of_ultra/number_of_observation)+number_of_smart*(number_of_smart/number_of_observation))/number_of_observation)


0.576189566306848


It is confirmed that the accuracy of our best model is better than chance.

# General conclusion

In this project, we used machine learning algorithms to develop a model that analyzed subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

First of all, we observed the data and ensured that all columns have the correct type and that there are no missing values. Then, we split the data into training, validation, and test sets. For the training data set we used 60% of the original data set. For validation and test data sets we used 20% of the original data set for each.

Using training and validation data sets, we investigated the quality of different models by changing hyperparameters. It appeared that in this case, the best model is Random Forest with 65 estimators, and the model with the least accuracy is Logistic Regression. Next, we checked the quality of our best model using the test set. The accuracy of our best model on the test data set was just a bit lower than on the validation data set which indicates that the model is well fitted.

Finally, we made a sanity check of the best model and ensured that the accuracy of the model is better than chance.