# Project Description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model. Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

# Data Analysis and Library Loading

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score

In [3]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/users_behavior.csv')


In [4]:
df.isna().sum()
df.info()

df['messages'] = df['messages'].astype(int) 
df['calls'] = df['calls'].astype(int) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


I checked for null values to see if I had to perform any EDA,this dataset needs no fixing.

# Data Splitting

In [5]:
#Split the data into training, testing and validation sets
#Split 25% for testing and 75% for training.
features = df.drop(columns=['is_ultra'])
target =  df['is_ultra']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.25, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.25, random_state=12345 )

# Model Tuning

In [23]:
#Logistic Regression Model
logistic = LogisticRegression(random_state=54321, solver='liblinear') 
logistic.fit(features_valid, target_valid)
logistic.score(features_valid, target_valid)

print('The Logistic Regression Model has accuracy of',logistic.score(features_valid, target_valid)*100,)

The Logistic Regression Model has accuracy of 70.64676616915423


In [24]:
#Decision Tree Model
tree = DecisionTreeClassifier(random_state =12345, max_depth=2, splitter='best',class_weight=None)

tree.fit(features_valid, target_valid)
tree.score(features_valid, target_valid)

print('The Decision Tree Model has accuracy of',tree.score(features_valid, target_valid)*100,)

The Decision Tree Model has accuracy of 78.93864013266997


In [25]:
tree = DecisionTreeClassifier(random_state =54321, max_depth=5, splitter='best',class_weight=None)

tree.fit(features_valid, target_valid)
tree.score(features_valid, target_valid)

print('The Decision Tree Model has accuracy of',tree.score(features_valid, target_valid)*100,)

The Decision Tree Model has accuracy of 85.57213930348259


In [26]:
#Random Forest Model
forest = RandomForestClassifier(random_state=12345,max_depth=1,min_samples_leaf=1,criterion='gini')

forest.fit(features_valid, target_valid)
forest.score(features_valid, target_valid)

print('The Random Forest Model has accuracy of',forest.score(features_valid, target_valid)*100,)

The Random Forest Model has accuracy of 76.61691542288557


In [27]:
#Random Forest Model
forest = RandomForestClassifier(random_state=12345,max_depth=7,min_samples_leaf=1,criterion='gini')

forest.fit(features_valid, target_valid)
forest.score(features_valid, target_valid)

print('The Random Forest Model has accuracy of',forest.score(features_valid, target_valid)*100,)

The Random Forest Model has accuracy of 90.54726368159204


We see that the Logistic Regression Model has the lowest accuracy of the three models. With a max depth of 7, we see that the Decison Tree and Random Forest Model perform at a higher level.

# Testing Quality - Test Set

In [21]:
features_test_accuracy = features_test
predictions_test_accuracy = forest.predict(features_test_accuracy)
quality = accuracy_score(target_test, predictions_test_accuracy)
quality

print('The Random Forest Model has accuracy of', quality * 100,)

The Random Forest Model has accuracy of 79.47761194029852


The threshold for accuracy is 0.75 for this project. The Random Forest model has an accuracy of 79.5%, which qaulifies. Out of curiosity, I checked the Decision Tree model. It performed well but the Random Forest slightly outperformed it.

In [28]:
features_test_accuracy = features_test
predictions_test_accuracy = forest.predict(features_test_accuracy)
quality = accuracy_score(target_test, predictions_test_accuracy)


In [30]:
precision = precision_score(forest.predict(features_test_accuracy), target_test)
precision 

print('The Random Forest Model has the precision of',precision * 100,)

The Random Forest Model has the precision of 48.13278008298755


The precision of Random Forest model were very low.

# Conclusion

The dataset had enough relevant data that all models performed well. 

Decision Tree and Random Forest completely outperformed Logistic Regression when tested on the train data set. When running these models on the test data set we saw that Random Forest slightly outperformed the Decision Tree with an accuracy rate of just over 80%.