I have the dataset from the mobile carrier Megaline and my goal is to develop a model with the highest possible accuracy that will recommend the best plan according to the customers behavior.
First, I will import any necessary libraries for the task.

In [127]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

Next, I will import the dataset and take a look at it.

In [128]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [129]:
print(df.shape)

(3214, 5)


In [130]:
display(df.head(5))

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


I will check for duplicates within the dataframe.

In [131]:
duplicates = df[df.duplicated()]
duplicates

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra


Now I will split the features and the target variable

In [132]:
x = df.drop('is_ultra', axis=1)
y = df['is_ultra']

I will show shape of the features and target variable.

In [133]:
print(x.shape)
print(y.shape)

(3214, 4)
(3214,)


I will split the data into training, validation, and testing sets with the "train_test_split" function. I am going to use a test size of 0.3 because I believe it will give me the highest accuracy and I will use the '54321' for the random_state

In [134]:
x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=0.4, random_state=54321)
x_valid, x_test, y_valid, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=54321)

The model I will be using is the RandomForestClassifier with the random_state parameter '54321' so it can be reproduced. I chose this model after playing around with the LinearRegression model and only getting up to a 0.735 accuracy. This model proved to be more accurate.

In [135]:
rfc_model = RandomForestClassifier(random_state=54321)

In [136]:
lr_model = LogisticRegression(random_state=54321)

In [137]:
dtc_model = DecisionTreeClassifier(random_state=54321)

I will check the dimensions of the train, valid, and test sets.

In [138]:
print("Features (x_train):", x_train.shape)
print("Target Variable (y_train)", y_train.shape)

print("\nFeatures (x_valid):", x_valid.shape)
print("Target Variable (y_valid):", y_valid.shape)

print("\nFeatures (x_test):", x_test.shape)
print("Target Variable (y_test):", y_test.shape)

Features (x_train): (1928, 4)
Target Variable (y_train) (1928,)

Features (x_valid): (643, 4)
Target Variable (y_valid): (643,)

Features (x_test): (643, 4)
Target Variable (y_test): (643,)


Now I will train the models with the 'fit' method using the training data and also make the predictions on the test set using 'predict' with the validation data. Then I will find the accuracy of each model with the test set.

In [139]:
rfc_model.fit(x_train, y_train)
y_prediction_rfc = rfc_model.predict(x_valid)

In [140]:
valid_accuracy_rfc = accuracy_score(y_valid, y_prediction_rfc)
print("Random Forest Classifier Validation Accuracy:", valid_accuracy_rfc)

Random Forest Classifier Validation Accuracy: 0.7822706065318819


In [141]:
lr_model.fit(x_train, y_train)
y_prediction_lr = lr_model.predict(x_valid)

In [142]:
valid_accuracy_lr = accuracy_score(y_valid, y_prediction_lr)
print("Logistic Regression Validation Accuracy:", valid_accuracy_lr)

Logistic Regression Validation Accuracy: 0.6749611197511665


In [143]:
dtc_model.fit(x_train, y_train)
y_prediction_dtc = dtc_model.predict(x_valid)

In [144]:
valid_accuracy_dtc = accuracy_score(y_valid, y_prediction_dtc)
print("Decision Tree Classifier Validation Accuracy:", valid_accuracy_dtc)

Decision Tree Classifier Validation Accuracy: 0.687402799377916


In [145]:
accuracy_rfc = accuracy_score(y_test, rfc_model.predict(x_test))
print("Random Forest Classifier Test Set Accuracy:", accuracy_rfc)

Random Forest Classifier Test Set Accuracy: 0.8102643856920684


In conclusion, after evaluating and comparing the accuracy of each of the models validation accuracy, the model with the highest  validation accuracy appears to be Random Forest Classifier, which ends up with a 0.7976 validation set accuracy and when we test this model, it returns a 0.81 test set accuracy.