# Mobile Plan Recommendation

There is data on the behavior of customers who have already switched to new mobile plans. The goal of this project is to build a classification model that will select the appropriate plan with the highest accuracy. To successfully complete the project, we need to achieve at least 0.75 accuracy on the test dataset.

The model will be used by the mobile operator "Megaline," which aims to create a system capable of analyzing customer behavior and offering users of archived tariffs a new tariff: "Smart" or "Ultra."

## Let's study the dataframe

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

In [2]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/users_behavior.csv')

In [3]:
# a function to display the general info
def preprocess(data):
    print(data.info(), data.head(10), f'Duplicate rows: {df.duplicated().sum()}', f'Percent of missing values by column:\n{df.isna().mean().round(4) * 100}', sep='\n'+'-'*50+'\n')

preprocess(df)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
--------------------------------------------------
   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
5   58.0   344.56      21.0  15823.37         0
6   57.0   431.64      20.0   3738.90         1
7   15.0   132.40       6.0  21911.60         0
8    7.0    43.39       3.0   2538.67         1
9   90.0   665.41      38.0  1735

In [4]:
# converting numeric values to Int64
for col in ['calls', 'messages']:
    df[col] = df[col].astype('Int64')

## Splitting the dataframe into training, testing and validation samples

In [5]:
# a function that we'll use to split the data
def feature_target_split(data, column):
    features = data.drop([column], axis=1)
    target = data[column]

    return features, target

In [6]:
# splitting into train and test
df_train, df_test = train_test_split(df, test_size=0.2, random_state=12345)

features_test, target_test = feature_target_split(data=df_test, column='is_ultra')

In [7]:
# splitting into train and valid
df_train, df_valid = train_test_split(df_train, test_size=0.25, random_state=12345)

features_train, target_train = feature_target_split(data=df_train, column='is_ultra')
features_valid, target_valid = feature_target_split(data=df_valid, column='is_ultra')

In [8]:
print(f'Original dataframe size: {df.shape}', f'Training sample size: {df_train.shape}', f'Testing sample size: {df_test.shape}', f'Validation sample size: {df_valid.shape}', sep="\n")

Original dataframe size: (3214, 5)
Training sample size: (1928, 5)
Testing sample size: (643, 5)
Validation sample size: (643, 5)


## Finding the best model

In [9]:
# Decision Tree model
best_accuracy = 0
best_depth = 0
for depth in range(1, 16):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    accuracy = model.score(features_valid, target_valid)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_depth = depth

print(f'Best accuracy: {best_accuracy}', f'Best depth: {best_depth}', sep='\n')

Best accuracy: 0.7744945567651633
Best depth: 7


In [10]:
# Random Forest
best_accuracy = 0
best_est = 0
best_depth = 0
for est in range(10, 51, 10):
    for depth in range (1, 11):
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        accuracy = model.score(features_valid, target_valid)
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_est = est
            best_depth = depth

print(f'Best accuracy: {best_accuracy}', f'Best depth: {best_depth}', f'Number of trees: {best_est}', sep='\n')

Best accuracy: 0.7978227060653188
Best depth: 10
Number of trees: 50


In [11]:
# Logistic Regression
model = LogisticRegression()
model.fit(features_train, target_train)
accuracy = model.score(features_valid, target_valid)
print("Best accuracy:", accuracy)

Best accuracy: 0.7262830482115086


Random Forest Classifier has the accuracy. Let's see its results on the testing sample.

## Testing the best model

In [12]:
model = RandomForestClassifier(max_depth=7, n_estimators=40, min_samples_leaf=3, random_state=12345)
model.fit(features_train, target_train)
accuracy = model.score(features_test, target_test)
print("Accuracy:", accuracy)

Accuracy: 0.7900466562986003


The result achieved on the testing sample is comparable to the result achieved during training.

## Checking the model's adequacy

In [13]:
dummy_clf = DummyClassifier(strategy='most_frequent')
dummy_clf.fit(features_train, target_train)

dummy_valid = dummy_clf.score(features_valid, target_valid)
dummy_test = dummy_clf.score(features_test, target_test)

print(f"Dummy model's result on the testing sample: {dummy_test}")

Dummy model's result on the testing sample: 0.6951788491446346


The result of the random model on the test dataset is lower than that of the models considered above. This proves the adequacy of the models examined.

## Conclusion
In this study, we have chosen a model for the classification task that will most effectively select the appropriate tariff for mobile operator's clients. For this purpose, we examined three different algorithms: Decision Tree, Random Forest, and Logistic Regression. The best model turned out to be the one built using RandomForestClassifier with a tree depth parameter of 10 and 50 trees. The accuracy of this model is close to 80% both on the training and testing data.