# Introduction to machine learning. Recommendation of tariffs

## Introduction

### Main goal

We have data on the behavior of customers who have already switched to certain tariffs. We need to build a model for the classification task, which will select the right tariff for new users. No preprocessing of the data will be needed because I already did it during the project from the "Statistical Data Analysis" chapter.

### Used metrics

I need to create a model with the value of *accuracy* = 0.75

## Loading the necessary libraries and data

Importing libraries.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

Opening a dataset and creating a variable.

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

Studying data.

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.shape

(3214, 5)

## Splitting the data into samples

I make samples according to the rule - 3:1:1 (training, validation, test). The training sample is for teaching the model, the validation sample is for selecting hyperparameters, and the test sample is for the final test of the model.

In [5]:
df_train_valid, df_test = train_test_split(df, test_size=0.2, random_state=12345)
df_train, df_valid = train_test_split(df_train_valid, test_size=0.2, random_state=12345)

In [6]:
df_train.shape, df_test.shape, df_valid.shape

((2056, 5), (643, 5), (515, 5))

Identifying the features and target features in each sample.

In [7]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

## Model research

Since the target feature is a categorical variable, we face the task of classification, not regression.

First, we will use the decision tree method. Using a cycle, we will change the most important hyperparameter (max_depth) and determine which depth results in the highest accuracy.

In [8]:
for depth in range(1, 15):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7223300970873786
max_depth = 2 : 0.7475728155339806
max_depth = 3 : 0.7553398058252427
max_depth = 4 : 0.7533980582524272
max_depth = 5 : 0.7572815533980582
max_depth = 6 : 0.7611650485436893
max_depth = 7 : 0.7650485436893204
max_depth = 8 : 0.7631067961165049
max_depth = 9 : 0.7533980582524272
max_depth = 10 : 0.7592233009708738
max_depth = 11 : 0.7533980582524272
max_depth = 12 : 0.7495145631067961
max_depth = 13 : 0.7339805825242719
max_depth = 14 : 0.7339805825242719


The highest accuracy is at max_depth=7.

Now let's use the random forest method. With the help of the cycle, we will find such a value of n_estimators, which will allow us to get the largest score(), which is inherently accuracy.

In [9]:
best_model = None
best_result = 0
for est in range(1, 14):
    model = RandomForestClassifier(random_state=12345, n_estimators=est)
    model.fit(features_train, target_train)
    result = model.score(features_valid, target_valid)
    if result > best_result:
        best_model = model
        best_result = result

print("Accuracy наилучшей модели на валидационной выборке:", best_result)

Accuracy наилучшей модели на валидационной выборке: 0.7728155339805826


It is worth noting that when the n_estimators parameter is increased, the quality of the model also increases. However, this is accompanied by the fact that the model running time also increases. At the same time, if at n_estimators=14 accuracy=0.7728, then at n_estimators=100 accuracy=0.7883. With such a significant increase in the number of trees, the quality of the model increases by only 1.5%. Therefore we can stop at the parameter n_estimators=14.

Now let's move on to the logistic regression model. We will use the 'lbfgs' algorithm in the hyperparameter solver since it is one of the most common and suitable for most problems. Let's set the number of iterations equal to 1000.

In [10]:
model = LogisticRegression(random_state=12345, solver='lbfgs', max_iter=1000) 
model.fit(features_train, target_train)
print(model.score(features_valid, target_valid))

0.7165048543689321


**Conclusion:** The study of models trained by different methods allows us to conclude that the model trained by the random forest method is of the highest quality. In this case, we have obtained the highest value of the accuracy metric. Therefore, the optimal parameter n_estimators can be considered to be 14.

## Checking the model on a test sample

In [11]:
model = RandomForestClassifier(random_state=12345, n_estimators=14)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
print(accuracy_score(target_test, predictions_test))

0.7838258164852255


So, we even exceeded the target indicator. Accuracy=0.78. The task is completed.

## Checking the model for adequacy

We will use the dummy classifier (sklearn.dummy.DummyClassifier). By using its parameter strategy='most_frequent', we can predict the target values simply based on the most occurring feature values. The accuracy score of the trained model must be higher than that of the model where DummyClassifier is used.

In [12]:
model = DummyClassifier(strategy='most_frequent', random_state=0)
model.fit(features_train, target_train)
model.score(features_test, target_test)

0.6951788491446346

We see that our model has more value of accuracy, so we can consider it adequate.

## Conclusion

A study of different machine learning models leads us to the conclusion that the most successful model for predicting user tariffs is the random forest model with the number of n_estimators=14. The accuracy metric in this case reaches 0.78.