# Tariff recomendation

Data on the behavior of customers who have already switched to these tariffs is at our disposal. Model for the classification problem that will select the appropriate rate must be built. Data preprocessing is not required - it's already done.

The percentage of accuracy must be at least 0.75.

## Opening and examining the data

In [1]:
import pandas as pd
from sklearn.dummy import DummyClassifier 
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier 
from sklearn.model_selection import train_test_split 

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

In [2]:
try: 
    df = pd.read_csv('/datasets/users_behavior.csv')
except:
    df = pd.read_csv('users_behavior.csv')

In [3]:
def info(df):
    df.info()
    print(100*'=')
    display(df.describe())
    display(df.head())
    display(df.shape)
info(df) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


(3214, 5)

Each object in the data set is information about the behavior of one user per month. Known:

- `calls` - number of calls,
- `minutes` — total duration of calls in minutes,
- `messages` - number of sms messages,
- `mb_used` - used Internet traffic in Mb,
- `is_ultra` - which tariff was used during the month ("Ultra" - 1, "Smart" - 0).

The data is already preprocessed, so we can move on to data analysis and machine learning.

## Dividing the Data

In [4]:
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

In [5]:

features_train, features_test, target_train, target_test = train_test_split(features, 
                                                                            target, 
                                                                            test_size=0.4, 
                                                                            random_state=12345,
                                                                            stratify=target
                                                                           ) 

In [6]:
features_valid, features_test, target_valid, target_test = train_test_split(features_test, 
                                                                            target_test, 
                                                                            test_size=0.5, 
                                                                            random_state=12345,
                                                                            stratify=target_test
                                                                           ) 

Let's divide the data into three parts: training, validation and test in the ratio 3:1:1.

In [7]:
print('Training sample size', features_train.shape)
print('Validation sample size', features_valid.shape)

Training sample size (1928, 4)
Validation sample size (643, 4)


The data separation was performed, we can proceed to the analysis of models.

## Analysing the models

In [8]:
best_model_tree = None
best_result_tree = 0
for depth in range(1, 6):
    model_tree = DecisionTreeClassifier(random_state=12345, max_depth= depth) 
    model_tree.fit(features_train, target_train)
    predictions = model_tree.predict(features_valid) 
    result_tree = accuracy_score(target_valid, predictions) 
    if result_tree > best_result_tree:
        best_model_tree = model_tree
        best_result_tree = result_tree

        
print('Accuracy of the best result of Decision Tree:', best_result_tree)


Accuracy of the best result of Decision Tree: 0.7853810264385692


In [9]:
%%time
best_model = None
best_result = 0
best_est = 0
best_depth = 0
for est in range(10, 51, 10):
    for depth in range (1, 11):
        model_forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
        model_forest.fit(features_train, target_train) 
        predictions_valid = model_forest.predict(features_valid) 
        result = accuracy_score(target_valid, predictions_valid)
        if result > best_result:
            best_model_forest = model_forest
            best_result = result
            best_est = est
            best_depth = depth

print("Accuracy of the best Random Forest model on the validation set:", best_result, "Number of trees:", best_est)

Accuracy of the best Random Forest model on the validation set: 0.8211508553654744 Number of trees: 40
Wall time: 5.5 s


In [10]:
model_lr = LogisticRegression(random_state=12345)
model_lr.fit(features_train, target_train) 
predictions = model_lr.predict(features_valid) 
result = accuracy_score(target_valid, predictions) 

print("Accuracy of Logistic Regression Models on the Validation Set:", result)

Accuracy of Logistic Regression Models on the Validation Set: 0.7387247278382582


During the analysis of models, it was found that `accuracy` is best for a random forest. Let's check the results on the test set.

## Checking the results on the test set.

In [11]:
predictions_test_forest = best_model_forest.predict(features_test)
result_test_forest = accuracy_score(predictions_test_forest, target_test)
print('Accuracy of Random Forest Models on a Test Set:', result_test_forest)

Accuracy of Random Forest Models on a Test Set: 0.8087091757387247


When testing the Random Forest model on the test set, `accuracy` is 0.8258164852255054, which is a good result.

## Checking models for adequacy

In [12]:
dummy_cl = DummyClassifier(strategy="most_frequent", random_state=0)
dummy_cl.fit(features_train,target_train)
dummy_cl.score(features_test,target_test)

0.6936236391912908

It can be concluded that the model is adequate.

## Summary

- Random Forest, Decision tree and Logical Regression models have been explored.
- The highest `accuracy`, which was 0.8211508553654744, was shown by the Random Forest model.
- The Random Forest model on the test sample showed `accuracy` 0.8258164852255054.
- The Random Forest model passed the adequacy test.