# Recommendation of tariffs

At your disposal is data on the behavior of customers who have already switched to these tariffs (from the project of the course “Statistical Data Analysis”). We need to build a model for the classification task that will select the appropriate tariff. There is no need for data preprocessing - you have already done it.

Build a model with the highest possible *accuracy* value. To pass the project successfully, you need to bring the percentage of correct answers to at least 0.75. Check *accuracy* on a test sample yourself.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score 
from sklearn.metrics import mean_squared_error

df = pd.read_csv('users_behavior.csv')
print(df.head())
print(df.info())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


## Breaking the data

Let's divide the data in a ratio of 3:1:1 (training, validation, test samples):

In [2]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

#Allocate 20% of the data to the test sample
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345)
#From the remaining data, we will allocate 25% of the data to the validation sample
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345)
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)
print(target_train.shape)
print(target_valid.shape)
print(target_test.shape)

(1928, 4)
(643, 4)
(643, 4)
(1928,)
(643,)
(643,)


## Exploring models

Let's train and study a decision tree model with a depth of 1-5 and a different minimum number of values in a “leaf” (from 1 to 10) with different selection criteria (gini or entropy) on a validation sample:

In [3]:
#  entropy criterion
best_model = None
best_result = 0
best_samples_leaf=0
for leaf in range(1, 11):
    for depth in range(1, 6):
        model = DecisionTreeClassifier(random_state=12345, min_samples_leaf=leaf, criterion='entropy', max_depth=depth) 
        model.fit(features_train,target_train) 
        predictions = model.predict(features_valid) 
        result = accuracy_score(target_valid, predictions) 
        if result > best_result:
            best_depth = depth
            best_result = result
            best_samples_leaf=leaf
        
print("Accuracy of the best model:", best_result, "Depth:", best_depth, "Minimum number of values in a sheet:", best_samples_leaf)

Accuracy of the best model: 0.7698289269051322 Depth: 5 Minimum number of values in a sheet: 5


In [4]:
#  gini criterion
best_model = None
best_result = 0
best_samples_leaf=0
for leaf in range(1, 11):
    for depth in range(1, 6):
        model = DecisionTreeClassifier(random_state=12345, min_samples_leaf=leaf, criterion='gini', max_depth=depth) 
        model.fit(features_train,target_train) 
        predictions = model.predict(features_valid) 
        result = accuracy_score(target_valid, predictions) 
        if result > best_result:
            best_depth = depth
            best_result = result
            best_samples_leaf=leaf
        
print("Accuracy of the best model:", best_result, "Depth:", best_depth, "Minimum number of values in a sheet:", best_samples_leaf)

Accuracy of the best model: 0.7698289269051322 Depth: 4 Minimum number of values in a sheet: 8


Let's train and study a random forest model with a depth of 1-10, number of trees 10-50 and different minimum number of values in a “leaf” (from 1 to 10) with different selection criteria (gini or entropy) on a validation set:

In [5]:
#  gini criterion
best_model = None
best_result = 0
best_est = 0
best_depth = 0
best_samples_leaf=0
for leaf in range(1, 11):
    for est in range(10, 51, 10):
        for depth in range (1, 11):
            model = RandomForestClassifier(random_state=12345, min_samples_leaf=leaf, 
                                           criterion='gini', n_estimators=est, max_depth=depth) 
            model.fit(features_train, target_train) 
            predictions = model.predict(features_valid) 
            result = accuracy_score(target_valid, predictions) 
            if result > best_result:
                best_model = model
                best_result = result
                best_est = est
                best_depth = depth
                best_samples_leaf=leaf

print("Accuracy of the best model:", best_result, "Number of estimators:", best_est, 
      "Maximum depth:", depth, "Minimum number of values in a sheet:", best_samples_leaf)

Accuracy of the best model: 0.7978227060653188 Number of estimators: 50 Maximum depth: 10 Minimum number of values in a sheet: 1


In [6]:
#  entropy criterion
best_model = None
best_result = 0
best_est = 0
best_depth = 0
best_samples_leaf=0
for leaf in range(1, 11):
    for est in range(10, 51, 10):
        for depth in range (1, 11):
            model = RandomForestClassifier(random_state=12345, min_samples_leaf=leaf, 
                                           criterion='entropy', n_estimators=est, max_depth=depth) 
            model.fit(features_train, target_train) 
            predictions = model.predict(features_valid) 
            result = accuracy_score(target_valid, predictions) 
            if result > best_result:
                best_model = model
                best_result = result
                best_est = est
                best_depth = depth
                best_samples_leaf=leaf

print("Accuracy of the best model:", best_result, "Number of estimators:", best_est, 
      "Maximum depth:", depth, "Minimum number of values in a sheet:", best_samples_leaf)

Accuracy of the best model: 0.7978227060653188 Number of estimators: 10 Maximum depth: 10 Minimum number of values in a sheet: 1


Let's train and study a logistic regression model with different algorithms ('lbfgs', 'liblinear', 'newton-cg') on a validation set:

In [7]:
solvers=['lbfgs', 'liblinear', 'newton-cg']
best_solver = 0
best_result = 0
for solv in solvers:
    model = LogisticRegression(solver=solv)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions)
    if result > best_result:
                best_solver = solv
print("Accuracy of the best model:", result, "Algorithm:",best_solver)

Accuracy of the best model: 0.7262830482115086 Algorithm: newton-cg




Conclusion: the best results were shown by a random forest model with a depth of 10 and a number of trees of 10 for the entropy criterion, and a random forest model with a depth of 10 and a number of trees of 50 for the gini criterion. We will select the most suitable one based on the analysis of the test sample

## Проверьте модель на тестовой выборке

In [8]:
model = RandomForestClassifier(random_state=12345,criterion='entropy', n_estimators=10, max_depth=10) 
model.fit(features_train, target_train) 
predictions = model.predict(features_test) 
result = accuracy_score(target_test, predictions)
print("Accuracy of the model:", result)

Accuracy of the model: 0.7869362363919129


In [9]:
model = RandomForestClassifier(random_state=12345,criterion='gini', n_estimators=50, max_depth=10) 
model.fit(features_train, target_train) 
predictions = model.predict(features_test) 
result = accuracy_score(target_test, predictions)
print("Accuracy of the model:", result)

Accuracy of the model: 0.7993779160186625


Conclusion: on the test sample, the best result was shown by the random forest model with a depth of 10 and a number of trees of 50 for the gini criterion. The accuracy of the model (0.7993779160186625) is quite high (more than 0.75 in accordance with the task conditions), which allows it to be used to predict what tariff to offer to users.

## Checking the model for adequacy

Let's check how effective prediction is based on the most common characteristic:

In [10]:
from sklearn.dummy import DummyClassifier
model = DummyClassifier(random_state=12345)
model.fit(features_train, target_train) 
predictions = model.predict(features_test) 
result = accuracy_score(target_test, predictions)
print("Accuracy of the model:", result)

Accuracy of the model: 0.6951788491446346


The share of correct answers in the predictions of our random forest model is higher than in the prediction based on the most frequently occurring feature, which means its use is justified