# Tariff recommendation

## Data Description
Each object in the dataset is information about one user's behavior for one month. Known:
- colls - number of calls,
- minutes - total duration of calls in minutes,
- messages - number of sms messages,
- mb_used - used Internet traffic in Mb,
- is_ultra - what tariff was used during the month ("Ultra" - 1, "Smart" - 0).

## Opening data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('/project/datasets/users_behavior.csv')

In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


The data contain 3214 rows, 5 columns. No data preprocessing is required.

In [5]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

There is an imbalance of classes.

## Preparation of the data

To train, validate, and test the models, we divide the provided data into three samples in a 3:1:1 ratio

In [6]:
test = df.sample(frac=0.2,random_state=42).copy()

In [7]:
train_full = df[~df.index.isin(test.index)].copy()

In [8]:
features = train_full.drop(['is_ultra'], axis = 1)
target = train_full['is_ultra']
features_train, features_valid, target_train, target_valid = train_test_split(features,
                                                                              target,test_size=0.25, 
                                                                              stratify = target,
                                                                              random_state=42)


In [9]:
features_test = test.drop(['is_ultra'], axis = 1)
target_test = test['is_ultra']

### Decision Tree

Train the model based on the DecisionTreeClssifier algorithm on the training sample, determine the optimum depth, the index of correct answers for the validation sample, and save the model in the `best_tree_model` variable.

In [10]:
best_tree_model = None
best_result = 0
for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=42, max_depth = depth) 
    model.fit(features_train, target_train)
    result = model.score(features_valid, target_valid)    
    if result > best_result:
            best_tree_model = model.fit(features_train, target_train)
            best_result = result
            best_depth = depth
print("max_depth =", best_depth, ": ", end='')
print("Accuracy of the best model on the validation sample:", best_result)
   

max_depth = 3 : Accuracy of the best model on the validation sample: 0.80248833592535


The best result on the validation sample for the decision tree model was the model with depth 4, the accuracy score was 0.80

### Random Forest

Train the model based on the RandomForestClssifier algorithm on the training dataset, determine the optimum depth and number of trees, the correct response rate for the validation sample, and store the model in the `best_forest_model` variable.

In [11]:
best_forest_model = None
best_result = 0
for est in range(1, 51):
    for depth in range(1,10):
        model = RandomForestClassifier(random_state=42, n_estimators=est, max_depth = depth)
        model.fit(features_train, target_train)
        result = model.score(features_valid, target_valid)
        if result > best_result:
            best_forest_model = model.fit(features_train, target_train)
            best_result = result
            best_depth = depth
            best_est = est
print("max_depth =", best_depth, ": ", end='')
print("est =", best_est, ": ", end='')
print("Accuracy of the best model on the validation sample:", best_result)

max_depth = 8 : est = 14 : Accuracy of the best model on the validation sample: 0.8242612752721618


The best result in the validation sample was shown by the model with the number of trees 20 and depth 7. The correct response rate was 0.82

### Logistic regression


Train the model based on the LogisticRegressionClassifier algorithm through a training dataset and obtain the value of accuracy from the validation sample data.

In [12]:
regmodel = LogisticRegression(random_state=42, solver='lbfgs')
regmodel.fit(features_train, target_train)
result_valid = regmodel.score(features_valid, target_valid)

print("Accuracy of the logistic regression model on the validation sample:", result_valid)


Accuracy of the logistic regression model on the validation sample: 0.7558320373250389


The accuracy score is 0.76

Let's check the models with test data.

In [13]:
best_tree_model.score(features_test,target_test)

0.8040435458786936

In [14]:
best_forest_model.score(features_test,target_test)

0.8164852255054432

In [15]:
regmodel.score(features_test, target_test) 

0.7620528771384136

The value of the accuracy score for the decision tree, random forest, and logistic regression models on the test data was 0.80, 0.82, and 0.76, respectively.

## Checking the sanity of the models

Let us check the sanity of the model by checking `accuracy` for the constant model.

In [16]:
target_test.value_counts()

0    455
1    188
Name: is_ultra, dtype: int64

In [17]:
pred_constant_test = pd.Series([0]*len(target_test))
print(accuracy_score(target_test, pred_constant_test))

0.7076205287713841


The accuracy for all constant models is lower than for models with real data.
The models passed the sanity test

## General conclusion

Based on the data provided, the models were trained and the hyperparameters that provide the highest value of accuracy were established.
The values of the accuracy for the models in the test samples were as follows:
- Decision Tree - 0.80 
- Random forest - 0.82
- Linear regression - 0.76

The models passed the sanity test.