# Tariff recommendation

You have at your disposal data on the behavior of customers who have already switched to these tariffs (from the draft course "Statistical Data Analysis"). You need to build a model for the classification problem that will select the appropriate rate. Data preprocessing is not required - you have already done it.

Build the model with the largest possible `accuracy` value. To pass the project successfully, you need to bring the percentage of correct answers to at least 0.75. Check `accuracy` on the test set yourself.

## 1. Open and explore data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler


df = pd.read_csv('users_behavior.csv')

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [3]:
# We can convert messages to an integer type, since messages cannot be "and a half"
#df['messages'] = df.messages.astype(int)

**Conclusion:** We looked through the data, everything is in order, the data is prepared.

## 2. Split the data into samples

In [4]:
df_learn, df_test = train_test_split(df, test_size=0.4, random_state=12345)
df_test, df_valid = train_test_split(df_test, test_size=0.5, random_state=12345)

features_learn = df_learn.drop('is_ultra', axis = 1)
target_learn = df_learn['is_ultra']

features_test = df_test.drop('is_ultra', axis = 1)
target_test = df_test['is_ultra']

features_valid = df_valid.drop('is_ultra', axis = 1)
target_valid = df_valid['is_ultra']

**Conclusion:** Split the data into samples in a ratio of 3:1:1

## 3. Explore Models

#### 3.1 RandomForest

In [5]:
best_model_forest = None
best_score_forest = 0
best_depth_forest = 0
best_est_forest = 0
for est in range(1, 20):
    for depth in range(1,20):
        model_RF = RandomForestClassifier(max_depth=depth, random_state = 12345, n_estimators=est)
        model_RF.fit(features_learn, target_learn)
        prediction_valid = model_RF.predict(features_valid)
        result = accuracy_score(target_valid, prediction_valid)
        if result > best_score_forest:
            best_score_forest = result
            best_model_forest = model_RF
            best_depth_forest = depth
            best_est_forest = est
        
print(f'Accuracy of the best model on the validation set: {best_score_forest}, estimators: {best_est_forest}, max_depth: {best_depth_forest}' )

Accuracy of the best model on the validation set: 0.8149300155520995, estimators: 14, max_depth: 9


#### 3.2 LogisticRegression

In [6]:
model_LR = LogisticRegression(random_state = 12345) 
model_LR.fit(features_learn, target_learn)
result = model_LR.score(features_valid, target_valid)
print("Accuracy of the logistic regression model on the validation set:", result)

Accuracy of the logistic regression model on the validation set: 0.6842923794712286


#### 3.3 DecisionTreeClassifier

In [7]:
best_model_tree = None
best_score_tree = 0
best_depth_tree = 0

for depth in range(1, 10):
    model_DT = DecisionTreeClassifier(max_depth=depth, random_state = 12345)
    model_DT.fit(features_learn, target_learn)
    prediction_valid = model_DT.predict(features_valid)
    accuracy_tree = accuracy_score(target_valid, prediction_valid)
    if accuracy_tree > best_score_tree:
        best_score_tree = accuracy_tree
        best_model_tree = model_DT
        best_depth_tree = depth
        
        print(f'max_depth = {best_depth_tree} - Accuracy: {best_score_tree}')


max_depth = 1 - Accuracy: 0.7356143079315708
max_depth = 2 - Accuracy: 0.7744945567651633
max_depth = 3 - Accuracy: 0.7791601866251944
max_depth = 5 - Accuracy: 0.7838258164852255
max_depth = 7 - Accuracy: 0.7993779160186625


**Conclusion:** Comparing the data, we can conclude that the highest accuracy is given by the model created using the "Random Forest", the lowest - using Logistic Regression

## 4. Check the model on the test set

#### 4.1 - LogisticRegression

In [8]:
predict_testLR = model_LR.predict(features_test)
accuracy_testLR = accuracy_score(predict_testLR, target_test)
print(f'Accuracy of the Logistic Regression Model on the Test Set - {accuracy_testLR}')

Accuracy of the Logistic Regression Model on the Test Set - 0.7107309486780715


#### 4.2 - RandomForest

In [9]:
predict_testRF = best_model_forest.predict(features_test)
accuracy_testRF = accuracy_score(predict_testRF, target_test)
print(f'Accuracy of the Random Forest model on the test set - {accuracy_testRF}')

Accuracy of the Random Forest model on the test set - 0.7900466562986003


#### 4.3 DecisionTreeClassifier

In [10]:
predict_testRF = best_model_tree.predict(features_test)
accuracy_testRF = accuracy_score(predict_testRF, target_test)
print(f'Accuracy модели Случайного леса на тестовой выборке - {accuracy_testRF}')

Accuracy модели Случайного леса на тестовой выборке - 0.7822706065318819


**Conclusion:** We checked the models on test samples, the values are similar, which means the models are working correctly.

## 5. Check models for adequacy

In [11]:
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent', random_state=12345).fit(features_learn, target_learn)
dummy_pred = dummy.predict(features_test)
print('Accuracy Dummy Models on a Test Set:', accuracy_score(target_test, dummy_pred))

Accuracy Dummy Models on a Test Set: 0.7060653188180405


# Conclusions:
   - we opened the dataset, divided it into samples, studied 3 variants of models with different hyperparameters. 
   - We found out that the `RandomForest` model works best. Checked on test samples, confirming the conclusions.