# Mobile plans personal offer

At your disposal is data on the behavior of customers who have already switched to these plans (from the project of the course “Statistical Data Analysis”). We need to build a model for the classification task that will select the appropriate tariff. There is no need for data preprocessing - you have already done it.

Build a model with the highest possible *accuracy* value. To pass the project successfully, you need to bring the percentage of correct answers to at least 0.75. Check *accuracy* on a test sample yourself.

## Open and analyze the database

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

In [2]:
data=pd.read_csv('users_behavior.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [3]:
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


**Conclusion**
1. The features are the columns calls, minutes, messages and mb_used, and is_ultra is the target feature.
2. Preprocessing was done efficiently, there were no missing values

## Divide data into 3 samples

In [4]:
train, valid_and_test=train_test_split(data, test_size=0.4, random_state=12345)
valid, test=train_test_split(valid_and_test, test_size=0.5, random_state=12345)

train_features = train.drop(['is_ultra'], axis=1)
train_target = train['is_ultra']
valid_features = valid.drop(['is_ultra'], axis=1)
valid_target = valid['is_ultra']
valid_features = valid.drop(['is_ultra'], axis=1)
valid_target = valid['is_ultra']
test_features = test.drop(['is_ultra'], axis=1)
test_target = test['is_ultra']

print('Size of training sample:', train.shape[0])
print('Size of validation sample:', valid.shape[0])
print('Size of test sample:', test.shape[0])

Size of training sample: 1928
Size of validation sample: 643
Size of test sample: 643


**Conclusion**

We divided the source data into training, validation and test samples in the classic proportion 60/20/20

## Explore models

**Decision Tree**

In [5]:
best_result=0
best_est=0
best_split=0
best_leaf=0
array=['gini', 'entropy']

model = DecisionTreeClassifier(random_state=12345)
model.fit(train_features, train_target)
result = model.score(valid_features, valid_target)
print("Decision Tree accuracy (before hyperparameters improvement)=", result)

Decision Tree accuracy (before hyperparameters improvement)= 0.713841368584759


In [6]:
best_model_tree=None
best_depth=0
best_result=0

for depth in range(1, 20):
    for criteria in array:
        model=DecisionTreeClassifier(random_state=12345, max_depth=depth, criterion=criteria)
        model.fit(train_features, train_target)
        valid_predictions=model.predict(valid_features)
        result=accuracy_score(valid_target, valid_predictions)
        if result > best_result:
            best_model_tree = model
            best_result = model.score(valid_features, valid_target)
            best_depth=depth
            best_criterion=criteria
            
for split in range(2, 30):
    model = RandomForestClassifier(random_state=12345, max_depth=best_depth, criterion=best_criterion,\
                                              min_samples_split=split)
    model.fit(train_features, train_target)
    result = model.score(valid_features, valid_target)
    if result>best_result:
        best_model_tree=model
        best_result=result
        best_split=split

for leaf in range (2, 15):
    model = RandomForestClassifier(random_state=12345, max_depth=best_depth, criterion=best_criterion,\
                                              min_samples_split=best_split, min_samples_leaf=leaf)
    model.fit(train_features, train_target)
    result = model.score(valid_features, valid_target)
    if result>best_result:
        best_model_tree=model 
        best_result=result
        best_leaf=leaf
            
print("Decision tree accuracy:", best_result)
print ("max_depth=", best_depth)
print ("criterion=", criteria)
print ("min_samples_split=", best_split)
print ("min_samples_leaf=", best_leaf)

Decision tree accuracy: 0.7884914463452566
max_depth= 3
criterion= entropy
min_samples_split= 2
min_samples_leaf= 0


**Random forest**

In [7]:
best_model_forest=None
model = RandomForestClassifier(random_state=12345)
model.fit(train_features, train_target)
result = model.score(valid_features, valid_target)
print("Random Forest accuracy (before hyperparameters improvement)=", result)

Random Forest accuracy (before hyperparameters improvement)= 0.7853810264385692


In [8]:
best_result=0
for est in range (1, 25):
    for criteria in array:
        model = RandomForestClassifier(random_state=12345, n_estimators=est, criterion=criteria)
        model.fit(train_features, train_target)
        result = model.score(valid_features, valid_target)
        if result>best_result:
            best_model_forest=model
            best_result=result
            best_est=est
            best_criterion=criteria
for split in range(2, 30):
    model = RandomForestClassifier(random_state=12345, n_estimators=best_est, criterion=best_criterion,\
                                              min_samples_split=split)
    model.fit(train_features, train_target)
    result = model.score(valid_features, valid_target)
    if result>best_result:
        best_model_forest=model
        best_result=result
        best_split=split

for leaf in range (2, 15):
    model = RandomForestClassifier(random_state=12345, n_estimators=best_est, criterion=best_criterion,\
                                              min_samples_split=best_split, min_samples_leaf=leaf)
    model.fit(train_features, train_target)
    result = model.score(valid_features, valid_target)
    if result>best_result:
        best_model_forest=model
        best_result=result
        best_leaf=leaf
            
print("Random Forest accuracy=", best_result)
print ("n_estimators=", best_est)
print ("criterion=", criteria)
print ("min_samples_split=", best_split)
print ("min_samples_leaf=", best_leaf)

Random Forest accuracy= 0.8009331259720062
n_estimators= 23
criterion= entropy
min_samples_split= 23
min_samples_leaf= 8


**Conclusion**

Thanks to changing the hyperparameters, it was possible to improve accuracy by 2.1 percentage

**Logistic resgression**

In [9]:
array_warm_start=[True, False]
array_multi_class=['auto', 'ovr', 'multinomial']
best_result=0

for warm in array_warm_start:
    for multi in array_multi_class:
        model = LogisticRegression(random_state = 12345, solver = 'lbfgs', max_iter=1000, warm_start=warm, multi_class=multi)
        model.fit(train_features,train_target)
        valid_predictions=model.predict(valid_features)
        result = accuracy_score(valid_target, valid_predictions)
        print(result)
        if result>best_result:
            best_model_regression=model
            best_result=result
            warm_starter=warm
            multi_classer=multi
    
print ('Logistic Regression accuracy=', best_result, 'multi_class:', multi_classer)
best_model_regression

0.7107309486780715
0.7107309486780715
0.7558320373250389
0.7107309486780715
0.7107309486780715
0.7558320373250389
Logistic Regression accuracy= 0.7558320373250389 multi_class: multinomial


## Test the model on a test sample

In [10]:
result_tree = best_model_tree.score (test_features, test_target)
result_forest=best_model_forest.score (test_features, test_target)
result_regression=best_model_regression.score (test_features, test_target)

print("Accuracy Decision Tree на тестовой выборке:", result_tree)
print("Accuracy Random Forest на тестовой выборке:", result_forest)
print("Accuracy Logistic Regression на тестовой выборке:", result_regression)

Accuracy Decision Tree на тестовой выборке: 0.7791601866251944
Accuracy Random Forest на тестовой выборке: 0.80248833592535
Accuracy Logistic Regression на тестовой выборке: 0.7387247278382582


**Conclusion**

The Decision Tree and Random Forest models surpassed 0.75. Since the Decision Tree works faster than the Random Forest, and the share of correct answers is approximately the same, we choose the first method.

## Check the models for adequacy

In [11]:
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier(strategy='most_frequent', random_state=12345)
dummy.fit(train_features, train_target)
output=dummy.score(test_features, test_target)
output

0.6842923794712286

**Conclusion**

The simplest dummy model gives an accuracy of 0.698, which is lower than that of the model we obtained. This means that a model based on the random forest method makes sense.

## Conclusion

1. Data file processed
2. The source data is divided into 3 samples: training, validation and test
3. 3 models were studied: Decision Tree, Random Forest and Logistic Regression
4. Selected hyperparameters that improve the model
5. The quality of the resulting model was checked on a test sample
6. We checked the model for sanity. Our model is more effective than the basic one