# Plan recommender

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible *accuracy*. In this project, the threshold for accuracy is 0.75. Check the *accuracy* using the test dataset.

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# 1. Loading data

In [4]:
df = pd.read_csv('users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [35]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.6 KB
None


# 2. Splitting data into sets

In [10]:
from sklearn.model_selection import train_test_split

train_valid, test = train_test_split(df, test_size=0.2)
train, valid = train_test_split(train_valid, test_size=0.25)

features_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']
features_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']
features_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


# 3. Tuning models

In [29]:
print("Decision Tree")
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    print("max_depth =", depth)
    print("Train:", model.score(features_train, target_train))
    print("Valid:", model.score(features_valid, target_valid))

Decision Tree
max_depth = 1
Train: 0.7515560165975104
Valid: 0.749611197511664
max_depth = 2
Train: 0.7863070539419087
Valid: 0.7807153965785381
max_depth = 3
Train: 0.8018672199170125
Valid: 0.7916018662519441
max_depth = 4
Train: 0.8112033195020747
Valid: 0.7916018662519441
max_depth = 5
Train: 0.8184647302904564
Valid: 0.7947122861586314
max_depth = 6
Train: 0.8298755186721992
Valid: 0.7962674961119751
max_depth = 7
Train: 0.8423236514522822
Valid: 0.7993779160186625
max_depth = 8
Train: 0.8568464730290456
Valid: 0.807153965785381
max_depth = 9
Train: 0.8661825726141079
Valid: 0.80248833592535
max_depth = 10
Train: 0.8755186721991701
Valid: 0.7900466562986003


In [18]:
print("Random Forest")
for estim in range(10, 101, 10):
    model = RandomForestClassifier(n_estimators=estim, random_state=12345)
    model.fit(features_train, target_train)
    print("n_estimators =", estim)
    print("Train:", model.score(features_train, target_train))
    print("Valid:", model.score(features_valid, target_valid))

Random Forest
n_estimators = 10
Train: 0.9813278008298755
Valid: 0.8055987558320373
n_estimators = 20
Train: 0.9901452282157677
Valid: 0.8087091757387247
n_estimators = 30
Train: 0.9953319502074689
Valid: 0.807153965785381
n_estimators = 40
Train: 0.9989626556016598
Valid: 0.8055987558320373
n_estimators = 50
Train: 0.9989626556016598
Valid: 0.7993779160186625
n_estimators = 60
Train: 1.0
Valid: 0.8102643856920684
n_estimators = 70
Train: 1.0
Valid: 0.8055987558320373
n_estimators = 80
Train: 1.0
Valid: 0.8164852255054432
n_estimators = 90
Train: 1.0
Valid: 0.8102643856920684
n_estimators = 100
Train: 1.0
Valid: 0.8149300155520995


In [19]:
print("Logistic Regression")
model = LogisticRegression(random_state=12345)
model.fit(features_train, target_train)
print("Train:", model.score(features_train, target_train))
print("Valid:", model.score(features_valid, target_valid))

Logistic Regression
Train: 0.7100622406639004
Valid: 0.7076205287713841




### Findings

- Linear regression shows the worst performance but it is not overfitted
- Decision tree is overfitted but the accuracy is higher
- Random forrest is also overfitted but the accuracy is slightly higher compared to decision tree

# 4. Testing model

In [20]:
features_full_train = train_valid.drop(['is_ultra'], axis=1)
target_full_train = train_valid['is_ultra']

In [28]:
model = RandomForestClassifier(n_estimators=80, random_state=12345)
model.fit(features_full_train, target_full_train)
model.score(features_test, target_test)

0.7869362363919129

# 5. Additional task: sanity check

In [7]:
df['is_ultra'].value_counts() / df.shape[0]

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

Sanity check score is ~69%, so the logistic regression hasn't learned much.