In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Open the file and study the general information

In [2]:
data = pd.read_csv('/datasets/users_behavior.csv')

data.info()
data.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


Our goal here is to create a model that predicts what plan is best for our customer. Since we don't have a test dataset, we'll split the dataset that we have into a **training dataset, validating dataset, and a test dataset** with a 3:1:1 ratio. Our **features** will be the **calls, minutes, messages, and mb_used**, while our **target** will be **is_ultra**.

# Step 2: Split the data

In [3]:
#split 3 ways - training, validating, and test.
#Here I call the method twice. I wasn't sure if there was a more optimal way of splitting without using numpy.
df_train, df_temp = train_test_split(data, test_size=0.40, random_state=12345)
df_valid, df_test = train_test_split(df_temp, test_size=0.50, random_state=12345)

print(
df_train.shape,
df_valid.shape,
df_test.shape)

(1928, 5) (643, 5) (643, 5)


Here we can see that we split our data as needed, with a 3:1:1 ratio. We can now test the different models to see which has the best result.

# Step 3: Train and validate models

We'll first create the features and target for both training and validating data.

In [4]:
#featues and target for training
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']

#features and target for validating
features_valid = df_valid.drop('is_ultra',axis=1)
target_valid = df_valid['is_ultra']

## Decision Tree Classifier

We'll first investigate the quality of the decision tree model by changing the max_depth hyperparameter.

In [5]:
for depth in range(1,10):
    #create model
    dt_model = DecisionTreeClassifier(random_state=123, max_depth=depth)

    #train model
    dt_model.fit(features_train,target_train)
    
    #make prediction
    dt_prediction = dt_model.predict(features_valid)
    
    print('Accuracy with max depth of', depth, ':',accuracy_score(target_valid, dt_prediction))
    


Accuracy with max depth of 1 : 0.7542768273716952
Accuracy with max depth of 2 : 0.7822706065318819
Accuracy with max depth of 3 : 0.7853810264385692
Accuracy with max depth of 4 : 0.7791601866251944
Accuracy with max depth of 5 : 0.7791601866251944
Accuracy with max depth of 6 : 0.7838258164852255
Accuracy with max depth of 7 : 0.7822706065318819
Accuracy with max depth of 8 : 0.7807153965785381
Accuracy with max depth of 9 : 0.7869362363919129


Our accuracy using the decision tree is between about 75-78%. With a max depth of 9, our model gives us the highest accuracy at 78.69%. However, for the sake of time efficiency, if we are to use a decision tree, we can stick with a **maximum depth of 2**, with **78.23% accuracy**.

## Random Forest Classifier

Here we'll test the random forest model, changing the hyperparameter that specifies the number of trees.

In [6]:
for trees in range(10,100,10):
    #create model
    rf_model = RandomForestClassifier(random_state=123,n_estimators=trees)

    #train model
    rf_model.fit(features_train,target_train)
    
    #make prediction
    rf_prediction = rf_model.predict(features_valid)
    
    print('Accuracy with', trees, 'trees:',accuracy_score(target_valid, rf_prediction))

Accuracy with 10 trees: 0.7884914463452566
Accuracy with 20 trees: 0.7931570762052877
Accuracy with 30 trees: 0.7916018662519441
Accuracy with 40 trees: 0.7978227060653188
Accuracy with 50 trees: 0.7947122861586314
Accuracy with 60 trees: 0.7916018662519441
Accuracy with 70 trees: 0.7947122861586314
Accuracy with 80 trees: 0.7916018662519441
Accuracy with 90 trees: 0.7900466562986003


As we may have expected, our accuracy is improved using the random forest model. Our results hover around 79% accuracy, and even with just **10** trees, we already have a greater accuracy than our best result using the decision tree model at **78.84%**. Our best result seems to be with **40** trees, at **79.78%**

In [7]:
for x in range(10, 100, 30):
    print(x)

10
40
70


## Logistic Regression

Let's try testing the Logistic Regression model.

In [12]:
#create model
lr_model = LogisticRegression(solver='liblinear')
 
#train model
lr_model.fit(features_train, target_train)

#make prediction
lr_prediction = lr_model.predict(features_valid)

print('Accuracy using logistic regression:',accuracy_score(target_valid, lr_prediction))

Accuracy using logistic regression: 0.7589424572317263


Our accuracy with logistic regression sits at **75.89%**. 

Looking at the different models, the fastest and least accurate is logistic regression. If we want a fast yet more accurate solution, a decision tree model with a depth of 2 would be a better solution. The random forest model is the most accurate- but we sacrifice time efficiency in exchange for better accuracy.

# Step 4: Test model

Since we're trying to create a model that gives us the highest accuracy, we'll use a random forest model with 40 trees, and check its quality.

In [9]:
features_test = df_test.drop('is_ultra', axis=1)
target_test = df_test['is_ultra']

model = RandomForestClassifier(random_state=123,n_estimators=40)
model.fit(features_train,target_train)
prediction = model.predict(features_test)

print("Our model's accuracy using the test data is:", accuracy_score(target_test,prediction))

Our model's accuracy using the test data is: 0.7884914463452566


Our accuracy is 78.85%, similar to the results we acheived when validating our model.