## Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

### Outline
- Exploratory Data Analysis
- Supervised Machine Learning
    - Decision Tree 
    - Random Forest 
    
    
### Description of data

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).


## Exploratory Data Analysis

In [1]:
# Initialize
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


In [2]:
# Load data
df = pd.read_csv('/datasets/users_behavior.csv')

# General information
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


This dataset contains information on 3214 Megaline subscribers. The minutes and mb_used columns contain decimal values therefore they can remain floats. The calls and messages columns contain whole numbers therefore they will be converted to integers.

In [3]:
# Converting datatypes
df.calls = df['calls'].astype(int)
df.messages = df['messages'].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


In [4]:
# Descriptive statistics
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [5]:
# Find null values
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [6]:
# Percentage of subscribers in each plan
df.is_ultra.value_counts(normalize=True)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

There are no missing values within this dataset. A majority of Megaline clients (69%) are subscribed to the Smart Plan.

## Supervised Machine Learning

Classification tasks deal with categorical targets (e.g. to determine animal species in a picture). Our target consists of two categories: whether a client is currently subscribed to either the Ultra (1) or Smart(0) plans, therefore  it is a binary classification. Decision Tree and Random Forest will be tested to see which is more accurate.

In [7]:

features = df.drop(columns=['is_ultra'],axis=1)
target =  df['is_ultra']

# Splitting data into Test and Validation datasets
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.20, random_state=12345)

In [8]:
# Sanity check

print(features_train.shape)
print(target_train.shape)
print(features_test.shape)
print(target_test.shape)
print(features_valid.shape)
print(target_valid.shape)

(2056, 4)
(2056,)
(643, 4)
(643,)
(515, 4)
(515,)


### Decision Tree

In [15]:
# Tuning hyperparameters
for depth in range(1, 10):
    model = DecisionTreeClassifier(max_depth=depth,random_state=12345) # < create a model, specify max_depth=depth >

        # < train the model >
    model.fit(features_train,target_train)
    predictions_valid = model.predict(features_valid) # < find the predictions using validation set >

    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid)) 

        


max_depth = 1 : 0.7223300970873786
max_depth = 2 : 0.7475728155339806
max_depth = 3 : 0.7553398058252427
max_depth = 4 : 0.7533980582524272
max_depth = 5 : 0.7572815533980582
max_depth = 6 : 0.7611650485436893
max_depth = 7 : 0.7650485436893204
max_depth = 8 : 0.7631067961165049
max_depth = 9 : 0.7533980582524272


A max depth of 7 yields the highest accuracy for the validation set.

In [10]:
# Train final tree model
tree_model = DecisionTreeClassifier(max_depth=7,random_state=12345)
tree_model.fit(features_train,target_train)

#Predictions
decision_prediction_test = tree_model.predict(features_test)
tree_train_predictions = tree_model.predict(features_train)

# Accuracy
print('Training set accuracy:',accuracy_score(target_train,tree_train_predictions))
print('Test set accuracy:',accuracy_score(target_test,decision_prediction_test))

Training set accuracy: 0.8516536964980544
Test set accuracy: 0.7916018662519441


The best decision tree model yielded an accuracy of 79%.

### Random Forest 

In [16]:
# Tuning hyperparameters

best_score = 0
best_est = 0
for est in range(1, 50): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(features_train,target_train) # train model on training set
    score = model.score(features_valid,target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score# save best accuracy score on validation set
        best_est = est# save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))



Accuracy of the best model on the validation set (n_estimators = 28): 0.7883495145631068


In [17]:
# Train final forest model
forest_model = RandomForestClassifier(random_state=54321, n_estimators=28) # change n_estimators to get best model
forest_model.fit(features_train, target_train)

# Predictions
forest_train_predictions = forest_model.predict(features_train)
forest_prediction_test = forest_model.predict(features_test)

# Accuracy
print('Training set accuracy:',accuracy_score(target_train,forest_train_predictions))
print('Test set accuracy:',accuracy_score(target_test,forest_prediction_test))



Training set accuracy: 0.995136186770428
Test set accuracy: 0.7853810264385692


The best random forest model yielded an accuracy of 78.5%

## Conclusion

The random forest and decision tree algorithms had a slight difference in performance in terms of accuracy, but tuning the hyperparameters for the random forest classifier is a very computationally intensive task. The decision tree algorithim was slightly more accurate, yielding an accuracy of 79% on the test set.