# Machine Learning Project

## This project will develop and create a machine learning classification model to best recommend to Megaline customers which plan they should get, Ultra or Smart based on customer behavior. The goal is for a minimum 75% accuracy for the best model. I will use the three classification models showed in the course which are Logistitc Regression, Decision Tree Classifier and Random Forest Classifier.

In [1]:
# Import necessary packages
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Reading in data set and having a look at its contents
df = pd.read_csv('/datasets/users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
# Detailed look at the data. There is no missing values or duplicate rows
df.info()
df.duplicated().value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


False    3214
dtype: int64

In [4]:
# Splitting data into training, validation, and testing sets

# Creating features list and target list
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

# Splitting into training (60%) and temporary (40%). Used random_state=12345 as shown in the sprint.
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=12345)

# Splitting the temporary set into validation (20%) and test (20%)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)

# Verify the sizes of the splits following the traditional 3:1:1 split.
print(f'Training set: {features_train.shape}, {target_train.shape}')
print(f'Validation set: {features_valid.shape}, {target_valid.shape}')
print(f'Test set: {features_test.shape}, {target_test.shape}')



Training set: (1928, 4), (1928,)
Validation set: (643, 4), (643,)
Test set: (643, 4), (643,)


In [5]:
# Investigating the quality of the LogisticRegression Classifier
regress_model =  LogisticRegression(random_state=12345, solver='liblinear', fit_intercept=False, max_iter=17) #initialize logistic regression classifier with parameters random_state=12345 and solver='liblinear'
regress_model.fit(features_train, target_train)  # train model on training set

regress_pred = regress_model.predict(features_valid) # get model predictions on validation set
regress_acc = accuracy_score(target_valid, regress_pred) # get accuracy of the model on the validation set
print("Logistic Regression accuracy:", regress_acc)


print()
regress_model =  LogisticRegression(random_state=12345, solver='liblinear', fit_intercept=True, max_iter=31)# initialize logistic regression constructor with parameters random_state=12345 and solver='liblinear'
regress_model.fit(features_train, target_train)  # train model on training set

regress_pred = regress_model.predict(features_valid) # get model predictions on validation set
regress_acc = accuracy_score(target_valid, regress_pred) # get accuracy of the model on the validation set
print("Logistic Regression accuracy:", regress_acc)

Logistic Regression accuracy: 0.713841368584759

Logistic Regression accuracy: 0.7589424572317263


For the Logisitic Regression Classifier, I have tried changing the fit_intercept parameter and the max_iter parameter. With the fit_intercept parameter set to true, the model is more accurate which makes sense because it allows the model to better fit a more accurate decision boundary. For the max_iter parameter, 17 and 31 are the minimum iterations needed to allow the model to fully converge or else an error messages appears. Anything more than 17 and 31, respectively, will result in the same accuracy measure. This accuracy meets the threshold of 75% outlined in the project description however it is possible other models can be more accurate.

In [6]:
# Investigating the quality of the DecisionTree Classifier

# Initializing the best_acc and best_depth values to store the best accuracy and the corresponding max depth
best_acc = 0
best_depth = 0
# Loop to change the max_depth parameter to find the optimal depth for highest accuracy
for depth in range(1, 31):
    tree_model = DecisionTreeClassifier(max_depth = depth, random_state=12345, min_samples_split=20) # create a model, specify max_depth=depth

    # train the model on training set
    tree_model.fit(features_train, target_train)

    tree_pred = tree_model.predict(features_valid) # find the predictions using validation set
    tree_acc = accuracy_score(target_valid, tree_pred) # get accuracy of the model on the validation set
    
    # Saving the best accuracy and corresponding max_depth
    if(tree_acc > best_acc):
        best_acc = tree_acc
        best_depth = depth
    
    print("max_depth =", depth, ": ", end='')
    print(tree_acc)
print("Best accuracy is at max depth of", best_depth, "with an accuracy of", best_acc)

max_depth = 1 : 0.7542768273716952
max_depth = 2 : 0.7822706065318819
max_depth = 3 : 0.7853810264385692
max_depth = 4 : 0.7791601866251944
max_depth = 5 : 0.7791601866251944
max_depth = 6 : 0.7884914463452566
max_depth = 7 : 0.7853810264385692
max_depth = 8 : 0.7791601866251944
max_depth = 9 : 0.7838258164852255
max_depth = 10 : 0.7744945567651633
max_depth = 11 : 0.7729393468118196
max_depth = 12 : 0.7620528771384136
max_depth = 13 : 0.7589424572317263
max_depth = 14 : 0.7620528771384136
max_depth = 15 : 0.7573872472783826
max_depth = 16 : 0.7558320373250389
max_depth = 17 : 0.7558320373250389
max_depth = 18 : 0.7542768273716952
max_depth = 19 : 0.7542768273716952
max_depth = 20 : 0.7558320373250389
max_depth = 21 : 0.7558320373250389
max_depth = 22 : 0.7558320373250389
max_depth = 23 : 0.7558320373250389
max_depth = 24 : 0.7558320373250389
max_depth = 25 : 0.7558320373250389
max_depth = 26 : 0.7558320373250389
max_depth = 27 : 0.7558320373250389
max_depth = 28 : 0.7558320373250389
m

For the Decision Tree Classifier, I changed the max_depth parameter and the min_samples_split parameter. For the max_depth parameter, I tried all depths from 1 to 30 and found that with min_samples_split=20 the highest accuracy obtained was around 0.788. Using other combinations of min_samples_split and all depths from 1 to 30 resulted in accuracies around 75%. However these two specific parameters of max_depth = 6 and min_samples_split = 20 resulted in the highest accuracy. 

In [7]:
# Investigating the quality of the RandomForest Classifier

# Initializing the best_acc and best_est values to store the best accuracy and the corresponding number of estimators
best_acc = 0
best_est = 0
best_depth = 0
# Loop to change the n_estimators parameter to find the optimal number of estimators for highest accuracy
for est in range(1, 42): # choose hyperparameter range
    for depth in range(1,11):
        
        forest_model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)  # create a model, specify n_estimators=est
        forest_model.fit(features_train, target_train) # train model on training set
        forest_pred = forest_model.predict(features_valid) # find the predictions using validation set
        forest_acc = accuracy_score(target_valid, forest_pred) # get accuracy of model on validation set
    
        # Store the best accuracy and corresponding n_estimators
        if(forest_acc > best_acc):
            best_acc = forest_acc
            best_est = est
            best_depth = depth
    
        print("est =", est, ":", "depth:", best_depth, end=' : ')
        print(forest_acc)
print("Best accuracy is at", best_est, "estimators", "and max depth", best_depth, "with an accuracy of", best_acc)

est = 1 : depth: 1 : 0.7542768273716952
est = 1 : depth: 2 : 0.7853810264385692
est = 1 : depth: 2 : 0.7853810264385692
est = 1 : depth: 2 : 0.7744945567651633
est = 1 : depth: 2 : 0.776049766718507
est = 1 : depth: 2 : 0.7853810264385692
est = 1 : depth: 2 : 0.7776049766718507
est = 1 : depth: 2 : 0.7776049766718507
est = 1 : depth: 2 : 0.7791601866251944
est = 1 : depth: 2 : 0.7729393468118196
est = 2 : depth: 2 : 0.7371695178849145
est = 2 : depth: 2 : 0.7822706065318819
est = 2 : depth: 2 : 0.7853810264385692
est = 2 : depth: 2 : 0.7776049766718507
est = 2 : depth: 2 : 0.7729393468118196
est = 2 : depth: 2 : 0.7853810264385692
est = 2 : depth: 2 : 0.7776049766718507
est = 2 : depth: 2 : 0.7807153965785381
est = 2 : depth: 2 : 0.7667185069984448
est = 2 : depth: 2 : 0.7698289269051322
est = 3 : depth: 2 : 0.7465007776049767
est = 3 : depth: 2 : 0.7776049766718507
est = 3 : depth: 2 : 0.7853810264385692
est = 3 : depth: 2 : 0.776049766718507
est = 3 : depth: 5 : 0.7884914463452566
es

est = 21 : depth: 6 : 0.7884914463452566
est = 21 : depth: 6 : 0.7993779160186625
est = 21 : depth: 6 : 0.8009331259720062
est = 21 : depth: 6 : 0.7962674961119751
est = 21 : depth: 6 : 0.7931570762052877
est = 21 : depth: 6 : 0.7947122861586314
est = 22 : depth: 6 : 0.7636080870917574
est = 22 : depth: 6 : 0.7853810264385692
est = 22 : depth: 6 : 0.7884914463452566
est = 22 : depth: 6 : 0.7884914463452566
est = 22 : depth: 6 : 0.7900466562986003
est = 22 : depth: 6 : 0.7993779160186625
est = 22 : depth: 6 : 0.7993779160186625
est = 22 : depth: 6 : 0.7978227060653188
est = 22 : depth: 6 : 0.7916018662519441
est = 22 : depth: 6 : 0.7916018662519441
est = 23 : depth: 6 : 0.7511664074650077
est = 23 : depth: 6 : 0.7853810264385692
est = 23 : depth: 6 : 0.7884914463452566
est = 23 : depth: 6 : 0.7853810264385692
est = 23 : depth: 6 : 0.7884914463452566
est = 23 : depth: 6 : 0.8009331259720062
est = 23 : depth: 6 : 0.8009331259720062
est = 23 : depth: 6 : 0.7962674961119751
est = 23 : depth

est = 41 : depth: 8 : 0.80248833592535
est = 41 : depth: 8 : 0.807153965785381
est = 41 : depth: 8 : 0.7931570762052877
est = 41 : depth: 8 : 0.7947122861586314
Best accuracy is at 40 estimators and max depth 8 with an accuracy of 0.8087091757387247


For the Random Forest Classifier, I adjusted the max_depth and n_estimators parameters starting from n_estimators = 1 to 42 and max_dpeth = 1 to 10 and tried every combination. Ultimately, I found that n_estimators = 40 and max_depth = 8 gives the highest accuracy around 0.81. All the other combinations range from around 0.75 to 0.8. Out of the three models, the Random Forest Classifier has the highest accuracy, so I will use this model for further analysis.

In [8]:
# Checking quality of the best Random Forest Model on the test set
forest_model_best = RandomForestClassifier(random_state=12345, n_estimators=40, max_depth=8) # set number of trees
forest_model_best.fit(features_train, target_train) # train the model on training data

forest_test_pred = forest_model_best.predict(features_test) # find the predictions using test set

features_test_acc = accuracy_score(target_test, forest_test_pred) # get accuracy of model on test set
print('Test Accuracy:', features_test_acc)

Test Accuracy: 0.7962674961119751


Running the model with the test set, it came out to an accuracy of around 0.796, which is relatively close to the trained model on the the validation set. This shows that the model can predict about 79.6% of the time the correct plan to a customer and it also exceeds the base line of 75% accuracy the company Megaline needs.

In [9]:
# Sanity check
forest_train_acc = forest_model_best.score(features_train, target_train)
forest_val_acc = forest_model_best.score(features_valid, target_valid)
forest_test_acc = forest_model_best.score(features_test, target_test)
print('Training Accuracy:', forest_train_acc)
print('Validation Accuracy:', forest_val_acc)
print('Test Accuracy:', forest_test_acc)

Training Accuracy: 0.875
Validation Accuracy: 0.8087091757387247
Test Accuracy: 0.7962674961119751


Comparing the accuracy scores of the best random forest model using the training, validation, and test sets, there is little overfitting in the model since the training accuracy is relatively more than the validation accuracy. This could be due to having the max_depth parameter being set high enough to cause overfitting for this data set. The validation set and the test set have relatively similar accuracies, suggesting that the model generalizes well to unseen data and implies that the model's performance is stable across different datasets.

Overall, out of the three classification models (Logistic Regression, Decision Tree Classifier, and Random Forest Classifier), the Random Forest classifier has the highest accuracy (about 79.6%) out of the experimental models and also is relatively stable. There could be some slight overfitting with the model, but it is adequate enough to exceed the 75% accuracy threshold and advise Megaline customers to the correct plan 79.6% of the time. Ideally, companies would want that number to be as high as possible to maximize customer satisfaction. To get the best model, I changed some parameters such as the max_depth in the Decision Tree Classifier and the Random Forest Classifier and n_estimators in the forest classifier as well since these values are easy to change in a for loop. The other models also performed well and above the 75% threshold but the random forest is more robust than a logistic regression and less prone to overfitting than a decision tree classifier and thus a better model to choose. 