Megaline wants to develop a model that analyzes subscriber behavior and recommends 
a Smart or Ultra plan. With access to behavior data about subscribers, we need to develop
a model that picks the right plan. 

Goal: Develop a model with highest possible accuracy. The threshold for accuracy is 0.75. 
Check the accuracy using the test dataset. 

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor	
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

In [2]:
#Open and look through the data file. Path to the file:datasets/users_behavior.csv

try:
    df = pd.read_csv('/datasets/users_behavior.csv')

except FileNotFoundError:
    df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
duplicates = df[df.duplicated()]
duplicates

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra


In [5]:
df['is_ultra'].unique()

array([0, 1])

Classification problem to determine wheter a plan is ultra

In [6]:
#Grab dependent and independent variables
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

In [7]:
#60% train and other 40% split into validation and test
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.40, random_state=12345) 


features_test, features_valid, target_test, target_valid = train_test_split(
    features_valid, target_valid, test_size=0.50, random_state=12345) 

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


In [8]:
best_score = 0
best_est = 0
for est in range(1, 50):
    model = RandomForestClassifier(random_state=54321, n_estimators=est, max_depth=12)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 47): 0.8149300155520995


In [9]:
print(model.score(features_test, target_test))

0.8009331259720062


Random forest classifier has a higher accuracy when compared to decision trees and logistic regression. Test accuracy decreased when compared to the validation set. 

In [10]:
model = LogisticRegression(random_state=54321, solver='liblinear') # initialize logistic regression constructor with parameters random_state=54321 and solver='liblinear'
model.fit(features_train, target_train)  # train model on training set
score_train = model.score(features_train, target_train)  
score_valid = model.score(features_valid, target_valid)  
score_test = model.score(features_test, target_test)  

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

print(
    "Accuracy of the logistic regression model on the test set:",
    score_test,
)

Accuracy of the logistic regression model on the training set: 0.7505186721991701
Accuracy of the logistic regression model on the validation set: 0.7402799377916018
Accuracy of the logistic regression model on the test set: 0.7589424572317263


In [11]:
best_model = None
best_result = 0
for depth in range(1, 15):
	model = DecisionTreeClassifier(random_state=12345, max_depth= depth) # create a model with the given depth
	model.fit(features_train, target_train) # train the model
	predictions = model.predict(features_valid) # get the model's predictions
	result = accuracy_score(target_valid, predictions) # calculate the accuracy
	if result > best_result:
		best_model = model
		best_result = result
        
print("Accuracy of the best model:", best_result)

Accuracy of the best model: 0.7993779160186625


In [12]:
predictions = best_model.predict(features_test)
print(accuracy_score(target_test, predictions))

0.7822706065318819


Conclusion: 
    
I've used three models: random forest, logistic regression, and a decision tree. Adjusting the random forest hyperparameter's max depth 12 achieves the highest accuracy. The original data set was converted into 60% training, 20% validation, and 20% test. Ultimately, I reached 80% accuracy on the test set.