<a href="https://colab.research.google.com/github/chalsai/Introduction-to-Regression-Project/blob/main/Week_4_Introduction_to_Regression_Project_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.






# Data description
Every observation in the dataset contains monthly behavior information about one user.

The information given is as follows:


сalls — number of calls,


minutes — total call duration in minutes,


messages — number of text messages,


mb_used — Internet traffic used in MB,


is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
#Let us load all relevant packages and import the data 
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import precision_score

In [2]:
#import data
df = pd.read_csv('https://bit.ly/UsersBehaviourTelco')



In [3]:
#Let's see if we have any NA
df.isna().sum()
df.info()
#Let's study the data types of the dataframe
df['messages'] = df['messages'].astype(int) 
df['calls'] = df['calls'].astype(int) 




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


<div class="alert alert-block alert-success">
<b>Success:</b> Data loading and initial analysis are well done.
</div>

We see that there are no NA values, so we didn't have to do any extra processing work. Calls and messages don't need to be float so I converted them to int. 

# Dataset splitting

In [4]:
#Now, we will split the data into training, testing and validation sets. Of the base dataset,
#I will split 20% for testing and 80% for training.
features = df.drop(columns=['is_ultra'])
target =  df['is_ultra']
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.20, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.2, random_state=12345 )


In [5]:
print(len(features_train))
print(len(target_train))
print(len(features_test))
print(len(target_test))
print(len(features_valid))
print(len(target_valid))


2056
2056
643
643
515
515


As a sanity check to see if the function worked correctly, I manually checked if the sizes of the training, testing and validation dataset numbers were correct.

<div class="alert alert-block alert-success">
<b>Success:</b> Data splitting was done well. Great that you print out information about sets after splitting!
</div>

# Hyperparameter tuning 

I will investigate the quality of different models by changing hyperparameters, and then briefly describe the findings of my study.

In [6]:
#First, let fit the datasets to a Logistic Regression model
LogRegMod = LogisticRegression(random_state=12345, solver='liblinear') 
LogRegMod.fit(features_train, target_train) 
LogRegMod.score(features_train, target_train)


0.745136186770428

We see that logistic regression has an accuracy of 74.5% which is not too bad but we can definitely do better with other models.

In [7]:
#Now, let us fit a decision tree model
DecTreeMod = DecisionTreeClassifier(random_state=12345, max_depth=5)

DecTreeMod.fit(features_train, target_train)
DecTreeMod.score(features_train, target_train)

0.828307392996109

As a random guess, I decided to see how a model with max depth 5 would do. We already have a better accuracy than the logistic regression. Let's see if there are better hyperparameters we can find by doing an exhaustive search using the sklearn function 'GridSearchCV'

In [8]:
depth_param = {'max_depth':range(1,25)}
DecTreeMod = DecisionTreeClassifier(random_state=12345)
DecTreeModOpt = GridSearchCV(DecTreeMod,depth_param)
DecTreeModOpt.fit(features_train, target_train)
DecTreeModOpt.score(features_train, target_train)
print(DecTreeModOpt.best_estimator_)


DecisionTreeClassifier(max_depth=4, random_state=12345)


Going as deep as 25 layers (arbitrarily chosen), we see that the optimal accuracy is obtained at a depth of 3. Since we know that a tree with a depth of 3 is good enough, let's further narrow the parameter space that the GridSearchCV function needs to look for the Random Forest model with max depth of 10. Having more than 50 estimators also doesn't feel parsimonious so based on pure feelings, I will set that as the range for the hyperparameters.

In [9]:
depth_param = {'max_depth':range(1,10), 'n_estimators':range(1,50)}
RandForestMod = RandomForestClassifier(random_state=12345)
RandForestOpt = GridSearchCV(RandForestMod,depth_param)
RandForestOpt.fit(features_train, target_train)
print(RandForestOpt.best_estimator_)
RandForestOpt.score(features_train, target_train)

RandomForestClassifier(max_depth=7, n_estimators=42, random_state=12345)


0.8599221789883269

This took a very long time to run. If run time is a priority, the parameter space needs to be greatly reduced. Here, we see that max depth of 8 worked best, along with 46 estimators. 46 estimators seems like too much so the company needs to evaluate whether they need this marginal increase in accuracy at the cost of having a bulky model.

<div class="alert alert-block alert-success">
<b>Success:</b> The tuning hyperparameters was done great! GreatSearch was correctly used.  
</div>

# Check the quality of the model using the test set

In [10]:
features_test_accuracy = features_test
predictions_test_accuracy = RandForestOpt.predict(features_test_accuracy)
quality = accuracy_score(target_test, predictions_test_accuracy)
quality

0.7884914463452566

In this project, the threshold for accuracy is 0.75. We see that the Random Forest model has an accuracy of 80.2%, which is good enough for us. We don't need to check the other models since this one had a better accuracy on the training data, but out of curiosity, let's see how the decision tree model, our second best one, does on the test data.

In [11]:
features_test_accuracy = features_test
predictions_test_accuracy = DecTreeModOpt.predict(features_test_accuracy)
quality = accuracy_score(target_test, predictions_test_accuracy)
quality

0.7884914463452566

Even the decision tree model has a good enough result, but as we predicted, the Random Forest Classifier outperforms it. Now, let's look into the precision scores out of curiosity.

In [12]:
precision = precision_score(RandForestOpt.predict(features_test_accuracy), target_test)
precision 

0.46938775510204084

In [13]:
precision = precision_score(DecTreeModOpt.predict(features_test_accuracy), target_test)
precision 

0.4642857142857143

Both of our models, unfortunately, have a very low precision. This means that the chance that we will pick the Ultra when the plan is actually Smart and vice versa (false positive) is quite high. 

# Conclusion

The data was so big and exhibited a pattern apparent enough that all 3 classification algorithms I implemented performed decently. Precision is a concern, but since the requirement was to focus on accuracy, we don't need to explore that for now. Random forest and decision tree algorithms had a slight difference in performance in terms of accuracy, but tuning the hyperparameters for the random forest classifier is a very computationally intensive task.

Finally, despite our very liberal hyperparameter requirements, we managed to avoid overfitting. We know this because our models performed very well on the test dataset. 

<div class="alert alert-block alert-success">
<b>Success</b> Testing was processed in a good way!
</div>