# Classification Task: Picking the right plan for Megaline users

In this task, we will develop a model that picks the right plan for users based on behavior data about subscribers who have already switched to the new plans: Smart or Ultra. We want a model that is at least 75% accurate. Since this is a classification task, we will test the Decision Tree, Random Forest, and Logistic Regression classifiers

# Table of Contents
1. [General Information](#step1)
2. [Splitting into training, validation, and test sets](#step2)
3. [Testing Models](#step3)
   1. [Decision Tree](#step3_1)
   2. [Random Forest](#step3_2)
   3. [Logistic Regression](#step3_3)    
4. [Quality Check using the Test set](#step4)
5. [The Sanity Check: Model vs. Chance](#step5)
6. [Conclusion](#step6)

## General Information <a name="step1"></a>

Let us first of all import the needed libraries and modules

In [1]:
import pandas as pd #for dealing with dataframes
from sklearn.tree import DecisionTreeClassifier #to deal with Decision Tree Models
from sklearn.ensemble import RandomForestClassifier #to deal with Random Forest Models
from sklearn.linear_model import LogisticRegression #to deal with Logistic Regression Models
from sklearn.model_selection import train_test_split #to be able to split datasets
from sklearn.metrics import accuracy_score #to be able to calculate model accuracy

We can then read our dataset

In [26]:
df=pd.read_csv('/datasets/users_behavior.csv')
#reads and converts our csv file into a pandas dataframe called df
df.head()#first 5 rows

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [27]:
df.info()#displays general information about our dataframe 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Our features are the calls, minutes, messages, and mb_used columns. These will be used by our models to predict our target - the is_ultra column

## Splitting into training, validation, and test sets <a name="step2"></a>

To do this we use the train_test_split() function. This splits a dataset into 2. Since we need 3 sets, we need to do it twice. The percentages of the original dataset should be 60, 20, and 20 for the training, validation, and test sets respectively.

In [3]:
df_train, df2 = train_test_split(df, test_size=0.4, random_state=12345)
#splits df into df_train (60%) and df2 (40%)
df_valid, df_test = train_test_split(df2, test_size=0.5, random_state=12345)
#splits df2 into df_valid(50% of df2, so 20% of df) and df_test(50% of df2, so 20% of df)

We can view the general information of each split set

In [4]:
df_train.info() #training set info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 3027 to 482
Data columns (total 5 columns):
calls       1928 non-null float64
minutes     1928 non-null float64
messages    1928 non-null float64
mb_used     1928 non-null float64
is_ultra    1928 non-null int64
dtypes: float64(4), int64(1)
memory usage: 90.4 KB


In [5]:
df_valid.info() #validation set info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1386 to 3197
Data columns (total 5 columns):
calls       643 non-null float64
minutes     643 non-null float64
messages    643 non-null float64
mb_used     643 non-null float64
is_ultra    643 non-null int64
dtypes: float64(4), int64(1)
memory usage: 30.1 KB


In [6]:
df_test.info() #test set info

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 160 to 2313
Data columns (total 5 columns):
calls       643 non-null float64
minutes     643 non-null float64
messages    643 non-null float64
mb_used     643 non-null float64
is_ultra    643 non-null int64
dtypes: float64(4), int64(1)
memory usage: 30.1 KB


We have successully split our original data set into a training set(60%), a validation set(20%), and a test set(20%). We now need to define the features and target sections for each set. For the features, we will call the whole dataframe but drop the target column. The target is the is_ultra column.

In [7]:
train_features = df_train.drop('is_ultra', axis=1)
#defines the training features by dropping the target column from the training set

train_target = df_train['is_ultra']#defines the training target as the is_ultra column of the training set

valid_features = df_valid.drop('is_ultra', axis=1)
#defines the validation features by dropping the target column from the validation set

valid_target = df_valid['is_ultra']
#defines the validation target as the is_ultra column of the validation set

test_features = df_test.drop('is_ultra', axis=1)
#defines the test features by dropping the target column from the test set

test_target = df_test['is_ultra']#defines the test target as the is_ultra column of the test set

We have clearly defined the features and targets for our sets. Now we can test models

## Testing Models <a name="step3"></a>

As we stated earlier, we will test the Decision Tree classifier, the Random Forest classifier, and the Logistic Regression models and train them with the training set (using the fit() method) and test them on the validation set by comparing a prediction using features from the validation set (using the predict() method) to the actual target from the validation set. For each, we will tweak hyperparameters so we can get a higher accuracy score, the latter being the metric for choosing the best model to move forward with 

### Decision Tree <a name="step3_1"></a>

Here, we will be calling the DecisiontreeClassifier() function. We will call 2 hyperparameters: random_state and max_depth. random_state has to be the same across the board so we will give it a fixed value (12345). max_depth, however, is the hyperparameter we will play with. So we will loop through a bunch of values for max_depth (in this case, 1 to 10) and get their accuracy scores.

In [8]:
for i in range(1, 11): #loops throuh values of i from 1 to 10    
    dt_model = DecisionTreeClassifier(random_state=12345, max_depth=i)
    #creates a Decision Tree model with the max_depth value
    dt_model.fit(train_features, train_target)
    #trains the model using the features and target of the training set
    dt_valid_pred=dt_model.predict(valid_features)
    #gets predictions from the model using the features of the validation set
    print('Max depth', i, 'accuracy =', accuracy_score(valid_target, dt_valid_pred))
    #prints the accuracy score by comparing the predictions to the target of the validation set

Max depth 1 accuracy = 0.7542768273716952
Max depth 2 accuracy = 0.7822706065318819
Max depth 3 accuracy = 0.7853810264385692
Max depth 4 accuracy = 0.7791601866251944
Max depth 5 accuracy = 0.7791601866251944
Max depth 6 accuracy = 0.7838258164852255
Max depth 7 accuracy = 0.7822706065318819
Max depth 8 accuracy = 0.7791601866251944
Max depth 9 accuracy = 0.7822706065318819
Max depth 10 accuracy = 0.7744945567651633


From the results, the best Decision Tree model is that which has max_depth 3 since it has the highest accuracy score of 78.53%

### Random Forest <a name="step3_2"></a>

We will be calling the RandomForestClassifier() function. Our random_state hyperparameter should remain the same as before. The hyperparameters we will be playing with are max_depth and n_estimators. In this case we will first create an empty list. Then we will loop through values of max_depth and, within that loop, loop through values of n_estimators. We will use this loop to create models with different permutations of max_depth and n_estimators values that we will store in the list, from which we will choose the model with the highest accuracy score

In [9]:
rf = []#empty list
for i in range(1, 11):#loops through values of i from 1 to 10 for max_depth
    for j in range(10, 101, 10):#loops through values of j from 1 to 100 with a step of 10 for n_estimators
        rf_model = RandomForestClassifier(random_state=12345, max_depth=i, n_estimators=j)
        #creates a random forest model
        rf_model.fit(train_features, train_target)
        #trains the model using the features and target of the training set
        rf.append(rf_model)#adds model to the list
    
print(max(rf, key=lambda rf_model: accuracy_score(rf_model.predict(valid_features), valid_target)))
#prints the model from the list with the highest accuracy score based on predictions made using the 
#features of the validation set and the actual target of the validation set

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=8, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=40,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)


Our result tells us that the best random forest classifier model is the one with max_depth=8 and n_estimators=40. We will call it best_rf. Let us find its accuracy score

In [10]:
best_rf=RandomForestClassifier(random_state=12345, max_depth=8, n_estimators=40)
best_rf.fit(train_features, train_target)
best_rf_pred=best_rf.predict(valid_features)
print(accuracy_score(valid_target, best_rf_pred))

0.8087091757387247


Our best random forest classifier has an accuracy of about 81%

### Logistic Regression <a name="step3_3"></a>

We will use the LogisticRegression() function. Again, our random_state should be the same. However, the max_depth and n_estimators hyperparameters don't apply here. All we'll need is to set a solver. We will use 'liblinear' 

In [11]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear')
lr_model.fit(train_features, train_target)
lr_valid_pred=lr_model.predict(valid_features)
print('Logistic Regression Accuracy =', accuracy_score(valid_target, lr_valid_pred))

Logistic Regression Accuracy = 0.7589424572317263


The accuracy of our Logistic Regression model is about 76%

- **Conclusion: Who was the best?**

The best model we have come up with is the Random Forest model with max_depth=8 and n_estimators=40 (accuracy score . Coming in second place, Decision Tree model with max_depth=3 (accuracy score 78.5%). Coming in last, Logistic Regression (accuracy score 76%)

## Quality Check using the Test set <a name="step4"></a>

We will now use our best model on the test set. Before that, we need to retrain the model using both the training and validation sets combined. To combine those sets, we can use the pd.concat function which takes a list of the sets invoved as argument, and set the parameter axis=0 to make it a vertical stacking.

In [12]:
train_final = pd.concat([df_train, df_valid], axis=0)#vertically stacks the training and validation sets
train_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2571 entries, 3027 to 3197
Data columns (total 5 columns):
calls       2571 non-null float64
minutes     2571 non-null float64
messages    2571 non-null float64
mb_used     2571 non-null float64
is_ultra    2571 non-null int64
dtypes: float64(4), int64(1)
memory usage: 120.5 KB


We can define the features and targets....

In [13]:
train_final_features = train_final.drop('is_ultra', axis=1)
train_final_target = train_final['is_ultra']

Now we can train (fit() method) using the new features and target, make predictions (predict() method) using the features from the test set, and get an accuracy score. Remember that we called our best model 'best_rf' 

In [21]:
best_rf.fit(train_final_features, train_final_target)
best_rf_pred=best_rf.predict(test_features)
print(accuracy_score(best_rf_pred, test_target))

0.7993779160186625


We have an accuracy score of about 80%, which is over the 75% threshold for our project

## The Sanity Check: Model vs. Chance <a name="step5"></a>

To sanity-check our model, we will have to compare it to chance. We can do so by getting the accuracy score if our predictions were basically us putting one value for the target all through. To do so, let us first of all see how many smart and ultra clients we have on the test set

In [23]:
smart_target=(test_target == 0) #series holding True values for smart clients
ultra_target=(test_target == 1) #series holding True values for ultra clients

print('Number of smart clients:', smart_target.sum())
print('Number of ultra clients:', ultra_target.sum())

Number of smart clients: 440
Number of ultra clients: 203


We have a lot more smart clients, so we will use them as an example. Let us pretend we have a random classifier whose predictions were just 0 (i.e smart plan) all through for the test set. What will be the accuracy score

In [25]:
smart_chance=smart_target.sum()/len(test_target)
print('Accuracy of random smart classifier:', smart_chance)

Accuracy of random smart classifier: 0.6842923794712286


The random classifier would have an accuracy score of 68.4%, which is less than the 80% that our random forest classifier got. So our model passes the sanity check 

## Conclusion <a name="step6"></a>

We split our data into training, validation, and test sets. We tested models and saw that the RandomForestClassifier was the best, scoring 80% accuracy on the validation and 79% on the test set and it passed our sanity check