**Review**
  
Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The project is accepted. Keep up the good work on the next sprint!

# Research on Megaline

Mobile carrier Megaline has found out that many of their subscribers use legacy plans.<br>
They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.<br>
You have access to behavior data about subscribers who have already switched to the new plans.<br>
For this classification task, you need to develop a model that will pick the right plan. <br>
<b>The Task:</b><br>
Develop a model with the highest possible accuracy.<br>
In this project, the threshold for accuracy is 0.75.<br>
Check the accuracy using the test dataset.

<b>Data description</b><br>
Every observation in the dataset contains monthly behavior information about one user.<br>
The information given is as follows:<br>
- сalls — number of calls,<br>
- minutes — total call duration in minutes,<br>
- messages — number of text messages,<br>
- mb_used — Internet traffic used in MB,<br>
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

## Open and look through the data file.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
import pprint

In [2]:
#Read csv file
users_behavior = pd.read_csv('https://code.s3.yandex.net/datasets/users_behavior.csv') 
#General information
users_behavior.info()
users_behavior.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1296,43.0,310.01,0.0,11486.69,1
2495,74.0,506.5,51.0,13174.87,0
2680,176.0,1205.57,74.0,33903.3,1
3060,42.0,277.25,49.0,15483.11,0
485,88.0,640.63,27.0,21251.05,0
2298,64.0,475.68,20.0,8540.27,0
883,5.0,38.41,1.0,9707.95,0
2715,82.0,595.63,42.0,11437.75,0
3151,84.0,622.6,32.0,16318.74,0
3036,16.0,100.87,8.0,8622.56,0


<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Alright, the data was loaded and examined!
  
</div>

## Split the source data into a training set, a validation set, and a test set.

Split data to 3 sets:<br>
training set - 60%<br>
validation set - 20%<br>
test set - 20%

In [3]:
# const random_state 
RANDOM_STATE = 12345

In [4]:
#split data into training and validation 
df_train, df_valid = train_test_split(users_behavior, test_size=0.2, random_state=RANDOM_STATE)
#split data into training and test 
df_train, df_test = train_test_split(df_train, test_size=0.25, random_state=RANDOM_STATE)

In [5]:
df_train.shape

(1928, 5)

In [6]:
df_valid.shape

(643, 5)

In [7]:
df_test.shape

(643, 5)

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Great, the data was split into train, validation and test. The proportions are reasonable.
  
</div>

##  Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

In [8]:
# list of features columns
features_train = df_train.drop(['is_ultra'], axis=1)
features_valid = df_valid.drop(['is_ultra'], axis=1)

# Target column
target_train = df_train['is_ultra']
target_valid = df_valid['is_ultra']

### DecisionTree

In [9]:
# Create dictionary
result_dict={}

#create a loop for max_depth from 1 to n
n=30
for depth in range(1, n):
        model = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=depth)
        # < train the model >
        model.fit(features_train, target_train)
        # find the predictions using validation set 
        predictions_valid = model.predict(features_valid) 
            
        # Add item to dictionary
        result_dict[depth] = accuracy_score(target_valid, predictions_valid)

# Sort the dictionary by values       
sorted_results = sorted(result_dict.items(),key = lambda kv:(kv[1], kv[0]),reverse=True)

# Print the results in ascending order
pprint.pprint(sorted_results)


[(7, 0.7884914463452566),
 (5, 0.7884914463452566),
 (4, 0.7869362363919129),
 (3, 0.7869362363919129),
 (2, 0.7838258164852255),
 (8, 0.7807153965785381),
 (6, 0.7791601866251944),
 (9, 0.7776049766718507),
 (10, 0.7713841368584758),
 (12, 0.7667185069984448),
 (11, 0.7651632970451011),
 (14, 0.7542768273716952),
 (13, 0.749611197511664),
 (1, 0.7480559875583204),
 (17, 0.744945567651633),
 (18, 0.7433903576982893),
 (16, 0.7433903576982893),
 (20, 0.7418351477449455),
 (15, 0.7418351477449455),
 (27, 0.7371695178849145),
 (24, 0.7325038880248833),
 (29, 0.7309486780715396),
 (28, 0.7309486780715396),
 (19, 0.7278382581648523),
 (26, 0.7262830482115086),
 (21, 0.7247278382581649),
 (23, 0.7231726283048211),
 (22, 0.7231726283048211),
 (25, 0.7216174183514774)]


### Conclusion

max_depth = 5,7 have the best results

### Random Forest

In [10]:
# Create dictionary
result_dict={}

#create a loop for n_estimators from 1 to n
n=30
for n_estimator in range(1, n):
        model = RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=n_estimator) 
        # < train the model >
        model.fit(features_train, target_train)
        # find the predictions using validation set 
        predictions_valid = model.predict(features_valid) 
            
        # Add item to dictionary
        result_dict[n_estimator] = accuracy_score(target_valid, predictions_valid)

# Sort the dictionary by values       
sorted_results = sorted(result_dict.items(),key = lambda kv:(kv[1], kv[0]),reverse=True)

# Print the results in ascending order
pprint.pprint(sorted_results)


[(28, 0.7916018662519441),
 (13, 0.7916018662519441),
 (15, 0.7900466562986003),
 (14, 0.7884914463452566),
 (29, 0.7869362363919129),
 (27, 0.7869362363919129),
 (10, 0.7869362363919129),
 (18, 0.7853810264385692),
 (26, 0.7838258164852255),
 (25, 0.7838258164852255),
 (24, 0.7838258164852255),
 (22, 0.7838258164852255),
 (19, 0.7838258164852255),
 (17, 0.7838258164852255),
 (16, 0.7838258164852255),
 (12, 0.7822706065318819),
 (11, 0.7822706065318819),
 (23, 0.7807153965785381),
 (21, 0.7807153965785381),
 (5, 0.7807153965785381),
 (20, 0.7791601866251944),
 (9, 0.7791601866251944),
 (8, 0.776049766718507),
 (6, 0.7744945567651633),
 (7, 0.7729393468118196),
 (3, 0.7729393468118196),
 (4, 0.7667185069984448),
 (2, 0.7542768273716952),
 (1, 0.7340590979782271)]


###  Conclusion

n_estimators=13,28 has the best results

### Logistic Regression

In [17]:
# Create dictionary
result_dict={}
solvers_list=['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
#create a loop for n_estimators from 1 to n
n=30
for solver in solvers_list:       
        model = LogisticRegression(random_state=RANDOM_STATE, solver=solver, max_iter=350) 
        # < train the model >
        model.fit(features_train, target_train)
        # find the predictions using validation set 
        predictions_valid = model.predict(features_valid) 
       
       # Add item to dictionary
        result_dict[solver] = accuracy_score(target_valid, predictions_valid)
              
          
        
        
        
# Sort the dictionary by values       
sorted_results = sorted(result_dict.items(),key = lambda kv:(kv[1], kv[0]),reverse=True)

# Print the results in ascending order
pprint.pprint(sorted_results)


newton-cg
lbfgs
liblinear
sag
saga




[('newton-cg', 0.7589424572317263),
 ('lbfgs', 0.7589424572317263),
 ('liblinear', 0.702954898911353),
 ('saga', 0.6967340590979783),
 ('sag', 0.6967340590979783)]




solver= 'newton-cg', 'lbfgs' has the best results

## Conclusion

- DecisionTree - max_depth=7 we got a score of 0.7884914463452566 <br>
- Random Forest - n_estimators=13 we got a score of 0.7916018662519441<br>
-  Logistic Regression - solver='newton-cg' we got a score of 0.7589424572317263<br>
<b>  Random Forest has the best result </b>

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Great, you tried three different models and tuned their hyperparameters using the validation set
  
</div>

## Check the quality of the model using the test set.

In [12]:
# list of features columns
features_test = df_test.drop(['is_ultra'], axis=1)

# Target column
target_test = df_test['is_ultra']


###  DecisionTree

In [13]:
model = DecisionTreeClassifier(random_state=RANDOM_STATE, max_depth=5)
# train the model 
model.fit(features_train, target_train)

train_predictions = model.predict(features_train)
train_accuracy = accuracy_score(target_train, train_predictions) 

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test) 
print('Accuracy')
print('Training set:', train_accuracy )
print('Test set:', accuracy_test )

Accuracy
Training set: 0.8272821576763485
Test set: 0.7589424572317263


###  Random Forest

In [14]:
model = RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=13)
# train the model 
model.fit(features_train, target_train)

train_predictions = model.predict(features_train)
train_accuracy = accuracy_score(target_train, train_predictions) 

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test) 
print('Accuracy')
print('Training set:', train_accuracy )
print('Test set:', accuracy_test )

Accuracy
Training set: 0.9901452282157677
Test set: 0.7776049766718507


###  Logistic Regression

In [15]:
model = LogisticRegression(random_state=RANDOM_STATE, solver='newton-cg', max_iter=350) 
# train the model 
model.fit(features_train, target_train)

train_predictions = model.predict(features_train)
train_accuracy = accuracy_score(target_train, train_predictions) 

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test) 
print('Accuracy')
print('Training set:', train_accuracy )
print('Test set:', accuracy_test )

Accuracy
Training set: 0.7510373443983402
Test set: 0.7262830482115086




### Conclusion

- Random Forest give as the best accuracy of 0.9901452282157677
- DecisionTreegive as the accuracy of 0.8272821576763485
- Logistic Regression gave us accuracy of 0.7510373443983402
- All the models gave us better accuracy than the test set

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Final models were evaluated on the test set
  
</div>

## Additional task: sanity check the model. 

In [16]:
dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(features_train, target_train)

dummy_accuracy = dummy_model.score(features_train, target_train)

print('Training set Accuracy:', dummy_accuracy )

Training set Accuracy: 0.6945020746887967


### Conclusion

The dummy model gave us accuracy of 0.6945020746887967.
This is lower than all models that we built.
 **conclusion** <br>
<b> we have a good model </b>

<div class="alert alert-success">
<b>Reviewer's comment</b>
  
Yep, it's important to have a simple baseline to make sure that our models learn something non-trivial!
  
</div>