#Introduction to Regression

## 1. Defining the Question

Develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra

### b) Defining the Metric for Success

Develop a model with the highest possible accuracy. In this project, the threshold for
accuracy is 0.75. 

### c) Understanding the context 

Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the
new plans (from the project for the Statistical Data Analysis course). For this
classification task, you need to develop a model that will pick the right plan. Since you’ve
already performed the data preprocessing step, you can move straight to creating the
model.



### d) Recording the Experimental Design

Describe the steps/approach that you will use to answer the given question.



1.   How did you look into data after downloading?

1.   Have you correctly split the data into train, validation, and test sets?

1.   How have you chosen the sets' sizes?

1.   Did you evaluate the quality of the models correctly?
2.   What models and hyperparameters did you use?

2.  Did you test the models correctly?

2.   What is your accuracy score?

2.   Have you stuck to the project structure and kept the code neat?















### e) Data Relevance

How relevant was the provided data?
Very relevant

## 2. Reading the Data

In [None]:
# Importing our libraries 
# ---
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [None]:
# Load the data below
# --- 
# Dataset url = : https://bit.ly/UsersBehaviourTelco

# --- 
df = pd.read_csv('https://bit.ly/UsersBehaviourTelco')

In [None]:
# Checking the first 5 rows of data
# ---
#
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [None]:
# Checking the last 5 rows of data
# ---
#
df.tail()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3209,122.0,910.98,20.0,35124.9,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0
3213,80.0,566.09,6.0,29480.52,1


In [None]:
# Sample 10 rows of data
# ---
#
df.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
974,66.0,465.75,6.0,16832.32,0
336,11.0,70.5,0.0,2042.74,0
819,62.0,462.48,32.0,14589.41,0
1156,40.0,259.23,20.0,19102.33,0
2220,72.0,512.5,0.0,21285.79,0
1628,85.0,509.61,4.0,26802.84,0
3209,122.0,910.98,20.0,35124.9,1
626,38.0,300.0,65.0,19261.1,0
2481,77.0,551.43,56.0,7694.58,0
2078,99.0,654.24,45.0,14256.68,0


In [None]:
# Checking number of rows and columns
# ---
#  
df.shape

(3214, 5)

In [None]:
# Checking datatypes
# ---
df.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

Record your observations below:

*   Most of the variables are float
*   The data provided has 3214 rows and 5 columns



## 4. Data Preparation

### Performing Data Cleaning

In [None]:
# Checking missing entries of all the variables. 
# ---
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

We observe the following from our dataset:

*   no missing values 



In [None]:
# Standardizing your dataset i.e. variable renaming 
# we make all our column headings to have lower case characters and check the first five rows to confirm changes
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


We observe the following from our dataset:

*   We renamed all columns to have lower cases and checked to confirm the changes. All columns now have lower cases in the column names.



In [None]:
# Checking how many duplicate rows are there in the data
# ---
df.duplicated().sum()

0

We observe the following from our dataset:

*   There are no duplicates in our data



In [None]:
# Checking if any of the columns are all null
# ---
df.isnull().all(axis = 0)

calls       False
minutes     False
messages    False
mb_used     False
is_ultra    False
dtype: bool

We observe the following from our dataset:

*   None of the columns contains all null values



In [None]:
# Checking if any of the rows are all null
# ---
sum(df.isnull().all(axis = 1))

0

We observe the following from our dataset:

*   No row contains completely null values



In [None]:
#creating a copy of our dataframe 
#
# ---
#
df_clean = df.copy()
df_clean.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [None]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [None]:
df_clean.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.038892,33.236368,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.208787,234.569872,0.0,274.575,430.6,571.9275,1632.06
messages,3214.0,38.281269,36.148326,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.673836,7570.968246,0.0,12491.9025,16943.235,21424.7,49745.73
is_ultra,3214.0,0.306472,0.4611,0.0,0.0,0.0,1.0,1.0


In [None]:
df_clean.columns

Index(['calls', 'minutes', 'messages', 'mb_used', 'is_ultra'], dtype='object')

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

Splitting the dataframe

In [None]:
df_train, df_valid = train_test_split(df_clean, test_size=0.25, random_state=12345)
# spliting the data in 80:10:10 for train:valid:test dataset
train_size=0.8

X = df_clean.drop(columns = ['is_ultra']).copy()
y = df['is_ultra']

# In the first step we will split the data in training and remaining dataset
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.8)

# Now since we want the valid and test size to be equal (10% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)
test_size = 0.5
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

(2571, 4)
(2571,)
(321, 4)
(321,)
(322, 4)
(322,)


(None, None)

In [None]:

model = DecisionTreeClassifier(random_state=12345,max_depth=3,class_weight=None)

model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
model.fit(X_test, y_test)
test_predictions = model.predict(X_test)
model.fit(X_valid, y_valid)
valid_predictions = model.predict(X_valid)

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))
print('Valid set:', accuracy_score(y_valid, valid_predictions))




Accuracy
Training set: 0.7950213924542979
Test set: 0.8167701863354038
Valid set: 0.838006230529595


Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=12345, n_estimators=3)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
model.fit(X_test, y_test)
test_predictions = model.predict(X_test)
model.fit(X_valid, y_valid)
valid_predictions = model.predict(X_valid)

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))
print('Valid set:', accuracy_score(y_valid, valid_predictions))

Accuracy
Training set: 0.9548813691170751
Test set: 0.9472049689440993
Valid set: 0.956386292834891


Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
model.fit(X_test, y_test)
test_predictions = model.predict(X_test)
model.fit(X_valid, y_valid)
z= model.score(X_valid, y_valid)

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))
print('Valid set:', z)

Accuracy
Training set: 0.7370672889926099
Test set: 0.7422360248447205
Valid set: 0.8037383177570093


Conclusion 

random forest has the highest accuracy 