<h2><center>Jupyter Notebook</center></h2>

Jupyter notebooks can be very handy as it helps in combining code with the text. We can create presentations, or technical documents using notebook. Default jupyter notebook can be difficult to work with, particularly with large document with a lot of sections and subsections. Or even codes become difficult to read if the code includes lots of functions, loops etc.

If you use jupyter notebook, please install the extension: nbextension. <a href="https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html">check here for installations</a>

After you install the extension, a new option will come up as shown Figure 1 (a). Click on it to and select the extension you want in your notebook as I have selected some in Figure 1(b). Once clicked, you can use those extensions.


<img src="notebook.png" width="1000">

<h2><center>Pandas Dataframe</center></h2>

Pandas dataframe is very useful tool for preparing your data for building machine learning/ analytics models. You can think of pandas dataframe as a table (2-d matrix). All the operations in pandas are performed in parallel (when you have millions of rows of data, <b>do not attempt to run loop</b>, always think as how you can process it using pandas.

This is just a short starter on Pandas. We cannot cover all the possible operations but please go through their documentation <a href='https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html'>here</a>. You can see all the possible operations you can do using pandas.

As you go along, please print the data to see what is happening with different operations

In [19]:
'''
- whenever we import any library in python, we use "as" when we want to provide a shortcut for that library import
  for example, import numpy as np etc. Its just a short, otherwise, we will have to write numpy or pandas everywhere
- when importing a library from python, we sometimes write as "from sklearn import random_projection", here from the library 
  of sklearn, we just want to import the module (function) we would like to import, otherwise, code becomes heavy unnecessary
'''
import pandas as pd 
import numpy as np    # along with pandas, numpy can be used for different operations

<h3>Panda Operation</h3>
    
We first read the data. Whenever we read the data in jupyter notebook, the data is stored in RAM of the computer (local memory). Thus if the data is too big and RAM of the computer is low, code may not be able to load the data and will "memory error".

<h3>One hot encoding</h3>

We are going to use a churn data (you can check the data). The data has information on user's of a telecom industry and they want to build a model if the user will churn or not. 

Majority of the columns are binary (0 or 1), except six columns. In three columns, tenure, MonthlyCharges, TotalCharges, the data is ratio data. In three columns, 'InternetService','Contract','PaymentMethod', have integer data but it is not binary, has two-three options. It is easier much easier for ML models to convert data columns into <b>one hot vector</b> where the number of options in a columns in not huge. 

One hot vector converts categorical data into binary data. For example, in a dataset, gender of user may be specified as "male", "female" or "other". Since ML models work with numbers, we need to convert it into numbers. We can assign 1 to female, 2 to male and 3 to others. However, for example, in linear regression, it does not make sense as why one category of gender should have higher value than other. We can create three new columns: gender_female, gender_male, gender_other. In these columns, if a user is male, the entry in gender_female = 0, gender_male =1 and gender_other = 0. There are advantages of one hot encoding:

1. it becomes much easier to interpret the results when we use one hot encoder
2. it can handle non-linearity (for example, 1,2,3 in gender does not make sense)
3. one disadvantage is that it makes the optimization problem difficult to solve (as the variables have to be integer). But current optimization methods are sophisticated enough to handle this.
4. personally, I use one hot vector wherever I can (with that I mean where the number of options in a column in not very high)

For example, data in tenure column can also be converted into one hot vector (since the maximum is 72 and minimum is 1), we will need 72 new columns (increasing the number of variables increases the demand for data). Also, increasing value of tenure "makes sense" (or it is advantageous to keep it as it is). 

<h3>Reading the Data</h3>

In [29]:
# reading data through pandas: it is similar to how we read SQL data
data = pd.read_csv('churn.csv')

# print the shape of data (many numpy functions work in pandas also : shape is a numpy function)
print('shape',data.shape)

# it will print the top n rows of the data, default value of n = 5
n = 3
print(data.head(n))

# printing the names of the columns
print(data.columns)

shape (7043, 21)
   customerID  gender  SeniorCitizen  Partner  Dependents  tenure  \
0  7590-VHVEG       0              0        1           0       1   
1  5575-GNVDE       1              0        0           0      34   
2  3668-QPYBK       1              0        0           0       2   

   PhoneService  MultipleLines  InternetService  OnlineSecurity  ...  \
0             0              0                1               0  ...   
1             1              0                1               1  ...   
2             1              0                1               1  ...   

   DeviceProtection  TechSupport  StreamingTV  StreamingMovies  Contract  \
0                 0            0            0                0         1   
1                 1            0            0                0         2   
2                 0            0            0                0         1   

   PaperlessBilling  PaymentMethod  MonthlyCharges  TotalCharges Churn  
0                 1              2     

<h3>Converting to one hot encoding</h3>

The three columns which had few integer options, we will convert them into one hot vector.

In [22]:
columns_to_change = ['InternetService','Contract','PaymentMethod']

for c in columns_to_change:
    oneHot = pd.get_dummies(data[c],prefix=c)  # prefix tells how new columns be named (prefex_values is how it is renamed)
    data   = data.drop(c,axis = 1)             # we dont need the old column, THIS IS HOW A COLUMN IS DROPPED
    data   = pd.concat([data,oneHot],axis=1)   # THIS IS HOW TWO DATAFRAMES ARE MERGED (INNER JOIN), axis 1 = columns

In [23]:
# here oneHot was the new dataframe with one hot vectors (it has the index values same as original data)
# check for the first row that the value for PaymentMethod was 2, so column PaymentMethod_2 = 1, rest are all 0

oneHot.head(5)

Unnamed: 0,PaymentMethod_0,PaymentMethod_1,PaymentMethod_2,PaymentMethod_3
0,0,0,1,0
1,0,0,0,1
2,0,0,0,1
3,1,0,0,0
4,0,0,1,0


In [30]:
'''
adding a new column
suppoe we want to add a new column as multiplication of two columns
we want to find number of female sennior citizens (Lets assume that 1 = male, 0 = female)
ALL THE OPERATIONS ARE SUPER FAST AS THEY ARE CARRIED OUT IN PARALLEL
'''

# this line first creates a new column with values (1-data['gender']) and multiplies it with seniorCitizen
data['senior_female'] = (1-data['gender'])*data['SeniorCitizen']

print("female and senior:",sum(data['senior_female']), "total users:", len(data))

# deleting a column
# deleting toalCharges as it is roughly multiplication of tenure and monthlyCharges
del data['TotalCharges']

# print and check the columnwise statistics np.sum() or np.mean, np.median: numpy and padas go hand in hand
print(np.mean(data))

female and senior: 568 total users: 7043
gender               0.504756
SeniorCitizen        0.162147
Partner              0.483033
Dependents           0.299588
tenure              32.371149
PhoneService         0.903166
MultipleLines        0.421837
InternetService      1.222916
OnlineSecurity       0.286668
OnlineBackup         0.344881
DeviceProtection     0.343888
TechSupport          0.290217
StreamingTV          0.384353
StreamingMovies      0.387903
Contract             1.690473
PaperlessBilling     0.592219
PaymentMethod        1.574329
MonthlyCharges      64.761692
Churn                0.265370
senior_female        0.080647
dtype: float64


<h3>Accessing DataFrame data</h3>

In [17]:
# Columns of the data frame can be accessed by two methods
# 1. method 1
customerIds = data['customerID']

# 2. method 2
customerIds = data.customerID           # make sure that the column name has no black spaces, otherwise it will not work

# For accessing a unique value (from one row and one column), it can accessed as
# 1. method 1
user_2_id = data.iloc[1, 0]             # data.iloc[row_number, column_number]

# 2. method 2
user_2_id = data.iloc[1]['customerID']  # data.iloc[row_number][column_name]

<h3>Applying functions in pandas dataframes</h3>

In [33]:
'''
- there are two methods using which we can perform actions on a column
- these actions are by default run in parallel so it is fastest method
- checking if the tenure is more than 25 (first we create a new column and then use function)

it uses lambda keywork to perform action on every datapoint in the column
'''

# method 1
data['is_tenure_25'] = 1   # creating a new column with all values all 1
data['is_tenure_25'] = data['tenure'].apply(lambda x:1 if x > 25 else 0)

print('number of users with tenure > 25:',sum(data['is_tenure_25']))

# using method 1, only small operations can be performed. If we want to perform complicated operations, we can create function

def check_if_25(x):
    if x > 25:
        return(1)
    else:
        return(0)

data['is_tenure_25'] = 1   # creating a new column with all values all 1
data['is_tenure_25'] = data['tenure'].apply(lambda x:check_if_25(x))

print('number of users with tenure > 25:',sum(data['is_tenure_25']))

number of users with tenure > 25: 3754
number of users with tenure > 25: 3754


<h3>Selecting some data rows from the set</h3>

Sometimes, we need to filter the data (for example, removing outliers in dataset). We can use the selection method to select the rows we want. Lets build an example where we will select users only when tenure is greater than 25

In [35]:
# data_tenure will have data with tenure value greater than 25

# in this method "data['is_tenure_25']==1" finds the index which satisfy the condition
data_tenure = data[data['is_tenure_25']==1]

# we can directly write any condition
data_tenure = data[data['tenure']>25]

# selecting senior female citizens
data_senior_females = data[(data['gender']==0) & (data['SeniorCitizen']==1)]

print(data_senior_females.head(5))   # check it will only have data where the conditions were satisfied

    customerID  gender  SeniorCitizen  Partner  Dependents  tenure  \
30  3841-NFECX       0              1        1           0      71   
50  8012-SOUDQ       0              1        0           0      43   
52  6575-SUVOI       0              1        1           0      25   
53  7495-OOKFY       0              1        1           0       8   
54  4667-QONEA       0              1        1           1      60   

    PhoneService  MultipleLines  InternetService  OnlineSecurity  ...  \
30             1              1                2               1  ...   
50             1              1                2               0  ...   
52             1              1                1               1  ...   
53             1              1                2               0  ...   
54             1              0                1               1  ...   

    TechSupport  StreamingTV  StreamingMovies  Contract  PaperlessBilling  \
30            1            0                0         3        

<h3>WORD OF CAUTION</h3>

If you doing some operation that is changing the fundamental structure of the original data, always create a copy. DataFrames act like an array in python. If you create a new variable, say data2 = data, and modify data2, data1 will also be modified. therefore, use the copy function to create copies, it will not alter the main data.mad

In [39]:
# creating a copy
data_copy = pd.DataFrame.copy(data)

<h3>SQL operations</h3>

All the possible SQL operations can be performed using pandas, example, JOIN, APPEND, finding unique rows, aggregating the data (for example, mean or sum using a key, say, userID), groupBY, countBy, prderBy etc.

Refer to some of the links:
https://medium.com/jbennetcodes/how-to-rewrite-your-sql-queries-in-pandas-and-more-149d341fc53e
https://towardsdatascience.com/sql-and-pandas-268f634a4f5d

In [48]:
# just as an example, create a new data by sorting based on the tenure (highest to lowest)
data_copy = data_copy.sort_values(by=['tenure'], ascending=False)

# sorting based on multiple conditions (sorting first by tenure, then by monthly payment)
data_copy = data_copy.sort_values(by=['tenure','MonthlyCharges'], ascending=False)
data_copy.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,Churn,senior_female,is_tenure_25
4586,7569-NMZYQ,0,0,1,1,72,1,1,2,1,...,1,1,1,3,1,0,118.75,0,0,1
6118,9924-JPRMC,1,0,0,0,72,1,1,2,1,...,1,1,1,3,1,2,118.2,0,0,1
4610,2889-FPWRM,1,0,1,0,72,1,1,2,1,...,1,1,1,2,1,0,117.8,1,0,1


Next we discuss, how to use an existing ML library for building ML models.

<h2><center>Building ML models is Python</center></h2>

All the programming languages have different libraries for machine learning. There are a lot of developers and they create their own library for different machine learning models. For example, scikit learn or statsmodels. We can use the machine learning models we need from any of these libraries.

However, Scikit-learn is the most widely used python library for machine learning in python. Given that different experts are involved in building the libraries for these ML models, these can be trusted. Scikit-learn is very easy to use.

Please install scikit-learn if you haven't already. If you use Anaconda, it is already installed in it. Scikit-learn provides a list of <b>classical machine learning</b> models. It also provides all the parameters that are required in the model (these are called hyper-parameters - how to select the best parameters is a research in itself but if we have an understanding of how these parameters might affect our prediction, we can make intelligent guesses on the parameter values to select).

We will explain it though an example in logistic regression.

Scikit-learn is very easy to use because irrespective of the model, same functions are used everywhere (for example fit(X,Y) or predict(X) etc).

<h3>Building the data</h3>

Use the data we have, after one hot encoding and construct the training and test sets.

In [53]:
X = pd.DataFrame.copy(data)
del X['Churn']
del X['customerID'] # it has to be deleted

Y = data['Churn']

<h3>Importing libraries</h3>

In [65]:
'''
------------- understanding scikit-learn library -----------
just google scikit-learn logistic regression or any model, it will take you the page for that model

import the models we want (example, train_test_split and logistic regression)
we can think of sklearn as a big library, which has sub libraries called model_selection, linear_model 
                                                                           which has the sub-sub library we need
                                                                           
Check the page for all the inputs the algorithm demands, the outputs which can be obtained
'''

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.metrics import accuracy_score, r2_score

In [54]:
# building test and train set
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html 
# thik link takes us to the inputs train_test function needs, it will return our data sets
# check the library for what do test_size or random_state means

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# check that sum of two sets = data size, also the dimensions (columns) should not change
print(X.shape,X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

(7043, 20) (5634, 20) (1409, 20) (5634,) (1409,)


<h3>Logistic Regression</h3>

In [57]:
'''
now that the data is ready, we can use Logistic regression function to get us the model results
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

check the link for all the hyperparameters required in logistic regression (gives the following)

class sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, 
                intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, 
                multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
                
read about weach of the these parameters. for example, if the data is unbalanced, we can use class weight to give more 
preferrence to the class with lower instances (for example, if 3% churn, it is important to give more weight to the 
data points where there is churn: it is an idea like SMOTE for unbalanced dataset). If we do not set these parameters, 
it will be set as the default value as given in the link.
'''

# setting the parameters. You can add or remove any parameter from the slogistic regression function
# if not set by us, it will be set as default

parameter = {'penalty':'l2', 'solver':'lbfgs', 'max_iter':1090, 'verbose':0, 'n_jobs':-1}

model       = LogisticRegression(**parameter).fit(X_train, Y_train)
y_pred      = model.predict(X_test)                 # predicting on the test data set
predictions = [round(value) for value in y_pred]    # converting probability to binary
accuracy    = accuracy_score(Y_test, predictions)   # finding the accuracy
print(accuracy)

0.8190205819730305


In [62]:
# setting the weight parameters (weight for churn = total users/total churns) to give more weight to churn
# how to add the weight parameter into the set of parameters we use.

# dictionary
weight = {1:len(Y_train)/sum(Y_train),0:1}

parameter = {'penalty':'l2', 'solver':'lbfgs', 'max_iter':100, 'verbose':0, 'n_jobs':-1,'class_weight':weight}

model       = LogisticRegression(**parameter).fit(X_train, Y_train)
y_pred      = model.predict(X_test)                 # predicting on the test data set
predictions = [round(value) for value in y_pred]    # converting probability to binary
accuracy    = accuracy_score(Y_test, predictions)   # finding the accuracy
print(accuracy)

# in any IDE, we can write model. and then click tab button on keyboard. it will show all the options available for output

0.7189496096522356


<h3>Lasso</h3>

Similar to any scikit-learn logistic regression model, we can also check it for other models.

In [63]:
'''
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

Go to the link, read on all the hyperparameters, as what do they mean and how should we select the values for them

class sklearn.linear_model.Lasso(alpha=1.0, fit_intercept=True, normalize=False, precompute=False, copy_X=True, 
                        max_iter=1000, tol=0.0001, warm_start=False, positive=False,
                        random_state=None, selection='cyclic')[source]
        
'''

# lets create a new example where we predict tenure (from the same data using lasso)

X = pd.DataFrame.copy(data)
del X['tenure']
del X['customerID'] # it has to be deleted

Y = data['tenure']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# check that sum of two sets = data size, also the dimensions (columns) should not change
print(X.shape,X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

(7043, 20) (5634, 20) (1409, 20) (5634,) (1409,)


In [69]:
parameter = {'alpha':0.1, 'max_iter':1000}

model      = Lasso(**parameter).fit(X_train, Y_train)
y_pred     = model.predict(X_test)                 # predicting on the test data set
r_score    = r2_score(Y_test, predictions)         # finding the accuracy
print(r_score)

# running this gives a r-square value of -1.63 (this counters the common misconception that r-square lies between 0 and 1)
# r-square value can lie from negative infinity to 1

-1.6377098483404007
