## Let’s get started

I’ll start by importing modules and loading the data set into Python environment:

In [13]:
import pandas as pd
import numpy as np
data = pd.read_csv("../data/train.csv", index_col="Loan_ID")

# #1 – Boolean Indexing

What do you do, if you want to filter values of a column based on conditions from another set of columns? For instance, we want a list of all females who are not graduate and got a loan. Boolean indexing can help here. You can use the following code:

In [14]:
#>

# #2 – Apply Function

It is one of the commonly used functions for playing with data and creating new variables. Apply returns some value after passing each row/column of a data frame with some function. The function can be both default or user-defined. For instance, here it can be used to find the #missing values in each row and column.

In [15]:
#Create a new function:
def num_missing(x):
  return sum(x.isnull())

#Applying per column:
print ("Missing values per column:")
#>

Missing values per column:


In [16]:
#Applying per row:
print "\nMissing values per row:"
#>


Missing values per row:


Thus we get the desired result.

Note: head() function is used in second output because it contains many rows.

# #3 – Imputing missing files

‘fillna()’ does it in one go. It is used for updating missing values with the overall mean/mode/median of the column. Let’s impute the ‘Gender’, ‘Married’ and ‘Self_Employed’ columns with their respective modes.

In [17]:
#First we import a function to determine the mode
from scipy.stats import mode
mode(data['Gender'])

  flag = np.concatenate(([True], aux[1:] != aux[:-1]))


ModeResult(mode=array(['Male'], dtype=object), count=array([489]))

Output: ModeResult(mode=array([‘Male’], dtype=object), count=array([489]))

This returns both mode and count. Remember that mode can be an array as there can be multiple values with high frequency. We will take the first one by default always using:

In [18]:
mode(data['Gender']).mode[0]

'Male'

Now we can fill the missing values and check using num_missing:



In [19]:
#Fill nan values of Gender, Married, and Self_employed with its mode
#>
#>
#>

#Now check the #missing values again to confirm:
#>

Hence, it is confirmed that missing values are imputed. Please note that this is the most primitive form of imputation. Other sophisticated techniques include modeling the missing values, using grouped averages (mean/mode/median). I’ll cover that part in my next articles.

# #4 – Pivot Table

Pandas can be used to create MS Excel style pivot tables. For instance, in this case, a key column is “LoanAmount” which has missing values. We can impute it using mean amount of each ‘Gender’, ‘Married’ and ‘Self_Employed’ group. The mean ‘LoanAmount’ of each group can be determined as:

In [20]:
data.pivot_table?

In [21]:
#Determine pivot table
#>

# #5 – Multi-Indexing

If you notice the output of step #3, it has a strange property. Each index is made up of a combination of 3 values. This is called Multi-Indexing. It helps in performing operations really fast.

Continuing the example from #3, we have the values for each group but they have not been imputed.
This can be done using the various techniques learned till now.

In [22]:
#iterate only through rows with missing LoanAmount
for i,row in data.loc[data['LoanAmount'].isnull(),:].iterrows():
  ind = tuple([row['Gender'],row['Married'],row['Self_Employed']])
  data.loc[i,'LoanAmount'] = impute_grps.loc[ind].values[0]

#Now check the #missing values again to confirm:
print data.apply(num_missing, axis=0)

NameError: name 'impute_grps' is not defined

Note:

Multi-index requires tuple for defining groups of indices in loc statement. This a tuple used in function.
The .values[0] suffix is required because, by default a series element is returned which has an index not matching with that of the dataframe. In this case, a direct assignment gives an error.

# #6. Crosstab

This function is used to get an initial “feel” (view) of the data. Here, we can validate some basic hypothesis. For instance, in this case, “Credit_History” is expected to affect the loan status significantly. This can be tested using cross-tabulation as shown below:



In [23]:
pd.crosstab?

In [24]:
#>


These are absolute numbers. But, percentages can be more intuitive in making some quick insights. We can do this using the apply function:

In [27]:
#> def percConvert(ser):

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert, axis=1)

NameError: name 'percConvert' is not defined

Now, it is evident that people with a credit history have much higher chances of getting a loan as 80% people with credit history got a loan as compared to only 9% without credit history.

But that’s not it. It tells an interesting story. Since I know that having a credit history is super important, what if I predict loan status to be Y for ones with credit history and N otherwise. Surprisingly, we’ll be right 82+378=460 times out of 614 which is a whopping 75%!

I won’t blame you if you’re wondering why the hell do we need statistical models. But trust me, increasing the accuracy by even 0.001% beyond this mark is a challenging task. Would you take this challenge?

Note: 75% is on train set. The test set will be slightly different but close. Also, I hope this gives some intuition into why even a 0.05% increase in accuracy can result in jump of 500 ranks on the Kaggle leaderboard.



# #7 – Merge DataFrames

Merging dataframes become essential when we have information coming from different sources to be collated. Consider a hypothetical case where the average property rates (INR per sq meters) is available for different property types. Let’s define a dataframe as:

In [None]:
prop_rates = pd.DataFrame([1000, 5000, 12000], index=['Rural','Semiurban','Urban'],columns=['rates'])
prop_rates

Now we can merge this information with the original dataframe as, on 'Property_Area', 

In [None]:
#>


The pivot table validates successful merge operation. Note that the ‘values’ argument is irrelevant here because we are simply counting the values.

In [None]:
#>

# #8 – Sorting DataFrames
Pandas allow easy sorting based on multiple columns. Sort values mode:ascending, on ApplicantIncome and CoapplicantIncome:

In [None]:
#>

Note: Pandas “sort” function is now deprecated. We should use “sort_values” instead.

# #9 – Plotting (Boxplot & Histogram)

Many of you might be unaware that boxplots and histograms can be directly plotted in Pandas and calling matplotlib separately is not necessary. It’s just a 1-line command. For instance, if we want to compare the distribution of ApplicantIncome by Loan_Status:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
#>


In [None]:
Histogram on column: ApplicantIncome by Loan_Status

In [None]:
#>


This shows that income is not a big deciding factor on its own as there is no appreciable difference between the people who received and were denied the loan.