# DATA MANIPULATION USING PYTHON

Article: https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/

Dataset: Loan Prediction Dataset

1. Boolean Indexing in Pandas
2. Apply Function in Pandas
3. Imputing missing values using Pandas
4. Pivot Table in Pandas
5. Multi-Indexing in Pandas Dataframe
6. Pandas Crosstab
7. Merge Pandas DataFrames
8. Sorting Pandas DataFrames
9. Plotting (Boxplot & Histogram) with Pandas
10. Cut function for binning
11. Coding nominal data using Pandas
12. Iterating over rows of a Pandas Dataframe


In [3]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [4]:
loan = pd.read_csv("E:\Jupyter Notebook\Dataset\Loan Dataset/train.csv")
loan.sample(5)

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
611,LP002983,Male,Yes,1,Graduate,No,8072,240.0,253.0,360.0,1.0,Urban,Y
18,LP001038,Male,Yes,0,Not Graduate,No,4887,0.0,133.0,360.0,1.0,Rural,N
168,LP001579,Male,No,0,Graduate,No,2237,0.0,63.0,480.0,0.0,Semiurban,N
187,LP001643,Male,Yes,0,Graduate,No,2383,2138.0,58.0,360.0,,Rural,Y
270,LP001888,Female,No,0,Graduate,No,3237,0.0,30.0,360.0,1.0,Urban,Y


## 1 – Boolean Indexing in Pandas

What do you do, if you want to filter values of a column based on conditions from another set of columns from a Pandas Dataframe? For instance, we want a list of all females who are not graduates and got a loan. Boolean indexing can help here.

In [10]:
loan.loc[(loan.Gender == "Female") & (loan.Education == "Not Graduate") & (loan["Loan_Status"] == "Y"), ["Gender", "Education", "Loan_Status"]]

Unnamed: 0,Gender,Education,Loan_Status
50,Female,Not Graduate,Y
197,Female,Not Graduate,Y
205,Female,Not Graduate,Y
279,Female,Not Graduate,Y
403,Female,Not Graduate,Y
407,Female,Not Graduate,Y
439,Female,Not Graduate,Y
463,Female,Not Graduate,Y
468,Female,Not Graduate,Y
480,Female,Not Graduate,Y


## 2 – Apply Function in Pandas

It is one of the commonly used Pandas functions for manipulating a pandas dataframe and creating new variables. Pandas Apply function returns some value after passing each row/column of a data frame with some function. 

The function can be both default or user-defined. For instance, here it can be used to find the #missing values in each row and column.

In [13]:
#Create a new function:
def num_missing(x):
  return sum(x.isnull())


#Applying per column:
print("Missing values per column:")
print(loan.apply(num_missing, axis=0))    #axis=0 defines that function is to be applied on each column


#Applying per row:
print("\nMissing values per row:")
print(loan.apply(num_missing, axis=1).head())       #axis=1 defines that function is to be applied on each row

Missing values per column:
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

Missing values per row:
0    1
1    0
2    0
3    0
4    0
dtype: int64


## 3 – Imputing missing values using Pandas

In [29]:
from scipy.stats import mode
print(mode(loan['Gender']))

loan['Gender'].fillna(mode(loan['Gender']).mode[0], inplace=True)

loan['Married'].fillna(mode(loan['Married']).mode[0], inplace=True)

loan['Self_Employed'].fillna(mode(loan['Self_Employed']).mode[0], inplace=True)

#Now check the #missing values again to confirm:
print(loan.apply(num_missing, axis=0))

ModeResult(mode=array(['Male'], dtype=object), count=array([502]))
Loan_ID               0
Gender                0
Married               0
Dependents           15
Education             0
Self_Employed         0
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64
