# Data science process
- Problem definition
- Data acquisition
- Data preparation and EDA (Exploratory Data Analysis)
- Feature Engineering and Feature extraction
- Model planning
- Model building
- Model Evaluation
- Model Deployment


## (1) Problem Definition
- Housing Finance company deals with home loans.
- The dataset is provided to identify the customer segments that are eligible for loan amount so that they can specifically target these customers.

## (2) Data Acquisition


**The dataset contains following 13 features information**

* Loan_ID	Unique Loan ID
* Gender	Male/ Female
* Married	Applicant married (Y/N)
* Dependents	Number of dependents
* Education	Applicant Education (Graduate/ Under Graduate)
* Self_Employed	Self employed (Y/N)
* ApplicantIncome	Applicant income
* CoapplicantIncome	Coapplicant income
* LoanAmount	Loan amount in thousands
* Loan_Amount_Term	Term of loan in months
* Credit_History	credit history meets guidelines
* Property_Area	Urban/ Semi Urban/ Rural
* Loan_Status	(Target) Loan approved (Y/N)

## (3) Data Preparation
- Data Exploration
- Data Cleaning
- Missing value treatments
- Outlier handling
- Data encoding techniques
- Feature engineering
- Challenges of high dimensionality
- Dimension reduction using principal component analysis

In [2]:
#Loading Packages
import pandas as pd 
import numpy as np      

import seaborn as sns                 
import matplotlib.pyplot as plt       
%matplotlib inline 
#pip install plotly
import plotly.express as px

import warnings  
warnings.filterwarnings("ignore")

In [3]:
data=pd.read_csv("loan.csv")
data.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'loan.csv'

## Data Exploration

In [None]:
data.shape

In [None]:
print('number of rows', data.shape[0])
print('number of columns',data.shape[1])

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data.dtypes

In [None]:
print('integer type data', data.select_dtypes(include='int64').shape)

In [None]:
# how many columns are of integer type
print('number of columns of integer type', data.select_dtypes(include='int64').shape[1])
# how many columns are of object type
print('number of columns of object type', data.select_dtypes(include='object').shape[1])
# how many columns are of float type
print('number of columns of float type', data.select_dtypes(include='float64').shape[1])

In [None]:
# Summary of numeric (int and float) data

data.describe()

In [None]:
# Summary of object data

data.describe(include="object")

In [None]:
# Aggregation by grouping
# Frequency

data['Gender'].value_counts()

In [None]:
# Relative Frequency

data['Gender'].value_counts(normalize=True)

In [None]:
# Identify duplicate data
# data[data.duplicated()==True]
data[data.duplicated()].sum()

## Data Visualization

## Univariate Analysis

## Numerical Attribute Analysis :Histogram

In [None]:
data['ApplicantIncome'].hist(bins=50,color='red')
plt.show()

In [None]:
data.hist(figsize=(8,8))
plt.show()

In [None]:
sns.boxplot(data=data, x='ApplicantIncome',color="red")

In [None]:
px.box(data,x='ApplicantIncome')

In [None]:
# identify the unique values
data['Gender'].unique()

In [None]:
data.drop (['Loan_ID'],inplace=True,axis=1)
{column:list(data[column].unique()) for column in data.select_dtypes('object').columns}

## Bivariate Analysis using visualization

## Stacked Histogram

In [None]:
Gdr=pd.crosstab(data['Gender'],data['Loan_Status'])
Gdr

In [None]:
Gdr.div(Gdr.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True)

In [None]:
Edu=pd.crosstab(data['Education'],data['Loan_Status'])
Edu

In [None]:
140+340

In [None]:
140/480

In [None]:
340/480

In [None]:
Edu.div(Edu.sum(1).astype(float),axis=0)

In [None]:

Edu.div(Edu.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True)
plt.xlabel("Education")
plt.ylabel("Rel.Frequency")

In [None]:
data.groupby('Gender')['Loan_Status'].count()

In [None]:
data.groupby('Gender')['Loan_Status'].count().plot(kind='bar')

## Data Cleaning

## Missing Value 


In [None]:
# Identify the missing values under each column
data.isnull().sum()

In [None]:
# Total missing values
data.isnull().sum().sum()

In [None]:
# % missing values
(data.isnull().sum().sum()/len(data))*100

In [None]:
def missing_data(data):
    total = data.isnull().sum().sort_values(ascending = False)
    percent = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending = False)
    return pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data(data)

## impute the missing data

In [None]:
# percentage of missing data whandling based on the domain knowledge

# <5% - remove the rows which are having missing values (Deletion)

# 5% to 10% - impute using mean, median (numeric data) or mode(non-numeric data)
# 5% to 10% - forward fill/backward fill (is used for time-series data)

# 10% to 20% - regression, KNN imputation, interpolation

# >20% - drop that column / attribute based on the relevance 
# >20% - factorization, random techniques

In [None]:
# Impute with mean provided outliers are not there
data.boxplot(figsize=(8,8))

In [None]:
# use median to impute the missing values for numeric attributes
data['LoanAmount'].fillna(data['LoanAmount'].median(),inplace=True)
data['Credit_History'].fillna(data['Credit_History'].median(),inplace=True)
data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].median(),inplace=True)

In [None]:
data.isnull().sum()

In [None]:
data.info()

In [None]:
# use mode to impute the missing values for object type attributes
data['Gender'].fillna(data['Gender'].mode()[0],inplace=True)
data['Married'].fillna(data['Married'].mode()[0],inplace=True)
data['Dependents'].fillna(data['Dependents'].mode()[0],inplace=True)
data['Self_Employed'].fillna(data['Self_Employed'].mode()[0],inplace=True)

In [None]:
data.isnull().sum()

In [None]:
d=pd.DataFrame([10,20,30,30,10])
d.mode()[0]

In [None]:
d=pd.DataFrame([10,20,30,30,10,np.nan,np.nan])
d

In [None]:
d.fillna(d.mode()[0],inplace=True)
d

In [None]:
from sklearn.impute import SimpleImputer

# Create the array
scores = np.array([15, np.nan, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9, np.nan])


# Create a SimpleImputer object with the mean strategy
imputer = SimpleImputer(strategy='median')

# # Fit and transform the imputer on the scores array
imputed_scores = imputer.fit_transform(scores.reshape(-1,1))


# # Print the imputed scores in one dimension
print(imputed_scores.flatten())

## Outlier detection ways
- box plot or histogram
- Z-score
- IQR (Inter Quartile Range)

In [None]:
scores=np.array([15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9,-201])

In [None]:
scores

In [None]:
# Box plot
sns.boxplot(scores)

In [None]:
# histogram
sns.histplot(scores)

In [None]:
# Z-score
# Z=(x-mean)/std_dev

scores=np.array([15, 101, 18, 7, 13, 16, 11, 21, 5, 15, 10, 9,-201])
scores=np.sort(scores)
mean=np.mean(scores)
std_dev=np.std(scores)
z_scores=(scores-mean)/std_dev
print(z_scores)

outliers=scores[np.abs(z_scores)>=1.5]
print("outliers",outliers)


- if z-score  of a value is less than -3 or greater than +3 then the value associated with the z-score is considered as an outlier

In [None]:
sns.boxplot(scores)

In [None]:
# IQR
# IQR=Q3-Q1
Q1=np.quantile(scores,0.25)
Q3=np.quantile(scores,0.75)
IQR=Q3-Q1

UB=Q3+1.5*IQR
LB=Q1-1.5*IQR

print(Q1,Q3,LB,UB)

outliers=scores[(scores<=LB) | (scores>=UB)]
print(outliers)

In [None]:
# Handling Outliers
# Remove the outliers considering the data imbalance
# Extending the IQR 
# Mean/Median used for imputation
# use percentiles (10th , 90th) for imputing

In [None]:
sns.boxplot(data=data['LoanAmount'])

In [None]:
# Remove the outliers
# IQR=Q3-Q1
Q1=np.quantile(data['LoanAmount'],0.25)
Q3=np.quantile(data['LoanAmount'],0.75)
IQR=Q3-Q1

UB=Q3+3*IQR
LB=Q1-3*IQR

#print(Q1,Q3,LB,UB)

outliers=data[(data['LoanAmount']<=LB) | (data['LoanAmount']>=UB)]
print(len(outliers))

In [None]:
not_outliers=data[(data['LoanAmount']>LB) & (data['LoanAmount']<UB)]
print(len(not_outliers))

In [None]:
sns.boxplot(data=not_outliers['LoanAmount'])

## How to handle outliers

In [None]:
# Remove the outliers: if they are due to data entry error or they are not part of the data distribution
# Tranformations: log-tranform in case of skewed data use 
# Binning / discretization: grouping the data into bins

In [None]:
# Box plot 
px.box(data,x='LoanAmount')

In [None]:
fig=px.histogram(data,x='LoanAmount')
fig.show()

In [None]:
sns.distplot(data['LoanAmount'])

In [None]:
# Handle outliers with log tranform
data1=data
data1['LoanAmount'].head(3)

In [None]:
data1['LoanAmount']=np.log(data1['LoanAmount'])
data1['LoanAmount'].head(3)

In [None]:
sns.distplot(data1['LoanAmount'])

In [None]:
# Remove the outliers
# IQR=Q3-Q1
Q1=np.quantile(data1['LoanAmount'],0.25)
Q3=np.quantile(data1['LoanAmount'],0.75)
IQR=Q3-Q1

UB=Q3+3*IQR
LB=Q1-3*IQR

#print(Q1,Q3,LB,UB)

outliers=data1[(data1['LoanAmount']<=LB) | (data1['LoanAmount']>=UB)]
print(len(outliers))

## Discrepancies / Inconsistencies in data

In [None]:
# Repalce 3+ with 4
data['Dependents'].value_counts()

In [None]:
data=data.replace(to_replace='3+',value=4)

In [None]:
data.head(3)

In [None]:
data['Dependents'].value_counts()

# Feature Engineering

- Create a new feature / attribute
- Transform the existing feature 
- Feature can be numeric, categorical

## Variable Transformation
- Categorical Variable Transformations -Encoding Techniques
- Numeric Variable Transformation - Standardization and Normalization 

## Categorical Variable Transformations -Encoding Techniques
- one-hot encoding
- count encoding
- label encoding
- ordinal encoding

## one-hot encoding 
- Each category is represented as a binary vector with 0s and a single 1.
- One-hot encoding is preferred in situations where there is no ordinal relationship between categories, and each category is treated as independent.

In [None]:
#pip install -U scikit-learn

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df=pd.DataFrame({'colors':['red','Green','blue','Green','red']})
encd=OneHotEncoder()
encoded_data=encd.fit_transform(df[['colors']])
encoded_data=pd.DataFrame(encoded_data.toarray(),columns=encd.get_feature_names_out())
print(encoded_data)

In [None]:
# blue Green red
# 0     0     1
# 0     1     0
# 1     0     0
# 0     1     0
# 0     0     1

In [None]:
#pip install category_encoders


## Count Encoding
- is used to convert categorical variables into numerical format by assigning the frequency count of each category

In [None]:
# Count Encoding
import pandas as pd
import category_encoders as ce
df=pd.DataFrame({'colors':['red','Green','blue','Green','red']})
encd = ce.CountEncoder()
encoded_data=encd.fit_transform(df['colors'])
encoded_data=pd.DataFrame(encoded_data,columns=encd.get_feature_names_out())
print(encoded_data)

In [None]:
# color    count
# red      2
# Green    2
# blue     1
# Green    2
# red      2


## Label Encoding
- Label encoding assigns a unique integer label to each category 

In [None]:
from sklearn.preprocessing import LabelEncoder
df=pd.DataFrame({'Size': ["Small","Medium","Large","ExtraLarge"]})
df

In [None]:
# initialize the encoder
encd=LabelEncoder()

In [None]:
# Fit and transform the data
encd_data=encd.fit_transform(df['Size'])

In [None]:
df['Encd_data']=encd_data
print(df)

## Ordinal Encoding
- convert categorical into numerical while preserving the ordinal information

In [None]:
df1=pd.DataFrame({'FeedBack': ["Average","Good","Average","Poor","Excellent","Good","Verypoor","Good","Average","Poor","Excellent","Verypoor"]})
df1

# The scale is : VeryPoor:0, Poor:1,Average:2,Good:3,Excellent:4

In [None]:
# Define the order
order=["Verypoor","Poor","Average","Good","Excellent"]

In [None]:
# convert the categorical with specified order
df1['FeedBack_codes']=pd.Categorical(df1["FeedBack"], categories=order, ordered=True).codes
print(df1)

## Apply Encoding techniques on loan dataset 
- to convert categorical to numerical

In [None]:
data.head(5)

In [None]:
data.dtypes

In [None]:
# Binary Encoding Technique using map() function
data1=data
Gender_map={'Male':0,'Female':1}                
data1['Gender'] = data1['Gender'].map(Gender_map)    
data1['Gender'].head(5)

In [None]:
data1["Property_Area"].value_counts()

In [None]:
Married_map={"Yes":1,"No":0}               
Dependents_map={"0":0,"1":1,"2":2,"3":3,"4":4}            
Education_map={"Graduate":1,"Not Graduate":0}             
Self_Employed_map={"Yes":1,"No":0}         
Property_Area_map={"Rural":0,"Urban":1,"Semiurban":2}         
Loan_Status_map={'Y':1,'N':0}   

In [None]:
data1['Married'] = data1['Married'].map(Married_map) 
data1['Dependents'] = data1['Dependents'].map(Dependents_map) 
data1['Education'] = data1['Education'].map(Education_map) 
data1['Self_Employed'] = data1['Self_Employed'].map(Self_Employed_map) 
data1['Property_Area'] = data1['Property_Area'].map(Property_Area_map) 
data1['Loan_Status'] = data1['Loan_Status'].map(Loan_Status_map) 

In [None]:
data1.head(5)