# Domain 1: HealthCare

## CONTEXT
Medical research university X is undergoing a deep research on patients with certain conditions.
University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by
the client by providing different datasets to the AI team for developing a AIML model which can predict the
condition of the patient depending on the received test results.

## Data Description
The data consists of biomechanics features of the patients according to their current
conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and
orientation of the condition to their body part.
1. P_incidence
2. P_tilt
3. L_angle
4. S_slope
5. P_radius
6. S_degree
7. Class

# 1. Import packages and warehouse data

In [28]:
# Importy packages
import numpy as np
import pandas as pd
# from PIL import ImageFont, ImageDraw, Image
import seaborn as sns # For Data Visualization
import matplotlib.pyplot as plt # Necessary module for plotting purpose
# plt.rcParams["patch.force_edgecolor"] = True
%matplotlib inline

# from plotly.offline import init_notebook_mode, iplot
# init_notebook_mode(connected=True)
# import plotly.graph_objs as go

import warnings
warnings.filterwarnings("ignore");

In [29]:
# Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google'

In [None]:
#importing data files
data1 = pd.read_csv(r"Part1 - Normal.csv")
data2 = pd.read_csv(r"Part1 - Type_H.csv")
data3 = pd.read_csv(r"Part1 - Type_S.csv")

In [None]:
data1.head(5)

In [None]:
data1.shape

In [None]:
data1['Class'].value_counts()

In [None]:
data2.head()

In [None]:
data2['Class'].value_counts()

In [None]:
data2.shape

In [None]:
data3.head()

In [None]:
data3['Class'].value_counts()

In [None]:
print("The size and shape of the first dataset is",data1.size,'and',data1.shape,'respectively.')
print("The size and shape of the second dataset is",data2.size,'and',data2.shape,'respectively.')
print("The size and shape of the third dataset is",data3.size,'and',data3.shape,'respectively.')

In [None]:
# merging all three data files in one dataframe
data_main=pd.concat([data1, data2, data3],ignore_index=True, axis=0)
data_main.head()

In [None]:
print("The final size and shape of the first dataset is",data_main.size,'and',data_main.shape,'respectively.')

In [None]:
data_main.head()

# 2. Data Cleansing

In [None]:
data_main.info()

In [None]:
data_main.dtypes

**Observation: P_incidence, P_tilt, L_angle, S_slope, P_radius, S_Degree are decimal numbers and hence float64 is a correct datatype for them. Class is a categorical variable with string values and hence object datatype.**

In [None]:
data_main.isna().sum().sum()  #getting the total null values in the dataset

**The dataset is clean and has no null values.**


In [None]:
data_main['Class'].value_counts()

**It can be seen that there are 6 kind of values in the "Class" attribute. Observing that "tp_s" is same as "Type_S", "Nrmal" is same as "Normal" and "type_h" is same as "Type_H". They have to be changed and made similar.**

In [None]:
data_main['Class'].replace(['tp_s', 'Nrmal','type_h'], ['Type_S','Normal','Type_H'], inplace=True)

In [None]:
data_main['Class'].value_counts()

**As it can be seen from above, the dataset is modified with necessary changes.**

# 3. Data analysis & visualisation

In [None]:
data_main.describe().round(2)  #doing statistical analysis on the dataset

**Observations:**

**1. The mean and median for P_incidence, P_tilt, L_angle, S_slope and P_radius are almost equal.**

**2. P_tilt and S_degree have negative values.**


##  Univariate analysis

In [None]:
plt.figure(figsize=(20,6))
plt.margins(y=0.3)
plt.subplot(1,3,1)
sns.distplot(data_main['P_incidence'],color='red')

plt.subplot(1,3,2)
sns.distplot(data_main['P_tilt'],color='blue').set(ylabel=None)

plt.subplot(1,3,3)
sns.distplot(data_main['L_angle'],color = 'green').set(ylabel=None)

plt.suptitle("Distribution plots for numerical attributes P_incidence, P_tilt and L_angle")

plt.figure(figsize=(20,7))
plt.subplot(1,3,1)
sns.boxplot(y=data_main['P_incidence'],color='red',width = 0.5)

plt.subplot(1,3,2)
sns.boxplot(y=data_main['P_tilt'],color='blue', width = 0.5)

plt.subplot(1,3,3)
sns.boxplot(y=data_main['L_angle'],color = 'green', width = 0.5)

plt.suptitle("Box plots for numerical attributes P_incidence, P_tilt and L_angle")


**Observations:**

**1. The attribute P_incidence has 3 outliers and most of the values lie below 70.**

**2. The attribute P_tilt has several outliers with one outlier being below the lower whisker. Most of the values lie between 10 and 25. The distribution looks normal.**

**3. The attribute L_angle has 1 outlier and most of the values lie between 40 to 60. The distribution looks normal.**

In [None]:
plt.figure(figsize=(20,6))
plt.margins(y=0.3)
plt.subplot(1,3,1)
sns.distplot(data_main['S_slope'],color='red')

plt.subplot(1,3,2)
sns.distplot(data_main['P_radius'],color='blue').set(ylabel=None)

plt.subplot(1,3,3)
sns.distplot(data_main['S_Degree'],color = 'green').set(ylabel=None)

plt.suptitle("Distribution plots for numerical attributes S_Slope, P_radius and S_Degree")

plt.figure(figsize=(20,7))
plt.subplot(1,3,1)
sns.boxplot(y=data_main['S_slope'],color='red',width = 0.5)

plt.subplot(1,3,2)
sns.boxplot(y=data_main['P_radius'],color='blue', width = 0.5)

plt.subplot(1,3,3)
sns.boxplot(y=data_main['S_Degree'],color = 'green', width = 0.5)

plt.suptitle("Box plots for numerical attributes S_Slope, P_radius and S_Degree")


**Observations:**

**1. The attribute S_slope is skewed towards the right with one outlier with most of the data lying below 60.**

**2. The attribute P_radius looks like it follows normal distribution with outliers both beyond upper whisker and below lower whisker.**

**3. The attribute S_Degree is right skewed and has various outliers.**

**Now let us figure out the number of outliers in each column.**

In [None]:
col_names = ['P_incidence','P_tilt','L_angle','S_slope','P_radius','S_Degree']
data_no_outliers = data_main.copy(deep=True)  # making a copy of the original dataframe where there will be no outliers
for i in col_names:
    q25,q75=np.percentile(data_main[i],25),np.percentile(data_main[i],75)
    IQR=q75-q25
    Threshold=IQR*1.5
    lower,upper=q25-Threshold,q75+Threshold
    outliers=[j for j in data_main[i] if j < lower or j > upper]
    for k in outliers:
        data_no_outliers.drop(data_no_outliers.index[data_no_outliers[i]==k],inplace=True,axis=0)  # removing outliers in the copied dataframe
    print('Total Number of outliers in the attribute',i,'is',len(outliers))

In [None]:
print('The shape of the dataset without any outlier is',data_no_outliers.shape)

In [None]:
plt.figure(figsize=(10,5))
plt.subplots_adjust(wspace=0.5)
plt.subplot(1,2,1)
sns.countplot(x=data_main['Class'],order = data_main['Class'].value_counts().index)

plt.subplot(1,2,2)
data_main['Class'].value_counts().plot.pie(autopct='%1.1f%%')

plt.suptitle('Distribution of Class attribute with outliers')

plt.figure(figsize=(10,5))
plt.subplots_adjust(wspace=0.5)
plt.subplot(1,2,1)
sns.countplot(x=data_no_outliers['Class'],order = data_no_outliers['Class'].value_counts().index)

plt.subplot(1,2,2)
data_no_outliers['Class'].value_counts().plot.pie(autopct='%1.1f%%')

plt.suptitle('Distribution of Class attribute without outliers')


**Observations:**

**1. With Outliers: Type_S has the highest Class attribute values, a total of 48.4% , followed by Normal (32.3%) and then Type_H (19.4%).**

**2. Without Outliers: Type_S has the highest Class attribute values, a total of 43.7% , followed by Normal (35.5%) and then Type_H (20.8%).**

**3. There is imbalance in the Class attribute, both with outliers and without it.**

## 3.2. Bivariate analysis

In [None]:
# correlation map
# doing analysis including the outliers
f, ax = plt.subplots(figsize=(10,10))
sns.heatmap(data_main[['P_incidence','P_tilt','L_angle','S_slope','P_radius','S_Degree']].corr(), annot=True, linewidth=".5", cmap="RdPu", fmt=".2f", ax = ax)
plt.title("Correlation Map",fontsize=20)
plt.show()

**Observations:**

**1. P_incidence has good positive correlation with P_tilt(0.63), L_angle(0.72), S_slope(0.81) and S_degree(0.64) but a low negative correlation with P-radius(-0.25).**

**2. P_tilt has good positive correlation with P_incidence(0.63), low positive correlation with L_angle (0.43) and S_Degree(0.40) but almost no correlation at all with S_slope(0.06) and P_radius(0.03).**

**3. L_angle has good positive correlation with P_incidence(0.72), S_slope(0.60) and S_Degree(0.53), low positive correlation with P_tilt(0.43) but almost no correlation with P_radius (-0.08).**

**4. S_slope has good correlation with P_incidence(0.81), L_angle(0.60) and S_degree(0.52), low negative correlation with P_radius(-0.34) and almost no correlation with P_tilt(0.06).**

**5. P_radius has low negative correlation with P_incidence(-0.25), S_slope(-0.34) and almost no correlation with P_tilt(0.03), L_angle(-0.08) and S_degree(-0.03).**

**6. S_Degree has good positive correlation with P_incidence(0.64), P_tilt(0.40), L_angle(0.53), S_slope(0.52) and almost no correlation with P_radius (-0.03).**

In [None]:
# doing pairplot including the outliers
sns.pairplot(data=data_main,hue="Class",palette="Set1",diag_kind='hist')
plt.suptitle("Pair Plot of Data",fontsize=20)
plt.show()   # pairplot without standard deviaton fields of data

**Observations:**

**1. People with high values of P_incidence, P_tilt, L_angle, S_slope and S_Degree fall under Type_S kind of Class.**

**2. Normal and Type_H kind of class have most of the data overlapping.**

# Domain 2: Banking and Finance

# 1. Importing data

In [None]:
data1 = pd.read_csv('/content/drive/My Drive/Corizo/Class 4/Part2 - Data1.csv')
data2 = pd.read_csv('/content/drive/My Drive/Corizo/Class 4/Part2 -Data2.csv')

In [None]:
print("The size and shape of the first dataset is",data1.size,'and',data1.shape,'respectively.')
print("The size and shape of the second dataset is",data2.size,'and',data2.shape,'respectively.')

In [None]:
data1.head()

In [None]:
data2.head()

Since 'ID' is the common attribute between the two datasets, I merged the two datasets on 'ID'.

In [None]:
data_concat = data1.merge(data2, on='ID')
#data_concat.drop(['ID'],axis=1,inplace=True)

In [30]:
data_concat.head()

NameError: name 'data_concat' is not defined

In [None]:
print("The size and shape of the final dataset is",data_concat.size,'and',data_concat.shape,'respectively.')

In [None]:
data_concat.head()

By closely observing the data and description given about each column attribute, it can be said that:

1. <b>Numeric data columns (Interval or Ratio) are</b>: Age, CustomerSince, HighestSpend, MonthlyAverageSpend, Mortgage, ZipCode     

2. <b>Ordinal Categorical columns are</b>: HiddenScore, Level

3. <b>Nominal Categorical columns are</b>: ID, Security, FixedDepositAccount, InternetBanking, CreditCard, LoanOnCard.

# 2. Data cleansing

In [None]:
data_concat.dtypes

There are two attributes namely, <b>MonthlyAverageSpend</b> and <b>LoanOnCard</b>, which belong to the data type <b>float 64</b>. Rest of the attributes are of the type <b>int 64</b>. The datatypes look correct as all the data entries are numeric in nature and hence they are either int64 or float64.

In [None]:
data_concat.head()

<b>Note that the ID column is of no use in building the model. Hence, it can be dropped.</b>

In [None]:
data_concat.drop('ID',axis=1,inplace=True)

In [None]:
data_concat.head()

In [None]:
data_concat.isnull().sum()

In [None]:
data_concat['LoanOnCard'].value_counts(dropna=False)

<b>Since the number of '0's in the attribute LoanOnCard are far more than '1's, I filled the NaN values with the mode(0).</b>

In [None]:
data_concat['LoanOnCard'].fillna(value=0,inplace=True)
print("The total NaN values in the dataframe are",data_concat.isnull().sum().sum())
print("The final shape of the dataframe after data cleansing is",data_concat.shape)

In [None]:
data_concat.describe().T

In [None]:
data_concat[data_concat['CustomerSince']<0].shape[0]
# .shape[0]

In [None]:
data_concat[data_concat['CustomerSince']<0]['CustomerSince'].value_counts()

<b>There are 52 negative entries in the CustomerSince column.</b>

There there unique negative entries -1,-2 and -3 in the CustomerSince column.

**Let us clean the CustomerSince column by removing the negative entries with appropriate values**

In order to get the appropriate values, let take a cue from correlation of CustomerSince attribute with other attributes

In [None]:
# Above table represented more elegently using heatmap
plt.figure(figsize=(12,10))
sns.heatmap(data_concat.corr(),cmap='YlGnBu',annot=True)

**Observations:**

**1. Age and CustomerSince attributes are highly correlated (0.99).**

**2. MonthlyAverageSpend has good correlation with HighestSpend(0.65) which is quite intuitive.**

**3. LoanOnCard attribute has good correlation with HighestSpend, MonthlyAverageSpend and FixedDepositAccount but almost close to 0 correlation with other attributes.**

In [None]:
# Let us find the unique ages which have -1, -2 and -3 entries in the CustomerSince column
val_cal = data_concat[data_concat['CustomerSince'] == -1]['Age'].value_counts()
print(val_cal)

In [None]:
# We will find the mean of positive CustomerSince values for the ages corresponding to negative values and use it to replace all the CustomerSince entries
# having negative value

k=[-1,-2,-3]
for j in k:
    list3 = data_concat[data_concat['CustomerSince'] == j]['Age'].value_counts().index.tolist()  # contains list of ages where customersince = -1, -2 or -3
    list4 = data_concat[data_concat['CustomerSince'] == j]['CustomerSince'].index.tolist()  # contains index locations where customersince = -1, -2 or -3
    for i in list4:
        data_concat.loc[i,'CustomerSince'] = data_concat[(data_concat['Age'].isin(list3)) & (data_concat['CustomerSince'] > 0)]['CustomerSince'].mean()

In [None]:
data_concat['CustomerSince'].describe()

#### As it can be seen that the minimum value for the attribute CustomerSince is no more negative. The dataset is now clean to work with.

# 3. Data analysis and visualization

## a. Univariate Analysis

In [None]:
#Age, CustomerSince, HighestSpend, MonthlyAverageSpend, Mortgage

plt.figure(figsize=(20,5))
plt.margins(y=0.3)
plt.subplot(1,5,1)
sns.distplot(data_concat['Age'],color='red')

plt.subplot(1,5,2)
sns.distplot(data_concat['CustomerSince'],color='blue').set(ylabel=None)

plt.subplot(1,5,3)
sns.distplot(data_concat['HighestSpend'],color = 'green').set(ylabel=None)

plt.subplot(1,5,4)
sns.distplot(data_concat['MonthlyAverageSpend'],color = 'magenta').set(ylabel=None)

plt.subplot(1,5,5)
sns.distplot(data_concat['Mortgage'],color = 'pink').set(ylabel=None)

plt.suptitle('Distribution of Age, CustomerSince, HighestSpend, MonthlyAverageSpend and Mortgage attribute')

plt.figure(figsize=(22,5))
plt.subplot(1,5,1)
sns.boxplot(y=data_concat['Age'],color='red',width = 0.5)

plt.subplot(1,5,2)
sns.boxplot(y=data_concat['CustomerSince'],color='blue', width = 0.5)

plt.subplot(1,5,3)
sns.boxplot(y=data_concat['HighestSpend'],color = 'green', width = 0.5)

plt.subplot(1,5,4)
sns.boxplot(y=data_concat['MonthlyAverageSpend'],color = 'magenta', width = 0.5)

plt.subplot(1,5,5)
sns.boxplot(y=data_concat['Mortgage'],color = 'pink', width = 0.5)

**Observations:**

1. The <b>Age</b> attribute has no outlier and most of the customers have ages between 35 to 55.
2. The <b>CustomerSince</b> attribute has no outlier and most of the customers are with the bank since 10 to 30 years.
3. The <b>HighestSpend attribute</b> is right skewed with some outliers (customers with highest spend more than 180 units) with most of the HighestSpend lying between 50 to 100 units.
4. The <b>MonthlyAverageSpend</b> is right skewed with a lot of  outliers (customers with monthly average spend more than 5 units) with most of the Monthly average spend lying below 3.
5. The <b>Mortgage</b> attribute is skewed towards the right because of many customers who have 0 Mortgage which shifts the distribution towards right. This brings in a lot of outliers. Also note that there is no lower whisker for the attribute which means that at least 25% of the observed values are 0, so the lower quartile is also 0.

In [None]:
# Since it is a ordinal categorical variable, we will use countplot
plt.figure(figsize=(18,10))
sns.distplot(data_concat['ZipCode'],color='darkgreen').set(title='ZipCode distribution')
print("The unique values in the attribute ZipCode are ",data_concat['ZipCode'].nunique())

**There are 467 unique values in the attribute ZipCode and follows the distribution as shown above. ZipCode is not an important feature as can be seen from its distribution and hence can be removed from the dataset.**

In [None]:
data_concat.drop('ZipCode',axis=1,inplace=True)

In [None]:
plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
sns.countplot(data = data_concat, x= 'HiddenScore', order = data_concat['HiddenScore'].value_counts().index)

plt.subplot(1,2,2)
sns.countplot(data = data_concat, x= 'Level',order = data_concat['Level'].value_counts().index).set(ylabel=None)

plt.figure(figsize=(10,5))
plt.subplot(1,2,1)
data_concat['HiddenScore'].value_counts().plot.pie(autopct='%1.1f%%')

plt.subplot(1,2,2)
data_concat['Level'].value_counts().plot.pie(autopct='%1.1f%%')

**Observations:**

1. The <b>HiddenScore</b> attribute shows nearly equal distribution of categories  2 and 4 (close to 25%) with customers having HiddenScore 1 being the most (30%) and customers with HiddenScore of 4 being the least (20%).
2. The <b>Level</b> attribute has more Level 1 (42%) customers than Level 2 and 3 which have nearly equal distributions(close to 30%).

In [None]:
#HiddenScore, Level, ID, ZipCode, Security, FixedDepositAccount, InternetBanking, CreditCard, LoanOnCard.

plt.figure(figsize=(20,5))
plt.margins(y=0.3)

plt.subplot(1,5,1)
sns.countplot(data = data_concat, x='Security')

plt.subplot(1,5,2)
sns.countplot(data = data_concat, x= 'FixedDepositAccount').set(ylabel=None)

plt.subplot(1,5,3)
sns.countplot(data = data_concat, x= 'InternetBanking').set(ylabel=None)

plt.subplot(1,5,4)
sns.countplot(data = data_concat, x='CreditCard').set(ylabel=None)

plt.subplot(1,5,5)
sns.countplot(data = data_concat, x='LoanOnCard').set(ylabel=None)

plt.figure(figsize=(20,5))
plt.margins(y=0.3)

plt.subplot(1,5,1)
data_concat['Security'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True)

plt.subplot(1,5,2)
data_concat['FixedDepositAccount'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True)

plt.subplot(1,5,3)
data_concat['InternetBanking'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True)

plt.subplot(1,5,4)
data_concat['CreditCard'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True)

plt.subplot(1,5,5)
data_concat['LoanOnCard'].value_counts().plot.pie(autopct='%1.1f%%',shadow=True)

**Observations:**

1. A very high number of customers (90%) do not have <b>Security</b> asset with the bank.
2. A very high number of customets (94%) do not have <b>FixedDepositAccount</b> with the bank.
3. Almost 60% of customers use <b>InternetBanking</b>.
4. A high number of customers (71%) do not use <b>CreditCard</b>.
5. A very high number of customers (90%) do not have <b>LoanOnCard</b>.



## b. Bivariate analysis

<u>1. Analyzing the Age attribute</u>

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data_concat[data_concat['LoanOnCard'] == 0]['Age'],kde=True, color='b', label='LoanOnCard=0')
sns.distplot(data_concat[data_concat['LoanOnCard'] == 1]['Age'],kde=True, color='r',label='LoanOnCard=1')
plt.legend()
plt.title("Distribution of Age attribute with customer division based on Loan")

In [None]:
age_cut = pd.cut(data_concat['Age'],bins=[20,30,40,50,60])
pd.crosstab(age_cut,data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observations :**

**1. From above table as well as distribution plot of Age attribute, one can observe that most of the customers lie in the age group of 30 to 60.**

**2. Also,one can observe that 10.5% of the total customers in age group 20-30 have bought LoanOnCard from the bank, while in age groups (30-40), (40-50), there is a conversion rate of around 9.5% and in age group (50-60), there is a conversion rate of 8.7%.**

<u>2. Analyzing the CustomerSince attribute</u>

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data_concat[data_concat['LoanOnCard'] == 0]['CustomerSince'],kde=True, color='b', label='LoanOnCard=0')
sns.distplot(data_concat[data_concat['LoanOnCard'] == 1]['CustomerSince'],kde=True, color='r',label='LoanOnCard=1')
plt.legend()
plt.title("CustomerSince Distribution")

In [None]:
exp_cut = pd.cut(data_concat['CustomerSince'],bins=[0,10,20,30,40,50])
pd.crosstab(exp_cut,data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : One can observe that out of the total customers with CustomerSince in the range 40-50 show a good conversion rate of almost 13% for buying the LoanOnCard, a healthy conversion rate of about 10.30% in the CustomerSince range 0-10, while in the ranges (10-20), (20-30) and (30-40) years of CustomerSince it is around 9%.**

<u>3. Analyzing the HighestSpend attribute</u>

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data_concat[data_concat['LoanOnCard'] == 0]['HighestSpend'],kde=True, color='b', label='LoanOnCard=0')
sns.distplot(data_concat[data_concat['LoanOnCard'] == 1]['HighestSpend'],kde=True, color='r',label='LoanOnCard=1')
plt.legend()
plt.title("HighestSpend Distribution")

In [None]:
inc_cut = pd.cut(data_concat['HighestSpend'],bins=[0,50,100,150,200,250])
pd.crosstab(inc_cut,data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : No customer with HighestSpend < 50 units opted for the LoanOnCard where as half of the cutomers with HighestSpend within the range of 150 to 200 units purchased LoanOnCard. Customers within range of (100 to 150) and (200 to 250) units showed a conversion rate of about 28.6% and 18.8%, respectively.**

<u>4. Analyzing the HiddenScore attribute</u>

In [None]:
# Since HiddenScore is an ordinal categorical variable, we will use countplot
sns.countplot(x='HiddenScore',hue='LoanOnCard',data=data_concat).set(title='HiddenScore distribution')

In [None]:
pd.crosstab(data_concat['HiddenScore'],data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : About 13% customers with HiddenScore of 3 and 11% customers with HiddenScore of 4, purchased LoanOnCards from the bank compared to 8% customers with HiddenScore of 2 and 7% customers with HiddenScore of 1.**

<u>5. Analyzing the MonthlyAverageSpend attribute</u>

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(data_concat[data_concat['LoanOnCard'] == 0]['MonthlyAverageSpend'],kde=True, color='b', label='LoanOnCard=0')
sns.distplot(data_concat[data_concat['LoanOnCard'] == 1]['MonthlyAverageSpend'],kde=True, color='r',label='LoanOnCard=1')
plt.legend()
plt.title("MonthlyAverageSpend Distribution")

In [None]:
MonthlyAverageSpend_cut = pd.cut(data_concat['MonthlyAverageSpend'],bins=[0,2,4,6,8,10])
pd.crosstab(MonthlyAverageSpend_cut,data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : One can see from the distribution plot that Customers with more average spending on credit cards per month show more tendancy to buy the LoanOnCards. Thus, MonthlyAverageSpend shows good correlation with Loan on Card. Customers with avg. credit card spending in the range of 4 to 6 units show around 47% conversion rate.**

<u>6. Analyzing the Level attribute</u>

In [None]:
sns.countplot(data = data_concat, x='Level', hue='LoanOnCard')

In [None]:
pd.crosstab(data_concat['Level'],data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : Customers with Levels 2 and 3 show a good conversion rate of about 13% and 13.7%, respectively.**

<u>7. Analyzing the Mortgage attribute</u>

In [None]:
plt.figure(figsize=(8,8))
sns.distplot(data_concat[data_concat['LoanOnCard'] == 0]['Mortgage'],kde=True, color='b', label='LoanOnCard=0')
sns.distplot(data_concat[data_concat['LoanOnCard'] == 1]['Mortgage'],kde=True, color='r',label='LoanOnCard=1')
plt.legend()
plt.title("Mortgage Distribution")

In [None]:
mort_cut = pd.cut(data_concat['Mortgage'],bins=[0,100,200,300,400,500,600])
pd.crosstab(mort_cut,data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observations :**

**1. Customers having house Mortgage value in the ranges (300 to 400), (400 to 500) and (500 to 600) show good tendency to buy the LoanOnCards.**

**2. HighestSpend, MonthlyAverageSpend, Mortgage histograms are not normally distributed**

<u>8. Analyzing the binary attributes Security, FixedDepositAccount, InternetBanking and CreditCard</u>

Since they are nominal variables we will use count plot and box plots for analysis

In [None]:
plt.figure(figsize=(12,12))
plt.subplot(2,2,1)
sns.countplot(data = data_concat, x='Security', hue='LoanOnCard')
plt.subplot(2,2,2)
sns.countplot(data = data_concat, x='FixedDepositAccount', hue='LoanOnCard')
plt.subplot(2,2,3)
sns.countplot(data = data_concat, x='InternetBanking', hue='LoanOnCard')
plt.subplot(2,2,4)
sns.countplot(data = data_concat, x='CreditCard', hue='LoanOnCard')
plt.suptitle('Distribution of binary attributes Security, FixedDepositAccount, InternetBanking and CreditCard',x=0.5,y=0.90)

In [None]:
pd.crosstab(data_concat['Security'],data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : Customers with Security have slightly higher percentage of buying the LoanOnCard than the customers with no Security**

In [None]:
pd.crosstab(data_concat['FixedDepositAccount'],data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : Customers with FixedDepositAccount have very high percentage (46.4%) of buying the LoanOnCard than the customers with no FixedDepositAccount (7.2%)**

In [None]:
pd.crosstab(data_concat['InternetBanking'],data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : InternetBanking has no effect on buying the LoanOnCard as for both kind of customers, who took or didn' take InternetBanking, the percentage of customers who bought LoanonCard  is same, around 9.5%.**

In [None]:
pd.crosstab(data_concat['CreditCard'],data_concat['LoanOnCard']).apply(lambda r: r/r.sum()*100, axis=1)

**Observation : Similar to InternetBanking attribute, customer using a credit card has no effect on buying the LoanOnCard.**

In [None]:
g=sns.FacetGrid(data=data_concat,row='Level',col='HiddenScore',hue='LoanOnCard').map(plt.hist,'HighestSpend').add_legend()
g.fig.suptitle('Distribution of Highest Spend for different levels and hiddenscores', x= 0.5, y= 1.05)

**Observation : Irrespective of their HighestSpend, Level 1 customer with 1 or 2 HiddenScore generally do not opt for LoanOnCard.**

In [None]:
sns.boxplot(x='Level',y='HighestSpend',hue='LoanOnCard',data=data_concat).set(title='Box plot of Highest Spend for each level')

**Observation : In each Level category, one can see that customers with higher HighestSpend tend to buy LoanOnCard.**

In [None]:
sns.boxplot(x='HiddenScore',y='HighestSpend',hue='LoanOnCard',data=data_concat).set(title ='Box plot of Highest Spend for each HiddenScore')

**Observation : For customers with HiddenScore of 1,2,3 or 4, higher HighestSpend is an important factor to buy LoanOnCard.**

In [None]:
# pairplot of numerical attributes
sns.pairplot(data_concat[['Age','CustomerSince','HighestSpend','MonthlyAverageSpend', 'Mortgage', 'LoanOnCard']],hue='LoanOnCard',diag_kind='hist')

**Observations :**

**1. Age has a positive linear relationship with CustomerSince.**

**2. No other correlation is observed in other numerical attributes.**