## Pandas in Python
- Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures
- Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyze
- Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc

**How to install pandas?**<br>
1. You can use-<br>
`!pip install pandas`<br>
2. You can import it as 'pd'<br>
import pandas as pd

In [1]:
## import pandas library as pd
import pandas as pd

## Case Study - Data Analysis using Pandas 

### Objective
Let us explore the Customer Churn dataset by performing data analysis using pandas library.
We will implement all the methods and functions which we have learnt so far and get some meaningful insights from the given dataset.

### Dataset Description

#### Domain: 
Finance and Banking

#### Context:
- The dataset is the details of the customers of a Banking Organization. 

#### Content:
- The columns are about it's estimated salary, age, sex, etc. Aiming to provide all details about an customer.
- Link to the dataset: https://www.kaggle.com/datasets/shubh0799/churn-modelling

### Load the dataset

In [6]:
df = pd.read_csv('Churn_Modelling.csv')
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### 1. Check the basic information of the dataset

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


- We can observe that there are 14 columns and 10000 rows in the dataset.
- We can see that there are no missing values in the dataset.
- We have surname, age, and gender as categorical columns with object datatype.
- RowNumber and CustomerId columns are the unique identifier.

### 2. Drop the redundant columns.

In [8]:
## dropping 'RowNumber', 'CustomerId' and 'Surname'
df.drop(['RowNumber','CustomerId','Surname'],axis=1,inplace=True)

- The axis parameter is set to 1 to drop columns and 0 for rows.
- The inplace parameter is set as True to save the changes.

### 3. Check for the duplicate records in the dataset.

In [9]:
df[df.duplicated()]

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited


- We donot have any duplicate records in the dataset.

### 4. Check for missing values in the dataset

In [10]:
df.isnull().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

- We donot have any missing values in the dataset.

### 5. Check the statistical summary of the dataset.

In [11]:
df.describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,96.653299,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,584.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


- We can observe that minimum credit score is around 350 and max is 850.
- Minimum age of the customer is 18 and max is 92.
- The average Tenure of the customer is around 5.
- The minimum balance is 0 and average is around 76000.

In [12]:
df.describe(include='O')

Unnamed: 0,Geography,Gender
count,10000,10000
unique,3,2
top,France,Male
freq,5014,5457


- We can observe that there are 3 unique Geographies and most frequently occured is France.

### 6. Select customers who live in 'Spain' and have churned.

In [13]:
df[(df['Geography']=='Spain') & (df['Exited']==1)]

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
5,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
22,510,Spain,Female,38,4,0.00,1,1,0,118913.53,1
30,591,Spain,Female,39,3,0.00,3,1,0,140469.38,1
58,511,Spain,Female,66,4,0.00,1,1,0,1643.11,1
86,750,Spain,Male,22,3,121681.82,1,1,0,128643.35,1
...,...,...,...,...,...,...,...,...,...,...,...
9718,710,Spain,Male,62,3,131078.42,2,1,0,119348.76,1
9756,648,Spain,Female,43,7,81153.82,1,1,1,144532.85,1
9800,762,Spain,Female,35,3,119349.69,3,1,1,47114.18,1
9852,501,Spain,Male,43,6,104533.24,1,0,0,81123.59,1


- We have around 413 customers who live in spain and have churned.

### 7. Select the customers whose credit score is equal to 850 or 350

In [14]:
df[df['CreditScore'].isin([850,350])]

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
38,850,France,Male,36,7,0.00,1,1,1,40812.90,0
180,850,Spain,Female,45,2,122311.21,1,1,1,19482.50,0
200,850,Spain,Male,30,2,141040.01,1,1,1,5978.20,0
223,850,France,Male,33,10,0.00,1,1,0,4861.72,1
...,...,...,...,...,...,...,...,...,...,...,...
9624,350,France,Female,40,0,111098.85,1,1,1,172321.21,1
9646,850,Spain,Male,71,10,69608.14,1,1,0,97893.40,1
9688,850,France,Male,68,5,169445.40,1,1,1,186335.07,0
9931,850,France,Female,34,6,101266.51,1,1,0,33501.98,0


- We have around 238 customers whose credit score is equal to 850 or 350

### 8. Group the customers based on geography, gender and then compute the average churn rate for each group

In [15]:
df.groupby(['Geography','Gender'])[['Exited']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Exited
Geography,Gender,Unnamed: 2_level_1
France,Female,0.20345
France,Male,0.127134
Germany,Female,0.375524
Germany,Male,0.278116
Spain,Female,0.212121
Spain,Male,0.131124


### 9. Group the customers based and geography,gender and then compute the average churn rate for each group and give a count of customers in each group

In [16]:
df.groupby(['Geography','Gender'])[['Exited']].agg(['mean','count'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Exited,Exited
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
Geography,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2
France,Female,0.20345,2261
France,Male,0.127134,2753
Germany,Female,0.375524,1193
Germany,Male,0.278116,1316
Spain,Female,0.212121,1089
Spain,Male,0.131124,1388


### 10. Retrieve the number of customers churned in each country.

In [17]:
df.groupby('Geography')[['Exited']].sum()

Unnamed: 0_level_0,Exited
Geography,Unnamed: 1_level_1
France,810
Germany,814
Spain,413


### 11. Retrieve the percentage of customers who haved churned

In [18]:
df['Exited'].value_counts(normalize=True)*100

0    79.63
1    20.37
Name: Exited, dtype: float64

- There are 20.37 % of customers who have churned and 79.63% of customers not churned

### 12. Retrieve the average Tenure of the customers who have churned

In [19]:
df[df['Exited']==1]['Tenure'].mean()

4.932744231713304

### 13. Access the rows where row lables are equal to 983 and 666

In [20]:
## accessing records based on the row lables '983' and '666'
df.loc[[983,666]]

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
983,614,Germany,Female,35,6,128100.28,1,0,0,69454.24,1
666,559,France,Female,31,3,127070.73,1,0,1,160941.78,0


### 14. Access the record where row label is 6890 and column name is Exited

In [21]:
## accessing records by passing row and column labels
df.loc[[6890],['Exited']]

Unnamed: 0,Exited
6890,1


### 15. Access first three records from first three columns based on interger index location

In [22]:
## accessing records based on row and column integer index location
df.iloc[:3,:3]

Unnamed: 0,CreditScore,Geography,Gender
0,619,France,Female
1,608,Spain,Female
2,502,France,Female


### 16. Create a pivot table using columns "Geography' and 'Balance'

In [23]:
## by default pivot table calculates the mean of the balance 
pd.pivot_table(df,index=['Geography'],values='Balance') 

Unnamed: 0_level_0,Balance
Geography,Unnamed: 1_level_1
France,62092.636516
Germany,119730.116134
Spain,61818.147763


In [27]:
## Find the maximum balance of customer in each country using pivot table
## can pass different aggregate function using the parameter 'aggfun'
pd.pivot_table(df,index=['Geography'],values='Balance',aggfunc='max') 

Unnamed: 0_level_0,Balance
Geography,Unnamed: 1_level_1
France,238387.56
Germany,214346.96
Spain,250898.09


### Happy Learning:)