<a href="https://colab.research.google.com/github/adhang/data-science-digitalskola/blob/update/08.%20Advanced%20Pandas%20Dataframe/Learn%20-%20Advanced%20Pandas%20Dataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Dataframe
Author: Adhang Muntaha Muhammad

[![LinkedIn](https://img.shields.io/badge/linkedin-0077B5?style=for-the-badge&logo=linkedin&logoColor=white&link=https://www.linkedin.com/in/adhangmuntaha/)](https://www.linkedin.com/in/adhangmuntaha/)
[![GitHub](https://img.shields.io/badge/github-121011?style=for-the-badge&logo=github&logoColor=white&link=https://github.com/adhang)](https://github.com/adhang)
[![Kaggle](https://img.shields.io/badge/kaggle-20BEFF?style=for-the-badge&logo=kaggle&logoColor=white&link=https://www.kaggle.com/adhang)](https://www.kaggle.com/adhang)
[![Tableau](https://img.shields.io/badge/tableau-E97627?style=for-the-badge&logo=tableau&logoColor=white&link=https://public.tableau.com/app/profile/adhang)](https://public.tableau.com/app/profile/adhang)
___
**Contents**
- Indexing Dataframe
- Dropping Columns
- Joining Dataframes
- Contatenating Dataframes
- Appending Dataframes
- Pivot Table
- Melting Dataframes
- Lambda Function on Dataframes

# Importing Libraries

In [2]:
import pandas as pd

# Reading Dataset
For this notebook, I will use my mentor's dataset from GitHub. You can check his works [here](https://github.com/densaiko).

In [12]:
file_path = 'https://raw.githubusercontent.com/densaiko/data_science_learning/main/dataset/insurance.csv'

data = pd.read_csv(file_path)
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


# Indexing Dataframe

## Default Index
An index is used to identify a row/ record. The default index starts from 0 to n (determined by the total rows).

For example, let's see the dataset size using `shape`.

In [4]:
data.shape

(1338, 7)

As we can see, there are 1338 rows and 7 columns. Let's see the first 5 rows.

In [5]:
data.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


From this output, we know that the index is started from 0 to 4. How about the last index?

In [6]:
data.tail()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1333,50,male,30.97,3,no,northwest,10600.5483
1334,18,female,31.92,0,no,northeast,2205.9808
1335,18,female,36.85,0,no,southeast,1629.8335
1336,21,female,25.8,0,no,southwest,2007.945
1337,61,female,29.07,0,yes,northwest,29141.3603


Now, we know the last 5 rows. The last row has 1337 as the index. So, an index ranging from 0 to 1337 means it has 1338 rows.

## Column as Index
Can we change the index using a column? Yes, absolutely. We can use `set_index()` method to do this.

Let's say we will use the `sex` column as the index.

In [20]:
data.set_index('sex')

Unnamed: 0_level_0,age,bmi,children,smoker,region,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,19,27.900,0,yes,southwest,16884.92400
male,18,33.770,1,no,southeast,1725.55230
male,28,33.000,3,no,southeast,4449.46200
male,33,22.705,0,no,northwest,21984.47061
male,32,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...
male,50,30.970,3,no,northwest,10600.54830
female,18,31.920,0,no,northeast,2205.98080
female,18,36.850,0,no,southeast,1629.83350
female,21,25.800,0,no,southwest,2007.94500


Note, it's just for displaying the result. If you want to save it, use `inplace=True` or assign it to a variable (whether it's a new variable or the same variable). Like this:

```
# assign to the original dataframe
data.set_index('sex', inplace=True)

# assign to a new variable
new_data = data.set_index('sex')

# assign to the same variable
data = data.set_index('sex')
```



## Multiple Index
We can set multiple columns as the index. It will create a multi-index.

In [18]:
data.set_index(['sex','smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,age,bmi,children,region,charges
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,yes,19,27.900,0,southwest,16884.92400
male,no,18,33.770,1,southeast,1725.55230
male,no,28,33.000,3,southeast,4449.46200
male,no,33,22.705,0,northwest,21984.47061
male,no,32,28.880,0,northwest,3866.85520
male,...,...,...,...,...,...
male,no,50,30.970,3,northwest,10600.54830
female,no,18,31.920,0,northeast,2205.98080
female,no,18,36.850,0,southeast,1629.83350
female,no,21,25.800,0,southwest,2007.94500


Here, the `sex` and `smoker` are set as the index.

## Reset Index
Let's say, our dataframe has an index that confusing (or not ordered). We can reset the index using `reset_index()`.

To demonstrate this, I will create a new dataframe using random sampling. So, if you re-run this notebook, you may get a different result.

In [19]:
data_5 = data.sample(n=5)
data_5

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
833,58,male,34.39,0,no,northwest,11743.9341
1065,42,female,25.3,1,no,southwest,7045.499
976,48,male,40.15,0,no,southeast,7804.1605
721,53,male,36.6,3,no,southwest,11264.541
785,35,female,27.7,3,no,southwest,6414.178


This dataframe contains i

# Dropping Columns

In [None]:
drop_1 = data.drop('charges', axis=1)
drop_1.head()

Unnamed: 0,age,sex,bmi,children,smoker,region
0,19,female,27.9,0,yes,southwest
1,18,male,33.77,1,no,southeast
2,28,male,33.0,3,no,southeast
3,33,male,22.705,0,no,northwest
4,32,male,28.88,0,no,northwest


In [None]:
drop_2 = data.drop(['age','sex'], axis=1)
drop_2.head()

Unnamed: 0,bmi,children,smoker,region,charges
0,27.9,0,yes,southwest,16884.924
1,33.77,1,no,southeast,1725.5523
2,33.0,3,no,southeast,4449.462
3,22.705,0,no,northwest,21984.47061
4,28.88,0,no,northwest,3866.8552


In [None]:
drop_row = data.drop(1, axis=0)
drop_row.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552
5,31,female,25.74,0,no,southeast,3756.6216


In [None]:
data_sex = data.set_index('sex')
data_sex.head()

Unnamed: 0_level_0,age,bmi,children,smoker,region,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,19,27.9,0,yes,southwest,16884.924
male,18,33.77,1,no,southeast,1725.5523
male,28,33.0,3,no,southeast,4449.462
male,33,22.705,0,no,northwest,21984.47061
male,32,28.88,0,no,northwest,3866.8552


In [None]:
data_sex_drop = data_sex.drop('female', axis=0)
data_sex_drop.head()

Unnamed: 0_level_0,age,bmi,children,smoker,region,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
male,18,33.77,1,no,southeast,1725.5523
male,28,33.0,3,no,southeast,4449.462
male,33,22.705,0,no,northwest,21984.47061
male,32,28.88,0,no,northwest,3866.8552
male,37,29.83,2,no,northeast,6406.4107


# Joining Dataframes

In [None]:
data_left = data.sample(n=5)
data_right = data.sample(n=10)

display(data_left.head())
display(data_right.head())

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
28,23,male,17.385,1,no,northwest,2775.19215
82,22,male,37.62,1,yes,southeast,37165.1638
283,55,female,32.395,1,no,northeast,11879.10405
761,23,male,35.2,1,no,southwest,2416.955
1044,55,male,35.245,1,no,northeast,11394.06555


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
1174,29,male,32.11,2,no,northwest,4433.9159
576,22,male,26.84,0,no,southeast,1664.9996
377,24,male,40.15,0,yes,southeast,38126.2465
1332,52,female,44.7,3,no,southwest,11411.685
1223,20,female,24.42,0,yes,southeast,26125.67477


In [None]:
data_left.join(data_right, lsuffix='_first', rsuffix='_second')

Unnamed: 0,age_first,sex_first,bmi_first,children_first,smoker_first,region_first,charges_first,age_second,sex_second,bmi_second,children_second,smoker_second,region_second,charges_second
28,23,male,17.385,1,no,northwest,2775.19215,,,,,,,
82,22,male,37.62,1,yes,southeast,37165.1638,,,,,,,
283,55,female,32.395,1,no,northeast,11879.10405,,,,,,,
761,23,male,35.2,1,no,southwest,2416.955,,,,,,,
1044,55,male,35.245,1,no,northeast,11394.06555,,,,,,,


# Concatenating Dataframes

In [None]:
pd.concat([data_left, data_right], axis=1)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,age.1,sex.1,bmi.1,children.1,smoker.1,region.1,charges.1
28,23.0,male,17.385,1.0,no,northwest,2775.19215,,,,,,,
82,22.0,male,37.62,1.0,yes,southeast,37165.1638,,,,,,,
283,55.0,female,32.395,1.0,no,northeast,11879.10405,,,,,,,
377,,,,,,,,24.0,male,40.15,0.0,yes,southeast,38126.2465
576,,,,,,,,22.0,male,26.84,0.0,no,southeast,1664.9996
683,,,,,,,,53.0,male,24.32,0.0,no,northwest,9863.4718
761,23.0,male,35.2,1.0,no,southwest,2416.955,,,,,,,
1044,55.0,male,35.245,1.0,no,northeast,11394.06555,,,,,,,
1122,,,,,,,,53.0,female,36.86,3.0,yes,northwest,46661.4424
1169,,,,,,,,37.0,female,34.105,1.0,no,northwest,6112.35295


In [None]:
pd.concat([data_left, data_right], axis=0)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
28,23,male,17.385,1,no,northwest,2775.19215
82,22,male,37.62,1,yes,southeast,37165.1638
283,55,female,32.395,1,no,northeast,11879.10405
761,23,male,35.2,1,no,southwest,2416.955
1044,55,male,35.245,1,no,northeast,11394.06555
1174,29,male,32.11,2,no,northwest,4433.9159
576,22,male,26.84,0,no,southeast,1664.9996
377,24,male,40.15,0,yes,southeast,38126.2465
1332,52,female,44.7,3,no,southwest,11411.685
1223,20,female,24.42,0,yes,southeast,26125.67477


# Appending Dataframes

In [None]:
data_left.append(data_right)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
28,23,male,17.385,1,no,northwest,2775.19215
82,22,male,37.62,1,yes,southeast,37165.1638
283,55,female,32.395,1,no,northeast,11879.10405
761,23,male,35.2,1,no,southwest,2416.955
1044,55,male,35.245,1,no,northeast,11394.06555
1174,29,male,32.11,2,no,northwest,4433.9159
576,22,male,26.84,0,no,southeast,1664.9996
377,24,male,40.15,0,yes,southeast,38126.2465
1332,52,female,44.7,3,no,southwest,11411.685
1223,20,female,24.42,0,yes,southeast,26125.67477


# Pivot Table

In [None]:
data.set_index(['sex','smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,age,bmi,children,region,charges
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,yes,19,27.900,0,southwest,16884.92400
male,no,18,33.770,1,southeast,1725.55230
male,no,28,33.000,3,southeast,4449.46200
male,no,33,22.705,0,northwest,21984.47061
male,no,32,28.880,0,northwest,3866.85520
male,...,...,...,...,...,...
male,no,50,30.970,3,northwest,10600.54830
female,no,18,31.920,0,northeast,2205.98080
female,no,18,36.850,0,southeast,1629.83350
female,no,21,25.800,0,southwest,2007.94500


In [None]:
data.groupby(['sex','smoker','region']).age.mean().unstack()

Unnamed: 0_level_0,region,northeast,northwest,southeast,southwest
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,no,39.840909,39.755556,39.071942,40.099291
female,yes,38.724138,38.827586,39.25,37.047619
male,no,39.216,38.568182,38.261194,40.277778
male,yes,37.868421,39.827586,40.054545,35.567568


In [None]:
pd.pivot_table(data, values='age', index=['sex','smoker'], columns='region', aggfunc='mean')

Unnamed: 0_level_0,region,northeast,northwest,southeast,southwest
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,no,39.840909,39.755556,39.071942,40.099291
female,yes,38.724138,38.827586,39.25,37.047619
male,no,39.216,38.568182,38.261194,40.277778
male,yes,37.868421,39.827586,40.054545,35.567568


In [None]:
pd.pivot_table(data, values='age', index=['sex','smoker'], columns='region', aggfunc=['mean','max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,mean,mean,mean,max,max,max,max
Unnamed: 0_level_1,region,northeast,northwest,southeast,southwest,northeast,northwest,southeast,southwest
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
female,no,39.840909,39.755556,39.071942,40.099291,64,64,64,64
female,yes,38.724138,38.827586,39.25,37.047619,63,64,64,64
male,no,39.216,38.568182,38.261194,40.277778,64,64,64,64
male,yes,37.868421,39.827586,40.054545,35.567568,62,62,64,61
