---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Lecture 3.22 (Pandas-14)</h1>

## _Reshape and Transform Dataframe.ipynb_

## Learning agenda of this notebook
Often, we need to transform data into some other specific shape for Data Analysis purpose, based on what kind on analysis you want to perform with your data 
1. Reshape Data Using `pivot()` and `pivot_table()` methods
2. Reshape Data Using `melt()` method
3. Reshape Data Using `crosstab()` method

## 1. Reshaping Data Using `df.pivot()`  and `df.pivot_table()` Method
- The Pandas `df.pivot_table()` and `df.pivot` methods allow us to look at data in different ways. 
- In Data Science lingo, this is called reshaping or transforming a data set in order to glean information. 
- Initially, these methods seem similar and behave the same way but deeper understanding will alleviate some frustration. 
- Each method acts on a data frame by taking the initial data frame and your input of the index, columns, and values you want to see. 
- Basically, they filter and flip-flop the data

**```df.pivot(index=None, columns=None, values=None)```**

Where,
- `index`: Column to use as new dataframe's index. If None, uses existing index.
- `columns`: Column to use to make new dataframe columns.
- `values`:  Column(s) to use for populating new frame's values. 

Read more about `pd.pivot()`: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html


**`df.pivot_table(index=None, columns=None, values=None, aggfunc= 'mean', fill_valus=None)`**

Where,
- `index`: Column to use as new dataframe's index. If None, uses existing index.
- `columns`: Column to use to make new dataframe columns.
- `values`:  Column(s) to use for populating new frame's values. 
- `aggfunc`:  default is numpy.mean
- `fill_value`: Value to replace missing values with



Read more about `pd.pivot_table()`: https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html

### DataSet 1:

In [115]:
import pandas as pd
df = pd.read_csv('datasets/pivot_weather1.csv')
df

Unnamed: 0,date,city,temperature,humidity
0,20/06/2021,Lahore,38,76
1,21/06/2021,Lahore,41,80
2,22/06/2021,Lahore,39,85
3,20/06/2021,Muree,17,71
4,21/06/2021,Muree,15,70
5,22/06/2021,Muree,18,74
6,20/06/2021,Karachi,33,93
7,21/06/2021,Karachi,30,91
8,22/06/2021,Karachi,35,90


**Suppose we want to have one record for each city, containing temperature and humidity for each date**

In [88]:
df1 = df.pivot(index='city', columns='date')
df1

Unnamed: 0_level_0,temperature,temperature,temperature,humidity,humidity,humidity
date,20/06/2021,21/06/2021,22/06/2021,20/06/2021,21/06/2021,22/06/2021
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Karachi,33,30,35,93,91,90
Lahore,38,41,39,76,80,85
Muree,17,15,18,71,70,74


Let us repeat the same using `pivot_table()`

In [79]:
df1 = df.pivot_table(index='city', columns='date')
df1

Unnamed: 0_level_0,humidity,humidity,humidity,temperature,temperature,temperature
date,20/06/2021,21/06/2021,22/06/2021,20/06/2021,21/06/2021,22/06/2021
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Karachi,93,91,90,33,30,35
Lahore,76,80,85,38,41,39
Muree,71,70,74,17,15,18


- By setting the `index='city'`, the city column is the left most column now having unique values. 
- By setting the `columns='date'`, the values from the date column have become the  column headers now.


**Suppose we just want to see the temperature in the output. This can be achieved by setting the `values` argument to the name of the column**

In [80]:
df1 = df.pivot_table(index='city', columns='date', values='temperature')
df1

date,20/06/2021,21/06/2021,22/06/2021
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Karachi,33,30,35
Lahore,38,41,39
Muree,17,15,18


In [81]:
df1 = df.pivot(index='city', columns='date', values='temperature')
df1

date,20/06/2021,21/06/2021,22/06/2021
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Karachi,33,30,35
Lahore,38,41,39
Muree,17,15,18


**Suppose we want to have one record for each date, containing temperature and humidity for each city**

In [82]:
df1 = df.pivot(index='date', columns='city')
df1

Unnamed: 0_level_0,temperature,temperature,temperature,humidity,humidity,humidity
city,Karachi,Lahore,Muree,Karachi,Lahore,Muree
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
20/06/2021,33,38,17,93,76,71
21/06/2021,30,41,15,91,80,70
22/06/2021,35,39,18,90,85,74


In [83]:
df1 = df.pivot_table(index='date', columns='city')
df1

Unnamed: 0_level_0,humidity,humidity,humidity,temperature,temperature,temperature
city,Karachi,Lahore,Muree,Karachi,Lahore,Muree
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
20/06/2021,93,76,71,33,38,17
21/06/2021,91,80,70,30,41,15
22/06/2021,90,85,74,35,39,18


### DataSet 2:

In [100]:
import pandas as pd
df = pd.read_csv('datasets/pivot_weather2.csv')
df

Unnamed: 0,date,city,temperature,humidity
0,20/06/2021,Lahore,38,76
1,20/06/2021,Lahore,40,75
2,21/06/2021,Lahore,39,79
3,21/06/2021,Lahore,37,74
4,20/06/2021,Muree,15,88
5,20/06/2021,Muree,17,90
6,21/06/2021,Muree,10,93
7,21/06/2021,Muree,8,91


Note in this dataset we donot have unique values for date and city combined 

In [105]:
#df1 = df.pivot(index='date', columns='city')
#df1

- When we set the index to `date` and columns to `city`, the `pivot()` tries to set the left key to `20/06/2021` and then match the column name of the differing city (Lahore) values. 
- In this case there are two rows which have `20/06/2021` and columns of `Lahore`. The function doesn't know what value to put into cell values. 
- So raise a ValueError: Index contains duplicate entries, cannot reshape


- Pivot and pivot_table may only exhibit the same functionality if the data allows. If there are duplicate entries possible from the index(es) of interest you will need to aggregate the data in pivot_table, not pivot(due to duplicate error).


- Let us try to do the same using `pivot_table()` method
- In the pivot_table function, there is another argument `aggfunc=’mean’` that decides this.

In [106]:
df1 = df.pivot_table(index='date', columns='city')
df1

Unnamed: 0_level_0,humidity,humidity,temperature,temperature
city,Lahore,Muree,Lahore,Muree
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
20/06/2021,75.5,89.0,39,16
21/06/2021,76.5,92.0,38,9


The default value to the `aggfunc` argument is 'mean', and you can explicitly pass any other function name

In [108]:
df1 = df.pivot_table(index='date', columns='city', aggfunc='sum')
df1

Unnamed: 0_level_0,humidity,humidity,temperature,temperature
city,Lahore,Muree,Lahore,Muree
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
20/06/2021,151,178,78,32
21/06/2021,153,184,76,18


### DataSet 3:

In [110]:
import pandas as pd
df = pd.read_csv('datasets/pivot_std1.csv')
df

Unnamed: 0,gender,sport,age,height,weight
0,male,cricket,22,72,200
1,female,cricket,21,72,130
2,female,basketball,23,73,150
3,male,basketball,21,75,175
4,female,cricket,20,68,170


**Suppose we want to have one record for each gender, containing age, height and weight for each sport**

In [111]:
df1 = df.pivot_table(index='gender', columns='sport')
df1

Unnamed: 0_level_0,age,age,height,height,weight,weight
sport,basketball,cricket,basketball,cricket,basketball,cricket
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
female,23.0,20.5,73,70,150,150
male,21.0,22.0,75,72,175,200


When we try to repeat the same using `pivot()`, we get a ValueError: Index contains duplicate entries, cannot reshape

In [96]:
#df1 = df.pivot(index='gender', columns='sport')
#df1

- When we set the index to `gender` and columns to `sport`, the `pivot()`
- In this case there are two rows which have `female` and play `basketball`. 
- The `pivot()` function doesn't know what value to put into cell values. 
- So raise a ValueError: Index contains duplicate entries, cannot reshape
- The `pivot_table()` method use the default `aggfunc=’mean’` argument to decide this.

**Use of margins argument to `pivot_table()` method**

In [112]:
df.pivot_table(index='gender', columns='sport', margins=True)

Unnamed: 0_level_0,age,age,age,height,height,height,weight,weight,weight
sport,basketball,cricket,All,basketball,cricket,All,basketball,cricket,All
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
female,23.0,20.5,21.333333,73,70.0,71.0,150.0,150.0,150.0
male,21.0,22.0,21.5,75,72.0,73.5,175.0,200.0,187.5
All,22.0,21.0,21.4,74,70.666667,72.0,162.5,166.666667,165.0


### DataSet 4:

In [154]:
import pandas as pd
df = pd.read_csv('datasets/waterneed.csv')
df

Unnamed: 0,uniq_id,animal,water_need,speed
0,1001,elephant,500,20
1,1002,elephant,600,25
2,1003,elephant,350,29
3,1004,tiger,300,60
4,1005,tiger,320,65
5,1006,tiger,330,70
6,1007,tiger,290,69
7,1008,tiger,310,72
8,1009,zebra,200,75
9,1010,zebra,220,77


- **The `pivot()` method requires atleast two arguments index and columns**

- **The `pivot_table()` on the contrary can work on index argument only, the values place are using the mean aggregate function.**

In [157]:
df1 = df.pivot_table(index='animal')
df1

Unnamed: 0_level_0,speed,uniq_id,water_need
animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
elephant,24.666667,1002.0,483.333333
kangaroo,19.333333,1021.0,416.666667
lion,66.25,1017.5,477.5
tiger,67.2,1006.0,310.0
zebra,77.285714,1012.0,184.285714


You can apply aggregation function on the new dataframe as well, such as compute the average speed

In [138]:
df1['speed'].agg('mean')

50.94714285714285

In [127]:
# You can also perfrom aggragtion to summarize data
df1[['speed','water_need']].agg('mean')

speed          50.947143
water_need    374.357143
dtype: float64

**Multilevel indexing** You can perfrom multi-level indexing by passing the columns as a list to index argument to `pivottable()`

In [130]:
df1 = df.pivot_table(index=['animal','uniq_id'])
df1

Unnamed: 0_level_0,Unnamed: 1_level_0,speed,water_need
animal,uniq_id,Unnamed: 2_level_1,Unnamed: 3_level_1
elephant,1001,20,500
elephant,1002,25,600
elephant,1003,29,350
kangaroo,1020,19,410
kangaroo,1021,22,430
kangaroo,1022,17,410
lion,1016,66,420
lion,1017,67,600
lion,1018,68,500
lion,1019,64,390


## 2. Reshaping Data Using `df.melt()` Method
https://www.youtube.com/watch?v=oY62o-tBHF4&list=PLeo1K3hjS3uuASpe-1LjfG5f14Bnozjwy&index=11

Pandas melt() function is used to change the DataFrame format from wide to long. It's used to create a specific format of the DataFrame object where one or more columns work as identifiers. All the remaining columns are treated as values and unpivoted to the row axis and only two columns – variable and value.
The Pandas `pd.melt()` method is useful to reshape a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars). Its signature is:
```
pandas.melt(Dataframe, id_vars=None, value_vars=None, var_name=None, value_name='value',ignore_index=True)
```
Where,
- `id_vars`: tuple, list, or ndarray, optional  (Column(s) to use as identifier variables)
- `value_var`: stuple, list, or ndarray, optional (If not specified, uses all columns that are not set as id_vars)
- `var_name`: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
- `value_name`: Name to use for the ‘value’ column.
- `ignore_index`: bool, default True (If True, original index is ignored. If False, the original index is retained.)

In [148]:
# Reading data from 'datasets/sample2.csv' file
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
df

Unnamed: 0,group,math,english,urdu
0,group A,55,45,75
1,group B,35,70,43
2,group C,67,65,69
3,group D,35,45,78
4,group E,27,39,56


This dataframe has groups that have specific marks in math english and urdu respecively. Transform dataframe in such as way that subjects appear in rows instead of columns

In [149]:
df1 = pd.melt(df, id_vars = ['group'])
df1

Unnamed: 0,group,variable,value
0,group A,math,55
1,group B,math,35
2,group C,math,67
3,group D,math,35
4,group E,math,27
5,group A,english,45
6,group B,english,70
7,group C,english,65
8,group D,english,45
9,group E,english,39


You can change the name of columns for example, replace the variable and value with some meaningful names

In [150]:
df1 = pd.melt(df, id_vars = ['group'], var_name = 'subjects', value_name = 'marks')
df1

Unnamed: 0,group,subjects,marks
0,group A,math,55
1,group B,math,35
2,group C,math,67
3,group D,math,35
4,group E,math,27
5,group A,english,45
6,group B,english,70
7,group C,english,65
8,group D,english,45
9,group E,english,39


**Perform Aggregation:** You can apply aggregation function on the new dataframe as well, such as compute the average marks

In [151]:
df1['marks'].agg('mean')

53.6

**Filter Data:** Now you can perform various operations on this new dataframe. For example you want to filter the results only for english subject

In [152]:
df1[df1['subjects'] == 'english']

Unnamed: 0,group,subjects,marks
5,group A,english,45
6,group B,english,70
7,group C,english,65
8,group D,english,45
9,group E,english,39


You can compute the average marks of all groups in english subject

In [153]:
# computer the total marks of all the groups in urdu subject
df1[df1['subjects'] == 'english' ].marks.agg('mean')

52.8

## 3. Reshaping Data Using `df.crosstab()` Method

https://www.youtube.com/watch?v=I_kUj-MfYys&list=PLeo1K3hjS3uuASpe-1LjfG5f14Bnozjwy&index=13

Crosstab Computes a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
```
pandas.crosstab(index, 
                columns, 
                aggfunc=None,
                values=None,
                margins=False, 
                normalize=False)
```
Where,
- index: array-like, Series, or list of arrays/Series (Values to group by in the rows)
- columns: array-like, Series, or list of arrays/Series (Values to group by in the columns)
- values: array-like, optional (Array of values to aggregate according to the factors. Requires aggfunc be specified)
- aggfunc: function, optional If specified, requires values be specified as well.
- margins: bool, default False, Add row/column margins (subtotals).
- normalizeL bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False (Normalize by dividing all values by the sum of values)

In [143]:
# Reading data from 'datasets/sample.csv' file
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/sample1.csv')
df.head()

Unnamed: 0,city,gender,age
0,Lahore,male,35
1,Lahore,male,40
2,Lahore,female,70
3,Lahore,female,25
4,Lahore,female,33


set column 'city' as index and 'gender' as Data Column. crosstab calculates frequency of gender columns as how many male and female in each city

In [144]:
pd.crosstab(df.city, df.gender)

gender,female,male
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Islamabad,2,2
Karachi,1,2
Lahore,3,2
Multan,1,1


In [145]:
# you can also get the count of total male and femal in each city by setting margins attribut to true
pd.crosstab(df.city, df.gender, margins=True)

gender,female,male,All
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Islamabad,2,2,4
Karachi,1,2,3
Lahore,3,2,5
Multan,1,1,2
All,7,7,14


In [146]:
# instead of getting frequencies in whole number you can also calculate the percentage of male and female in each city
pd.crosstab(df.city, df.gender, normalize='index')

gender,female,male
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Islamabad,0.5,0.5
Karachi,0.333333,0.666667
Lahore,0.6,0.4
Multan,0.5,0.5


#### d. Aggregate DataFame

In [147]:
# Average age of male and female in different cities
# set the values attribute to age column, and perfrom mean aggragte function on it
pd.crosstab(df.city, df.gender, values=df.age, aggfunc=np.mean)

gender,female,male
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Islamabad,15.0,35.0
Karachi,24.0,47.5
Lahore,42.666667,37.5
Multan,65.0,24.0
