---   
 <img align="left" width="75" height="75"  src="https://upload.wikimedia.org/wikipedia/en/c/c8/University_of_the_Punjab_logo.png"> 

<h1 align="center">Department of Data Science</h1>
<h1 align="center">Course: Tools and Techniques for Data Science</h1>

---
<h3><div align="right">Instructor: Muhammad Arif Butt, Ph.D.</div></h3>    

<h1 align="center">Pandas</h1>

## _13-Reshape and Transform Data.ipynb_

## Learning agenda of this notebook


4. What do you mean by reshaping of dataframe
    - Reshape Data Using `melt()` Function
    - Reshape Data Using `pivottable()`
    - Reshape Data Using `crosstab()`

## 1. Reshape/Transform Data
Often, we need to transform data into some other specific shape for Data Analysis purpose, based on what kind on analysis you want to perform with your data 

### 1. Reshape Data Using Melt() Function

#### a. Prepare DataFrame

In [18]:
# Reading data from 'datasets/sample.csv' file
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/sample2.csv')
df
df.melt

Unnamed: 0,group,math,english,urdu
0,group A,55,45,75
1,group B,35,70,43
2,group C,67,65,69
3,group D,35,45,78
4,group E,27,39,56


#### b. Transform data
Melt function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars). Its signature is:
```
pandas.melt(
        Dataframe, 
        id_vars=None, 
        value_vars=None, 
        var_name=None, 
        value_name='value',
        ignore_index=True)
```
Where,
- id_vars: tuple, list, or ndarray, optional  (Column(s) to use as identifier variables)
- value_var: stuple, list, or ndarray, optional (If not specified, uses all columns that are not set as id_vars)
- var_name: Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.
- value_name: Name to use for the ‘value’ column.
- ignore_index: bool, default True (If True, original index is ignored. If False, the original index is retained.)

In [19]:
# Our dataframe has groups that have specific marks in math english and urdu respecively
# Transform dataframe in such as way that subjects are appearing in rows instead of columns
df1 = pd.melt(df, id_vars = ['group'])
df1

Unnamed: 0,group,variable,value
0,group A,math,55
1,group B,math,35
2,group C,math,67
3,group D,math,35
4,group E,math,27
5,group A,english,45
6,group B,english,70
7,group C,english,65
8,group D,english,45
9,group E,english,39


In [20]:
# you can change the name of columns
# for example, replace the variable and value with some meaningful names
df1 = pd.melt(df, id_vars = ['group'], var_name = 'subjects', value_name = 'marks')
df1

Unnamed: 0,group,subjects,marks
0,group A,math,55
1,group B,math,35
2,group C,math,67
3,group D,math,35
4,group E,math,27
5,group A,english,45
6,group B,english,70
7,group C,english,65
8,group D,english,45
9,group E,english,39


#### c. Filter Data

In [21]:
# now you can perform various operations on this dataframe
# for example you want to filter the results only for english subject
df1[df1['subjects'] == 'english']

Unnamed: 0,group,subjects,marks
5,group A,english,45
6,group B,english,70
7,group C,english,65
8,group D,english,45
9,group E,english,39


#### d. Perform Aggregation

In [22]:
# you can apply aggregation function on the new dataframe as well, such as compute the average marks
df1['marks'].agg('mean')

53.6

In [23]:
# computer the total marks of all the groups in urdu subject
df1[df1['subjects'] == 'urdu' ].marks.agg('sum')

321

### 2. Reshape Data Using Pivottable

#### a. Prepare DataFrame

In [24]:
# Reading data from 'datasets/sample.csv' file
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/sample.csv')
df.head()

Unnamed: 0,uniq_id,animal,water_need,speed
0,1001,elephant,500,20
1,1002,elephant,600,25
2,1003,elephant,350,29
3,1004,tiger,300,60
4,1005,tiger,320,65


#### b. Transform Data
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame. Its Signature is:
```
sample_df.pivot_table( 
                    index=None, 
                    columns=None, 
                    aggfunc='mean', 
                    margins=False)
```
Where,
- index: column, Grouper, array, or list of the previous
- columns: column, Grouper, array, or list of the previous
- aggfunc: function, list of functions, dict, default numpy.mean
- margins: bool, default False (Add all row / columns (e.g. for subtotal / grand totals))

In [25]:
# you can transfrom tha dataframe accroding to your desire using pivottable
# pass the column names that you want to use as an index
# it will use the animal column as an index and will summarize the rest accordingly
df.pivot_table(index='animal')

Unnamed: 0_level_0,speed,uniq_id,water_need
animal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
elephant,24.666667,1002.0,483.333333
kangaroo,19.333333,1021.0,416.666667
lion,66.25,1017.5,477.5
tiger,67.2,1006.0,310.0
zebra,77.285714,1012.0,184.285714


In [27]:
# you can perform multilevel indexing by passing them as a list to pivottable

d1 = df.pivot_table(index=['animal','uniq_id'])
d1

Unnamed: 0_level_0,Unnamed: 1_level_0,speed,water_need
animal,uniq_id,Unnamed: 2_level_1,Unnamed: 3_level_1
elephant,1001,20,500
elephant,1002,25,600
elephant,1003,29,350
kangaroo,1020,19,410
kangaroo,1021,22,430
kangaroo,1022,17,410
lion,1016,66,420
lion,1017,67,600
lion,1018,68,500
lion,1019,64,390


#### c. Aggregate DataFame

In [26]:
# You can also perfrom aggragtion to summarize data
d1[['speed','water_need']].agg('mean')


Unnamed: 0_level_0,speed,water_need
animal,Unnamed: 1_level_1,Unnamed: 2_level_1
elephant,24.666667,483.333333
kangaroo,19.333333,416.666667
lion,66.25,477.5
tiger,67.2,310.0
zebra,77.285714,184.285714


### 3. Reshape DataFrame using Crosstab
contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables.

#### a. Prepare DataFrame

In [27]:
# Reading data from 'datasets/sample.csv' file
import numpy as np
import pandas as pd
df = pd.read_csv('datasets/sample1.csv')
df.head()

Unnamed: 0,city,gender,age
0,Lahore,male,35
1,Lahore,male,40
2,Lahore,female,70
3,Lahore,female,25
4,Lahore,female,33


#### b. Transform Data
Crosstab Computes a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values and an aggregation function are passed.
```
pandas.crosstab(index, 
                columns, 
                aggfunc=None,
                values=None,
                margins=False, 
                normalize=False)
```
Where,
- index: array-like, Series, or list of arrays/Series (Values to group by in the rows)
- columns: array-like, Series, or list of arrays/Series (Values to group by in the columns)
- values: array-like, optional (Array of values to aggregate according to the factors. Requires aggfunc be specified)
- aggfunc: function, optional If specified, requires values be specified as well.
- margins: bool, default False, Add row/column margins (subtotals).
- normalizeL bool, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False (Normalize by dividing all values by the sum of values)

In [28]:
# set column 'city' as index and 'gender' as Data Column
# crosstab calculates frequency of gender columns as how many male and female in each city
pd.crosstab(df.city, df.gender)

gender,female,male
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Islamabad,2,2
Karachi,1,2
Lahore,3,2
Multan,1,1


In [29]:
# you can also get the count of total male and femal in each city by setting margins attribut to true
pd.crosstab(df.city, df.gender, margins=True)

gender,female,male,All
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Islamabad,2,2,4
Karachi,1,2,3
Lahore,3,2,5
Multan,1,1,2
All,7,7,14


In [30]:
# instead of getting frequencies in whole number you can also calculate the percentage of male and female in each city
pd.crosstab(df.city, df.gender, normalize='index')

gender,female,male
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Islamabad,0.5,0.5
Karachi,0.333333,0.666667
Lahore,0.6,0.4
Multan,0.5,0.5


#### d. Aggregate DataFame

In [31]:
# Average age of male and female in different cities
# set the values attribute to age column, and perfrom mean aggragte function on it
pd.crosstab(df.city, df.gender, values=df.age, aggfunc=np.mean)

gender,female,male
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Islamabad,15.0,35.0
Karachi,24.0,47.5
Lahore,42.666667,37.5
Multan,65.0,24.0
