<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Pandas: Reshaping the data
              
</p>
</div>

Data Science Cohort Live NYC Nov 2023
<p>Phase 1</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

Data in the fires of the forge!


<figure><center><img src = "Images/conanbarbarian.png" width = 500></center>
</figure>



Transformation:
    
<figure><center><img src = "Images/conanthebarbarian5.png" width = 500></center>
</figure>

A suitably shaped dataset will make it easier to visualize/answer your questions.

#### Basic Ideas of Data Shaping in Pandas
1. Wide vs. Long Formats


<div>
    <center><img src="Images/hw_wide.png" align = "center" width="200"/></center>
    <center>Wide format</center>
</div>
    

<div align>
        <center><img src="Images/hw_long.png" align = "center" width="200"/></center>
    <center>Long format</center>
</div>

2. Multi-indexing (Hierarchical indexing)
- Saw this with multiple grouping.
- Very useful in quickly exploring responses conditioned on different factors.

<div>
    <center><img src="Images/hw_multi.png" align = "center" width="200"/></center> <br>
    <center>Multi-indexed Dataframe</center>
</div>
    

<div>
    <center><img src="Images/hw_multi_swap.png" align = "center" width="200"/></center>
    <br>
    <center>After index level swap</center>
</div>

#### Pivoting

- Convert from a long to a wide format:

   - DataFrame.pivot(index, columns, values):
  
 One attribute becomes index, values in other attribute becomes labels for new columns.
 
 Best to see an example:

In [None]:
import itertools
import pandas as pd
import numpy as np
value_list = [182, 160, 130, 78, 67, 52]
physical_data = pd.DataFrame(np.array([['John', 'Christopher', 'Melinda']*2, ['Height', 'Weight']*3, value_list]).T,
             columns = ['name', 'attribute', 'value'])

physical_data.head()

This is long form. Use pivot to convert to wide format:

In [None]:
wide_form = physical_data.pivot(index = 'name', columns = 'attribute', values = 'value')
wide_form

#### Melting: the inverse of pivoting.

- Take data from wide to long format.
- pd.melt(dataframe, id_vars, value_vars, var_name, value_name)

In [None]:
wide_form.reset_index(inplace = True)
wide_form

In [None]:
pd.melt(wide_form, 
        id_vars = ['name'], 
        value_vars = ['Height', 'Weight'])

#### Pivot Tables

- When the columns you want to pivot on have non-unique entries.
- E.g., temperature as function of position X,Y for a given month but multiple measurements at each X,Y
- Want average of these measurements at each X,Y in pivoted form:

    - df.pivot_table(..., aggfunc = __)
    

Forest fire dataset:

Looks at temperature logged at various X, Y positions in a forest over several months.

In [None]:
forest_df = pd.read_csv('Data/forestfires.csv', usecols = ['X', 'Y', 'month', 'day', 'temp'])
inamonth_df = forest_df[(forest_df['month'] == 'mar')]

inamonth_df.head(10)

Average temperature at (X, Y) positions for March. Organized as pivot table:

In [None]:
inamonth_df.pivot_table(index = 'X', columns = 'Y', values = 'temp', aggfunc = 'mean')

#### Multiindexing
- Setting multiple columns as index
- Setting hierarchies.
- Accessing data in multi-indexed DataFrames.

Airfoil noise dataset:
- Various factors affecting sound amplitude off of airplane wings.

In [None]:
colnames = ['Frequency [Hz]', 'Angle of attack [deg]', 'Chord length [m]', 'Free-stream velocity [m/s]', 'Suction side thickness [m]', 'Sound volume [dB]']
airfoil_df = pd.read_csv('Data/airfoil_self_noise.dat', delimiter='\t', header = None, names = colnames  )

airfoil_df.head()

Setting multiple attributes as indices can give us flexibility in addressing the data.
- How does sound amplitude depend just on frequency, stream velocity, and foil chord length?
- Create hierarchical Multiindex:

In [None]:
col_subset = ['Frequency [Hz]', 'Free-stream velocity [m/s]', 'Angle of attack [deg]', 'Sound volume [dB]']
airfoil_df = airfoil_df[col_subset].set_index(col_subset[0:3])
airfoil_df.head()

Moved columns to index, but hierarchical structure of indices not set:
- Can be accomplished with the .sort_index() method.

In [None]:
airfoil_df = airfoil_df.sort_index()
airfoil_df

#### Accessing via the .loc accessor on multi-indices
-DataFrame.loc[first_level_index, columns]
- Dataframe.loc[(first_level_index, second_level_index, third_level_index), columns]

In [None]:
# at frequency = 1000 Hz
airfoil_df.loc[1000, :]

In [None]:
# sound vol vs angle of attack
# fixed at 1000 Hz, 55.5 m/s stream velocity
airfoil_df.loc[(1000, 55.5)]

Swapping level hierarchy:
- Look at measurement/response keeping one variable fixed and varying another.
- Swapping level hierarchy switches which we keep fixed and which we vary.


In [None]:
swapped_df = airfoil_df.swaplevel('Free-stream velocity [m/s]', 'Angle of attack [deg]').sort_index()

In [None]:
swapped_df.head()

In [None]:

swapped_df.loc[(1000, 7.3)]

At fixed angle of attack, sound volume increases with airflow speed at 1 kHz.

Multi-indexing opens up many possibilities for data manipulation.

Strongly encourage you to look at supplementary material and pandas documentation!