# **`Reshaping`**

Data isn't always given to us in the format that's most convenient for our analysis.
Therefore, we need to be able to restructure data into both wide and long formats,
depending on the analysis we want to perform. For many analyses, we will want wide
format data so that we can look at the summary statistics easily and share our results in
that format.

In this section we will explore Pivoting, Transposing and Melting data

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [12]:
long_df = pd.read_csv("../data/long_data.csv", 
                      usecols=["date", "datatype", "value"]
                     ).rename(columns={"value": "temp_C"}).assign(date=lambda x:pd.to_datetime(x["date"]),
                                                                 temp_F=lambda x:(x.temp_C * 9/5) + 32)
long_df

Unnamed: 0,datatype,date,temp_C,temp_F
0,TMAX,2018-10-01,21.1,69.98
1,TMIN,2018-10-01,8.9,48.02
2,TOBS,2018-10-01,13.9,57.02
3,TMAX,2018-10-02,23.9,75.02
4,TMIN,2018-10-02,13.9,57.02
...,...,...,...,...
88,TMIN,2018-10-30,2.2,35.96
89,TOBS,2018-10-30,5.0,41.00
90,TMAX,2018-10-31,12.2,53.96
91,TMIN,2018-10-31,0.0,32.00


## **`Transposing DataFrames`**


In [14]:
long_df.set_index("date").head().T

date,2018-10-01,2018-10-01.1,2018-10-01.2,2018-10-02,2018-10-02.1
datatype,TMAX,TMIN,TOBS,TMAX,TMIN
temp_C,21.1,8.9,13.9,23.9,13.9
temp_F,69.98,48.02,57.02,75.02,57.02


## **`Pivoting DataFrames`**

We pivot our data to go from long format to wide format. The pivot() method
performs this restructuring of our DataFrame object. To pivot, we need to tell pandas
which column currently holds the values (with the values argument) and the column
that contains what will become the column names in wide format (the columns
argument).

In [15]:
pivot_df = long_df.pivot(index="date", columns="datatype", values="temp_C")

In [17]:
pivot_df.head()

datatype,TMAX,TMIN,TOBS
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-10-01,21.1,8.9,13.9
2018-10-02,23.9,13.9,17.2
2018-10-03,25.0,15.6,16.1
2018-10-04,22.8,11.7,11.7
2018-10-05,23.3,11.7,18.9


In [18]:
pivot_df.describe()

datatype,TMAX,TMIN,TOBS
count,31.0,31.0,31.0
mean,16.829032,7.56129,10.022581
std,5.714962,6.513252,6.59655
min,7.8,-1.1,-1.1
25%,12.75,2.5,5.55
50%,16.1,6.7,8.3
75%,21.95,13.6,16.1
max,26.7,17.8,21.7


In [21]:
pivot_df = long_df.pivot(index="date", columns="datatype", values=["temp_C", "temp_F"])
pivot_df.head()

Unnamed: 0_level_0,temp_C,temp_C,temp_C,temp_F,temp_F,temp_F
datatype,TMAX,TMIN,TOBS,TMAX,TMIN,TOBS
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2018-10-01,21.1,8.9,13.9,69.98,48.02,57.02
2018-10-02,23.9,13.9,17.2,75.02,57.02,62.96
2018-10-03,25.0,15.6,16.1,77.0,60.08,60.98
2018-10-04,22.8,11.7,11.7,73.04,53.06,53.06
2018-10-05,23.3,11.7,18.9,73.94,53.06,66.02


We have been working with a single index throughout this chapter; however, we can create
an index from any number of columns with` set_index()`. This gives us an index of
type **MultiIndex**, where the outermost level corresponds to the first element in the list
provided to set_index():

In [24]:
multi_index_df = long_df.set_index(["date", "datatype"])

multi_index_df.head().index

MultiIndex([('2018-10-01', 'TMAX'),
            ('2018-10-01', 'TMIN'),
            ('2018-10-01', 'TOBS'),
            ('2018-10-02', 'TMAX'),
            ('2018-10-02', 'TMIN')],
           names=['date', 'datatype'])

In [25]:
multi_index_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,temp_C,temp_F
date,datatype,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-10-01,TMAX,21.1,69.98
2018-10-01,TMIN,8.9,48.02
2018-10-01,TOBS,13.9,57.02
2018-10-02,TMAX,23.9,75.02
2018-10-02,TMIN,13.9,57.02


Note we now have two levels of index.
The `pivot()` method expects the data to only have one column to set as the index; if
we have a multi-level index, we should use the `unstack()` method instead.

In [27]:
unstacked_df = multi_index_df.unstack()
unstacked_df.head()

Unnamed: 0_level_0,temp_C,temp_C,temp_C,temp_F,temp_F,temp_F
datatype,TMAX,TMIN,TOBS,TMAX,TMIN,TOBS
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2018-10-01,21.1,8.9,13.9,69.98,48.02,57.02
2018-10-02,23.9,13.9,17.2,75.02,57.02,62.96
2018-10-03,25.0,15.6,16.1,77.0,60.08,60.98
2018-10-04,22.8,11.7,11.7,73.04,53.06,53.06
2018-10-05,23.3,11.7,18.9,73.94,53.06,66.02


## **`Melting DataFrames`**

To go from wide format to long format, we need to melt the data. Melting undoes a pivot.
For this example, we will read in the data from the `wide_data.csv` file:

In [29]:
wide_df = pd.read_csv("../data/wide_data.csv")
wide_df.head()

Unnamed: 0,date,TMAX,TMIN,TOBS
0,2018-10-01,21.1,8.9,13.9
1,2018-10-02,23.9,13.9,17.2
2,2018-10-03,25.0,15.6,16.1
3,2018-10-04,22.8,11.7,11.7
4,2018-10-05,23.3,11.7,18.9


In [32]:
melted_df = wide_df.melt(id_vars="date", 
                         value_vars=["TMAX", "TMIN", "TOBS"], 
                         value_name="Temp_C", 
                         var_name="measurement")

In [33]:
melted_df.head()

Unnamed: 0,date,measurement,Temp_C
0,2018-10-01,TMAX,21.1
1,2018-10-02,TMAX,23.9
2,2018-10-03,TMAX,25.0
3,2018-10-04,TMAX,22.8
4,2018-10-05,TMAX,23.3
