# Reshape your dataframe to wide and long formats

Driven by [Ted Petrou's Minimally Sufficient Pandas](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428) approach I will also preach that
I as well feel strongly that Minimally Sufficient Pandas is a useful guide for those wanting to increase their effectiveness at data analysis without getting lost in the syntax.

In [1]:
import pandas as pd

In [2]:
local_relative_path = "./source/data_processed/wellbore_exploration_all_clean_names.csv"
wellbore_exploration_all = pd.read_csv(local_relative_path)
wellbore_exploration_all.head()

Unnamed: 0,wellbore_name,well,drilling_operator,production_licence,purpose,status,content,well_type,sub_sea,entry_date,...,npdid_wellbore,dsc_npdid_discovery,npdid_field,npdid_facility_drilling,npdid_wellbore_reclass,l_npdid_production_licence,npdid_site_survey,date_updated,date_updated_max,datesync_npd
0,1/2-1,1/2-1,Phillips Petroleum Norsk AS,143,WILDCAT,P&A,OIL,EXPLORATION,NO,20.03.1989,...,1382,43814.0,3437650.0,296245.0,0,21956.0,,03.10.2019,03.10.2019,22.11.2019
1,1/2-2,1/2-2,Paladin Resources Norge AS,143 CS,WILDCAT,P&A,OIL SHOWS,EXPLORATION,NO,14.12.2005,...,5192,,,278245.0,0,2424919.0,,03.10.2019,03.10.2019,22.11.2019
2,1/3-1,1/3-1,A/S Norske Shell,011,WILDCAT,P&A,GAS,EXPLORATION,NO,06.07.1968,...,154,43820.0,,288604.0,0,20844.0,,03.10.2019,03.10.2019,22.11.2019
3,1/3-2,1/3-2,A/S Norske Shell,011,WILDCAT,P&A,DRY,EXPLORATION,NO,14.05.1969,...,165,,,288847.0,0,20844.0,,03.10.2019,03.10.2019,22.11.2019
4,1/3-3,1/3-3,Elf Petroleum Norge AS,065,WILDCAT,P&A,OIL,EXPLORATION,NO,22.08.1982,...,87,43826.0,1028599.0,288334.0,0,21316.0,,03.10.2019,03.10.2019,22.11.2019


## Long to wide dataframe format with `pivot_table`

Let’s use the pivot method to reshape this data so that the `purpose` of the wells become columns names and the `water_depth` becomes their respective values.

Guidance — Consider using only `pivot_table` and not `pivot`.

`pivot_table` can accomplish all of what `pivot` can do. In the case that you do not need to perform an aggregation, you still must provide an aggregation function.

**Before:**

In [3]:
(wellbore_exploration_all
     .filter(items=["wellbore_name", "purpose", "water_depth"])
     .sample(10))

Unnamed: 0,wellbore_name,purpose,water_depth
771,25/11-14 S,APPRAISAL,127.0
1773,7120/1-3,WILDCAT,342.0
246,7/11-12 A,WILDCAT,72.0
1292,35/3-3,WILDCAT,259.0
1683,6605/8-2,WILDCAT,818.0
1341,35/11-7,WILDCAT,355.0
1848,7219/12-2 S,WILDCAT,338.0
1246,34/10-37,WILDCAT,140.0
778,25/11-19 SR,APPRAISAL,129.0
1638,6507/7-13,WILDCAT,381.0


Column `purpose` has 3 values - WILDCAT, APPRAISAL and WILDCAT-CCS so we are expecting 3 new columns after applying `pivot_table` method.

In [4]:
wellbore_exploration_all["purpose"].value_counts()

WILDCAT        1226
APPRAISAL       695
WILDCAT-CCS       1
Name: purpose, dtype: int64

In [5]:
f"Dataframe has {wellbore_exploration_all.shape[0]} rows and {wellbore_exploration_all.shape[1]} columns"

'Dataframe has 1922 rows and 87 columns'

**After:**

In [6]:
df_pivoted = (wellbore_exploration_all
    .filter(items=["wellbore_name", "purpose", "water_depth"])
    .pivot_table(
        index="wellbore_name",
        columns="purpose", # Column(s) we want to pivot.
        values="water_depth", # Column with values that we want to have in our new pivoted columns.
        aggfunc="mean" # Even if there is not aggregation we need to provide aggregation funciton.
    )
    .reset_index()
    )

#df_pivoted.rename_axis("", axis="columns", inplace=True)

df_pivoted.head()

purpose,wellbore_name,APPRAISAL,WILDCAT,WILDCAT-CCS
0,1/2-1,,72.0,
1,1/2-2,,74.0,
2,1/3-1,,71.0,
3,1/3-10,72.0,,
4,1/3-10 A,72.0,,


In [7]:
f"Dataframe has {df_pivoted.shape[0]} rows and {df_pivoted.shape[1]} columns"

'Dataframe has 1922 rows and 4 columns'

## Wide to long dataframe format with `melt`

Now we will go back to the original **long** format.
In addition we will drop values where `water_depth_m` is `na` with `.dropna`.

In [8]:
(df_pivoted
    .melt(
    id_vars='wellbore_name',                            # Column(s) to use as identifier variables.
    value_vars=['APPRAISAL', 'WILDCAT', 'WILDCAT-CCS'], # Column(s) to unpivot. 
    var_name = "purpose",                               # Name to use for the ‘variable’ column
    value_name='water_depth_m'                          # Name to use for the ‘value’ column
    )
    .dropna(subset=["water_depth_m"])
    )

Unnamed: 0,wellbore_name,purpose,water_depth_m
3,1/3-10,APPRAISAL,72.0
4,1/3-10 A,APPRAISAL,72.0
12,1/3-7,APPRAISAL,72.0
14,1/3-9 S,APPRAISAL,68.0
22,1/6-3,APPRAISAL,69.0
...,...,...,...
3840,9/4-3,WILDCAT,72.0
3841,9/4-4,WILDCAT,78.0
3842,9/4-5,WILDCAT,77.0
3843,9/8-1,WILDCAT,68.0


In [9]:
f"Dataframe has {df_pivoted.shape[0]} rows and {df_pivoted.shape[1]} columns"

'Dataframe has 1922 rows and 4 columns'