## Recreating Analysis of Piao Nature 2008

In [58]:
from ccgcrv import ccg_filter
from ccgcrv import ccg_dates
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
from copy import copy, deepcopy

### Step 1: data retrieval

CO2 data from the stations used in Piao (and others) was retrieved from 2 sources: NOAA-ESRL and ICOS. Selection of Piao stations copied to: /Users/moyanofe/Work/Augsburg/Research/CO2/data/piao_2008

Data with Piao stations was then merged into 3 files:
- piao2008_co2_flask_event.csv
- piao2008_co2_flask_monthly.csv
- piao2008_co2_insitu_daily.csv
- piao2008_co2_insitu_hourly.csv
- piao2008_co2_insitu_monthly.csv

### Step 2: co2 data preprocessing

Note: Piao 2008 used flask data. There is no information about the frequency (daily vs monthly)

1. Create daily dataset by aggregating event data and adding to daily file
2. Create monthly data by aggregating the daily file and adding to monthly

In [59]:
# Read in the data
co2_fm = pd.read_csv('../data/piao_2008/piao2008_co2_flask_monthly.csv')
co2_fe = pd.read_csv('../data/piao_2008/piao2008_co2_flask_event.csv')

In [60]:
# Get daily averages from event data and remove the averaged hour column
co2_fd = co2_fe.groupby(['station','year','month','day'], as_index=False).mean() 
co2_fd.drop(['hour'], axis=1, inplace=True)

In [61]:
# Get monthly averages from daily data and remove the averaged day column
co2_fm2 = co2_fd.groupby(['station','year','month'], as_index=False).mean()
co2_fm3 = co2_fd.groupby(['station','year','month'], as_index=False).std()
co2_fm2.drop(['day', 'stdev'], axis=1, inplace=True)

**Merge and select data**

Merge and use the original monthly values when available. This avoids many outliers that appear in the calculated averages.

In [62]:
# Merge monthly dataframes
co2_fm1 = deepcopy(co2_fm)
co2_fm1['key'] = co2_fm1['station'] + co2_fm1['year'].astype(str) + co2_fm1['month'].astype(str)
co2_fm2['key'] = co2_fm2['station'] + co2_fm2['year'].astype(str) + co2_fm2['month'].astype(str)
co2_fm = co2_fm1.join(co2_fm2.set_index('key'), on='key', how='outer', lsuffix='_orig', rsuffix='_calc')
co2_fm.reset_index(inplace=True)

In [63]:
# Restructure and select the data

co2_fm['station'] = co2_fm['station_orig']
co2_fm['year'] = co2_fm['year_orig']
co2_fm['month'] = co2_fm['month_orig']
co2_fm['co2'] = co2_fm['co2_orig']

is_na = co2_fm['station_orig'].isna()
co2_fm.loc[is_na, 'station'] = co2_fm['station_calc'].loc[is_na]
is_na = co2_fm['year_orig'].isna()
co2_fm.loc[is_na, 'year'] = co2_fm['year_calc'].loc[is_na]
is_na = co2_fm['month_orig'].isna()
co2_fm.loc[is_na, 'month'] = co2_fm['month_calc'].loc[is_na]
is_na = co2_fm['co2_orig'].isna()
co2_fm.loc[is_na, 'co2'] = co2_fm['co2_calc'].loc[is_na]

co2_fm.drop(['station_orig', 'station_calc', 'year_orig', 'year_calc', 'month_orig', 'month_calc', 'key'], axis=1, inplace=True)