# Tidying the NASA climate data


The data fetching/wrangling script can be found at [fetch_wrangle.py](fetch_wrangle.py)



## Global temperature data

The context and data for this comes from NASA's page, [Climate Change: Vital Signs of the Planet: Global Temperature](https://climate.nasa.gov/vital-signs/global-temperature/).





### The setup


In [149]:
import pandas as pd
from pathlib import Path
DATAPATH = Path('./datastash')

In [150]:
srcpath = DATAPATH.joinpath('wrangled', 'global_temps.csv')
df = pd.read_csv(srcpath)

In [151]:
df.head()

Unnamed: 0,year,annual_mean,lowess
0,1880,-0.19,-0.11
1,1881,-0.1,-0.14
2,1882,-0.1,-0.17
3,1883,-0.19,-0.21
4,1884,-0.28,-0.24


In [152]:
df.tail()

Unnamed: 0,year,annual_mean,lowess
133,2013,0.64,0.71
134,2014,0.73,0.77
135,2015,0.86,0.83
136,2016,0.99,0.89
137,2017,0.9,0.95


## Tidying


https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html

Sort: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

Reset the index: 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html

Write to CSV:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html


In [153]:
tidydf = pd.melt(df, id_vars=['year'], 
                     value_vars=['annual_mean', 'lowess'],
                     var_name='type',
                     value_name='temperature_anomaly_celsius'
                ).sort_values(by=['year', 'type'])

In [154]:
tidydf.head()


Unnamed: 0,year,type,temperature_anomaly_celsius
0,1880,annual_mean,-0.19
138,1880,lowess,-0.11
1,1881,annual_mean,-0.1
139,1881,lowess,-0.14
2,1882,annual_mean,-0.1


In [155]:
tidydf.tail()

Unnamed: 0,year,type,temperature_anomaly_celsius
273,2015,lowess,0.83
136,2016,annual_mean,0.99
274,2016,lowess,0.89
137,2017,annual_mean,0.9
275,2017,lowess,0.95


In [168]:
dest_path = DATAPATH.joinpath('tidied', srcpath.name)
dest_path.parent.mkdir(exist_ok=True)
tidydf.to_csv(dest_path, index=False)

## Carbon dioxide parts per million

Context and data: [Climate Change: Vital Signs of the Planet: Carbon Dioxide](https://climate.nasa.gov/vital-signs/carbon-dioxide/)

In [157]:
import pandas as pd
from pathlib import Path
DATAPATH = Path('./datastash')

srcpath = DATAPATH.joinpath('wrangled', 'co2.csv')
df = pd.read_csv(srcpath)

In [158]:
df.head()

Unnamed: 0,year,month,decimal_date,average,interpolated,trend,days
0,1958,3,1958.208,315.71,315.71,314.62,
1,1958,4,1958.292,317.45,317.45,315.29,
2,1958,5,1958.375,317.5,317.5,314.71,
3,1958,6,1958.458,,317.1,314.85,
4,1958,7,1958.542,315.86,315.86,314.98,


In [159]:
df.tail()

Unnamed: 0,year,month,decimal_date,average,interpolated,trend,days
717,2017,12,2017.958,406.82,406.82,407.53,31.0
718,2018,1,2018.042,407.98,407.98,407.74,29.0
719,2018,2,2018.125,408.35,408.35,407.62,28.0
720,2018,3,2018.208,409.46,409.46,408.02,28.0
721,2018,4,2018.292,410.26,410.26,407.45,21.0


### More wrangling

For this file, we'll do a little more work. Some of the metadata isn't super interesting, such as `decimal_date` or `days`, so we'll leave them out:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html


In [160]:
# on second thought, dont drop stuff
# df = df.drop(labels=['decimal_date', 'days'], axis='columns')

### Add a new column

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html

https://stackoverflow.com/questions/19350806/how-to-convert-columns-into-one-datetime-column-in-pandas/37103131

In [165]:
from datetime import datetime
df['yearmonth'] = pd.to_datetime({'year': df['year'], 'month': df['month'], 'day': 1})

df.head()

Unnamed: 0,year,month,decimal_date,average,interpolated,trend,days,yearmonth
0,1958,3,1958.208,315.71,315.71,314.62,,1958-03-01
1,1958,4,1958.292,317.45,317.45,315.29,,1958-04-01
2,1958,5,1958.375,317.5,317.5,314.71,,1958-05-01
3,1958,6,1958.458,,317.1,314.85,,1958-06-01
4,1958,7,1958.542,315.86,315.86,314.98,,1958-07-01


In [166]:
# tidying

tidydf = pd.melt(df, id_vars=['year', 'month', 'yearmonth', 'decimal_date', 'days'], 
                     value_vars=['average', 'interpolated', 'trend'],
                     var_name='type',
                     value_name='ppm',
                ).sort_values(by=['year', 'month', 'type'])

In [163]:
tidydf.head()

Unnamed: 0,year,month,yearmonth,decimal_date,days,type,ppm
0,1958,3,1958-03-01,1958.208,,average,315.71
722,1958,3,1958-03-01,1958.208,,interpolated,315.71
1444,1958,3,1958-03-01,1958.208,,trend,314.62
1,1958,4,1958-04-01,1958.292,,average,317.45
723,1958,4,1958-04-01,1958.292,,interpolated,317.45


In [167]:
dest_path = DATAPATH.joinpath('tidied', srcpath.name)
dest_path.parent.mkdir(exist_ok=True)
tidydf.to_csv(dest_path, index=False)
