# Example of working with GHCN data from Romania

[Global historcal climate network](http://www.ncdc.noaa.gov/ghcnm/) weather station data from Romania.

N.B. Parts of this work will only function with Pandas version > 0.19.0

In [None]:
# Load some libraries
import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from tqdm import *

matplotlib.style.use('ggplot')
%matplotlib inline

## Step 1: data preperation

From our Station Data files we need to create:
* one single data structure
* Date indexed data (with only one index for all the datasets)
* Station names as column identifiers

We will use the [Pandas library](http://pandas.pydata.org/pandas-docs/stable/index.html), as it is perfectly suited to our task. We have read it in above with the alias **pd**.

Before we try and read all the data, let's test a procedure with one single station file.

In [None]:
# Read a station data file with Pandas
test_data = pd.read_csv("Data/station_data/BUM00015502_VIDIN_BU_.csv")
test_data.head()

Looks good, but the Dates should be an index, not a column, and they should also be a date object, not a simple integer (we get much more functionality that way).

In [None]:
# Make a list of datetime values out of the integer dates using a list comprehension technique
dates = []
for date in test_data['DATE']:
    dates.append(pd.datetime.strptime(str(date),"%Y%m%d"))
# The above could have been done more effectivley using list comprehension

# Next set the new list as an index, and remove the old column from the dataset
test_data.index = dates
test_data = test_data.drop(['DATE','PRCP'], axis=1)

test_data.head()

Great! We can plot a simple preview of the data to make sure it looks good.

In [None]:
test_data.plot()

The preview plot is messy as our data are not contiguous, but as a quick-check, it seems like everything is more-or-less fine.

So, reading a single file is easy, and straightforward. But we want to do some exploratory analysis on multiple station measurements. For this we will need to read all the station data together into a consistent data object. 

In [None]:
# Make a small tools (functions) to help with the work

def station_name(fname):
    """Return the station ID from a path/filename.csv string"""
    tmp = fname.split('/')[-1]
    return tmp.split('_')[0]

In [None]:
# If we were on a Mac or Linux system, we could get the file list via a bash command
flist = !ls Data/station_data/*.csv

In [None]:
# But this will break on windows. To make our code cross-platform we use a python
# library to find all the files instead. This is much better than hard-coding the files!

frames = [] # an empty list to hold each data object as it is loaded

mypath = 'Data/station_data/'          # Set path to data
for item in tqdm(os.listdir(mypath)):        # Find all files in that path and loop over them
    if '.csv' in item:                 # If the file is a csv type do something...
        fname = ''.join([mypath,item])
        station = station_name(fname)
        #print('\rReading data from station', station, end='')
        tmp = pd.read_csv(fname)
        dates = [pd.datetime.strptime(str(date),"%Y%m%d") for date in tmp['DATE']]
        tmp.index = dates
        tmp = tmp.drop(['DATE','PRCP'], axis=1) # get rid of date and precipitation columns
        tmp.columns = [station]     # Re-name TAVG to be the station name
        frames.append(tmp)
#print('\rDone reading data.')
print("{0} GHCN files read".format(len(frames)))

In [None]:
df = pd.concat(frames)  # Join all the seperate data together into one object

## Step 2: Cleaning the dataset for analysis

Now we have created a dataframe **df** holding all the station data with one coherant time index.

This abstraction will do much of the work for us...

In [None]:
# First lets see how long these data run for in time
print("minimum date:", min(df.index).date())
print("maximum date:", max(df.index).date())

In [None]:
# Now let's look at a statistical description of these data
df.describe()

There is a clear problem with these stats. Most of these data seem to have a missing value of `-9999.0` included.
To proceede we should replace with with a missing data type that we can operate with `np.nan`

In [None]:
# we can replace all -999.0 values with np.nan like this

df[df == -9999.0] = np.nan

In [None]:
# Now, the dataframe values seems reasonable, except we can see there are many series which are empty.
# They were just full of missing values for whatever reason.

df.describe()

In [None]:
# It looks like we can simply filter out data that now has a low count (e.g. < 10,000).

limit = 10000

for key in df:
    if df[key].count() <= limit:
        print('removing', key,'from df object.')
        df = df.drop([key], axis=1)

In [None]:
# Much better! Finally a clean df object, that we can work from.

df.describe()

## Step 3. Analysis
Individually, the station data is still patchy and potentially poor. It also looks like there are still some bad values...

In [None]:
for key in df:
    plt.plot(df[key], lw=1., alpha=0.5)
plt.title('Individual stations')
plt.show()

Aggregated though, it gives a much clearer picture

In [None]:
df.mean(axis=1).plot(title='Romanian TAVG', lw=0.1)


In [None]:
# make the average data a new series, and strip out nan values for working ease

df_mean = pd.DataFrame(df.mean(axis=1), columns=['mean_temp'])  # make a new df object
df_mean = df_mean[df_mean.notnull().values]                     # remove missing values

In [None]:
#df_mean.rolling(300, center=True).mean().plot(color='k', lw=0.01)

In [None]:
#df_mean


# Day of year mean...

# Deseasnalise data with DOY mean

# Examine average temperature anomalies

## Step 4: Using the data for something useful!

Based on historical Romanian average temperature anomalies, how does a given value rank?

Requires an average temperature and a date as input.

In [None]:
#df.mean(axis=1).plot.hist(bins=150, normed=True, xlim=[-30,40])

In [None]:
# Extract a ranked list of the valid values

tmp = df.mean(axis=1, skipna=True)
tmp = tmp[tmp.notnull()]
ranked = np.array(sorted(tmp.values))

plt.plot(ranked)

In [None]:
a,b,c = plt.hist(ranked, bins=150, cumulative=True, normed=True)

In [None]:
# Needs to be anomalies
x = 30
pop = len(ranked)
print("{0} values in population".format(pop))
mask = x < ranked
larger = len(ranked[mask])
print("{0} values are greater than {1}C".format(larger, x))