#### Global Historical Climatology Network Dataset

Problems to solve (variables in rows and columns):

  - tmin and tmax variables in one column as data
  - actual values somewhere in the day columns

In [None]:
import pandas as pd

messy = pd.read_csv('./data/weather-raw.csv')
messy

Sparse data; we can not throw away the missing values, because we will loose the 4 rows that contain the information. So, we are going to melt the dataframe first.

In [None]:
molten = pd.melt(messy,
                id_vars = ['id', 'year', 'month', 'element',],
                var_name = 'day');
molten.dropna(inplace = True)
molten = molten.reset_index(drop = True)
molten

The dataframe is not tidy yet. The "element" column contains variable names. And, one variable "date" is shattered over 3 variables: "year", "month", and "day". We will fix the last problem first.

In [None]:
def f(row):
    return "%d-%02d-%02d" % (row['year'], row['month'], int(row['day'][1:]))

molten['date'] = molten.apply(f, axis = 1)
molten = molten[['id', 'element', 'value', 'date']]
molten

Now we just have to pivot the "element" column:

In [None]:
tidy = molten.pivot(index='date', columns='element', values='value')
tidy

But now we lost the 'id' column. The trick is to move the 'id' to an index with the groupby() function and apply the pivot() function inside each group.

In [None]:
tidy = molten.groupby('id').apply(pd.DataFrame.pivot,
                                 index='date',
                                 columns='element',
                                 values='value')
tidy

So, we have 'id' back, but we like to have it as a column:

In [None]:
tidy.reset_index(inplace=True)
tidy

#### One type in multiple tables

here the problems are the following:

  - the data is spread across multiple tables/files
  - the "year" variable is present, but in the file name

In [None]:
import sys
import glob
import re

def extract_year(string):
    match = re.match(".+(\d{4})", string)
    if match != None: return match.group(1)
    
path = './data'
allFiles = glob.glob(path + "/201*-baby-names-illinois.csv")
frame = pd.DataFrame()
df_list = []
for file_ in allFiles:
    df = pd.read_csv(file_, index_col = None, header = 0)
    df.columns = map(str.lower, df.columns)
    df["year"] = extract_year(file_)
    df_list.append(df)
    
df = pd.concat(df_list)
df.head(10)