# Week 04 Assignment weather data

Welcome to week four of this course programming 1. You will to organise your data into the required format and apply smoothing. In this assignment we will work with weatherdata from the KNMI. A subset of weatherdata is for you available in the file: `KNMI_20181231.csv`. The data consist of several stations with daily weather data of several years. Your task is to make a plot similar to the plot below.


<img src="images/weather.png" alt="drawing" width="400"/>


Furthermore the plot needs the following enhancements

1. proper titles and ticks
2. a slider widget selecting a particular year or all years
3. lines need to be smoothed
3. legends needs to be added

Use your creativity. Consider colors, alpha settings, sizes etc. 

Learning outcomes

- load, inspect and clean a dataset 
- reformat dataframes
- apply smoothing technologies
- visualize timeseries data

The assignment consists of 6 parts:

- [part 1: load the data](#0)
- [part 2: clean the data](#1)
- [part 3: reformat data](#2)
- [part 4: smooth the data](#3)
- [part 5: visualize the data](#4)
- [part 6: Challenge](#5)

Part 1 and 5 are mandatory, part 6 is optional (bonus)
To pass the assingnment you need to a score of 60%. 


---

<a name='0'></a>
## Part 1: Load the data

Either load the dataset `KNMI_20181231.csv` or `KNMI_20181231.txt.tsv`. The dataheaders contain spaces and are not very self explainable. Change this into more readable ones. Select data from station 270. Select only the mean, minimum and maximum temperature. The data should look something like this:


In [1]:
import pandas as pd
import re

In [2]:
with open('../data/KNMI_20181231.txt.tsv', 'r') as f:
    data = f.readlines()

In [3]:
start_index = None

for i, line in enumerate(data):
    if 'STN,YYYYMMDD' in line:
        start_index = i + 1
        
print('The data we need starts at index {}.'.format(start_index))

The data we need starts at index 64.


In [4]:
def extract_data(data):
    """ 
    Loop through the lines in the file, add them to the dictionary, and
    convert the dictionary to a DataFrame as this is the fast method.
    https://stackoverflow.com/questions/57000903/what-is-the-fastest-and-most-efficient-way-to-append-rows-to-a-dataframe
    """
    dataset = {
        'STN': [],
        'Date': [],
        'Tmean': [],
        'Tmm': [],
        'Tmax': []
    }
    
    stn_date_pattern = '(270),([0-9]+),'
    numerical_pattern = '-{0,1}[0-9]+'

    for line in data:
        res = re.findall(stn_date_pattern, line)
        if res:
            stn, date = res[0]
            try:
                tg, tn, tx, sq, dr, rh = re.findall(numerical_pattern, line)[2:] # skip stn and date
                dataset['STN'].append(stn)
                dataset['Date'].append(date)
                dataset['Tmean'].append(tg)
                dataset['Tmm'].append(tn)
                dataset['Tmax'].append(tx)
            except ValueError:
                print('Expects six values for station {}!.'.format(stn))
                
    return pd.DataFrame(dataset)

---

<a name='1'></a>
## Part 2: Clean the data

The data ia not clean. There are empty cells in the dataframe which needs to be replaced with NaN's and the temperature is in centidegrees which needs to be transformed into degrees. The date field needs a datetime format. For visualization convience we would like to remove the leap year. Conduct the cleaning.

In [5]:
df = extract_data(data[start_index:]).astype({
    'STN': 'int',
    'Date': 'datetime64[ns]',
    'Tmean': 'float',
    'Tmm': 'float',
    'Tmax': 'float'
})
df.head()

Unnamed: 0,STN,Date,Tmean,Tmm,Tmax
0,270,2000-01-01,42.0,-4.0,79.0
1,270,2000-01-02,55.0,33.0,74.0
2,270,2000-01-03,74.0,49.0,89.0
3,270,2000-01-04,46.0,22.0,75.0
4,270,2000-01-05,41.0,14.0,56.0


In [6]:
# Multiply temperatures by 0.1, as the temperature was in units of 0.1.
df[['Tmean', 'Tmm', 'Tmax']] = df[['Tmean', 'Tmm', 'Tmax']].multiply(0.1)
df.head()

Unnamed: 0,STN,Date,Tmean,Tmm,Tmax
0,270,2000-01-01,4.2,-0.4,7.9
1,270,2000-01-02,5.5,3.3,7.4
2,270,2000-01-03,7.4,4.9,8.9
3,270,2000-01-04,4.6,2.2,7.5
4,270,2000-01-05,4.1,1.4,5.6


In [7]:
# remove the leap years
df = df[~df['Date'].dt.is_leap_year]
df.head()

Unnamed: 0,STN,Date,Tmean,Tmm,Tmax
366,270,2001-01-01,2.1,0.4,3.8
367,270,2001-01-02,6.3,3.7,8.5
368,270,2001-01-03,5.3,2.6,7.9
369,270,2001-01-04,6.5,4.6,7.9
370,270,2001-01-05,6.6,5.6,8.3


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5110 entries, 366 to 6939
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   STN     5110 non-null   int64         
 1   Date    5110 non-null   datetime64[ns]
 2   Tmean   5110 non-null   float64       
 3   Tmm     5110 non-null   float64       
 4   Tmax    5110 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1)
memory usage: 239.5 KB


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>pd.to_datetime(df['Date'].astype(str), format='%Y%m%d')</li>
    <li>regex for empty cells = `^\s*$` </li>
    <li>remove month == 2 & day == 29</li> 
</ul>
</details>

### Expected outcome

---

<a name='2'></a>
## Part 3: Reform your data

First we will split the data in data from 2018 and data before 2018. Best is to split this in two dataframes. 
Next we need for the non 2018 data the minimum values for each day and the maximum values for each day. So we look for the minimum value out of all january-01 minimum values (regardless the year). Create a dataframe with 365 days containing the ultimate minimum and the ultimate maximum per day. 


In [9]:
df_pre = df[df['Date'].dt.year < 2018].set_index('Date')
df_after = df[df['Date'].dt.year >= 2018].set_index('Date')

In [10]:
df_pre.head()

Unnamed: 0_level_0,STN,Tmean,Tmm,Tmax
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2001-01-01,270,2.1,0.4,3.8
2001-01-02,270,6.3,3.7,8.5
2001-01-03,270,5.3,2.6,7.9
2001-01-04,270,6.5,4.6,7.9
2001-01-05,270,6.6,5.6,8.3


In [11]:
def month_day(df_multipleyears):
    intersect = ['Tmm', 'Tmax']
    
    df = df_multipleyears.copy()
    df = df.groupby(df.index.strftime('%m-%d')).agg({'Tmm':'min', 'Tmax':'max'})    
    df['date'] = df.index
    df[['month', 'day']] = df['date'].str.split('-', 1, expand=True).astype('int')
    df = df.set_index([df['month'], df['day']]).sort_index()
    
    return df[df.columns.intersection(intersect)]

In [12]:
def test_reformed(df):
    return month_day(df)
    

test_reformed(df_pre)

Unnamed: 0_level_0,Unnamed: 1_level_0,Tmm,Tmax
month,day,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,-5.8,10.1
1,2,-7.5,10.2
1,3,-12.6,10.7
1,4,-6.7,9.8
1,5,-6.2,9.4
...,...,...,...
12,27,-6.0,11.7
12,28,-7.4,10.9
12,29,-7.3,8.6
12,30,-6.7,11.1


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use the dt.month and dt.day to groupby</li>
</ul>
</details>

### Expected outcome
Note, the layout or names my differ, but the length should be 365 and the minimum values should be the same

---

<a name='3'></a>
## Part 4: Smooth the data

Make a function that takes an array or a dataframe column and returns an array of smoothed data. Explain in words why you choose a certain smoothing algoritm


In [13]:
# I use a weekly rolling window to smooth the data. 
# A 7-days window is used as a whole week consists of 7 days.
# A Exponential moving average is used as it is suspected that the previous
# day influences the current day more than other days in the past.
values_df = test_reformed(df_pre)

def smooth_data(df, col, periods):
    df = df.copy()
    start_vals = df[col].iloc[:periods]
    
    df['ewm'] = df[col].ewm(com=0.5, min_periods=periods).mean()
    df['ewm'].iloc[:periods] = start_vals
    df['ewm'] = df['ewm'].round(2)
    return df

In [14]:
smoothed_df = smooth_data(values_df, 'Tmm', 7)

In [15]:
smoothed_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Tmm,Tmax,ewm
month,day,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,-5.8,10.1,-5.80
1,2,-7.5,10.2,-7.50
1,3,-12.6,10.7,-12.60
1,4,-6.7,9.8,-6.70
1,5,-6.2,9.4,-6.20
...,...,...,...,...
12,27,-6.0,11.7,-5.72
12,28,-7.4,10.9,-6.84
12,29,-7.3,8.6,-7.15
12,30,-6.7,11.1,-6.85


<a name='4'></a>
## Part 5: Visualize the data

Plot the mean temperature of the year 2018. Create a shaded band with the ultimate minimum values and the ultimate maximum values from the multi-year dataset. Add labels, titles and legends. Use proper ranges. Be creative to make the plot attractive. 



<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>use from bokeh.models import Band</li>
    <li>use ColumnDataSource to parse data arrays</li>
    <li>look for xaxis tick formatters</li>
</ul>
</details>

---

In [16]:
from bokeh.models import ColumnDataSource, Band, Legend, HoverTool
from bokeh.models.tools import CustomJSHover
from bokeh.plotting import figure, show, output_notebook,gridplot
from bokeh.themes import built_in_themes
from bokeh.io import curdoc

curdoc().theme = 'dark_minimal'

output_notebook()

In [17]:
df_after.head()

Unnamed: 0_level_0,STN,Tmean,Tmm,Tmax
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-01-01,270,6.0,4.0,7.9
2018-01-02,270,5.6,3.1,7.5
2018-01-03,270,7.5,5.3,9.2
2018-01-04,270,7.3,5.8,9.1
2018-01-05,270,6.0,4.0,7.6


In [19]:
df_after = smooth_data(df_after, 'Tmean', 7)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [20]:
df_after.head()

Unnamed: 0_level_0,STN,Tmean,Tmm,Tmax,ewm
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-01,270,6.0,4.0,7.9,6.0
2018-01-02,270,5.6,3.1,7.5,5.6
2018-01-03,270,7.5,5.3,9.2,7.5
2018-01-04,270,7.3,5.8,9.1,7.3
2018-01-05,270,6.0,4.0,7.6,6.0


In [29]:
source = ColumnDataSource(df_after)

lowest = df_after['Tmean'].min()
highest = df_after['Tmean'].max()

p = figure(title='Average temperature in 2018', x_axis_type='datetime', width=1000)
p.line(x='Date', y='ewm', source=source, legend_label='Average temperature')

band = Band(base='Date', lower='Tmm', upper='Tmax', source=source, level='underlay',
           fill_alpha=0.1, fill_color='lightgrey', line_width=1, line_color='salmon')
p.add_layout(band)

p.title.text = "Temperature 2018 Smoothed"
p.xaxis.axis_label = 'Date'
p.yaxis.axis_label = 'Degrees in Celcius'

p.legend.location = 'top_left'

p.add_tools(
    HoverTool(
        show_arrow=False,
        line_policy='next',
        tooltips=[
            ('Avg', '@Tmean'),
            ('Max', '@Tmax'),
            ('Min', '@Tmm')
        ],
    )
)

show(p)

<a name='5'></a>
## Part 6: Challenge

Make a widget in which you can select the year range for the multiyear set. Add this to your layout to make the plot interactive. Add another widget to select or deselect the smoother. Inspiration: https://demo.bokeh.org/weather

In [35]:
# challenge set
ch_df = df.reset_index(drop=True)
ch_df = ch_df.set_index('Date')

In [36]:
ch_df.head()

Unnamed: 0_level_0,STN,Tmean,Tmm,Tmax
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2001-01-01,270,2.1,0.4,3.8
2001-01-02,270,6.3,3.7,8.5
2001-01-03,270,5.3,2.6,7.9
2001-01-04,270,6.5,4.6,7.9
2001-01-05,270,6.6,5.6,8.3


In [46]:
ch_df.index[0]

Timestamp('2001-01-01 00:00:00')

In [48]:
from bokeh.models import ColumnDataSource, DateRangeSlider, Select, CustomJS

In [39]:
# Creating smoothing options
smoothing_option = 'Moving Average'
smoothing_options = ['Moving Average', 'Exponential Moving Average', 'None']
smoothing_select = Select(value=smoothing_option, title='Smoothing option', options=smoothing_options)

In [50]:
start_date = ch_df.index[0]
med_date = ch_df.index[500] # random
end_date = ch_df.index[-1]

callback = CustomJS()

date_range_slider = DateRangeSlider(value=(start_date, med_date),
                                    start=start_date, end=end_date)
date_range_slider.js_on_change("value", callback

show(date_range_slider)

In [51]:
source = ColumnDataSource(ch_df)

In [None]:
plot = figure(width=800, height=800)