# Assignment weather data solution

In this assignment we will work with weatherdata from the KNMI. A subset of weatherdata is for you available in the file: `KNMI_20181231.csv`. The data consist of several stations with daily weather data of several years. Your task is to make a tidy dataframe ready for further processing. 


Learning outcomes

- load, inspect and clean a dataset 

The assignment consists of 6 parts:

- [part 1: load the data](#0)
- [part 2: clean the data](#1)


---

<a name='0'></a>
## Part 1: Load the data

Either load the dataset `KNMI_20181231.csv` or `KNMI_20181231.txt.tsv`. The dataheaders contain spaces and are not very self explainable. Change this into more readable ones. Select data from station 270. Select only the mean, minimum and maximum temperature. The data should look something like this:


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/KNMI_20181231.txt.tsv",
                 sep=",", 
                 usecols=[0, 1, 2, 3, 4], 
                 names=["station", "Date", "Tmean", "Tmin", "Tmax"], 
                 comment="#", 
                 low_memory=False)
df = df[df.station == 270]
df

Unnamed: 0,station,Date,Tmean,Tmin,Tmax
97641,270,20000101,42,-4,79
97642,270,20000102,55,33,74
97643,270,20000103,74,49,89
97644,270,20000104,46,22,75
97645,270,20000105,41,14,56
...,...,...,...,...,...
104576,270,20181227,57,53,62
104577,270,20181228,71,58,81
104578,270,20181229,85,69,102
104579,270,20181230,80,68,90


---

<a name='1'></a>
## Part 2: Clean the data

The data ia not clean. There are empty cells in the dataframe which needs to be replaced with NaN's and the temperature is in centidegrees which needs to be transformed into degrees. The date field needs a datetime format. For visualization convience we would like to remove the leap year. Conduct the cleaning.

In [2]:
#replace cells with spaces to NaN
df = df.replace(r"^\s*$", np.nan, regex=True)
#change data formats
df.Date = pd.to_datetime(df['Date'].astype(str), format='%Y%m%d')
#change temperatures to celcius degrees
df.Tmean = pd.to_numeric(df.Tmean, errors='coerce') / 10
df.Tmin = pd.to_numeric(df.Tmin, errors='coerce') / 10
df.Tmax = pd.to_numeric(df.Tmax, errors='coerce') / 10
#remove leap year
df = df[~((df.Date.dt.month == 2) & (df.Date.dt.day == 29))]
df

Unnamed: 0,station,Date,Tmean,Tmin,Tmax
97641,270,2000-01-01,4.2,-0.4,7.9
97642,270,2000-01-02,5.5,3.3,7.4
97643,270,2000-01-03,7.4,4.9,8.9
97644,270,2000-01-04,4.6,2.2,7.5
97645,270,2000-01-05,4.1,1.4,5.6
...,...,...,...,...,...
104576,270,2018-12-27,5.7,5.3,6.2
104577,270,2018-12-28,7.1,5.8,8.1
104578,270,2018-12-29,8.5,6.9,10.2
104579,270,2018-12-30,8.0,6.8,9.0


<details>    
<summary>
    <font size="3" color="darkgreen"><b>Hints</b></font>
</summary>
<ul><li>pd.to_datetime(df['Date'].astype(str), format='%Y%m%d')</li>
    <li>regex for empty cells = `^\s*$` </li>
    <li>remove month == 2 & day == 29</li> 
</ul>
</details>

### Expected outcome

---