# Data Processing Task

The goal of this task is to convert the weather data into parquet format, setting the raw group 
to appropriate value you see fit for this data.

The converted data should be queryable to answer the following question:
    
- Which date was the hottest day?
- What was the temperature on that day?
- In which region was the hottest day?

The steps followed are: (XXX)


1. Retrieving data from source
2. Exploring data
3. Preprocessing data
4. Storing dataframe to parquet file format
5. Testings


### 1. Retrieving data from source

First we'll import our usual data analysis imports.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

from datetime import datetime, timedelta
import datetime

Retrieve dataset from csv file to pandas DataFrame.

In [2]:
file = 'src/weather.20160201.csv'
try:
    df = pd.read_csv(file)
except Exception as e:
    print('the dataset has not been retrieved from csv file')

### 2. Exploring data

In [3]:
df.head()

Unnamed: 0,ForecastSiteCode,ObservationTime,ObservationDate,WindDirection,WindSpeed,WindGust,Visibility,ScreenTemperature,Pressure,SignificantWeatherCode,SiteName,Latitude,Longitude,Region,Country
0,3002,0,2016-02-01T00:00:00,12,8,,30000.0,2.1,997.0,8,BALTASOUND (3002),60.749,-0.854,Orkney & Shetland,SCOTLAND
1,3005,0,2016-02-01T00:00:00,10,2,,35000.0,0.1,997.0,7,LERWICK (S. SCREEN) (3005),60.139,-1.183,Orkney & Shetland,SCOTLAND
2,3008,0,2016-02-01T00:00:00,8,6,,50000.0,2.8,997.0,-99,FAIR ISLE (3008),59.53,-1.63,Orkney & Shetland,
3,3017,0,2016-02-01T00:00:00,6,8,,40000.0,1.6,996.0,8,KIRKWALL (3017),58.954,-2.9,Orkney & Shetland,SCOTLAND
4,3023,0,2016-02-01T00:00:00,10,30,37.0,2600.0,9.8,991.0,11,SOUTH UIST RANGE (3023),57.358,-7.397,Highland & Eilean Siar,SCOTLAND


Observation 1: No measurement units for 'WindDirection', 'WindSpeed','WindGust','Visibility', 'ScreenTemperature', 'Pressure'.

Assumptions made for data understanding: 

- WindDirection [°]
- WindSpeed [km/h]
- WindGust [km/h]
- Visibility [m]
- ScreenTemperature [?]
- Pressure [Pa]

Check column datatypes

In [4]:
print('Columns datetypes:')
df.dtypes

Columns datetypes:


ForecastSiteCode            int64
ObservationTime             int64
ObservationDate            object
WindDirection               int64
WindSpeed                   int64
WindGust                  float64
Visibility                float64
ScreenTemperature         float64
Pressure                  float64
SignificantWeatherCode      int64
SiteName                   object
Latitude                  float64
Longitude                 float64
Region                     object
Country                    object
dtype: object

In [5]:
type(df['ObservationDate'][0]) # Object type means string here

str

### 3. Preprocessing data

Observation 2: ObservationTime and ObservationDate are integer and string data type respectively.
    
Let's merge both in a datetime data type column.

In [6]:
# Convert ObservationDate into datetime
df['ObservationDate'] = df['ObservationDate'].astype('datetime64[s]')

In [7]:
# Create a new datetime column merging the date and time of ObservationDate and ObservationTime columns
df['ObservationDateTime'] = df.apply(lambda row: row.ObservationDate + timedelta(hours = row.ObservationTime), axis='columns')

In [8]:
# Drop both columns from dataframe
df = df.drop(['ObservationDate','ObservationTime'], axis='columns')

### 4. Storing dataframe to parquet file format

In [9]:
try:
    df.to_parquet('output/weather_data.parquet', engine='fastparquet') 
except Exception as e:
    print('It has not been possible to write the dataframe into parquet file')
    
    

### 5. Testings

Check stored data are equal to source dataframe

In [10]:
# Retrieve dataset from parquet file
df_test = pd.read_parquet('output/weather_data.parquet')

In [27]:
# Check wether they are equals or not
if df.equals(df_test):
    print('OK: Dataset correctly stored')
else:
    raise Exception('KO: Dataset incorrectly stored')

OK: Dataset correctly stored
