# Exercise Set 4: Data Structuring 1

*Afternoon, August 13, 2019*

In this Exercise Set we will apply some of the basic things we have learned with pandas.

#### Load modules
We begin by loading relevant packages.

In [1]:
import numpy as np
import pandas as pd

##  Exercise Section 4.1: Weather, part 1

Some data sources are open and easy to collect data from. They can be 'scraped' as is and they are already in a table format. This Exercise part of exercises is the first part of three that work with weather data, the follow ups are Exercise Sections 6.1 and 7.1. Our source will be National Oceanic and Atmospheric Administration (NOAA) which have a global data collection going back a couple of centuries. This collection is called Global Historical Climatology Network (GHCN). A description of GHCN can be found [here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/readme.txt).


> **Ex. 4.1.1:** Use Pandas' CSV reader to fetch  daily data weather from 1864 for various stations - available [here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/). 

> *Hint 1*: for compressed files you may need to specify the keyword `compression`.

> *Hint 2*: keyword `header` can be specified as the CSV has no column names.

> *Hint 3*: Specify the path, as the URL linking directly to the 1864 file. 

In [2]:
# [Answer to Ex. 4.1.1]
url = 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/1864.csv.gz'
test = pd.read_csv(url, header=0, compression='gzip')
print(test)

       ITE00100550  18640101  TMAX   10 Unnamed: 4 Unnamed: 5  E  Unnamed: 7
0      ITE00100550  18640101  TMIN  -23        NaN        NaN  E         NaN
1      ITE00100550  18640101  PRCP   25        NaN        NaN  E         NaN
2      ASN00079028  18640101  PRCP    0        NaN        NaN  a         NaN
3      USC00064757  18640101  PRCP  119        NaN        NaN  F         NaN
4      SF000208660  18640101  PRCP    0        NaN        NaN  I         NaN
5      ASN00089000  18640101  PRCP    0        NaN        NaN  a         NaN
6      SWE00100003  18640101  PRCP    0        NaN        NaN  E         NaN
7      ASN00086071  18640101  TMAX  214        NaN        NaN  a         NaN
8      ASN00086071  18640101  TMIN  101        NaN        NaN  a         NaN
9      ASN00086071  18640101  PRCP    0        NaN        NaN  a         NaN
10     USP00CA0003  18640101  PRCP    0        NaN        NaN  F         NaN
11     USC00189674  18640101  PRCP    0        NaN        NaN  F         NaN


> **Ex. 4.1.2:** Structure your weather DataFrame by using only the relevant columns (station identifier, data, observation type, observation value), rename them. Make sure observations are correctly formated (how many decimals should we add? one?).

> *Hint:* rename can be done with `df.columns=COLS` where `COLS` is a list of column names.


In [10]:
# [Answer to Ex. 4.1.2]

# The yearly files are formatted so that every observation 
#(i.e.,station/year/month/day/element/observation time) is represented by a single row 
#with the following fields:

# station identifier (GHCN Daily Identification Number)
# date (yyyymmdd; where yyyy=year; mm=month; and, dd=day)
# observation type (see ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt for definitions)
# observation value (see ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt for units)
# observation time (if available, as hhmm where hh=hour and mm=minutes in local time)

columns=['StationId' ,'Date', 'ObsType', 'ObsValue']

df = test.iloc[:, :4]
df.columns=columns
print(df.head())
print(df.tail())

     StationId      Date ObsType  ObsValue
0  ITE00100550  18640101    TMIN       -23
1  ITE00100550  18640101    PRCP        25
2  ASN00079028  18640101    PRCP         0
3  USC00064757  18640101    PRCP       119
4  SF000208660  18640101    PRCP         0
         StationId      Date ObsType  ObsValue
27343  UK000056225  18641231    PRCP         3
27344  ASN00026026  18641231    PRCP         0
27345  ASN00089049  18641231    PRCP         0
27346  SZ000006717  18641231    TMAX       -62
27347  SZ000006717  18641231    TMIN      -105



> **Ex. 4.1.3:**  Select data for the station `ITE00100550` and only observations for maximal temperature. Make a copy of the DataFrame. Explain in a one or two sentences how copying works.

> *Hint 1*: the `&` operator works elementwise on boolean series (like `and` in core python).

> *Hint 2*: copying of the dataframe is done with the `copy` method for DataFrames.

In [4]:
# [Answer to Ex. 4.1.3]
station_id = 'ITE00100550'
observation_type = 'TMAX'
ite0010055_maxtmp_obs = df.loc[(df['StationId'] == station_id) & (df['ObsType'] == observation_type)].copy()
print(ite0010055_maxtmp_obs.head())
print(ite0010055_maxtmp_obs.tail())

       StationId      Date ObsType  ObsValue
74   ITE00100550  18640102    TMAX         8
151  ITE00100550  18640103    TMAX       -28
226  ITE00100550  18640104    TMAX         0
304  ITE00100550  18640105    TMAX       -19
382  ITE00100550  18640106    TMAX       -13
         StationId      Date ObsType  ObsValue
26967  ITE00100550  18641227    TMAX        20
27042  ITE00100550  18641228    TMAX        63
27118  ITE00100550  18641229    TMAX        71
27195  ITE00100550  18641230    TMAX        50
27271  ITE00100550  18641231    TMAX        33


> **Ex. 4.1.4:** Make a new column called `TMAX_F` where you have converted the temperature variables to Fahrenheit. 

> *Hint*: Conversion is $F = 32 + 1.8*C$ where $F$ is Fahrenheit and $C$ is Celsius.

In [5]:
# [Answer to Ex. 4.1.4]
def c_to_f(c: int) -> float:
    return 32 + 1.8 * c

ite0010055_maxtmp_obs['TMAX_F'] = ite0010055_maxtmp_obs['ObsValue'].apply(c_to_f)
print(ite0010055_maxtmp_obs.head())
print(ite0010055_maxtmp_obs.tail())

       StationId      Date ObsType  ObsValue  TMAX_F
74   ITE00100550  18640102    TMAX         8    46.4
151  ITE00100550  18640103    TMAX       -28   -18.4
226  ITE00100550  18640104    TMAX         0    32.0
304  ITE00100550  18640105    TMAX       -19    -2.2
382  ITE00100550  18640106    TMAX       -13     8.6
         StationId      Date ObsType  ObsValue  TMAX_F
26967  ITE00100550  18641227    TMAX        20    68.0
27042  ITE00100550  18641228    TMAX        63   145.4
27118  ITE00100550  18641229    TMAX        71   159.8
27195  ITE00100550  18641230    TMAX        50   122.0
27271  ITE00100550  18641231    TMAX        33    91.4


> **Ex 4.1.5:**  Inspect the indices, are they following the sequence of natural numbers, 0,1,2,...? If not, reset the index and make sure to drop the old.

In [11]:
# [Answer to Ex. 4.1.5]
ite0010055_maxtmp_obs = ite0010055_maxtmp_obs.reset_index(drop=True)
print(ite0010055_maxtmp_obs.head())
print(ite0010055_maxtmp_obs.tail())

     StationId      Date ObsType  ObsValue  TMAX_F
0  ITE00100550  18640102    TMAX         8    46.4
1  ITE00100550  18640103    TMAX       -28   -18.4
2  ITE00100550  18640104    TMAX         0    32.0
3  ITE00100550  18640105    TMAX       -19    -2.2
4  ITE00100550  18640106    TMAX       -13     8.6
       StationId      Date ObsType  ObsValue  TMAX_F
360  ITE00100550  18641227    TMAX        20    68.0
361  ITE00100550  18641228    TMAX        63   145.4
362  ITE00100550  18641229    TMAX        71   159.8
363  ITE00100550  18641230    TMAX        50   122.0
364  ITE00100550  18641231    TMAX        33    91.4


> **Ex 4.1.6:** Make a new DataFrame where you have sorted by the maximum temperature. What is the date for the first and last observations?

In [26]:
# [Answer to Ex. 4.1.6]
obs_sorted = ite0010055_maxtmp_obs.sort_values(by=['ObsValue'])
#print(obs_sorted.iloc[0, :])
#print(obs_sorted.iloc[-1, :])
print(obs_sorted.loc[:, 'Date'].head(1))
print(obs_sorted.loc[:, 'Date'].tail(1))

print()

# Alternate using iloc
date_column = 1
print(obs_sorted.iloc[0, date_column])
print(obs_sorted.iloc[-1, date_column])


15    18640117
Name: Date, dtype: int64
220    18640809
Name: Date, dtype: int64

18640117
18640809


> **Ex 4.1.7:** CSV-files: save your DataFrame as a CSV file. what does index argument do?

> Try to save the file using a relative path and an absolut path. 
With a relative you only specify the file name. This will save the file in the folder you are currently working in. With an absolute path, you specify the whole path, which allows you to save the file in a folder of your choice

In [8]:
# [Answer to Ex. 4.1.7]
import os

home_path = os.environ['HOME']
abs_path = home_path + '/development/SummerSchool/sds_assignments/lecture_exercises/'

ite0010055_maxtmp_obs.to_csv('ite0010055_maxtmp_obs.csv')
df.to_csv(abs_path + '1864_tmps.csv')

> **(Bonus) Ex. 4.1.8**: A very compact way of writing code and making list in Python, is called list comprehensions. Depending on what you are doing, list can be more or less efficient that for example vectorized operations using NumPy. 

>Read about list comprehenseions online, and use it to make a list with the numbers from 0 to a million (10\*\*6), and add 3 to each element. Do the same doing NumPy, and time both methods. Which method is faster? 

> *Hint 1*: Use the `timeit` package for timing each method 

In [9]:
import timeit

num_list = list(range(0, 1000000))

list_comp_time = timeit.timeit('[num + 3 for num in range(0, 1000000)]', number=100)
np_arange_time = timeit.timeit('import numpy; numpy.arange(0, 1000000) + 3', number=100)

print(list_comp_time)
print(np_arange_time)

6.188841584999977
0.17127553000000262
