# Exercise Set 4: Data Structuring 1

*Afternoon, August 13, 2019*

In this Exercise Set we will apply some of the basic things we have learned with pandas.

#### Load modules
We begin by loading relevant packages.

In [1]:
import numpy as np
import pandas as pd

##  Exercise Section 4.1: Weather, part 1

Some data sources are open and easy to collect data from. They can be 'scraped' as is and they are already in a table format. This Exercise part of exercises is the first part of three that work with weather data, the follow ups are Exercise Sections 6.1 and 7.1. Our source will be National Oceanic and Atmospheric Administration (NOAA) which have a global data collection going back a couple of centuries. This collection is called Global Historical Climatology Network (GHCN). A description of GHCN can be found [here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/readme.txt).


> **Ex. 4.1.1:** Use Pandas' CSV reader to fetch  daily data weather from 1864 for various stations - available [here](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/). 

> *Hint 1*: for compressed files you may need to specify the keyword `compression`.

> *Hint 2*: keyword `header` can be specified as the CSV has no column names.

> *Hint 3*: Specify the path, as the URL linking directly to the 1864 file. 

In [2]:
# [Answer to Ex. 4.1.1]
url = 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/1864.csv.gz'
test = pd.read_csv(url, header=0, compression='gzip')
print(test)

       ITE00100550  18640101  TMAX   10 Unnamed: 4 Unnamed: 5  E  Unnamed: 7
0      ITE00100550  18640101  TMIN  -23        NaN        NaN  E         NaN
1      ITE00100550  18640101  PRCP   25        NaN        NaN  E         NaN
2      ASN00079028  18640101  PRCP    0        NaN        NaN  a         NaN
3      USC00064757  18640101  PRCP  119        NaN        NaN  F         NaN
4      SF000208660  18640101  PRCP    0        NaN        NaN  I         NaN
5      ASN00089000  18640101  PRCP    0        NaN        NaN  a         NaN
6      SWE00100003  18640101  PRCP    0        NaN        NaN  E         NaN
7      ASN00086071  18640101  TMAX  214        NaN        NaN  a         NaN
8      ASN00086071  18640101  TMIN  101        NaN        NaN  a         NaN
9      ASN00086071  18640101  PRCP    0        NaN        NaN  a         NaN
10     USP00CA0003  18640101  PRCP    0        NaN        NaN  F         NaN
11     USC00189674  18640101  PRCP    0        NaN        NaN  F         NaN


> **Ex. 4.1.2:** Structure your weather DataFrame by using only the relevant columns (station identifier, data, observation type, observation value), rename them. Make sure observations are correctly formated (how many decimals should we add? one?).

> *Hint:* rename can be done with `df.columns=COLS` where `COLS` is a list of column names.


In [3]:
# [Answer to Ex. 4.1.2]

# The yearly files are formatted so that every observation 
#(i.e.,station/year/month/day/element/observation time) is represented by a single row 
#with the following fields:

# station identifier (GHCN Daily Identification Number)
# date (yyyymmdd; where yyyy=year; mm=month; and, dd=day)
# observation type (see ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt for definitions)
# observation value (see ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt for units)
# observation time (if available, as hhmm where hh=hour and mm=minutes in local time)

columns=['StationId' ,'Date', 'ObsType', 'ObsValue']

df = test.iloc[:, :4]
df.columns=columns
print(df)

         StationId      Date ObsType  ObsValue
0      ITE00100550  18640101    TMIN       -23
1      ITE00100550  18640101    PRCP        25
2      ASN00079028  18640101    PRCP         0
3      USC00064757  18640101    PRCP       119
4      SF000208660  18640101    PRCP         0
5      ASN00089000  18640101    PRCP         0
6      SWE00100003  18640101    PRCP         0
7      ASN00086071  18640101    TMAX       214
8      ASN00086071  18640101    TMIN       101
9      ASN00086071  18640101    PRCP         0
10     USP00CA0003  18640101    PRCP         0
11     USC00189674  18640101    PRCP         0
12     USC00144559  18640101    PRCP         0
13     USC00144559  18640101    SNOW         0
14     CA006158350  18640101    TMAX        11
15     CA006158350  18640101    TMIN      -133
16     CA006158350  18640101    PRCP         5
17     CA006158350  18640101    SNOW         5
18     HRE00105189  18640101    PRCP       189
19     ASN00067054  18640101    PRCP        61
20     ASN000


> **Ex. 4.1.3:**  Select data for the station `ITE00100550` and only observations for maximal temperature. Make a copy of the DataFrame. Explain in a one or two sentences how copying works.

> *Hint 1*: the `&` operator works elementwise on boolean series (like `and` in core python).

> *Hint 2*: copying of the dataframe is done with the `copy` method for DataFrames.

In [4]:
# [Answer to Ex. 4.1.3]
station_id = 'ITE00100550'
observation_type = 'TMAX'
ite0010055_maxtmp_obs = df.loc[(df['StationId'] == station_id) & (df['ObsType'] == observation_type)].copy()
print(ite0010055_maxtmp_obs)

         StationId      Date ObsType  ObsValue
74     ITE00100550  18640102    TMAX         8
151    ITE00100550  18640103    TMAX       -28
226    ITE00100550  18640104    TMAX         0
304    ITE00100550  18640105    TMAX       -19
382    ITE00100550  18640106    TMAX       -13
459    ITE00100550  18640107    TMAX        -4
537    ITE00100550  18640108    TMAX        13
617    ITE00100550  18640109    TMAX        13
694    ITE00100550  18640110    TMAX         6
769    ITE00100550  18640111    TMAX       -15
846    ITE00100550  18640112    TMAX       -25
923    ITE00100550  18640113    TMAX       -43
1001   ITE00100550  18640114    TMAX       -50
1079   ITE00100550  18640115    TMAX       -31
1157   ITE00100550  18640116    TMAX       -25
1235   ITE00100550  18640117    TMAX       -63
1310   ITE00100550  18640118    TMAX       -50
1387   ITE00100550  18640119    TMAX       -16
1464   ITE00100550  18640120    TMAX        -9
1541   ITE00100550  18640121    TMAX        -4
1618   ITE001

> **Ex. 4.1.4:** Make a new column called `TMAX_F` where you have converted the temperature variables to Fahrenheit. 

> *Hint*: Conversion is $F = 32 + 1.8*C$ where $F$ is Fahrenheit and $C$ is Celsius.

In [5]:
# [Answer to Ex. 4.1.4]
def c_to_f(c: int) -> float:
    return 32 + 1.8 * c

ite0010055_maxtmp_obs['TMAX_F'] = ite0010055_maxtmp_obs['ObsValue'].apply(c_to_f)
print(ite0010055_maxtmp_obs)

         StationId      Date ObsType  ObsValue  TMAX_F
74     ITE00100550  18640102    TMAX         8    46.4
151    ITE00100550  18640103    TMAX       -28   -18.4
226    ITE00100550  18640104    TMAX         0    32.0
304    ITE00100550  18640105    TMAX       -19    -2.2
382    ITE00100550  18640106    TMAX       -13     8.6
459    ITE00100550  18640107    TMAX        -4    24.8
537    ITE00100550  18640108    TMAX        13    55.4
617    ITE00100550  18640109    TMAX        13    55.4
694    ITE00100550  18640110    TMAX         6    42.8
769    ITE00100550  18640111    TMAX       -15     5.0
846    ITE00100550  18640112    TMAX       -25   -13.0
923    ITE00100550  18640113    TMAX       -43   -45.4
1001   ITE00100550  18640114    TMAX       -50   -58.0
1079   ITE00100550  18640115    TMAX       -31   -23.8
1157   ITE00100550  18640116    TMAX       -25   -13.0
1235   ITE00100550  18640117    TMAX       -63   -81.4
1310   ITE00100550  18640118    TMAX       -50   -58.0
1387   ITE

> **Ex 4.1.5:**  Inspect the indices, are they following the sequence of natural numbers, 0,1,2,...? If not, reset the index and make sure to drop the old.

In [6]:
# [Answer to Ex. 4.1.5]
ite0010055_maxtmp_obs = ite0010055_maxtmp_obs.reset_index(drop=True)
print(ite0010055_maxtmp_obs)

       StationId      Date ObsType  ObsValue  TMAX_F
0    ITE00100550  18640102    TMAX         8    46.4
1    ITE00100550  18640103    TMAX       -28   -18.4
2    ITE00100550  18640104    TMAX         0    32.0
3    ITE00100550  18640105    TMAX       -19    -2.2
4    ITE00100550  18640106    TMAX       -13     8.6
5    ITE00100550  18640107    TMAX        -4    24.8
6    ITE00100550  18640108    TMAX        13    55.4
7    ITE00100550  18640109    TMAX        13    55.4
8    ITE00100550  18640110    TMAX         6    42.8
9    ITE00100550  18640111    TMAX       -15     5.0
10   ITE00100550  18640112    TMAX       -25   -13.0
11   ITE00100550  18640113    TMAX       -43   -45.4
12   ITE00100550  18640114    TMAX       -50   -58.0
13   ITE00100550  18640115    TMAX       -31   -23.8
14   ITE00100550  18640116    TMAX       -25   -13.0
15   ITE00100550  18640117    TMAX       -63   -81.4
16   ITE00100550  18640118    TMAX       -50   -58.0
17   ITE00100550  18640119    TMAX       -16  

> **Ex 4.1.6:** Make a new DataFrame where you have sorted by the maximum temperature. What is the date for the first and last observations?

In [7]:
# [Answer to Ex. 4.1.6]
obs_sorted = ite0010055_maxtmp_obs.sort_values(by=['ObsValue'])
#print(obs_sorted.iloc[0, :])
#print(obs_sorted.iloc[-1, :])
print(obs_sorted.loc[:, 'Date'].head(1))
print(obs_sorted.loc[:, 'Date'].tail(1))

15    18640117
Name: Date, dtype: int64
220    18640809
Name: Date, dtype: int64


> **Ex 4.1.7:** CSV-files: save your DataFrame as a CSV file. what does index argument do?

> Try to save the file using a relative path and an absolut path. 
With a relative you only specify the file name. This will save the file in the folder you are currently working in. With an absolute path, you specify the whole path, which allows you to save the file in a folder of your choice

In [8]:
# [Answer to Ex. 4.1.7]
import os

home_path = os.environ['HOME']
abs_path = home_path + '/development/SummerSchool/sds_assignments/lecture_exercises/'

ite0010055_maxtmp_obs.to_csv('ite0010055_maxtmp_obs.csv')
df.to_csv(abs_path + '1864_tmps.csv')

> **(Bonus) Ex. 4.1.8**: A very compact way of writing code and making list in Python, is called list comprehensions. Depending on what you are doing, list can be more or less efficient that for example vectorized operations using NumPy. 

>Read about list comprehenseions online, and use it to make a list with the numbers from 0 to a million (10\*\*6), and add 3 to each element. Do the same doing NumPy, and time both methods. Which method is faster? 

> *Hint 1*: Use the `timeit` package for timing each method 

In [21]:
import timeit

num_list = list(range(0, 1000000))

list_comp_time = timeit.timeit('[num + 3 for num in range(0, 1000000)]', number=100)
np_arange_time = timeit.timeit('import numpy; numpy.arange(0, 1000000) + 3', number=100)

print(list_comp_time)
print(np_arange_time)

6.224539359998744
0.2249275160011166
