> **Note:** Every week, you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NOT EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `tsds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

# Week 2b: Data structuring

*Thursday, February 15, 2018*

In this part of today's session you will be working with traffic data from Copenhagen Municipality. Note that this part is quite long. The reason is that there is a lot of catching up and recap from our summer course.

The municipality have made the data openly available through the `opendata.dk` platform. We will use the data from [traffic counters](http://data.kk.dk/dataset/faste-trafiktaellinger) to construct a dataset of hourly traffic. We will use this data to get basic insights on the development in traffic over time and relate it to weather. The gist here is to practice a very important skill in Data Science: being able to quickly fetch data from the web and structure it so that you can work with it. Scraping usually gets a bit more advanced than what we will do today, but the following exercises should give you a taste for how it works. The bulk of these exercise, however, revolve around using the *Pandas* library to structure and analyze data.

An overview of today's exercises:
* 2b.1: Get some traffic data
* 2b.2: Structure your dataset
* 2b.3: String data, selection and rotation
* 2b.4: Structure temporal data
* 2b.5: Statistical descriptions of traffic data
* 2b.6 (extra): Working with weather station data from NOAA
* 2b.7 (extra): Further learning

*Note for R-users*: Pandas is a lot like *R* so if you are a shark at that there's no need relearn [things you already know](https://pandas.pydata.org/pandas-docs/stable/comparison_with_r.html).

## Exercises

### Part 2b.1: Get some traffic data

Hence follows a simple scraping exercise where you (1) collect urls for datasets in the webpage listing data on [traffic counters](http://data.kk.dk/dataset/faste-trafiktaellinger) and (2) use these urls to load the data into one dataframe.

#### Scrape dataset urls

> **Ex. 2b.1.1**: Using the `requests` module, extract the html markup of the webpage and store it as a string in a new variable.

In [1]:
# [Answer to Ex. 2b.1.1]

> **Ex. 2b.1.2**: Using the `re` module, extract a list of all the urls in the html string and store them in a new variable.

> *Hint: Try using the `re.findall` method. You may want to Google around to figure out how to do this. Protip: searching for something along the lines of "extract all links in html regex python" and hitting the first StackOverflow link will probably get you farther than reading elaborate documentation.*

In [14]:
# [Answer to Ex. 2b.1.2]

> **Ex. 2b.1.3**: Create a new variable that only contains the links to traffic data.

In [16]:
# [Answer to Ex. 2b.1.3]

#### Load everything into a single dataframe

> **Ex. 2b.1.4**: Using `pd.read_excel` method, load the datasets into a list. Your resulting variable should hold a list of Pandas dataframes.

> *Note: you may want to set the `skiprows=` keyword argument.*

In [17]:
# [Answer to Ex. 2b.1.4]

> **Ex. 2b.1.5**: Merge the list of dataframes into a single dataframe.

> *Hint: try using pandas' `concat` function.*

In [15]:
# [Answer to Ex. 2b.1.5]

### Part 2b.2: Structure your dataset

If you successfully completed the previous part, you should now have a dataframe with about 183.397 rows (if your number of rows is close but not the same, worry not—it matters little in the following). Well done! But the data is still in no shape for analysis, so we must clean it up a little.

183.397 rows (and 30 columns) is a lot of data. ~3.3 MB by my back-of-the-envelope calculations, so not "Big Data", but still enough to make your CPU heat up if you don't use it carefully. Pandas is built to handle fairly large dataframes and has advanced functionality to perform very fast operations even when the size of your data grows huge. So instead of working with basic Python we recommend working `pandas` built-in procedures as they are constructed to be fast on dataframes.

*Nerd fact: the reason `pandas` is much faster than pure Python is that dataframes access a lower level programming languages (namely C, C++) which are multiple times faster than Python. The reason it is faster is that it has a higher level of explicitness and thus is more difficult to learn and navigate.*

#### Tidy indices and columns

Remember `numpy` arrays from last week? Unlike these, `pandas` dataframes have the advantage that columns and rows can be labeled. These labels are referred to respectively as *row indices* and *column names*. We start out with formatting the indices and altering the column names. 

> **Ex. 2b.2.1**: Reset the row indices of your dataframe so the first index is 0 and the last is whatever the number of rows your dataframe has.

> *Hint: Check out the `reset_index` method for dataframes.*

In [1]:
# [Answer to Ex. 2b.2.1]

> **Ex. 2b.2.2**: The column called `Spor` is superfluous. Delete it.

> *Hint: try using the `drop` method. What does keyword arguments `inplace=`, `axis=` do?*

In [2]:
# [Answer to Ex. 2b.2.2]

> **Ex. 2b.2.3**: Rename variables from Danish to English using the dictionary below.

> *Hint: this is possible using the dataframes' `rename` method.*

In [3]:
# [Answer to Ex. 2b.2.3]

dk_to_uk = {
    'Vejnavn':'road_name',
    '(UTM32)':'UTM32_north',
    '(UTM32).1':'UTM32_east',
    'Dato':'date',
    'Vej-Id':'road_id'
}

#### Mind your memory

Python is quite efficient. For example, when you create a new dataframe by manipulating an old one, Python notices that—apart from some minor changes—these two objects are almost the same. Since memory is a precious resource, Python will represent the values in the new dataframe as *references* to the variables in the old dataset. This is great for performance, but if you for whatever reason change some of the values in your old dataframe, values in the new one will also change—and we don't want that! Luckily, we can break this dependency.

> **Ex. 2b.2.4**: Break the dependencies of the dataframe that resulted from Ex. 2b.2.3. Delete all other dataframes. 

> *Hint: try using the dataframes' `copy` method.*

In [4]:
# [Answer to Ex. 2b.2.4]

### Part 2b.3: String data, selection and rotation
Once you have structured appropriately, something that you will want to do again and again is **selecting subsets of the data**. Specifically, it means that you select specific rows in the dataset based on some column values.

#### Basic operations: selecting and subsetting

> **Ex. 2b.3.1**: Create a new column in the dataframe called `total` that is `True` when the last letter of `road_id` is T and otherwise `False`.

> *Hint1: try using `str` method for pandas series/columns for accessing the string elements, e.g. "data.road_id.str[2]".*

> *Hint2: you can use the equal operator `==` on series/columns.*

In [5]:
# [Answer to Ex. 2b.3.1]

> **Ex. 2b.3.2** Select rows where `total` is True. Delete all the remaining observations.

> *Hint: try to get inspiration from this [Stack Overflow question](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas).*

In [6]:
# [Answer to Ex. 2b.3.2]

> **Ex. 2b.3.3**: Make two datasets based on the lists of columns below. Call the dataset with spatial columns `data_geo` and the other `data`.

In [7]:
# [Answer to Ex. 2b.3.3]

# Columns for `geo_data`, stored in `geo_columns`
spatial_columns = ['road_name', 'UTM32_north', 'UTM32_east']

# Columns for `data`, stored in `select_columns`
hours = ['kl.%s-%s' % (str(h).zfill(2), str(h+1).zfill(2)) for h in range(24)]
select_columns = ['road_name', 'date'] + hours

> **Ex. 2b.3.4**: Drop the duplicate rows in `data_geo`.

In [18]:
# [Answer to Ex. 2b.3.4]

#### Formatting: wide and narrow format

When talking about two-dimensional data (matrices, tables or dataframes, we can call it many things), we can either say that it is in *wide* or *long* format (see explanation [here](https://en.wikipedia.org/wiki/Wide_and_narrow_data), "wide" and "long" are used interchangably). In Pandas we can use the commands `stack` and `unstack` to move between these formats.

The wide format has the advantage that it often requires less storage and is easier to read when printed. On the other hand the long format can be easier for modelling, because each observation has its own row. Turns out that the latter is what we most often need.

> **Ex. 2b.3.5**: Turn the dataset from wide to long so hourly data is now vertically stacked. Store this dataset in a dataframe called `data`. Name the column with hourly information `hour_period`. Your resulting dataframe should look something like [this](http://ulfaslak.com/tsds/ex_235_example.png).

> *Hint: pandas' `melt` function may be of use.*

In [9]:
# [Answer to Ex. 2b.3.5]

#### Categorical data

Categorical data can contain Python objects, usually strings. These are smart if you have variables with string observations that are long and often repeated, e.g. with road names.

> **Ex. 2b.3.6**: Convert the *type* of the `road_name` column to categorical.

> *Hint: The method `astype` for series/columns may be of use.* 

In [10]:
# [Answer to Ex. 2b.3.6]

### Part 2b.4: Structure temporal data

Pandas has native support for working with temporal data. This is handy as much 'big data' often has time stamps which we can make Pandas aware of. Once we have encoded temporal data it can be used to extract information such as the hour, second etc.

> **Ex. 2b.4.1**: Create a new column called `hour` which contains the hour-of-day for each row.

In [11]:
# [Answer to Ex. 2b.4.1]

> **Ex. 2b.4.2**: Create a new column called `time`, that contains the time of the row in `datetime` format. Delete the old temporal columns (hour, hour_period, date) to save memory.

> *Hint: try making an intermediary series of strings that has all temporal information for the row; then use pandas `to_datetime` function where you can specify the format of the date string.*

In [10]:
# [Answer to Ex. 2b.4.2]

> **Ex. 2b.4.3**: Using your `time` column make a new column called `weekday` which stores the weekday (in values between 0 and 6) of the corresponding `datetime`.

> *Hint: try using the `dt` method for the series called `time`; `dt` has some relevant methods itself.*

In [10]:
# [Answer to Ex. 2b.4.3]

> **Ex. 2b.4.4**: What other things can `dt` be used to compute? Try to compute week- and month number.

In [12]:
# [Answer to Ex. 2b.4.4]

### Part 2b.5: Statistical descriptions of traffic data

> **Ex. 2b.5.1**: Print the "descriptive statistics" of the `traffic` column.

> *Hint: Use the `describe` method of pandas dataframes.*

In [10]:
# [Answer to Ex. 2b.5.1]

> **Ex. 2b.5.2**: Which road has the most average traffic?

> *Hint: Start with a `groupby('road_name')` operation on `data`.*

In [10]:
# [Answer to Ex. 2b.5.2]

> **Ex. 2b.5.3**: Compute annual, average road traffic during day hours (9-17). Which station had the least traffic in 2013? Which station has seen highest growth in traffic from 2013 to 2014?

In [10]:
# [Answer to Ex. 2b.5.3]

## Additioal exercises and further learning

This final exercise is an old exercise from our summer course that we recommend that you finish. It has an exercise of joining different datasets into one.

### Part 2b.6: Working with weather station data from NOAA

> **Ex. 2b.6.1**: Do the in class exercises from the SDS course [here](https://abjer.github.io/sds/slides/in_class_exercise.ipynb). Note that the solution is available in the lecture [slides](https://abjer.github.io/sds/slides/plotting.pdf)/[notebook](https://abjer.github.io/sds/slides/plotting.ipynb).

In [13]:
# [Answer to Ex. 2b.6.1]

### Part 2b.7: Further learning

Many important topics for DataFrames have been skipped. These include:

- Copying data in python: deep vs. shallow - `copy` method for dataframes   
- Working with duplicates: dataframe methods `duplicated`, `drop_duplicates`
- Working with timeseries methods for dataframe e.g. `diff`, `shift`, `resample`, `rolling`