### Attribution:

This notebook was modified from Debsankha Manik's notebook 01-data-wrangling.ipynb, GGNB Data Science course held at the University of Goettingen (2019).

# Data Wrangling


* Often the first step of a Data Science project. 
* Datasets you want to work with may be generated from different softwares, devices, or even by hand.
* Before your *data science tool* can make sense out of them, often some "wrangling" is necessary.

In [None]:
import numpy as np
import pandas as pd

from urllib import request

## CSV i.e. "comma-seperated-value" format

### Using `pd.read_csv`

In [None]:
url = "http://download-data.deutschebahn.com/static/datasets/stationsdaten/DBSuS-Uebersicht_Bahnhoefe-Stand2016-07.csv"

In [None]:
response = request.urlopen(url)
data = response.read()
bf_st = data.decode('utf-8')

In [None]:
import io
buf = io.StringIO(bf_st)

In [None]:
df = pd.read_csv(buf)

### Oops, let's try again, Ignoring badly formed lines

In [None]:
import io
buf = io.StringIO(bf_st)
df = pd.read_csv(buf, error_bad_lines=False)

In [None]:
df.head()

In [None]:
df.shape, df.columns

Pandas did not recognize the columns, because we need to...

### Specify the  Delimiter

In [None]:
buf = io.StringIO(bf_st)
df = pd.read_csv(buf, delimiter = ';', error_bad_lines=False)

In [None]:
df.head()

In [None]:
df.columns

### Reading only certain columns

In [None]:
buf = io.StringIO(bf_st)
df = pd.read_csv(buf, delimiter=';', error_bad_lines=False, usecols=['Bundesland', 'Station', 'PLZ'])
df.head()

### Supplying your own column information

In [None]:
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
response = request.urlopen(url)
data = response.read()
census_st = data.decode('utf-8')
buf = io.StringIO(census_st)

In [None]:
df = pd.read_csv(buf)

In [None]:
df.head()

In [None]:
buf = io.StringIO(census_st)
df = pd.read_csv(buf, usecols=[0, 1,3], names = ['age', 'job', 'education'])
df.head()

#### Converting columns to desired datatype

In [None]:
buf = io.StringIO(census_st)
df = pd.read_csv(buf, usecols=[0, 1,3], names = ['age', 'job', 'education'], 
                 dtype={'age':float, 'job':str, 'education': str})
df.head()

#### Changing datatype is also possible afterwards:

In [None]:
df.astype({'age':int, 'job':str, 'education':np.str}).head()

### Handling datetimes

In [None]:
df = pd.read_csv('../data/amazonianBirds_climate.csv')
df.head()

#### Merge `date` and `time` columns

In [None]:
df = pd.read_csv('../data/amazonian_birds.csv', 
                 parse_dates={'datetime':[1,2]}, error_bad_lines=False)
df.head()

In [None]:
df.dtypes

In [None]:
df['datetime'] - df['datetime'].shift()

In [None]:
x = df[(df['datetime']>'2013-07-03') & (df['datetime']<'2013-07-15')]

In [None]:
print(len(x))
x.head()

In [None]:
import matplotlib.pyplot as plt
import numpy as np

In [None]:
from matplotlib.dates import DayLocator, HourLocator, DateFormatter, drange

fig, ax = plt.subplots()

ax.plot_date(x['datetime'], x['latitude'], 'o')
# The hour locator takes the hour or sequence of hours you want to
# tick, not the base multiple

# format ticks
ax.xaxis.set_major_locator(DayLocator())
ax.xaxis.set_minor_locator(HourLocator(np.arange(0, 25, 6))) # minor tick for hours
ax.xaxis.set_major_formatter(DateFormatter('%m-%d'))  # format of the date label 

ax.fmt_xdata = DateFormatter('%Y-%m-%d %H:%M:%S')
fig.autofmt_xdate() # rotates x labels
