## What is pandas good for?

Working with (large) data sets and created automated data processes.

Pandas is extensively used to prepare data in data science (machine learning, data analytics, ...)

**Examples**: 
* **Import and export** data into standard formats (CSV, Excel, Latex, ..).
* Combine with Numpy for **advanced computations** or Matplotlib for **visualisations**.
* Calculate **statistics** and answer questions about the data, like
  * What's the average, median, max, or min of each column?
  * Does column A correlate with column B?
  * What does the distribution of data in column C look like?
* **Clean** up data (e.g. fill out missing information and fix inconsistent formatting) and **merge** multiple data sets into one common dataset.


<img src="data/pressure.png" style="width: 1000px;"/>

In [None]:
import pandas as pd
import pylab as pl

First, a short recap of the video session

The two fundamental data-structures in pandas are Series and DataFrame:

In [None]:
s = pd.Series([1,2,3])
s

In [None]:
s = pd.Series([1,2,3], index=['a','b','c'])
s

In [None]:
dic = {'a':1, 'b':2, 'c':3}
s = pd.Series(dic)
s

In [None]:
s['a']

In [None]:
dic = {'a':[1,2], 'b':[3,4], 'c':[5,6]}
s = pd.DataFrame(dic)
s

In [None]:
s['a']

In [None]:
s['a'][0]

In [None]:
s.columns

# Reading data from file
Now assume that we have some pressure data obtained from a sensor, as shown below
<img src="data/pressure.png" style="width: 1000px;"/>

In [None]:
df = pd.read_csv('data/pressure.csv')

In [None]:
df

In [None]:
t = df['t']
p = df['p']

In [None]:
p

In [None]:
pl.plot(t,p)
pl.show()

This way of extracting data from the DataFrame is useful for futher computations with t and p. For plotting purposes only, the DataFrame has its own plot-function:

In [None]:
df.plot()
pl.show()

In [None]:
df.plot('t','p')
pl.show()

# How to write data to csv

In [None]:
t = pl.linspace(0,2*pl.pi,200)
p = pl.sin(2*pl.pi*t)

In [None]:
pl.plot(t,p)
pl.show()

In [None]:
data = pl.array([t,p])

When dealing with table data, you should always consider whether to use the .transpose() of a matrix

In [None]:
df = pd.DataFrame(data.transpose(), columns=['t','p'])

In [None]:
df

In [None]:
df.to_csv('pressure_computed.csv')

#### Adding a column to the existing DataFrame:

In [None]:
v = pl.cos(2*pl.pi*t)

In [None]:
df['v'] = v

In [None]:
df

It is possible to create an empty DataFrame and just add the columns whenever you like:

In [None]:
empty_df = pd.DataFrame()

In [None]:
empty_df['t'] = t
empty_df['p'] = p
empty_df['v'] = v

In [None]:
empty_df

In [None]:
empty_df.plot('t',['p','v']) # pl.plot(t,p,t,v) in matplotlib

**Exercise** 
1. Create uniformly sampled time points between 0 and 30.
2. Generate positional data in the xy-plane given by [0.4*t + cos(t), sin(t)]
3. Create a DataFrame consisting of the three columns t, x and y
4. plot x versus y using first the matplotlib plot function and then the DataFrame plot-method

Velocity-data can be computed by $v_{x_i} = \frac{x_{i+1} - x_i}{t_{i+1} - t_i}$, $v_{y_i} = \frac{y_{i+1} - y_i}{t_{i+1} - t_i}$
5. Compute the velocity data for x and y and add those as columns in the DataFrame



## A real world example. Oslo bysykkel data
We go to https://oslobysykkel.no/apne-data/historisk (you can also get there by "oslo bysykkel data historisk" on google). We download the September data as CSV. 

In [None]:
import pandas as pd
import pylab as pl
trips = pd.read_csv('data/bysykkel/trips-2021.9.1-2021.9.30.csv')

In [None]:
trips

We can work with the data using normal pylab (and numpy functions):

In [None]:
pl.hist(trips['duration'], range=[0,1500])

We can also use DataFrame built-in functions:

In [None]:
trips.sort_values('duration')

# Exercise
1. Make a scatter-plot showing the position of stations in Oslo. It is OK to plot a station several times. Use matplotlib or the built-in DataFrame.plot.scatter

2. (Bonus) Make a scatter-plot with different size of the cirles, and let the size be dependent on how popular a station is (i.e. how many trips were started at the given station)






In [None]:
# solution to 1

Let's see if we can find information about how popular the different start stations are

In [None]:
trips['start_station_id']

Let's first try the numpy-way:

In [None]:
stations = pl.unique(trips['start_station_id'])

In [None]:
stations

In [None]:
stations[0] == trips['start_station_id']   # find out if trips started at the given station

In [None]:
(stations[0] == trips['start_station_id']).sum()   # sum all trips that started at the given station

Now we generalize the line above to create a list of number of trips for each station

In [None]:
number_of_trips = [(stations[i] == trips['start_station_id']).sum() for i in range(len(stations))]

In [None]:
number_of_trips

Now let's try some pandas:
For the only purpose of counting trips per station we may use .value_counts()

In [None]:
number_of_trips_pandas = trips['start_station_id'].value_counts()

In [None]:
number_of_trips_pandas

Now let's say in our case we want all the information we can get about the start station, not only the number of trips. To group the data by start_station_id and count, while still extracting other relevant data for the start station we can use groupby()

In [None]:
station_data = trips.groupby(['start_station_id','start_station_name',
                       'start_station_description', 'start_station_latitude',
                       'start_station_longitude']).count()
station_data

In [None]:
station_data = station_data.reset_index()
station_data

In [None]:
station_data = station_data.drop(columns=station_data.columns[-7:])
station_data

In [None]:
station_data = station_data.rename(columns={'started_at':'started_trips'})
station_data = station_data.set_index('start_station_id')
station_data

In [None]:
station_data.sort_values('started_trips', ascending=False)

In [None]:
ended_trips = trips['end_station_id'].value_counts()
ended_trips

In [None]:
station_data['ended_trips'] = ended_trips
station_data.sort_values('started_trips', ascending=False)

# Plotting on a map with ipyleaflet and HTML

We saw that the scatterplot could be used to plot stations on a map:

In [None]:
station_data.plot.scatter('start_station_longitude', 'start_station_latitude')

We now have tools to plot the most popular bike stations as bigger circles

In [None]:
station_data.plot.scatter('start_station_longitude', 'start_station_latitude', s='started_trips')

### ipywidgets/HTML and ipyleaflet are useful tools to visualize data on maps

In [None]:
from ipywidgets import HTML
from ipyleaflet import Map, Marker, basemaps, basemap_to_tiles, Circle, Polyline

In [None]:
oslo_center = 59.9127, 10.7461    #NB ipyleaflet uses Lat-Long (i.e. y,x, when specifying coordinates)

In [None]:
oslo_map = Map(center=oslo_center, zoom=13)

In [None]:
oslo_map

In [None]:
oslo_map.save('data/raw_oslo_map.html')  # if interactive view is not possible inline try to open this in your browser

We can add different layers to our map with a marker function. The function is written such that for a given row in the dataframe (i.e. a given station), we add one marker to the map

In [None]:
def add_markers(row):
    center = row['start_station_latitude'], row['start_station_longitude']
    marker = Circle(location=center, radius=int(0.04*row["started_trips"]), color = 'green')
    oslo_map.add_layer(marker)

In [None]:
station_data.apply(add_markers, axis=1)

# Exercise

Note: If you have issues with installing ipyleaflet or ipywidget with pip, just use pl.scatter() or pl.plot()

1) Create the DataFrame station_data as described in the lecture

2) Make a similar plot of the Oslo map with the most popular end-stations as red circles

3) Add the following line as the last line in your add_markers-function: marker.popup = HTML(f"{row['start_station_name']} Trips started: {row['started_trips']}")  . You can also add newlines within the string with the HTML command for newline

4) Try to make an Oslo map showing both started trips and ended trips in the same map

5) Make a map showing which stations are most popular going from Stensgata

In [None]:
%reset
import pandas as pd
import pylab as pl

trips = pd.read_csv("data/bysykkel/trips-2021.9.1-2021.9.30.csv")
station_data = trips.groupby(['start_station_id', 'start_station_longitude', 
                    'start_station_latitude', 'start_station_name']).count()
station_data = station_data.reset_index()
station_data = station_data.drop(columns=station_data.columns[-7:])
station_data = station_data.rename(columns={'started_at':'started_trips'})
station_data = station_data.set_index('start_station_id')
station_data['ended_trips'] = trips['end_station_id'].value_counts()

from ipywidgets import HTML
from ipyleaflet import Map, Marker, basemaps, basemap_to_tiles, Circle, Polyline
oslo_center = 59.9127, 10.7461    #NB ipyleaflet uses Lat-Long (i.e. y,x, when specifying coordinates)
oslo_map = Map(center=oslo_center, zoom=13)
def add_markers(row):
    center = row['start_station_latitude'], row['start_station_longitude']
    marker = Circle(location=center, radius=int(0.04*row["started_trips"]), color = 'green')
    oslo_map.add_layer(marker)
    
station_data.apply(add_markers, axis=1)

In [None]:
oslo_map