# Introduction into pandas

## The dataframe

* Core to pandas is a data structure called dataframe.
* In principle it is a table like structure:
 * Named colums with arbitrary types
 * Indices to conveniently select, filter and aggregate rows
 

    Index      | columns
    -------------------------------------------------------
    date       | temperature | humidity | description
    -------------------------------------------------------
    2018-08-15 | 36.6        | 0.8      | "Hot like always"
    2018-08-16 | 40.6        | 0.9      | "Even hotter"

In [None]:
# the most famous imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

Dataframes can be constructed und loaded from a vast amount of sources
* directly from e.g. a dict, list of list...
* file formatls: csv, parquet
* from a DB via sql query 

In [None]:
df = pd.DataFrame({"a": [1,2,3]*3, "b": ['a','b','c']*3, "c": [0.3,0.4,None]*3})
df

In [None]:
# access like a dictionary
df['b']

In [None]:
# access like a method (tab completion!)
df.a

In [None]:
# multiple colums
df[['b','c']]

In [None]:
# first 5 rows
df[:5]

In [None]:
# 5 rows in the middle
df[2:7] # slicing

In [None]:
# combine it (dice it!)
df[['b','c']][2:7]

In [None]:
# filtering is very important!
df['c'].notnull()

In [None]:
df['c']>0.3

In [None]:
# now we can use this "binary" index to filter out rows
df[df['c']>0.3]

In [None]:
# we can also build complex logical combinations
df[((df['c'].notnull())&(df['a']==2))|(df['b']=='a')]

# Data exploration

Now lets explore a real dataset: NYC Taxi and Limousine Commission trip dataset
* free download at https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page
* 10GB per year only yello cab

Let's explore it

In [None]:
# load the csv, but only the first 100000 rows (otherwise it's not fun)
df = pd.read_csv('../../taxi-data-csv/yellow_tripdata_2017-01.csv', nrows=100000)

In [None]:
# a very first step is just to look what it is
df

In [None]:
# also nice is describe for a variety of summary statistics
df.describe()

In [None]:
# or only the first few rows
df.head()

In [None]:
# very handy are some prediefined statistics functions
df.tip_amount.mean()

In [None]:
df['tip_amount'].std()

In [None]:
# but also plotting is (very) easy 
df.tip_amount.hist()

In [None]:
# more bins!
df.tip_amount.hist(bins=100)

In [None]:
# these are the things you have to search at stackoverflow...
fig, ax = plt.subplots()
df.tip_amount.hist(ax=ax, bins=100, bottom=0.1)
ax.set_yscale('log')
# negative tips?

In [None]:
#non 0 tips
fig, ax = plt.subplots()
df[(df.tip_amount>0)&(df.tip_amount<10)].tip_amount.hist(ax=ax, bins=100, bottom=0.1)

In [None]:
# 2D scatter plot is a very powerful visualization (but often very expensive)
plt.scatter(df.trip_distance, df.tip_amount, s=1)

In [None]:
# often one wnats to "zoom in", we know how to do that: filtering!!
df_cut = df[(df.tip_amount>0)&(df.tip_amount<30)&(df.trip_distance<10)]
plt.scatter(df_cut.trip_distance, df_cut.tip_amount, s=1)

In [None]:
# but more useful is a a so called "profile plot", only available in the seaborn package
import seaborn as sns
sns.regplot(x=df_cut.trip_distance, y=df_cut.tip_amount, x_bins=15, fit_reg=None)

In [None]:
# not so useful, but awesome looking (very slow)
f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True)
sns.kdeplot(df_cut.trip_distance, df_cut.tip_amount, cmap=cmap, n_levels=60, shade=True);


In [None]:
# a heatmap (2D-Histogram) is also an interesting visualization, but I haven't found one, so we have to build it
df_cut = df[(df.tip_amount>3)&(df.tip_amount<15)&(df.trip_distance<15)]
heatmap_df = df_cut.groupby([pd.cut(df_cut.trip_distance, 20), pd.cut(df_cut.tip_amount, 20)]).tip_amount.count()

In [None]:
heatmap_df

In [None]:
# unstacking is a very important transformation, but I only do it try-and-error
heatmap_df.unstack()

In [None]:
# now plot the 2d-matrix as a heatmap
ax = sns.heatmap(heatmap_df.unstack())

In [None]:
from matplotlib.colors import LogNorm
ax = sns.heatmap(heatmap_df.unstack(), norm=LogNorm(vmin=heatmap_df.min(), vmax=heatmap_df.max()))

# Now it's your turn!
Explore a bit further:
* what other variables are there?
* what data types are they?
* is data missing?
* can you spot interesting correlations
