<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pandas" data-toc-modified-id="Pandas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pandas</a></span><ul class="toc-item"><li><span><a href="#Pandas-data-formats" data-toc-modified-id="Pandas-data-formats-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Pandas data formats</a></span><ul class="toc-item"><li><span><a href="#Pandas-Series:" data-toc-modified-id="Pandas-Series:-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Pandas Series:</a></span></li><li><span><a href="#Pandas-Dataframe" data-toc-modified-id="Pandas-Dataframe-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Pandas Dataframe</a></span><ul class="toc-item"><li><span><a href="#select-data-regions-in-DataFrame:" data-toc-modified-id="select-data-regions-in-DataFrame:-1.1.2.1"><span class="toc-item-num">1.1.2.1&nbsp;&nbsp;</span>select data regions in DataFrame:</a></span></li></ul></li><li><span><a href="#Change-content-of-dataframe" data-toc-modified-id="Change-content-of-dataframe-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Change content of dataframe</a></span></li><li><span><a href="#Functions-can-be-applied-to-content" data-toc-modified-id="Functions-can-be-applied-to-content-1.1.4"><span class="toc-item-num">1.1.4&nbsp;&nbsp;</span>Functions can be applied to content</a></span></li></ul></li><li><span><a href="#First-practical-example:-Old-faithful-data" data-toc-modified-id="First-practical-example:-Old-faithful-data-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>First practical example: Old faithful data</a></span></li><li><span><a href="#Handling-missing-data-or-null-values-in-Pandas" data-toc-modified-id="Handling-missing-data-or-null-values-in-Pandas-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Handling missing data or null-values in Pandas</a></span><ul class="toc-item"><li><span><a href="#Python-None-object" data-toc-modified-id="Python-None-object-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Python None object</a></span></li><li><span><a href="#Python-NaN" data-toc-modified-id="Python-NaN-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Python NaN</a></span></li><li><span><a href="#NaN-and-None-in-Pandas" data-toc-modified-id="NaN-and-None-in-Pandas-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>NaN and None in Pandas</a></span><ul class="toc-item"><li><span><a href="#Example-with-real-data" data-toc-modified-id="Example-with-real-data-1.3.3.1"><span class="toc-item-num">1.3.3.1&nbsp;&nbsp;</span>Example with real data</a></span></li></ul></li></ul></li><li><span><a href="#Aggregations-in-Panda" data-toc-modified-id="Aggregations-in-Panda-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Aggregations in Panda</a></span><ul class="toc-item"><li><span><a href="#Grouping-in-Pandas" data-toc-modified-id="Grouping-in-Pandas-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Grouping in Pandas</a></span></li></ul></li><li><span><a href="#Second-Example----Planets-Data" data-toc-modified-id="Second-Example----Planets-Data-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Second Example -  Planets Data</a></span></li><li><span><a href="#Third-Example---Spread-sheet-data" data-toc-modified-id="Third-Example---Spread-sheet-data-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Third Example - Spread sheet data</a></span></li><li><span><a href="#Fourth-Example---Time-series-data" data-toc-modified-id="Fourth-Example---Time-series-data-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Fourth Example - Time series data</a></span></li></ul></li></ul></div>

## Pandas


Pandas is a set of Python Modules for  data analysis, i.e. 

* IO Tools to handle many different data formats, such as   *txt, csv, xls, json, hdf5, sql, ...*
* Tools to manipulate (filter, fix, extract, extend, ...) the data
* special visualization tools (on top of matplotlib)
* many features to combine, group, select, aggregate  specific properties of the data

A couple of smart instructions in pandas allow complex data-mining procedures which would otherwise take substantial programming effort.




### Pandas data formats

Pandas builds on special array types, which build/extend numpy arrays.

* Series: one-dimensional array, can contain different data types, and supports flexible indexing
* DataFrame: two-dimensional array, basically a table with rows and columns, can also contain different data types and has very 
flexible indexing




In [1]:
# Pandas Setup: 
import pandas as pd
# and usually as well:
import numpy as np
import matplotlib.pyplot as plt

#### Pandas Series:

In [None]:
obj=pd.Series([4,3,5,-6])
print(type(obj))
print(obj)

In [None]:
# Pandas series is not just array but has associated index
print(obj.values)
print(obj.index)

In [None]:
type(obj.values)

In [None]:
obj[2]

In [None]:
# default index is just integer range
#
# But can specify different index explizitly:

obj=pd.Series([4,3,5,-6],index=['a','b','c','d'])
print(obj)
print(obj.index)

In [None]:
# can use explicit index 'a'-'d' for range
obj['a':'c'] # watch out, last element included 

In [None]:
# or implicit numerical index:
obj[0:2] # watch out, last element not included 

In [None]:
# or change index later on:
obj.index=['I','II','III','IV']
obj

In [None]:
# Usual numpy array operations can be used for Pandas series

obj[obj>3] # select elements

In [None]:
obj**2 # square

In [None]:
np.exp(obj) # exp fct

#### Pandas Dataframe

the 2D version, they have both row and column index


In [None]:
# example creation 
# list of dates and row index
dates = pd.date_range('20130101', periods=6).to_native_types()
print(dates)
# list of chars as row-index
cols=list('ABCD')
# real data as numpy 2d array filled with randoms nums:
mydata = np.random.randn(6,4)
df = pd.DataFrame( mydata, index=dates, columns=cols)

print(df)

##### select data regions in DataFrame:

In [None]:
# column
df['A'] # column index 

In [None]:
df.A # in most cases also this notation works to select col 

In [None]:
# select row with loc and explicit index
print(df.loc['2013-01-05'])

In [None]:
# select row with iloc and implicit index
print(df.iloc[4])

In [None]:
# element with col and row
df['B'].loc['2013-01-05']


In [None]:
# or more direct
df['B']['2013-01-05']

In [None]:
# or range of rows
df['A']['2013-01-01':'2013-01-04']

In [None]:
# or implicit row index
df['B'][0:3] 

In [None]:
df['A']['2013-01-01':'2013-01-05']

***
Selecting range of rows also needs function

In [None]:
df['B':'C',:] # naive attempt does not work

In [None]:
df.loc[:,'B':'C'] # df.loc allows to select arbitrary regions with key index

In [None]:
df.iloc[:,1:3] # equivalent with numerical col index

In [None]:
df.iloc[1:4,2:4] # arbitrary sub-range

In [None]:
# transpose row <--> columns
df.T 

#### Change content of dataframe
Many nice ways to modify content, some examples:

In [None]:
df3 = df.copy() # independent copy
#df3['2013-01-04']
df3

In [None]:
# delete row
df4=df3.drop('2013-01-04')
df4

In [None]:
# df3 unchanged
df3

In [None]:
# delete column, need to specify axis
# and set inplace to make change in Dataframe
df3.drop('A',axis=1,inplace=True)

In [None]:
df3

In [None]:
# add new column
df4 = df.copy()
df4['E'] = ['one', 'one','two','three','four','three']
df4

In [None]:
# select specific rows 
df4[df4['E'].isin(['two','four'])]

#### Functions can be applied to content

In [None]:
df

In [None]:
df.apply(np.cumsum) # cumulative sum in each row

In [None]:
df.apply(np.exp) # exponential

In [None]:
df.mean()

In [None]:
df.sum()

.... and much more stuff

### First practical example: Old faithful data 
*(Eruptions of Geysir in Yellowstone National Park)*

Redo it with Pandas ...

In [None]:
# analyse old faithful data with pandas
import pandas as pd
import matplotlib.pyplot as plt
# can read data directly from web
# interpret as csv format 
# and import as Pandas dataframe 
d=pd.read_csv('https://people.sc.fsu.edu/~jburkardt/data/csv/faithful.csv')


In [None]:
d.info() # extract general info on dataframe

In [None]:
d.head() # inspect first lines

In [None]:
d.describe() # some basic statistic for each column

In [None]:
d.columns

In [None]:
# lengthy column names are unpractical, re-name
d.columns=['Index','El','Ew']
d.head()

In [None]:
print (d.Ew) # print all rows for Ew

In [None]:
# make plots directly with Pandas
d.El.plot.hist() # can call directly histogram for Pandas column
plt.xlabel('El')
fig=plt.figure() # trick to make new hist plot
d.Ew.plot.hist()
plt.xlabel('Ew')
d.plot.scatter('Ew','El'); # and also scatter plot

### Handling missing data or null-values in Pandas

#### Python None object
Python has special object `None` which is often used to specify empty or unassigned value to a variable

This can also be used or happen with Numpy/Pandas 

In [None]:
# 
vals1 = np.array([1, None, 3, 4])
vals1

In [None]:
# results in object-array with very limited use,
# many numpy funcs/operations broken
vals1**2

#### Python NaN
NaN (Not-a-number) is special floating point value which indicates that value cannot be treated as regular number, generally used in programming as specified in IEEE floating-point standard.

Also defined for numpy (`np.nan`)


In [None]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

In [None]:
# results in float array which can be used
# just no operation with the nan element  
vals2**2

####  NaN and None in Pandas

Pandas does not make that strict distinction but treats ``NaN`` and ``None`` basically the same:

In [None]:
# pandas converts to nan and float array
a=pd.Series([1, np.nan, 2, None])

In [None]:
a**2

Pandas has special functions to efficienlty handle None/NaN valuse:

- ``isnull()``: Generate a boolean mask indicating missing values
- ``notnull()``: Opposite of ``isnull()``
- ``dropna()``: Return a filtered version of the data
- ``fillna()``: Return a copy of the data with missing values filled or imputed

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
print (data)

In [None]:
print (data.isnull()) # boolean output

In [None]:
print (data[data.notnull()]) # select only non-missing values

In [None]:
print (data.dropna()) # does the same

In [None]:
print (data.fillna('Fehlt leider')) # fill something instead

For  **Dataframes** there are many more options for dropping (both axes) and filling, see chapter *Handling Missing Data* in Book 

##### Example with real data
'planets' data (discussed further below) has many missing entries

Needs seaborn package:

Use alternative python setup: 

`module load anaconda3/2019.07 ` in shell


In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.info()

In [None]:
planets.describe()

In [None]:
planets.mass.plot.hist()

### Aggregations in Panda

In [None]:
# simple df with gaussian random nums
df = pd.DataFrame({'A': np.random.randn(5),
                   'B': np.random.randn(5), 'C': np.random.randn(5)})
df

In [None]:
df.describe() # gives column-wise statistics

These **aggregations** can also be called explicitly:

In [None]:
print (df.A.mean(), df.A.std(), df.B.max())

By default aggregation is column-wise but one can also do row-wise:

In [None]:
df.mean(axis='columns') # gives mean value by row

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

These are all methods of ``DataFrame`` and ``Series`` objects.

#### Grouping in Pandas
A frequent use case is that one is not just interested in overall aggregation but separated or split according to other criteria or keys.

A simple example:

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': np.random.randn(6)}, columns=['key', 'data'])
df

Now one might be interested not just in overall aggregation (e.g. mean) but in aggregation separate for each key. 
Can in principle be done by selecting specific rows:

In [None]:
print (df[df.key=='A'])
df[df.key=='A'].mean()

A much more powerful alternative is to apply `groupby`, this automatically splits dataframe in subframes and then arbitrary aggregations can be applied:

In [None]:
df.groupby('key')

 `groupby` results in special object which allows subsequent aggregation calls:

In [None]:
df.groupby('key').mean()

In [None]:
df.groupby('key').describe()

**Illustrative figure to show effect of groupby**
* Sequence of
  * Split
  * Apply
  * Combine

![](figures/03.08-split-apply-combine.png)

### Second Example -  Planets Data

Planets data comes with the  [Seaborn package](http://seaborn.pydata.org/).
It gives information on planets that astronomers have discovered around other stars (known as *extrasolar planets* or *exoplanets* for short). It can be downloaded with a simple Seaborn command:

In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

No big magic behind, simple csv file read via Pandas from Web server and stored as Dataframe

In [None]:
# do usual inspection
planets.info()
planets.head()

In [None]:
planets.describe()

In [None]:
# some missing data, might consider to remove these lines
planets.dropna().describe()

In [None]:
planets.groupby('method').mean()

In [None]:
# or only specific columns
planets.groupby('method')['orbital_period'].mean()

In [None]:
# again describe can be applied on sub-split
planets.groupby('method')['year'].describe()

**Of course we can also plot**

In [None]:
planets.plot.scatter('mass','orbital_period')

Gives useful information on dataset, e.g. Radial Velocity is most frequent method, followed by Transit, but latter only applied from 2013 onwards, ...

### Third Example - Spread sheet data
** WLCG Computing Resources ** see 
https://wlcg-rebus.cern.ch/apps/pledges/resources/

In [None]:
# json input
#dwlcg = pd.read_json('https://wlcg-rebus.cern.ch/apps/pledges/resources/2016/all/json')
dwlcg = pd.read_json('http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/sw/source/wlcg.json')
dwlcg.info()

In [None]:
dwlcg.head()

In [None]:
dwlcg.describe() # not very meaningful, data too inhomogenous

In [None]:
dwlcg.ATLAS.mean() # select col ATLAS and try to get mean -> error
# 

Gives error:  
Empty fields had been filled as empty string (""), therefore columns which should contain numeric data have mix of numeric and string data and are treated as un-specific   *object* data type.

In [None]:
dwlcg.info()

In [None]:
# fix by enforcing conversion into numeric field
for s in ['ATLAS','CMS','ALICE','LHCb']:
    dwlcg[s]=dwlcg[s].apply(pd.to_numeric)
    print (s, dwlcg[s].count(), dwlcg[s].mean())


In [None]:
dwlcg.info()

In [None]:
dwlcg.ATLAS.mean() # now mean works ok

Still not meaningfull, mix of different types of resources

In [None]:
dwlcg.groupby(['PledgeType'])['ATLAS'].mean()

Sum of resources for ATLAS 

In [None]:
dwlcg.groupby(['PledgeType']).mean() # for all

Further breakdown by country:

In [None]:
dwlcg.groupby(['PledgeType','Country'])['ATLAS'].sum()

In [None]:
# can also be further  used:
r = dwlcg.groupby(['PledgeType','Country'])['ATLAS'].sum()
print (r['CPU'])
print ('CPU UK:', r['CPU','UK'])
r[:,'UK']


In [None]:
dr=r.unstack() # back to regular dataframe

In [None]:
dr.head()

In [None]:
r.head()

** Remark:**  
Such a spreadsheet can also be regarded as multi-dimensional data format:
1. Country
1. Federation
1. Experiment
1. PledgeType

For practical reasons it is stored and filled as a 2D table.
But with groupby one can basically get back these extra dimensions.

### Fourth Example - Time series data
Example on weather/climate data:

The "Deutsche Wetterdienst" has an archive with many decades of daily measurements from many weatherstations all over Germany:  http://www.dwd.de/DE/leistungen/klimadatendeutschland/klarchivtagmonat.html,
One Example is the dataset from Zugspitze Station which dates back to 
August 1, 1900.

The analysis of that data is instructive, e.g. to look for effects of climate change.

Pandas provides a few special methods which are very useful for time-series analysis:
- easy selection of time-periods
- resampling/averaging of data over arbitrary intervals
  - days, weeks, months, years...

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import numpy as np
# csv data file from zip file dowlnoad from 
# http://www.dwd.de/DE/leistungen/klimadatendeutschland/klarchivtagmonat.html
# Messdatum als index und Interpretation als Datum (nicht nur String) 
#df=pd.read_csv('http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/sw/source/produkt_klima_Tageswerte_19000801_20151231_05792.txt',';\s*',index_col='MESS_DATUM',parse_dates=['MESS_DATUM'],skipfooter=1)

df=pd.read_csv('http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/sw/source/produkt_klima_tag_19000801_20181231_05792.txt',';\s*',index_col='MESS_DATUM',parse_dates=['MESS_DATUM'],engine='python')
print (df.size)
print (df.columns)
df.info()

# kryptische Abkuerzungen...
# TMK = mittlere Temperatur
# SHK_TAG = Schneehoehe
# ..
# Details siehe http://www-static.etp.physik.uni-muenchen.de/kurs/Computing/sw/source/Metadaten_Parameter_klima_tag_05792.html
#


In [None]:
df.head()

In [None]:
df['TMK']

df['TMK'].plot();
# starke Schwankungen, kein Trend erkennbar

In [None]:
# nur fuer 1 Jahr
df['TMK']['1901'].plot();
# crucial to have Index recognized as date 
# to allow selection of time period ('1901'), etc

In [None]:
# seit 2005
df['TMK']['2005':].plot()

In [None]:
# re-sample over year
dfy=df['TMK'].resample('A').mean()
dfy.plot()


In [None]:
# avg per quarter
dfq=df['TMK'].resample('Q-NOV').mean()

# Winter
dfq[dfq.index.quarter==1].plot()

# Summer
plt.figure()
dfq[dfq.index.quarter==3].plot()

# since 1970
plt.figure()
dfq[(dfq.index.quarter==3) & (dfq.index.year>1970)].plot()
