# Pandas for real-world data
In the previous notebook Numpy arrays have been introduced. Numpy arrays provide an efficient data structure for clean, numerical data. However, in practise the quality of data is often more challenging. There can be missing data and lots of different data types. Moreover, accessing data only by integer indices may be cumbersome. More meaningful indices would be desirable.
**Pandas** is a newer python package, which builds on top of numpy arrays and extends it with many features, which provide a much more efficient managing of practical data. Pandas provides

* methods to cope with missing data and different data types, 
* methods to label columns and rows for a comfortable data access
* functions, which are familiar to users of database- and spreadsheet frameworks, such as complex queries, filters, joins, pivots, ...
* methods for comfortable data visualisation (on top of Matplotlib)

The fundamental data structures of Pandas are _Series_ and _Dataframes_. Both of them apply a third important structure, the _Index_. The basics of these datastructures are introduced in this notebook. 


Import Pandas and check version:

In [7]:
import pandas as pd
print pd.__version__

0.17.1


## Basics of Pandas Series
Pandas Series can be considered to be 1-dimensional numpy arrays with an explicitly configurable index.
### Construction of Pandas Series

In [8]:
S1=pd.Series(data=[10,20,30,40])
print "Pandas Series object:    \n",S1
print "Values of series object: ",S1.values
print "Type of series values:   ",type(S1.values)
print "Index of series object:  ",S1.index
print "Type of series index:    ",type(S1.index)

Pandas Series object:    
0    10
1    20
2    30
3    40
dtype: int64
Values of series object:  [10 20 30 40]
Type of series values:    <type 'numpy.ndarray'>
Index of series object:   Int64Index([0, 1, 2, 3], dtype='int64')
Type of series index:     <class 'pandas.core.index.Int64Index'>


In [9]:
pd.Series()
S2=pd.Series(index=["2014","2015","2016","2017"],data=[19.5,20.3,18.7,17.0])
print "Pandas Series object:    \n",S2
print "Values of series object: ",S2.values
print "Type of series values:   ",type(S2.values)
print "Index of series object:  ",S2.index
print "Type of series index:    ",type(S2.index)

Pandas Series object:    
2014    19.5
2015    20.3
2016    18.7
2017    17.0
dtype: float64
Values of series object:  [ 19.5  20.3  18.7  17. ]
Type of series values:    <type 'numpy.ndarray'>
Index of series object:   Index([u'2014', u'2015', u'2016', u'2017'], dtype='object')
Type of series index:     <class 'pandas.core.index.Index'>


Pandas series can be directly generated from Python dictionaries:

In [10]:
RegisteredUsersDict={"June":14789,"July":15511,"August":15517,"September":16012}
print RegisteredUsersDict

{'September': 16012, 'July': 15511, 'June': 14789, 'August': 15517}


In [11]:
RegisteredUsersSeries=pd.Series(RegisteredUsersDict)
print RegisteredUsersSeries

August       15517
July         15511
June         14789
September    16012
dtype: int64


### Accessing Pandas Series data
Accessing single elements:

In [12]:
print S1[1]
print S2["2015"]

20
20.3


A data structure with an explicitly defined index is already available in Python: the _dictionary_. However, pandas series provide more capabilities, e.g. for the query of slices over a key-range, such as 

In [13]:
print "Slice of S1:\n",S1[1:3] 
print "\nSlice of S2:\n",S2["2015":"2017"]

Slice of S1:
1    20
2    30
dtype: int64

Slice of S2:
2015    20.3
2016    18.7
2017    17.0
dtype: float64


Even though the index has been defined explicitly, it is always possible to asscess by integer indices

In [14]:
print S2[1:3]

2015    20.3
2016    18.7
dtype: float64


The possibilty to access elements by the explicitly defined index and the implicit integer index may yield confusions, in particular if the explicitly defined index also contains integers. Therefore it is recommended to access elements by _.loc[]_ and _.iloc[]._ The former provides access by explicitly defined index and the latter by the implicit integer index:

In [15]:
S2.loc["2015":"2017"]

2015    20.3
2016    18.7
2017    17.0
dtype: float64

In [16]:
S2.iloc[1:3]

2015    20.3
2016    18.7
dtype: float64

Masked access:

In [17]:
print S2[(S2<20) & (S2>18)] 

2014    19.5
2016    18.7
dtype: float64


Add new element to series

In [18]:
print "Before: \n",S2
S2["2018"]=17.7
print "\nAfter: \n",S2

Before: 
2014    19.5
2015    20.3
2016    18.7
2017    17.0
dtype: float64

After: 
2014    19.5
2015    20.3
2016    18.7
2017    17.0
2018    17.7
dtype: float64


## Basics of Pandas Dataframes

Pandas Series can be considered to be 2-dimensional numpy arrays with an explicitly configurable index.

### Construction of Pandas Dataframes
Create dataframe from nested Python list:

In [19]:
DF1=pd.DataFrame(data=[[1,2,3],[4,5,6]])
print "Pandas dataframe object:    \n",DF1
print "\nValues of dataframe object: \n",DF1.values
print "\nType of dataframe values:   ",type(DF1.values)
print "Index of dataframe object:  ",DF1.index
print "Type of dataframe index:    ",type(DF1.index)
print "Columns of dataframe object:",DF1.columns

Pandas dataframe object:    
   0  1  2
0  1  2  3
1  4  5  6

Values of dataframe object: 
[[1 2 3]
 [4 5 6]]

Type of dataframe values:    <type 'numpy.ndarray'>
Index of dataframe object:   Int64Index([0, 1], dtype='int64')
Type of dataframe index:     <class 'pandas.core.index.Int64Index'>
Columns of dataframe object: Int64Index([0, 1, 2], dtype='int64')


Create dataframe with explicitly defined index and labeled column names:

In [20]:
DF2=pd.DataFrame(index=["peter","paul","mary"],columns=["gender","age"],data=[["male",23],["male",31],["female",25]])
print "Pandas dataframe object:    \n",DF2
print "\nValues of dataframe object: \n",DF2.values
print "\nType of dataframe values:   ",type(DF2.values)
print "Index of dataframe object:  ",DF2.index
print "Type of dataframe index:    ",type(DF2.index)
print "Columns of dataframe object:",DF2.columns

Pandas dataframe object:    
       gender  age
peter    male   23
paul     male   31
mary   female   25

Values of dataframe object: 
[['male' 23L]
 ['male' 31L]
 ['female' 25L]]

Type of dataframe values:    <type 'numpy.ndarray'>
Index of dataframe object:   Index([u'peter', u'paul', u'mary'], dtype='object')
Type of dataframe index:     <class 'pandas.core.index.Index'>
Columns of dataframe object: Index([u'gender', u'age'], dtype='object')


Create dataframe from Numpy array:

In [21]:
import numpy as np
arr= np.random.randint(0,10,(3,7))
DF3=pd.DataFrame(arr)
print "Pandas dataframe object:    \n",DF3
print "\nValues of dataframe object: \n",DF3.values
print "\nType of dataframe values:   ",type(DF3.values)
print "Index of dataframe object:  ",DF3.index
print "Type of dataframe index:    ",type(DF3.index)
print "Columns of dataframe object:",DF3.columns

Pandas dataframe object:    
   0  1  2  3  4  5  6
0  2  5  5  8  1  0  9
1  4  8  5  4  5  6  3
2  8  7  3  6  0  7  6

Values of dataframe object: 
[[2 5 5 8 1 0 9]
 [4 8 5 4 5 6 3]
 [8 7 3 6 0 7 6]]

Type of dataframe values:    <type 'numpy.ndarray'>
Index of dataframe object:   Int64Index([0, 1, 2], dtype='int64')
Type of dataframe index:     <class 'pandas.core.index.Int64Index'>
Columns of dataframe object: Int64Index([0, 1, 2, 3, 4, 5, 6], dtype='int64')


Rename index and columns of dataframe:

In [22]:
DF3.columns=["A","B","C","D","E","F","G"]
DF3.index= ["user1","user2","user3"]
print DF3

       A  B  C  D  E  F  G
user1  2  5  5  8  1  0  9
user2  4  8  5  4  5  6  3
user3  8  7  3  6  0  7  6


Insert new column into existing dataframe

In [23]:
hometown=['new york','san diego','Boston']
DF2["home"]=hometown
print DF2

       gender  age       home
peter    male   23   new york
paul     male   31  san diego
mary   female   25     Boston


In the previous section the Pandas series _RegisteredUsersSeries_ has been defined. From such a series object dataframes can be constructed as follows:

In [24]:
RegisteredUsersDF=pd.DataFrame(RegisteredUsersSeries,columns=["Users"])
print RegisteredUsersDF

           Users
August     15517
July       15511
June       14789
September  16012


Create dataframe from list of dictionaries. Note that even though not all dictionaries have the same keys the dataframe can be built.

In [25]:
personDict=[{"name":"ben","age":13},{"name":"lucia","age":15},{"name":"lynn","age":11},{"name":"eve"}]
print personDict

[{'age': 13, 'name': 'ben'}, {'age': 15, 'name': 'lucia'}, {'age': 11, 'name': 'lynn'}, {'name': 'eve'}]


In [26]:
personDF=pd.DataFrame(personDict)
print personDF

   age   name
0   13    ben
1   15  lucia
2   11   lynn
3  NaN    eve


### Accessing Pandas Dataframe data

A single column of a dataframe can be accessed by specifying the name of the column such as e.g.:

In [27]:
print DF2["home"]

peter     new york
paul     san diego
mary        Boston
Name: home, dtype: object


or alternatively:

In [28]:
print DF2.home

peter     new york
paul     san diego
mary        Boston
Name: home, dtype: object


However, __a single row can not be accessed in this way__:

In [29]:
# DF2["peter"] or DF2[0] raises an error

A possibility to access a single row would be

In [30]:
print DF2["peter":"peter"]

      gender  age      home
peter   male   23  new york


In [31]:
print DF2[0:1]

      gender  age      home
peter   male   23  new york


As already mentioned in the context of Pandas Series objects, this type of indexing is confusing and it is recommended to apply _.loc[]_ and _.iloc[]_ instead.

Access of single row:

In [32]:
print DF2.loc["peter"]

gender        male
age             23
home      new york
Name: peter, dtype: object


or:

In [33]:
print DF2.iloc[0]

gender        male
age             23
home      new york
Name: peter, dtype: object


Access element in dedicated row and column:

In [34]:
print DF2.loc["peter","home"]

new york


or:

In [35]:
print DF2.iloc[0,2]

new york


Access dedicated subframe of the dataframe

In [36]:
print DF2.loc[["peter","mary"],["age","home"]]

       age      home
peter   23  new york
mary    25    Boston


or:

In [37]:
print DF2.iloc[[0,2],[1,2]]

       age      home
peter   23  new york
mary    25    Boston


## Time Ranges in Pandas
Data is often associated to date- and time-stamps. In particular time-series data consists of a series of uni- or multi-variate data instances, where each instance is labeled with an unique date-time-stamp. In Pandas date-time ranges can be created by the _date_range()_ method as shown below. The first parameter defines the start of the date-time range, the _periods_-parameter defines the number of instances and the _freq_-parameter defines the duration between successive date-time-stamps.

Date-time range with a frequency of 2-days:

In [38]:
dates1 = pd.date_range('20161003', periods=6,freq="2D")
print dates1

DatetimeIndex(['2016-10-03', '2016-10-05', '2016-10-07', '2016-10-09',
               '2016-10-11', '2016-10-13'],
              dtype='datetime64[ns]', freq='2D')


Date-time range with a frequency of 45 minutes:

In [39]:
dates2 = pd.date_range('20161003104500', periods=4,freq="45Min")
print dates2

DatetimeIndex(['2016-10-03 10:45:00', '2016-10-03 11:30:00',
               '2016-10-03 12:15:00', '2016-10-03 13:00:00'],
              dtype='datetime64[ns]', freq='45T')


The elements of a date-time range are _timestamps_. Single _timestamps_-objects can be created as follows:

In [40]:
dec23=pd.Timestamp(pd.datetime(2016,12,23,9,15,0))
jan9=pd.Timestamp(pd.datetime(2017,1,9,10,30,0))
print dec23
print jan9

2016-12-23 09:15:00
2017-01-09 10:30:00


Operations like adding a time-delta to a given timestamp or calculating the number of days can be performed like e.g.:

In [41]:
print type(dates1[1])
print dates1[1]+20
datediff1=dates1[4]-dates1[1]
print datediff1
datediff2=jan9-dec23
print datediff2
print datediff1 + datediff2

<class 'pandas.tslib.Timestamp'>
2016-11-14 00:00:00
6 days 00:00:00
17 days 01:15:00
23 days 01:15:00


In Pandas date-time-ranges are often applied as index for series or dataframes:

In [42]:
TS1=pd.Series(index=dates1,data=np.arange(10,16))
print TS1

2016-10-03    10
2016-10-05    11
2016-10-07    12
2016-10-09    13
2016-10-11    14
2016-10-13    15
Freq: 2D, dtype: int32


In [43]:
dates2 = pd.date_range('20160929', periods=5,freq="2D")
TS2=pd.Series(index=dates2,data=np.arange(5))
print TS2

2016-09-29    0
2016-10-01    1
2016-10-03    2
2016-10-05    3
2016-10-07    4
Freq: 2D, dtype: int32


In [44]:
dates3 = pd.date_range('20161003', periods=5,freq="10H30MIN")
TDF1=pd.DataFrame(index=dates3,data=np.random.randint(100,200,(5,3)),columns=["C1","C2","C3"])
print TDF1

                      C1   C2   C3
2016-10-03 00:00:00  168  177  113
2016-10-03 10:30:00  118  117  181
2016-10-03 21:00:00  196  126  188
2016-10-04 07:30:00  191  157  107
2016-10-04 18:00:00  199  139  107


In [45]:
dates4 = pd.date_range('20161003110000', periods=10,freq="10S")
TDF2=pd.DataFrame(index=dates4,data=np.random.randint(100,200,(10,3)),columns=["C3","C4","C5"])
print TDF2

                      C3   C4   C5
2016-10-03 11:00:00  149  140  100
2016-10-03 11:00:10  115  148  179
2016-10-03 11:00:20  164  189  171
2016-10-03 11:00:30  148  155  180
2016-10-03 11:00:40  155  195  178
2016-10-03 11:00:50  187  155  163
2016-10-03 11:01:00  187  125  123
2016-10-03 11:01:10  126  106  154
2016-10-03 11:01:20  193  118  161
2016-10-03 11:01:30  143  100  183


## Combining Series and Dataframes
Series and dataframes can be combined by the methods _combine()_ and _combine_first()_. Application of _combine_first()_ is demonstrated below. If 2 series _S1_, and _S2_ are combined by _S1.combine_first(S2)_ the result is a new Series-object whose index is the union of the index elements in _S1_ and _S2_. For each element in the new index the corresponding value is
* the value of _S1_ at this index-element, if _S1_ contains this index-element.
* the value of _S2_ at this index-element, if _S1_ does not contain this index-element

If 2 dataframes _DF1_, and _DF2_ are combined by _DF1.combine_first(DF2)_ the result is a new Dataframe-object whose index is the union of the index elements in _DF1_ and _DF2_ and whose columns are the union of the columns in _DF1_ and _DF2_. For each element in the resulting dataframe the value is
* the value of _DF1_ at this index/column-element, if _DF1_ contains this index/column-element.
* the value of _DF2_ at this index/column-element, if _DF1_ does not contain this index/column-element, but _DF2_ contains it
* _NAN_ if neither _DF1_ nor _DF2_ contains this index/column-element.

In [46]:
print "\nTime Series TS1=\n",TS1
print "\nTime Series TS2=\n",TS2
TS3=TS1.combine_first(TS2)
print "\nCombination of TS1 and TS2 is TS3=\n",TS3
print "-"*30
print "\nDataframe TDF1=\n",TDF1
print "\nDataframe TDF2=\n",TDF2
TSDF1=TDF2.combine_first(TDF1)
print "\nCombination of TDF1 and TDF2 is TSDF1=\n",TSDF1


Time Series TS1=
2016-10-03    10
2016-10-05    11
2016-10-07    12
2016-10-09    13
2016-10-11    14
2016-10-13    15
Freq: 2D, dtype: int32

Time Series TS2=
2016-09-29    0
2016-10-01    1
2016-10-03    2
2016-10-05    3
2016-10-07    4
Freq: 2D, dtype: int32

Combination of TS1 and TS2 is TS3=
2016-09-29     0
2016-10-01     1
2016-10-03    10
2016-10-05    11
2016-10-07    12
2016-10-09    13
2016-10-11    14
2016-10-13    15
Freq: 2D, dtype: float64
------------------------------

Dataframe TDF1=
                      C1   C2   C3
2016-10-03 00:00:00  168  177  113
2016-10-03 10:30:00  118  117  181
2016-10-03 21:00:00  196  126  188
2016-10-04 07:30:00  191  157  107
2016-10-04 18:00:00  199  139  107

Dataframe TDF2=
                      C3   C4   C5
2016-10-03 11:00:00  149  140  100
2016-10-03 11:00:10  115  148  179
2016-10-03 11:00:20  164  189  171
2016-10-03 11:00:30  148  155  180
2016-10-03 11:00:40  155  195  178
2016-10-03 11:00:50  187  155  163
2016-10-03 11:01:00

## Handling of Missing Data
One of the main features of _Pandas_ is it's capability to manage missing data. As shown in the example above, Series- and Dataframe-elements without assigned data are represented by _NaN_, which is actually Numpy's _numpy.nan_ value. A boolean mask, which identifies all dataframe-elements with missing data (NaN-values) can be calculated as follows. It contains _True_ at all _NaN_-positions.

In [47]:
print TSDF1.isnull()

                        C1     C2     C3     C4     C5
2016-10-03 00:00:00  False  False  False   True   True
2016-10-03 10:30:00  False  False  False   True   True
2016-10-03 11:00:00   True   True  False  False  False
2016-10-03 11:00:10   True   True  False  False  False
2016-10-03 11:00:20   True   True  False  False  False
2016-10-03 11:00:30   True   True  False  False  False
2016-10-03 11:00:40   True   True  False  False  False
2016-10-03 11:00:50   True   True  False  False  False
2016-10-03 11:01:00   True   True  False  False  False
2016-10-03 11:01:10   True   True  False  False  False
2016-10-03 11:01:20   True   True  False  False  False
2016-10-03 11:01:30   True   True  False  False  False
2016-10-03 21:00:00  False  False  False   True   True
2016-10-04 07:30:00  False  False  False   True   True
2016-10-04 18:00:00  False  False  False   True   True


In combination with _all()_ and _any()_ the _isnull()_-method can also be applied to check whether all columns or all rows or any column or any row contain missing data.

In [48]:
print "Are all elements in a column of TSDF1 NaN?\n",TSDF1.isnull().all(axis=0)
print "\nIs any element in a column of TSDF1 NaN?\n",TSDF1.isnull().any(axis=0)
print "\nAre all elements in a row of TSDF1 NaN?\n",TSDF1.isnull().all(axis=1)
print "\nIs any element in a row of TSDF1 NaN?\n",TSDF1.isnull().any(axis=1)

Are all elements in a column of TSDF1 NaN?
C1    False
C2    False
C3    False
C4    False
C5    False
dtype: bool

Is any element in a column of TSDF1 NaN?
C1     True
C2     True
C3    False
C4     True
C5     True
dtype: bool

Are all elements in a row of TSDF1 NaN?
2016-10-03 00:00:00    False
2016-10-03 10:30:00    False
2016-10-03 11:00:00    False
2016-10-03 11:00:10    False
2016-10-03 11:00:20    False
2016-10-03 11:00:30    False
2016-10-03 11:00:40    False
2016-10-03 11:00:50    False
2016-10-03 11:01:00    False
2016-10-03 11:01:10    False
2016-10-03 11:01:20    False
2016-10-03 11:01:30    False
2016-10-03 21:00:00    False
2016-10-04 07:30:00    False
2016-10-04 18:00:00    False
dtype: bool

Is any element in a row of TSDF1 NaN?
2016-10-03 00:00:00    True
2016-10-03 10:30:00    True
2016-10-03 11:00:00    True
2016-10-03 11:00:10    True
2016-10-03 11:00:20    True
2016-10-03 11:00:30    True
2016-10-03 11:00:40    True
2016-10-03 11:00:50    True
2016-10-03 11:01:00 

The _dropna()_-method can be applied to drop all columns or rows, in which at least one or in which all elements are NaN. The following use of _dropna()_ drops all columns, which contain at least one _NaN_.

In [49]:
TSDF2=TSDF1.copy()
print TSDF2.dropna(axis=1,how="any")

                      C3
2016-10-03 00:00:00  113
2016-10-03 10:30:00  181
2016-10-03 11:00:00  149
2016-10-03 11:00:10  115
2016-10-03 11:00:20  164
2016-10-03 11:00:30  148
2016-10-03 11:00:40  155
2016-10-03 11:00:50  187
2016-10-03 11:01:00  187
2016-10-03 11:01:10  126
2016-10-03 11:01:20  193
2016-10-03 11:01:30  143
2016-10-03 21:00:00  188
2016-10-04 07:30:00  107
2016-10-04 18:00:00  107


_NaN_-values can be replaced by any other value using the _fillna()_ method:

In [50]:
TSDF3=TSDF1.fillna(value=0.0)
print TSDF3

                      C1   C2   C3   C4   C5
2016-10-03 00:00:00  168  177  113    0    0
2016-10-03 10:30:00  118  117  181    0    0
2016-10-03 11:00:00    0    0  149  140  100
2016-10-03 11:00:10    0    0  115  148  179
2016-10-03 11:00:20    0    0  164  189  171
2016-10-03 11:00:30    0    0  148  155  180
2016-10-03 11:00:40    0    0  155  195  178
2016-10-03 11:00:50    0    0  187  155  163
2016-10-03 11:01:00    0    0  187  125  123
2016-10-03 11:01:10    0    0  126  106  154
2016-10-03 11:01:20    0    0  193  118  161
2016-10-03 11:01:30    0    0  143  100  183
2016-10-03 21:00:00  196  126  188    0    0
2016-10-04 07:30:00  191  157  107    0    0
2016-10-04 18:00:00  199  139  107    0    0


## Split, Concatenate and Join
Pandas dataframes can be splitted into parts using the common slicing approach as shown below.

In [51]:
parts=[TSDF3[:2],TSDF3[2:12],TSDF3[12:]]
for i,p in enumerate(parts):
    print "\nPart %1d\n"%i,p


Part 0
                      C1   C2   C3  C4  C5
2016-10-03 00:00:00  168  177  113   0   0
2016-10-03 10:30:00  118  117  181   0   0

Part 1
                     C1  C2   C3   C4   C5
2016-10-03 11:00:00   0   0  149  140  100
2016-10-03 11:00:10   0   0  115  148  179
2016-10-03 11:00:20   0   0  164  189  171
2016-10-03 11:00:30   0   0  148  155  180
2016-10-03 11:00:40   0   0  155  195  178
2016-10-03 11:00:50   0   0  187  155  163
2016-10-03 11:01:00   0   0  187  125  123
2016-10-03 11:01:10   0   0  126  106  154
2016-10-03 11:01:20   0   0  193  118  161
2016-10-03 11:01:30   0   0  143  100  183

Part 2
                      C1   C2   C3  C4  C5
2016-10-03 21:00:00  196  126  188   0   0
2016-10-04 07:30:00  191  157  107   0   0
2016-10-04 18:00:00  199  139  107   0   0


Several dataframes can be concatenated. For this the parts must be assigned to a python list, which is passed to the Pandas method _concat()_. Vertical concatenation is realised by setting the parameter _axis=0_. For horizontal concatenation this parameter must be _1_. 

In [52]:
All=pd.concat(parts,axis=0)
print All

                      C1   C2   C3   C4   C5
2016-10-03 00:00:00  168  177  113    0    0
2016-10-03 10:30:00  118  117  181    0    0
2016-10-03 11:00:00    0    0  149  140  100
2016-10-03 11:00:10    0    0  115  148  179
2016-10-03 11:00:20    0    0  164  189  171
2016-10-03 11:00:30    0    0  148  155  180
2016-10-03 11:00:40    0    0  155  195  178
2016-10-03 11:00:50    0    0  187  155  163
2016-10-03 11:01:00    0    0  187  125  123
2016-10-03 11:01:10    0    0  126  106  154
2016-10-03 11:01:20    0    0  193  118  161
2016-10-03 11:01:30    0    0  143  100  183
2016-10-03 21:00:00  196  126  188    0    0
2016-10-04 07:30:00  191  157  107    0    0
2016-10-04 18:00:00  199  139  107    0    0


SQL-stile joins can be performed on pandas dataframes by applying the _merge()_-method. The _on_-parameter of the _merge()_-method takes a list, whose elements are the keys on which the join shall be performed. The keys must be column names, which exist in both dataframes. The _how_-parameter of the _merge()_-method is used to specify the type of join. The type of join defines how to create the new dataframe in the case that some key-combinations do not exist in both dataframes:
* _inner_: The joined dataframe contains only rows, whose key-combinations exist in both dataframes
* _outer_: The joined dataframe contains all rows, whose key-combinations exist either in the left, the right or in both dataframes
* _left_: The joined dataframe contains all rows, whose key-combinations exist in the left dataframe
* _right_: The joined dataframe contains all rows, whose key-combinations exist in the right dataframe

Examples of all join-types are demonstrated below.

In [53]:
group1=pd.DataFrame({"firstname":["peter","paul","mary"],"familyname":["aman","bman","cman"],"age":[21,18,23],"gender":["m","m","f"]})
print group1
group2=pd.DataFrame({"firstname":["peter","paul","mary"],"familyname":["aman","bman","dman"],"home":["new york","boston","florida"],"phone":[1234,4789,9856]})
print group2

   age familyname firstname gender
0   21       aman     peter      m
1   18       bman      paul      m
2   23       cman      mary      f
  familyname firstname      home  phone
0       aman     peter  new york   1234
1       bman      paul    boston   4789
2       dman      mary   florida   9856


In [54]:
groupInner=pd.merge(group1, group2, on=['firstname','familyname'],how="inner")
print groupInner

   age familyname firstname gender      home  phone
0   21       aman     peter      m  new york   1234
1   18       bman      paul      m    boston   4789


In [55]:
groupOuter=pd.merge(group1, group2, on=['firstname','familyname'],how="outer")
print groupOuter

   age familyname firstname gender      home  phone
0   21       aman     peter      m  new york   1234
1   18       bman      paul      m    boston   4789
2   23       cman      mary      f       NaN    NaN
3  NaN       dman      mary    NaN   florida   9856


In [56]:
groupLeft=pd.merge(group1, group2, on=['firstname','familyname'],how="left")
print groupLeft

   age familyname firstname gender      home  phone
0   21       aman     peter      m  new york   1234
1   18       bman      paul      m    boston   4789
2   23       cman      mary      f       NaN    NaN


In [57]:
groupRight=pd.merge(group1, group2, on=['firstname','familyname'],how="right")
print groupRight

   age familyname firstname gender      home  phone
0   21       aman     peter      m  new york   1234
1   18       bman      paul      m    boston   4789
2  NaN       dman      mary    NaN   florida   9856


## Simple Operations on Pandas Series and Dataframes

### Python operations

In [58]:
print DF1
DF2=3*DF1
print DF2
print DF1.add(DF2)

   0  1  2
0  1  2  3
1  4  5  6
    0   1   2
0   3   6   9
1  12  15  18
    0   1   2
0   4   8  12
1  16  20  24


### Numpy operations on Pandas Series and Dataframes 

In [59]:
print S1

0    10
1    20
2    30
3    40
dtype: int64


In [60]:
print DF1

   0  1  2
0  1  2  3
1  4  5  6


In [61]:
print "-"*40
print "\nlog2 of series values:\n",np.log2(S1)
print "\nlog2 of dataframe values:\n",np.log2(DF1)

print "-"*40
print "\nSinus of series values:\n",np.sin(S1)
print "\nSinus of dataframe values:\n",np.sin(DF1)

print "-"*40
print "\nSecond power of series values:\n",np.power(S1,2)
print "\nSecond power of dataframe values:\n",np.power(DF1,2)

----------------------------------------

log2 of series values:
0    3.321928
1    4.321928
2    4.906891
3    5.321928
dtype: float64

log2 of dataframe values:
   0         1         2
0  0  1.000000  1.584963
1  2  2.321928  2.584963
----------------------------------------

Sinus of series values:
0   -0.544021
1    0.912945
2   -0.988032
3    0.745113
dtype: float64

Sinus of dataframe values:
          0         1         2
0  0.841471  0.909297  0.141120
1 -0.756802 -0.958924 -0.279415
----------------------------------------

Second power of series values:
0     100
1     400
2     900
3    1600
dtype: int64

Second power of dataframe values:
    0   1   2
0   1   4   9
1  16  25  36


In [62]:
print "Which series value are in the specified range:\n",TS2.between(3,8)
print "\nMaximum value in series:\n",TS2.max()
print "\nIndex of maximum value\n",TS2.argmax()

Which series value are in the specified range:
2016-09-29    False
2016-10-01    False
2016-10-03    False
2016-10-05     True
2016-10-07     True
Freq: 2D, dtype: bool

Maximum value in series:
4

Index of maximum value
2016-10-07 00:00:00


## Read from and Write to Files

### CSV File IO

In [80]:
print groupRight

   age familyname firstname gender      home  phone
0   21       aman     peter      m  new york   1234
1   18       bman      paul      m    boston   4789
2  NaN       dman      mary    NaN   florida   9856


In [81]:
groupRight.to_csv("groupRight.csv",sep=",",encoding="utf-8")

In [82]:
newgroupRight=pd.read_csv("groupRight.csv",sep=",",encoding="utf-8",index_col=0)

In [83]:
print newgroupRight

   age familyname firstname gender      home  phone
0   21       aman     peter      m  new york   1234
1   18       bman      paul      m    boston   4789
2  NaN       dman      mary    NaN   florida   9856


### Write Data to SQLite
In order to write to and read from databases a *sqlalchemy*-engine must be created, which provides the connection to the database:  

In [84]:
from sqlalchemy import create_engine # database connection
disk_engine = create_engine('sqlite:///dmTutorial.db')
from IPython.display import display

In the following code-snippet data is read into a Pandas data frame and stored from the dataframe to the database. Note, that it is not necessary to import the entire .csv file into the dataframe. Instead the user can define the size of chunks, which are imported into the dataframe. Importing and processing data in chunks is recommended for very large amounts of data. In the example below the file is very small. Chunk-by-chunk processing is not necessary in this case. The example shall just demonstrate how it works.

In [85]:
csvfilename="groupRight.csv"
tablename=csvfilename[:-4]
print "Name of table is: ",tablename
chunksize = 2
index_start = 1
for df in pd.read_csv(csvfilename, chunksize=chunksize, iterator=True, encoding='utf-8'):
    df.index += index_start
    df.to_sql(tablename, disk_engine, if_exists='append')
    index_start = df.index[-1] + 1

Name of table is:  groupRight


### Access Data from SQLite 

In [86]:
df = pd.read_sql_query('SELECT * FROM groupRight', disk_engine)
display(df)

Unnamed: 0.1,index,Unnamed: 0,age,familyname,firstname,gender,home,phone
0,1,0,21.0,aman,peter,m,new york,1234
1,2,1,18.0,bman,paul,m,boston,4789
2,3,2,,dman,mary,,florida,9856


In [88]:
dfq = pd.read_sql_query('SELECT age,firstname,familyname FROM groupRight WHERE age >20', disk_engine)
display(dfq.head())

Unnamed: 0,age,firstname,familyname
0,21,peter,aman
