# Introduction to Pandas

The pandas is a high-performance open source library for data analysis in Python developed by Wes McKinney in 2008.

It has the following key features

* Data structures with labeled axes support. This allows the ability to refer to data rows or columns using integer numbers or names

* Data structures to handle both time series data and non-time series data. Indexing on datetime field is built in to Pandas and allows for querying for date ranges or using partial date fields.

* Pandas makes it easy to merge data from various sources. In most data analysis scenarios, we will have one or more datasets (may be one per region, country or year), which needs to be merged to do comprehensive analysis.  

* Ability to handle missing data. It is common for a dataset to have missing entries, incorrect data(string entries in otherwise numeric column), wrong entries (e.g. mislabelled data). Pandas makes it easy to handle missing data.

* Ability to filter, slice, group, reorder and reshape datasets. This coupled with tabular data structure and indexes makes it incredibly powerful for data analysis. 

* Ability to pivot data. 

* Pandas is used for arithmetic operations and reductions, that makes it very useful for scientific computing. Easy integratration with libraries like Numpy, Scipy makes it an essential tool for scientific computing and machine learning. 

Pandas is built on top of Numpy so it benefits from many of the performance benefits of Numpy, especially for numerical and scientific computing.

# Pandas Data Structures

There are three data structures in Pandas
* Series    - Series are 1D numpy array with an index
* DataFrame - An excel like tabular data structure. Its a 2D array with Series as columns, sharing a common index. 
* Panel     - 3D tables

Indexes are sequence of labels. They are immutable and have the same data type.

Let's start by importing pandas and numpy and look at the two primary data structures - Series and DataFrame

In [1]:
# Import our two essential libraries - numpy and pandas
import pandas as pd
import numpy as np


# Pandas Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values.

In [2]:
prices = [10, 10.5, 12, 11.5, 9]
stockPrices = pd.Series(prices)
print(stockPrices)

0    10.0
1    10.5
2    12.0
3    11.5
4     9.0
dtype: float64


The string representation of a Series shows the index on the left and the values on the right. The default index consists of the integers 0 to N - 1 where N is the length of the data.

Next, we will look at some of the attributes of the Series object - values and index


In [3]:
# Lets get the values of the series object
stockPrices.values

array([ 10. ,  10.5,  12. ,  11.5,   9. ])

In [7]:
# Lets get some more information about the index for this series
stockPrices.index

Index(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], dtype='object', name='Day')

Note that we have the series printed with associated numeric index.


To create a Series with an index identifying each data point, we can specify an index array.

In [4]:
# Now instead of the default numeric index, we specify an index of labels for each day
stockPrices = pd.Series(prices, index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'])
stockPrices.index.name = 'Day'
print(stockPrices)

Day
Monday       10.0
Tuesday      10.5
Wednesday    12.0
Thursday     11.5
Friday        9.0
dtype: float64


Let's see how we can slice a Pandas series

In [40]:
data = np.arange(5.)
s = pd.Series(data)

print ("\n ======== Pandas Series =========== \n")
print(s)

# Using array slicing notation on Series
# Selecting the first 3 items in the series
print("\n Slicing of numpy array: \n", data[:3])
print("\n Slicing of pandas series: \n", s[:3])

# Selecting from 2nd to 3rd 
print("\n Selecting elements from numpy array: \n", data[1:3])
print("\n Selecting elements from pandas series: \n",s[1:3])



0    0
1    1
2    2
3    3
4    4
dtype: float64

 Slicing of numpy array: 
 [ 0.  1.  2.]

 Slicing of pandas series: 
 0    0
1    1
2    2
dtype: float64

 Selecting elements from numpy array: 
 [ 1.  2.]

 Selecting elements from pandas series: 
 1    1
2    2
dtype: float64


In [33]:
#Pandas Series
s = pd.Series(np.arange(5.), index=['a', 'b','c','d','e'])
print ("\n ======== Pandas Series =========== \n", s)


 a    0
b    1
c    2
d    3
e    4
dtype: float64


In [20]:
# Applying a calcualtion to all the series elements
s = s * 5
print(s)

idx
a     0
b     5
c    10
d    15
e    20
dtype: float64


In [28]:
# Selecting some values based on index. Items not in sequence come as NaN
print(s[['b','c','e','f']])

idx
b      5
c    100
e    200
f    NaN
dtype: float64


In [21]:
# Applying a calculation to some of the elements using index
evenidx = s[s %2 == 0].index
s[evenidx] = s[evenidx]*10
print(s)

idx
a      0
b      5
c    100
d     15
e    200
dtype: float64


In [23]:
# Re-indexing the pandas series object
s2 = s.reindex(['a','b','d','c','e'])
print(s2)

idx
a      0
b      5
d     15
c    100
e    200
dtype: float64


In [24]:
# Reindexing elements and filling in missing values
s3 = s.reindex(['a','b','d','c','e','f','g'], fill_value=0)
print(s3)

idx
a      0
b      5
d     15
c    100
e    200
f      0
g      0
dtype: float64


In [70]:
#Series Arithmetic
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

print("\n S1:\n{0} \n S2:\n{1}".format(s1,s2))
print("Addition without fill values")
print(s1+s2)

print("With fillvalue=0 specified")
print (s1.add(s2, fill_value=0))


 S1:
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64 
 S2:
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
Addition without fill values
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
With fill values specified
a    5.2
c    1.1
d    3.4
e    0.0
f    4.0
g    3.1
dtype: float64


### Applying functions on a series

In [25]:
from collections import Counter

s = pd.Series([1,2,3,2,1,3,4,5,3,2,1])

# Applying a function on the entire series
cnt = Counter(s)
print("\nElement Count: \n", cnt)


# Applying a function on each element of the series
# map works element-wise on a Series
print ("\n Transforming a series using element wise processing")
print (s.map(lambda x: x if (x < 2 or x > 5) else 0) )

print (s.apply(lambda x: x if (x < 2 or x > 5) else 0) )



Element Count: 
 Counter({1: 3, 2: 3, 3: 3, 4: 1, 5: 1})

 Transforming a series using element wise processing
0     1
1     0
2     0
3     0
4     1
5     0
6     0
7     0
8     0
9     0
10    1
dtype: int64
0     1
1     0
2     0
3     0
4     1
5     0
6     0
7     0
8     0
9     0
10    1
dtype: int64



# Pandas DataFrame


In [2]:
# Creating a dataframe
data = np.arange(9).reshape((3,3))

df = pd.DataFrame(data=data, columns= ['a', 'b', 'c'])
df.index.name = 'idx'
print ("\n ======== DataFrame =========== \n", df)


      a  b  c
idx         
0    0  1  2
1    3  4  5
2    6  7  8


### Now we will see how to use row and column index (names or position) to slice dataframe

In [21]:
# Accessing rows - we can use loc or iloc
# loc method can be used to get rows (or columns) at a particular position based on labels from index
# iloc method can get rows (or columns) at a particular position in index. It takes only integer 
print ("\n Printing 3rd row using iloc: \n", df.iloc[2])
print ("\n Printing 3rd row using loc: \n", df.loc[2])

print("\n Extracting rows 1 and 2 from the original dataframe")
print(df.loc[1:2])

print("\n Accessing column index. When its just one column a series is returned")
print(df['a'])

print("\n Extracting a set of columns from a dataframe")
# More than one column indexes using list notation
newDf = df[['a','b']]
print(newDf)


 Printing 3rd row using iloc: 
 a    6
b    7
c    8
Name: 2, dtype: int64

 Printing 3rd row using loc: 
 a    6
b    7
c    8
Name: 2, dtype: int64

 Extracting rows 1 and 2 from the original dataframe
     a  b  c
idx         
1    3  4  5
2    6  7  8

 Accessing column index. When its just one column a series is returned
idx
0    0
1    3
2    6
Name: a, dtype: int64

 Extracting a set of columns from a dataframe
     a  b
idx      
0    0  1
1    3  4
2    6  7


In [41]:
# ---------- Creating DF from dictionary of arrays ---- #
df = pd.DataFrame({ 
   'name': ['John', 'Kelly', 'Bob'],
   'score': [ 77, 26, 35],
   'age' : [35, 37, 29]
})

print (df[['names', 'age']])

   names  age
0   John   35
1  Kelly  377
2    Bob   29


In [55]:
# ---------- Creating DF from dict of dictionaries ---- #
# Note: The other dict keys will become columns
df = pd.DataFrame( data = {
      'JohnK' :  { 'name': 'John', 'score' : 77,  'age': 35 },
      'KellyC':  { 'name': 'Kelly', 'score' : 26,  'age': 37 },
      'BobM':    { 'name': 'Bob', 'score' : 355,  'age': 29 } 
})
print (df)

       BobM JohnK KellyC
age      29    35     37
names   Bob  John  Kelly
scores  355    77     26


In [63]:
df = pd.DataFrame({ 
   'name': ['John', 'Kelly', 'Bob'],
   'score': [ 77, 26, 35],
   'age' : [35, 37, 29]
})

# Filtering rows based on boolean array
print( df.loc[df['age']>30, ['name','score']] )

    name  score
0   John     77
1  Kelly     26


In [74]:
df = pd.DataFrame(np.arange(12).reshape(4,3), columns=list('abc'))
print(df)

s = df.ix[0]
s

print(df-s)

   a  b  c
0  0  0  0
1  3  3  3
2  6  6  6
3  9  9  9


## Applying functions

In [9]:
# Applying a function to a row or column

df = pd.DataFrame(data=np.arange(9).reshape(3,3), columns=list('abc'), index=['r1','r2','r3'])
print(df)

# Apply a transformation function
trnf = lambda x: x.max() - x.min()

print ("\nRow-wise max-min\n")
print (df.apply(trnf))

print ("\nColumn-wise max-min\n")
print (df.apply(trnf, axis=1))


    a  b  c
r1  0  1  2
r2  3  4  5
r3  6  7  8

Row-wise max-min

a    6
b    6
c    6
dtype: int64

Column-wise max-min

r1    2
r2    2
r3    2
dtype: int64


In [11]:
# Finding the min or max of a row or column

def minMax(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

print("\n Original DataFrame\n")
print(df)

print ("\nRow-wise max and min\n")
print (df.apply(minMax))

print ("\nColumn-wise max and min\n")
print (df.apply(minMax, axis=1))


 Original DataFrame

    a  b  c
r1  0  1  2
r2  3  4  5
r3  6  7  8

Row-wise max and min

     a  b  c
min  0  1  2
max  6  7  8

Column-wise max and min

    min  max
r1    0    2
r2    3    5
r3    6    8
