## Getting started with pandas



-   pandas is primarily designed to work with tabular and heterogeneous data
-   Adopts parts of Numpy's functionality (array-based computations)



In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## pandas data structures: Series



A Series is a 1-D object containing

-   a sequence of values
-   and an associated array of data labels (its **index**)



In [2]:
obj = Series([1, 2, 3, 4])

In [5]:
type(obj.values)

numpy.ndarray

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

We can also have labels for each data point



In [6]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [7]:
obj2['a']

-5

How do we select a set of values (e.g. 'd' and 'c')?



In [9]:
obj2[['d', 'c']]

d    4
c    3
dtype: int64

### Using Numpy-like operations



Filtering with a boolean array, scalar multiplication, or applying math functions



In [12]:
#print(obj2)
#obj2[obj2 > 0]

In [13]:
obj2 * 2 + np.exp(obj2)

d      62.598150
b    1110.633158
a      -9.993262
c      26.085537
dtype: float64

### Converting from other data structures



From a dictionary



In [14]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [15]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [16]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

### Data alignment in Series objects



pandas automatically aligns by index label



In [20]:
#obj3

In [21]:
#obj4

In [19]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

## pandas data structure: DataFrame



-   A DataFrame represents a rectangular table of data
-   Has both rows and columns, which can have different types

Many ways to construct a DataFrame, but most commonly from a dict or a Numpy arrays



In [22]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


We can pass columns explicitly and if they are not found in the dict, missing values will be used



In [23]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                        index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


### Retrieving columns and rows



In [24]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [28]:
#frame2['state']

In [27]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Retrieving rows can be done with the `loc` keyword



In [30]:
type(frame2.loc['three'])

pandas.core.series.Series

### Modifying columns and rows



In [34]:
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [None]:
frame2['debt'] = np.arange(6.)
frame2

If you assign a Series, its labels will be realigned exactly to
the DataFrame’s index, inserting missing values in any holes



In [35]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assigning a column that doesn't exist will create a new column



In [36]:
frame2['eastern'] = (frame2.state == 'Ohio')
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


We can also delete columns



In [37]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


### What happens if we pass a nested dict of dicts?



In [39]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


### Possible inputs to DataFrame constructor



![img](images/dataframe_constructor.png)



## pandas data structure: Index objects



-   pandas Index objects hold the axis labels and other metadata
-   Any sequence can be used as an index object



In [40]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [41]:
index[1:]

Index(['b', 'c'], dtype='object')

In [42]:
index[1] = 'd'

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures



In [43]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [44]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

pandas Index objects behave like sets but can have duplicate labels



In [45]:
'Ohio' in frame3.columns

True

In [46]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

### Some Index methods and properties



![img](images/pandas_index.png)



## Essential pandas functionality



For each of these commands write out what you think their functionality is and possible use cases



### =pd.reindex=



In [49]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [54]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
#obj3.reindex(range(6), method='ffill')
#obj3.reindex(range(6), method='bfill')
obj3.reindex(range(6), method='bfill', fill_value='missing')

0       blue
1     purple
2     purple
3     yellow
4     yellow
5    missing
dtype: object

### =pd.drop=



In [58]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
obj.drop('c', inplace=True)

In [70]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
#data.drop(['Colorado', 'Ohio'])
#data.drop('two')
data.drop([c for c in data.index if c != 'Colorado'])

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7


### =[]=



In [71]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']
obj[1]
#obj[['b', 'a', 'd']] 

1.0

In [73]:
data['two']
data[:2]
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### =loc=, =iloc=



In [74]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [75]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

### Indexing options



![img](images/pdloc1.png)

![img](images/pdloc2.png)



### Arithmetic and data alignment



When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs



In [76]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [77]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1],
               index=['a', 'c', 'e', 'f', 'g'])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [78]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrames the alignment occurs in both the rows and columns



In [79]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [80]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [81]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [82]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


#### Does broadcasting work?



In [83]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [84]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [85]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [86]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


### Function application and mapping



In [None]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

Let's apply a Numpy *ufunc*



In [87]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


Write a lambda function and apply it with `frame.apply(f)`



In [91]:
frame.apply(lambda x: x - 2)

Unnamed: 0,b,d,e
Utah,-2.0,-1.0,0.0
Ohio,1.0,2.0,3.0
Texas,4.0,5.0,6.0
Oregon,7.0,8.0,9.0


#### =pd.applymap=



In [None]:
format = lambda x: '%.2f' % x
frame.applymap(format)

### Sorting



In [None]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()
obj.sort_values()

#### Sorting DataFrames



In [None]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

In [None]:
frame.sort_values(by='b')

In [None]:
frame.sort_values(by=['a', 'b'])

## Descriptive statistics



A number of common mathematical and statistical methods are available for Series and DataFrames



In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

In [None]:
df.sum()

In [None]:
df.mean(axis='columns', skipna=False)

In [None]:
df.describe()

### Descriptive and summary statistics



![img](images/pdsummary.png)



## Homework



You will need to read some CSV data



In [None]:
euro12 = pd.read_csv('Euro_2012_stats.csv')
euro12