In [4]:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np



# Introduction to pandas Data Structures

## Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.

In [6]:
obj = Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Get Series values and index attributes:

In [8]:
obj.values

array([ 4,  7, -5,  3])

In [9]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Create a Series with an index identifying each data point:

In [10]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [11]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

Index values can be used to select single values or a set of values.

In [12]:
obj2['a']

-5

In [13]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

Numpy operations will preserve the index-value link.
Filtering with boolean array:

In [14]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

Scalar Multiplication:

In [15]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

Applying math functions:

In [16]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

A Series can be thought of as a fixed-length, ordered dict, since it maps index values to data values.  
It can be substituted into many functions that expect a dict.

In [17]:
'b' in obj2

True

Create a Series from Python dict by passing the dict:

In [18]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

This results in the dict keys being the index values of the Series.

Creating Series from dict with specifying an index:

In [19]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

A critical Series feature for many applications is that it automatically aligns differently-indexed data in arithmetic operations:

In [20]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object and its index have a <name> attribute which integrates with other key areas of pandas functionality:

In [21]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered in place by assignment:

In [22]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame
### Creating DataFrames

Create DataFrame from a dict of equal-length lists or NumPy arrays:

In [24]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


This results in a DataFrame with an index assigned automatically as with Series, and the columns are placed in sorted order A-Z.  
Specifying a sequence of columns retains the order assigned:

In [25]:
DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


Columns can be passed empty:

In [26]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [27]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

### Retrieving Data from DataFrames

By dict-like notation:

In [28]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

By attribute:

In [29]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

Retrieving rows either by position, name, or methods like the 'loc' indexing field:

In [31]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

### Modify Data by Assignment

In [32]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [34]:
frame2['debt'] = np.arange(5.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


When assigning lists or arrays to a column, the length must match the length of the DataFrame.  
Assigning a Series will conform exactly to the DataFrame's index, inserting missing values in any holes:

In [35]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


Assigning a column that doesn't exist creates a new column:

In [36]:
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


Columns can be deleted using the 'del' keyword like in a dictionary:

In [37]:
del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dict of dicts format:

In [38]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

Passing the above to a DataFrame will interpret the outer keys as the columns and inner keys as the row indices:

In [39]:
frame3 = DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The result can be transposed:

In [40]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


Dicts of Series are treated the same way:

In [41]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}
DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [42]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The 'values' attribute returns the data contained in the DataFrame as a 2D ndarray:

In [43]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

## Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [44]:
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [46]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and can't be modified by the user. This is important so that Index objects can be safely shared among data structures:

In [47]:
index = pd.Index(np.arange(3))
obj2 = Series([1.5, -2.5, 0], index=index)
obj2.index is index

True

There are multiple kinds of Index objects and each has a number of methods and properties for set logic and answering other common questions about the data it contains:

In [50]:
pd.options.display.max_colwidth = 100

methods = ['append', 'diff', 'intersection', 'union', 'isin', 'delete', 'drop', 'insert', 'is_monotonic', 'is_unique']
description = ['Concatenate with additional Index objects, producing a new Index', 
               'Compute set difference as an Index', 
               'Compute set intersection', 
               'Compute set union', 
               'Compute boolean array indicating whether each value is contained in the passed collection', 
               'Compute new Index with element at index i deleted', 
               'Compute new Index by deleting passed values', 
               'Compute new Index by inserting element at index i', 
               'Returns True if each element is greater than or equal to the previous element', 
               'Returns True if the Index has no duplicate values']
index_objects = DataFrame({'Method': methods, 'Description': description})
index_objects

Unnamed: 0,Method,Description
0,append,"Concatenate with additional Index objects, producing a new Index"
1,diff,Compute set difference as an Index
2,intersection,Compute set intersection
3,union,Compute set union
4,isin,Compute boolean array indicating whether each value is contained in the passed collection
5,delete,Compute new Index with element at index i deleted
6,drop,Compute new Index by deleting passed values
7,insert,Compute new Index by inserting element at index i
8,is_monotonic,Returns True if each element is greater than or equal to the previous element
9,is_unique,Returns True if the Index has no duplicate values


# Essential Functionality