# Pandas Objects: Series, DataFrame, Index

In [91]:
import pandas as pd

In [92]:
pd.__version__

'0.23.4'

## Series

Series is an extended version of a NumPy array that contains a sequence on values with an explicitly defined index. Index can be numberic or can consist of alphanumeric labels while index values don't have to be sequential.

### Creating a Series from array

In [93]:
values = [0.25, 0.5, 0.75, 1.0]

In [94]:
index = ['a', 'b', 'c', 'd']

In [95]:
data = pd.Series(values, index=index)
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [96]:
data.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [97]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [98]:
data[1:3]

b    0.50
c    0.75
dtype: float64

In [99]:
data['a']

0.25

### Creating a Series from a dictionary

In [100]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

In [101]:
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [102]:
# Access by index
population['California']

38332521

In [103]:
# Slicing by index
population['California':'New York']

California    38332521
Texas         26448193
New York      19651127
dtype: int64

## DataFrame

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index.

### Construct DataFrame from two Series objects that share the same index

In [104]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}

In [105]:
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [106]:
states = pd.DataFrame({'population':population, 'area':area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [107]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [108]:
states.values

array([[38332521,   423967],
       [26448193,   695662],
       [19651127,   141297],
       [19552860,   170312],
       [12882135,   149995]])

We can also think of a DataFrame as a specialization of a dictionary. Where
a dictionary maps a key to a value, a DataFrame maps a column name to a Series of
column data:

In [109]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

### DataFrame can be constructed out of a list of dictionaries

In [110]:
# Note that missed keys will be filled with NaN values automatically
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


### DataFrame can be constructed from a two-dimensional NumPy array

In [111]:
import numpy as np
pd.DataFrame(np.random.rand(3, 2), columns=['foo', 'bar'], index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.73192,0.212701
b,0.40544,0.147698
c,0.303749,0.337508


## Pandas Index object

Index can be thought of either as an immutable array or as an ordered set (technically a multiset, as Index objects may contain repeated values).

In [112]:
# Constructing an Index out of an array
ind = pd.Index([1,3,5,7,11])
ind

Int64Index([1, 3, 5, 7, 11], dtype='int64')

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python’s built-in set data structure, so that unions, intersec‐
tions, differences, and other combinations can be computed in a familiar way:

In [113]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [114]:
# Index intersection
indA & indB

Int64Index([3, 5, 7], dtype='int64')

In [115]:
# Union
indA | indB

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [116]:
# XOR (symmetric difference)
indA ^ indB

Int64Index([1, 2, 9, 11], dtype='int64')

## Data selection

### The "loc" attribute allows indexing and slicing that always references the explicit index

In [117]:
population.loc['California']

38332521

In [118]:
states.loc['California']

population    38332521
area            423967
Name: California, dtype: int64

### The "iloc" attribute allows indexing using implicit numeric index

In [119]:
population.iloc[0]

38332521

In [120]:
states.iloc[0]

population    38332521
area            423967
Name: California, dtype: int64

### Data selection in DataFrame

In [121]:
states.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [122]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [131]:
# dictionary-style syntax can also be used to modify the object, in this case to add a new column
states['density'] = states['population']/states['area']
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [124]:
states.values

array([[3.83325210e+07, 4.23967000e+05, 9.04139261e+01],
       [2.64481930e+07, 6.95662000e+05, 3.80187404e+01],
       [1.96511270e+07, 1.41297000e+05, 1.39076746e+02],
       [1.95528600e+07, 1.70312000e+05, 1.14806121e+02],
       [1.28821350e+07, 1.49995000e+05, 8.58837628e+01]])

In [138]:
# Selecting ranges of rows and columns by explicit index

states.loc['California':'New York', 'area':'density']

Unnamed: 0,area,density
California,423967,90.413926
Texas,695662,38.01874
New York,141297,139.076746


In [126]:
# Masking using explicit index

states.loc[states.density > 100, ['population', 'density']]

Unnamed: 0,population,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [133]:
# Selecting ranges of rows and columns by implicit index

states.iloc[:3, [1,2]]

Unnamed: 0,area,density
California,423967,90.413926
Texas,695662,38.01874
New York,141297,139.076746


### DataFrame transposition

In [128]:
states

Unnamed: 0,population,area,density
California,38332521,423967,90.413926
Texas,26448193,695662,38.01874
New York,19651127,141297,139.076746
Florida,19552860,170312,114.806121
Illinois,12882135,149995,85.883763


In [129]:
states.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
population,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
area,423967.0,695662.0,141297.0,170312.0,149995.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


## DataFrame operations

### Index alignment in DataFrame

In [159]:
rng = np.random.RandomState(42)
A = pd.DataFrame(rng.randint(0, 20, (2,2)), columns=['a','b'])
A

Unnamed: 0,a,b
0,6,19
1,14,10


In [145]:
B = pd.DataFrame(rng.randint(0, 20, (3,3)), columns=['a','b','c'])
B

Unnamed: 0,a,b,c
0,7,6,18
1,10,10,3
2,7,2,1


In [148]:
C = A * B
C

Unnamed: 0,a,b,c
0,42.0,114.0,
1,140.0,100.0,
2,,,


In [158]:
# Slicing with a step (every other element)
C.iloc[0, ::2]

a    42.0
c     NaN
Name: 0, dtype: float64

## Hierarchical data and MultiIndex

In [161]:
# raw data
index = [('California', 2000), ('California', 2010),('New York', 2000), ('New York', 2010),('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,18976457, 19378102,20851820, 25145561]

In [164]:
# naive indexing
pop = pd.Series(populations, index=index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [163]:
# create a MultiIndex
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [166]:
# hierarchical structure
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [169]:
# we can now extract data by index constituents
pop[:,2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [171]:
pop['California']

2000    33871648
2010    37253956
dtype: int64

In [173]:
# series above can be unstack into a flat 2D DataFrame
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [174]:
# we can convert this flat DataFrame back to a Series using a stack operation
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [177]:
# We can easily add columns to a multi-index Series

pop_df = pd.DataFrame({'total': pop,'under18': [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [179]:
# and then calculate statistics, E.g. a ratio of people under 18 y.o.

fraction_under_18yo = pop_df['total']/pop_df['under18']
fraction_under_18yo

California  2000    3.655047
            2010    4.012665
New York    2000    4.048420
            2010    4.487715
Texas       2000    3.530436
            2010    3.655402
dtype: float64