# Intro to Data Structures

Pandas is built on top of numpy (means its fast), and has two main types of data structures: series, and dataframes.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('max_columns', 50)
%matplotlib inline

## Series

Series in pandas are 1 dimensional objects (think arrays, lists, or columns in a table). Series assign a labled index to each item in the Series. Default lables are 0 to N, where N is length(Series) - 1.

In [3]:
# create a Series with an arbitrary list, items are inserted inside []
s = pd.Series([123, 'Testing', 6.28, -1222334456, 'Potatoes'])
s

0            123
1        Testing
2           6.28
3    -1222334456
4       Potatoes
dtype: object

You can also specify an index label for each item in the series:

In [4]:
s = pd.Series([123, 'Testing', 6.28, -1222334456, 'Potatoes'],
             index=['A','B','C','D','E'])
s

A            123
B        Testing
C           6.28
D    -1222334456
E       Potatoes
dtype: object

You can also construct a Series from a python dictionary, using the keys as labels (cool!)

In [5]:
d = {'Chicago': 1000, 'New York': 1300, 'Portland': 900, 'San Francisco': 1100,
     'Austin': 450, 'Boston': None}
cities = pd.Series(d)
cities

Austin            450.0
Boston              NaN
Chicago          1000.0
New York         1300.0
Portland          900.0
San Francisco    1100.0
dtype: float64

You can then select items in a series directly by its index (label)

In [6]:
cities['New York']

1300.0

In [8]:
cities[['New York', 'Portland', 'Chicago']] #notice the double [[]] (think: select [] this list [])

New York    1300.0
Portland     900.0
Chicago     1000.0
dtype: float64

Can also do selection via boolean logic

In [9]:
cities[cities < 1000]

Austin      450.0
Portland    900.0
dtype: float64

That looks a bit weird, the `cities < 1000` actually returns a Series of True/False values for each cities value > 1000, we then pass that Series of True/False values to `cities` to get just those values.

In [10]:
less_than_1000 = cities < 1000
print(less_than_1000)
print('\n')
print(cities[less_than_1000])

Austin            True
Boston           False
Chicago          False
New York         False
Portland          True
San Francisco    False
dtype: bool


Austin      450.0
Portland    900.0
dtype: float64


We can also edit Series values after creation:

In [11]:
# changing based on the index
print('Old value:', cities['Chicago'])
cities['Chicago'] = 1400
print('New value:', cities['Chicago'])

Old value: 1000.0
New value: 1400.0


In [16]:
# changing values using boolean logic
print(cities[cities < 1000])
print('\n')
cities[cities < 1000] = 750

print(cities[cities < 1000])

Austin      400.0
Portland    400.0
dtype: float64


Austin      750.0
Portland    750.0
dtype: float64


Pandas allows for idiomatic Python (rad), for example lets see if an item is in our series

In [17]:
print('Seattle' in cities)
print('San Francisco' in cities)

False
True


Maths can also be done on Series using scalars and functions

In [18]:
# divide city values by 3
cities / 3

Austin           250.000000
Boston                  NaN
Chicago          466.666667
New York         433.333333
Portland         250.000000
San Francisco    366.666667
dtype: float64

In [19]:
# square city values
np.square(cities)

Austin            562500.0
Boston                 NaN
Chicago          1960000.0
New York         1690000.0
Portland          562500.0
San Francisco    1210000.0
dtype: float64

You can also add/subtract two series together, producing the union of the two series, and in cases where indexes match these values are added/subtracted from each other. When an item wasn't in one of the series it is attached to the new Series with the value NaN.

In [20]:
print(cities[['Chicago', 'New York', 'Portland']])
print('\n')
print(cities[['Austin', 'New York']])
print('\n')
print(cities[['Chicago', 'New York', 'Portland']] + cities[['Austin', 'New York']])

Chicago     1400.0
New York    1300.0
Portland     750.0
dtype: float64


Austin       750.0
New York    1300.0
dtype: float64


Austin         NaN
Chicago        NaN
New York    2600.0
Portland       NaN
dtype: float64


Null (NaN) checking can be performed on Series using `isnull` and `notnull`.

In [21]:
# returns a boolean series indicating which values aren't NULL
cities.notnull()

Austin            True
Boston           False
Chicago           True
New York          True
Portland          True
San Francisco     True
dtype: bool

In [22]:
# use boolean logic to grab the NULL cities
print(cities[cities.isnull()])

Boston   NaN
dtype: float64


## Data Frames

Thats it for Series, lets move onto Data Frames.  

Data Frames are tabular data structures comprised of rows and columns. You can also think of Data Frame objects as a group of Series objects that share an index (column names).