# Agenda

1. Pandas in general
2. Series
    - Creating 
    - Retrieving from them
    - Methods
    - Working with `nan`
    - Broadcasting
    - Indexes
3. Data frames
    - Creating
    - Retrieving from them
    - Applying methods across all columns
4. Reading data from files
    - CSV
    - (Excel, a little)
    - JSON
    - Retrieving from the network

# Pandas data structures

There are two main data structures you need to know about with Pandas:

- `Series` -- basically a 1D NumPy array
- `DataFrame` -- basically a 2D NumPy array

The rows of a data frame are going to be equivalent to the rows of a NumPy array.

The columns of a data frame are all going to be Series objects.

# Installing pandas

You can install it along with many other Python packages with `pip` (command-line program, not to be run inside of Python or Jupyter):

    python3 -m pip install -U pandas
    
Once it is installed, then we'll want to load it into our program.

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# let's create a Pandas Series

s = Series([10, 20, 30, 40, 50, 60, 70])

In [4]:
type(s)

pandas.core.series.Series

In [5]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [6]:
# I can do many of the same things with a series that I did with NumPy arrays

s[0]

10

In [7]:
s[3]

40

In [8]:
s.sum()

280

In [9]:
s.mean()

40.0

In [10]:
s = Series([10, 20, 30, 40, 50, 60, 70], dtype=np.float64)

In [11]:
s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
5    60.0
6    70.0
dtype: float64

In [12]:
# behind the scenes, we can find a NumPy array
s.values

array([10., 20., 30., 40., 50., 60., 70.])

In [13]:
# I can create a Pandas Series  based on a 1D NumPy array

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [14]:
s.min()

9

In [15]:
s.max()

87

In [16]:
s.mean()

52.5

In [17]:
s.std()

25.67424130654432

In [18]:
s.sum()

525

In [20]:
s.count()   # how many non-nan values are there in this series?

10

In [21]:
# can I create a boolean series, and use it as a mask index? YES!

s < 30

0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
8    False
9    False
dtype: bool

In [22]:
s[s<30]  # notice that when I filter the elements of s, the items I get back retain their original indexes

5     9
7    21
dtype: int64

In [23]:
s[s<s.mean()]  # retrieve values less than the mean

0    44
1    47
5     9
7    21
8    36
dtype: int64

# Ways to retrieve from a series

1. We've now seen that we can retrieve from a series using `[]`, just as we did with NumPy arrays.  But I'd like you to break this habit, ASAP.  (When we get to data frames, this will make more sense, and you won't regret it.)

2. We can also retrieve from a series using the `.loc` accessor and `[]`.  This lets us retrieve via the index.

3. We can also retrieve from a series using the `iloc` accessor and `[]`.  This lets us retrieve via the position in the series, regardless of the index, starting at 0.

In [24]:
a = s[s<s.mean()]

In [25]:
a   # this is a series containing 5 elements.

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [26]:
# I can retrieve via the index using just [].  Again: Probably not a good idea, but it will work!
a[5]

9

In [27]:
a[8]

36

In [28]:
a[0]

44

In [29]:
a[[1, 7]]   # fancy indexing works here, too!

1    47
7    21
dtype: int64

In [30]:
# Better than [] alone is retrieving with .loc, which lets us specify the index

a.loc[5]

9

In [31]:
a.loc[8]

36

In [32]:
a.loc[0]

44

In [33]:
a.loc[[1,7]]

1    47
7    21
dtype: int64

In [36]:
# Another way to retrieve is with .iloc, where we specify the numeric position in the series,
# starting with 0.

a.iloc[3]

21

In [37]:
a

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [38]:
a.loc[7]    # same as a.iloc[3]

21

In [39]:
a.iloc[2]

9

In [40]:
a.loc[2]

KeyError: 2

In [41]:
a

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [43]:
# is there a way for me to set the index when I create a series?
# yes!

s = Series([10, 20, 30, 40, 50],
          index=[2,4,6,8,10])      # index and values must be of the same length!

In [44]:
s

2     10
4     20
6     30
8     40
10    50
dtype: int64

In [45]:
s.loc[4]  # this retrieves the item whose index is 4

20

In [46]:
s.iloc[4]  # this retrieves the item whose position is 4

50

In [47]:
s = Series([10, 20, 30, 40, 50],
          index=list('abcde'))    # indexes can be strings, including 1-character strings!

s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [48]:
s.loc['d']

40

In [49]:
s.iloc[3]

40

In [51]:
# we can retrieve multiple values with fancy indexing
# we'll get a series back as a result (assuming that there's more than one value there)

s.loc[['b', 'd']] # fancy indexing with strings

b    20
d    40
dtype: int64

In [52]:
# I can even use a slice!

s.loc['b':'d']   # in general, Python slices are [start:end+1] -- but not in Pandas, where slices are INCLUSIVE

b    20
c    30
d    40
dtype: int64

# Exercises: Series

1. Create a series containing 10 random integers from 0-100, with index `a` - `j`.
2. Retrieve the item at index `b`.
3. Retrieve the items at indexes `c`, `d`, and `f`.
4. What is the mean of the items at indexes `a`, `e`, `g`, and `h`?
5. What is the sum of the even values?
6. What is the sum of the items whose positions (`.iloc`) are even?

In [56]:
np.random.seed(0)
s = Series(np.random.randint(0, 100, 10),
          index=list('abcdefghij'))
s

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [58]:
# retrieve the item at index b

s.loc['b']

47

In [59]:
# Retrieve the items at indexes c, d, and f.

s.loc[['c', 'd', 'f']]   # fancy indexing -- pass a list of indexes to .loc[]

c    64
d    67
f     9
dtype: int64

In [60]:
# Retrieve the items at indexes c, d, and e

s.loc['c':'e']  # one set of square brackets here

c    64
d    67
e    67
dtype: int64

In [63]:
# What is the mean of the items at indexes a, e, g, and h?

s.loc[['a', 'e', 'g', 'h']].mean()

53.75

In [66]:
# What is the sum of the even values?

s.loc[s%2==0].sum()

144

In [67]:
# What is the sum of the items whose positions (.iloc) are even?

s.iloc[[0,2,4,6,8]]

a    44
c    64
e    67
g    83
i    36
dtype: int64

In [69]:
s.iloc[range(0, 10, 2)]   # from 0, up to (not including) 10, skipping by 2

a    44
c    64
e    67
g    83
i    36
dtype: int64

In [70]:
s.iloc[range(0, 10, 2)].sum()

294

# How is Pandas different from NumPy?


In [71]:
np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 10), 
           index=list('abcdefghij'))
s2 = Series(np.random.randint(0, 100, 10), 
           index=list('defghijklm'))


In [72]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [73]:
s2

d    70
e    88
f    88
g    12
h    58
i    65
j    39
k    87
l    46
m    88
dtype: int64

In [74]:
s1 + s1  # add it to itself

a     88
b     94
c    128
d    134
e    134
f     18
g    166
h     42
i     72
j    174
dtype: int64

In [75]:
s1 + s2  # what happens here?

a      NaN
b      NaN
c      NaN
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k      NaN
l      NaN
m      NaN
dtype: float64

In [76]:
s3 = Series([10, 20, 30, 40], index=list('abcd'))

s1 + s3

a     54.0
b     67.0
c     94.0
d    107.0
e      NaN
f      NaN
g      NaN
h      NaN
i      NaN
j      NaN
dtype: float64

In [77]:
# how can I get around this, if I want to?

s1 + s2   # this is translated in Python to s1.__add__(s2).  In Pandas, that becomes s1.add(s2)

a      NaN
b      NaN
c      NaN
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k      NaN
l      NaN
m      NaN
dtype: float64

In [78]:
s1.add(s2)

a      NaN
b      NaN
c      NaN
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k      NaN
l      NaN
m      NaN
dtype: float64

In [79]:
# we can pass a value to the "fill_value" keyword argument in s1.add, telling it what
# value to use if the index isn't matched on the other side.

s1.add(s2, fill_value=0)

a     44.0
b     47.0
c     64.0
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k     87.0
l     46.0
m     88.0
dtype: float64

# Exercises: Indexing and series

1. Create two series, each with 10 random integers from 0-100.  The first should have an index of a-j. The second should have an index of b-k.
2. Add them together, but first predicting which indexes will give you `nan` results.
3. Add them together, with 0 being a fill value.  Can you use a different fill value?
4. Multiply them together, with 1 being a fill value. 
5. Sum all of the even numbers together from both of the series.

In [None]:
np.random.seed(0)

s1 = Series(np.random.randint(0, 100, 10), 
           index=)