# Agenda

1. Pandas in general
2. Series
    - Creating 
    - Retrieving from them
    - Methods
    - Working with `nan`
    - Broadcasting
    - Indexes
3. Data frames
    - Creating
    - Retrieving from them
    - Applying methods across all columns
4. Reading data from files
    - CSV
    - (Excel, a little)
    - JSON
    - Retrieving from the network

# Pandas data structures

There are two main data structures you need to know about with Pandas:

- `Series` -- basically a 1D NumPy array
- `DataFrame` -- basically a 2D NumPy array

The rows of a data frame are going to be equivalent to the rows of a NumPy array.

The columns of a data frame are all going to be Series objects.

# Installing pandas

You can install it along with many other Python packages with `pip` (command-line program, not to be run inside of Python or Jupyter):

    python3 -m pip install -U pandas
    
Once it is installed, then we'll want to load it into our program.

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# let's create a Pandas Series

s = Series([10, 20, 30, 40, 50, 60, 70])

In [4]:
type(s)

pandas.core.series.Series

In [5]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [6]:
# I can do many of the same things with a series that I did with NumPy arrays

s[0]

10

In [7]:
s[3]

40

In [8]:
s.sum()

280

In [9]:
s.mean()

40.0

In [10]:
s = Series([10, 20, 30, 40, 50, 60, 70], dtype=np.float64)

In [11]:
s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
5    60.0
6    70.0
dtype: float64

In [12]:
# behind the scenes, we can find a NumPy array
s.values

array([10., 20., 30., 40., 50., 60., 70.])

In [13]:
# I can create a Pandas Series  based on a 1D NumPy array

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [14]:
s.min()

9

In [15]:
s.max()

87

In [16]:
s.mean()

52.5

In [17]:
s.std()

25.67424130654432

In [18]:
s.sum()

525

In [20]:
s.count()   # how many non-nan values are there in this series?

10

In [21]:
# can I create a boolean series, and use it as a mask index? YES!

s < 30

0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
8    False
9    False
dtype: bool

In [22]:
s[s<30]  # notice that when I filter the elements of s, the items I get back retain their original indexes

5     9
7    21
dtype: int64

In [23]:
s[s<s.mean()]  # retrieve values less than the mean

0    44
1    47
5     9
7    21
8    36
dtype: int64

# Ways to retrieve from a series

1. We've now seen that we can retrieve from a series using `[]`, just as we did with NumPy arrays.  But I'd like you to break this habit, ASAP.  (When we get to data frames, this will make more sense, and you won't regret it.)

2. We can also retrieve from a series using the `.loc` accessor and `[]`.  This lets us retrieve via the index.

3. We can also retrieve from a series using the `iloc` accessor and `[]`.  This lets us retrieve via the position in the series, regardless of the index, starting at 0.

In [24]:
a = s[s<s.mean()]

In [25]:
a   # this is a series containing 5 elements.

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [26]:
# I can retrieve via the index using just [].  Again: Probably not a good idea, but it will work!
a[5]

9

In [27]:
a[8]

36

In [28]:
a[0]

44

In [29]:
a[[1, 7]]   # fancy indexing works here, too!

1    47
7    21
dtype: int64

In [30]:
# Better than [] alone is retrieving with .loc, which lets us specify the index

a.loc[5]

9

In [31]:
a.loc[8]

36

In [32]:
a.loc[0]

44

In [33]:
a.loc[[1,7]]

1    47
7    21
dtype: int64

In [36]:
# Another way to retrieve is with .iloc, where we specify the numeric position in the series,
# starting with 0.

a.iloc[3]

21

In [37]:
a

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [38]:
a.loc[7] # same as a.iloc[3]

21