# Agenda

1. Recap
2. Address book
3. More with reading from and writing to files
4. Cleaning data with `nan` and interpolating
5. Analysis with data frames
    - Cutting and categorizing
    - Sorting
    - Grouping
    - Concatenating data frames together
    - Join data frames
    

# Recap

When we use Pandas, we're mainly using two different data structures:

- Series, which is basically a 1D NumPy array with a nice set of wrappers around it.  Each series has a single dtype.  Pandas often guesses correctly, but you can set it just as you did with NumPy arrays.
- Data frame, which is basically a glorified 2D NumPy array.  Each column in a data frame is a separate series, which means that each column has a separate dtype.  

Both a series and a data frame have an *index*, which describes the rows. An index can contain any type of values at all -- integers, strings, dates, or anything else.  Integers and strings are most common.  The values can even repeat.

A data frame, in addition to an index, has a value for "columns," which describes the names of the columns.

We can retrieve from either a series or from a data frame via the index using `.loc`.  Or we can use the numeric position using `.iloc`.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
df = DataFrame(np.random.randint(0, 1000, [5,6]),
              index=list('vwxyz'),      # rows
              columns=list('abcdef'))   # columns
df

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320,11,773
w,400,535,723,139,423,244
x,475,892,999,438,333,382
y,610,323,559,372,365,336
z,770,201,77,18,935,138


In [4]:
# I can retrieve an entire row via .loc and an index

df.loc['x']

a    475
b    892
c    999
d    438
e    333
f    382
Name: x, dtype: int64

In [5]:
df.loc['x', 'd']   # retrieve row x, column d

438

In [6]:
df.loc['x', 'd'] = 12.34
df   # the dtype for d has changed - now it's np.float64

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
w,400,535,723,139.0,423,244
x,475,892,999,12.34,333,382
y,610,323,559,372.0,365,336
z,770,201,77,18.0,935,138


In [7]:
df.dtypes  # show me all dtypes for all columns

a      int64
b      int64
c      int64
d    float64
e      int64
f      int64
dtype: object

In [8]:
# d is now a float64 column
# but what if I retrieve row x?

df.loc['x']   # the dtype of this row is float64, because Pandas needs to find a type that's good for all values

a    475.00
b    892.00
c    999.00
d     12.34
e    333.00
f    382.00
Name: x, dtype: float64

In [10]:
# what if I want to find all of the elements of column b that are even?

df['b']%2

v    0
w    1
x    0
y    1
z    1
Name: b, dtype: int64

In [11]:
df['b']%2 == 0   # the remainder is 0 if the numbers are even

v     True
w    False
x     True
y    False
z    False
Name: b, dtype: bool

In [12]:
# I can apply this boolean series as a mask index on df['b']
# in this way, I can get a new series, containing all of the values of df['b']
# that are even

#        apply this boolean series as a mask
df['b'][df['b']%2 == 0]

v    582
x    892
Name: b, dtype: int64

In [13]:
# what if we apply our mask index not only to df['b'], but to all of df?

# this will show me all of the rows of the data frame
# (all columns) where b is even 
# aka: it'll only show us rows v and x of df
df[df['b']%2 == 0]

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
x,475,892,999,12.34,333,382


In [14]:
# if we use .loc, and don't directly apply [] to df, we 
# can then also specify which columns we want

# that's because df.loc has the syntax of
# df.loc[ROW_SELECTOR, COLUMN_SELECTOR]
# if you don't select columns explicitly, then you get all of them.

df.loc[df['b']%2 == 0]

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
x,475,892,999,12.34,333,382


In [15]:
# this shows all rows of df
# where df['b'] is even
# and only column 'c'

df.loc[df['b']%2 == 0, 'c']

v    393
x    999
Name: c, dtype: int64

In [16]:
# all rows of df
# where df['b'] is even
# and only columns c and e

df.loc[df['b']%2 == 0, ['c', 'e']]

Unnamed: 0,c,e
v,393,11
x,999,333


In [None]:
# show m