# Agenda

1. Recap
2. Address book
3. More with reading from and writing to files
4. Cleaning data with `nan` and interpolating
5. Analysis with data frames
    - Cutting and categorizing
    - Sorting
    - Grouping
    - Concatenating data frames together
    - Join data frames
    

# Recap

When we use Pandas, we're mainly using two different data structures:

- Series, which is basically a 1D NumPy array with a nice set of wrappers around it.  Each series has a single dtype.  Pandas often guesses correctly, but you can set it just as you did with NumPy arrays.
- Data frame, which is basically a glorified 2D NumPy array.  Each column in a data frame is a separate series, which means that each column has a separate dtype.  

Both a series and a data frame have an *index*, which describes the rows. An index can contain any type of values at all -- integers, strings, dates, or anything else.  Integers and strings are most common.  The values can even repeat.

A data frame, in addition to an index, has a value for "columns," which describes the names of the columns.

We can retrieve from either a series or from a data frame via the index using `.loc`.  Or we can use the numeric position using `.iloc`.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
df = DataFrame(np.random.randint(0, 1000, [5,6]),
              index=list('vwxyz'),      # rows
              columns=list('abcdef'))   # columns
df

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320,11,773
w,400,535,723,139,423,244
x,475,892,999,438,333,382
y,610,323,559,372,365,336
z,770,201,77,18,935,138


In [4]:
# I can retrieve an entire row via .loc and an index

df.loc['x']

a    475
b    892
c    999
d    438
e    333
f    382
Name: x, dtype: int64

In [5]:
df.loc['x', 'd']   # retrieve row x, column d

438

In [6]:
df.loc['x', 'd'] = 12.34
df   # the dtype for d has changed - now it's np.float64

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
w,400,535,723,139.0,423,244
x,475,892,999,12.34,333,382
y,610,323,559,372.0,365,336
z,770,201,77,18.0,935,138


In [7]:
df.dtypes  # show me all dtypes for all columns

a      int64
b      int64
c      int64
d    float64
e      int64
f      int64
dtype: object

In [8]:
# d is now a float64 column
# but what if I retrieve row x?

df.loc['x']   # the dtype of this row is float64, because Pandas needs to find a type that's good for all values

a    475.00
b    892.00
c    999.00
d     12.34
e    333.00
f    382.00
Name: x, dtype: float64

In [10]:
# what if I want to find all of the elements of column b that are even?

df['b']%2

v    0
w    1
x    0
y    1
z    1
Name: b, dtype: int64

In [11]:
df['b']%2 == 0   # the remainder is 0 if the numbers are even

v     True
w    False
x     True
y    False
z    False
Name: b, dtype: bool

In [12]:
# I can apply this boolean series as a mask index on df['b']
# in this way, I can get a new series, containing all of the values of df['b']
# that are even

#        apply this boolean series as a mask
df['b'][df['b']%2 == 0]

v    582
x    892
Name: b, dtype: int64

In [13]:
# what if we apply our mask index not only to df['b'], but to all of df?

# this will show me all of the rows of the data frame
# (all columns) where b is even 
# aka: it'll only show us rows v and x of df
df[df['b']%2 == 0]

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
x,475,892,999,12.34,333,382


In [14]:
# if we use .loc, and don't directly apply [] to df, we 
# can then also specify which columns we want

# that's because df.loc has the syntax of
# df.loc[ROW_SELECTOR, COLUMN_SELECTOR]
# if you don't select columns explicitly, then you get all of them.

df.loc[df['b']%2 == 0]

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
x,475,892,999,12.34,333,382


In [15]:
# this shows all rows of df
# where df['b'] is even
# and only column 'c'

df.loc[df['b']%2 == 0, 'c']

v    393
x    999
Name: c, dtype: int64

In [16]:
# all rows of df
# where df['b'] is even
# and only columns c and e

df.loc[df['b']%2 == 0, ['c', 'e']]

Unnamed: 0,c,e
v,393,11
x,999,333


In [19]:
# show me all rows of df
# where df['c'] < df['c'].mean()
# and only columns a and d

df.loc[df['c'] < df['c'].mean(), ['a', 'd']]

Unnamed: 0,a,d
v,772,320.0
z,770,18.0


In [20]:
df.describe()

Unnamed: 0,a,b,c,d,e,f
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,605.4,506.6,550.2,172.268,413.4,374.6
std,168.866219,265.577296,346.40612,167.48131,332.750357,241.50735
min,400.0,201.0,77.0,12.34,11.0,138.0
25%,475.0,323.0,393.0,18.0,333.0,244.0
50%,610.0,535.0,559.0,139.0,365.0,336.0
75%,770.0,582.0,723.0,320.0,423.0,382.0
max,772.0,892.0,999.0,372.0,935.0,773.0


In [21]:
df.mean()

a    605.400
b    506.600
c    550.200
d    172.268
e    413.400
f    374.600
dtype: float64

In [22]:
df.sum()

a    3027.00
b    2533.00
c    2751.00
d     861.34
e    2067.00
f    1873.00
dtype: float64

In [23]:
df.max()

a    772.0
b    892.0
c    999.0
d    372.0
e    935.0
f    773.0
dtype: float64

In [24]:
df

Unnamed: 0,a,b,c,d,e,f
v,772,582,393,320.0,11,773
w,400,535,723,139.0,423,244
x,475,892,999,12.34,333,382
y,610,323,559,372.0,365,336
z,770,201,77,18.0,935,138


In [25]:
df['g'] = ['duck', 'duck', 'duck', 'duck', 'goose']

In [26]:
df

Unnamed: 0,a,b,c,d,e,f,g
v,772,582,393,320.0,11,773,duck
w,400,535,723,139.0,423,244,duck
x,475,892,999,12.34,333,382,duck
y,610,323,559,372.0,365,336,duck
z,770,201,77,18.0,935,138,goose


In [27]:
df.describe()

Unnamed: 0,a,b,c,d,e,f
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,605.4,506.6,550.2,172.268,413.4,374.6
std,168.866219,265.577296,346.40612,167.48131,332.750357,241.50735
min,400.0,201.0,77.0,12.34,11.0,138.0
25%,475.0,323.0,393.0,18.0,333.0,244.0
50%,610.0,535.0,559.0,139.0,365.0,336.0
75%,770.0,582.0,723.0,320.0,423.0,382.0
max,772.0,892.0,999.0,372.0,935.0,773.0


In [28]:
df['g'].describe()

count        5
unique       2
top       duck
freq         4
Name: g, dtype: object

# Exercise: Address book

1. Create a data frame in which you have a few friends and family members. Every person in the data frame will have the following columns:
    - `firstname`
    - `lastname`
    - `age`
2. Create the data frame with about 7-10 people.
3. What is the average age of people in your address book?
4. Show the first and last names of people whose ages are above average.
5. Show people (name and age) whose first name is shorter than the average for first names.

In [29]:
df = DataFrame([['a', 'b', 10],
                ['c', 'd', 20],
                ['e', 'f', 30]])
df

Unnamed: 0,0,1,2
0,a,b,10
1,c,d,20
2,e,f,30


In [32]:
df = DataFrame([['Reuven', 'Lerner', 51],
                ['Atara', 'Lerner-Friedman', 21],
                ['Shikma', 'Lerner-Friedman', 19],
                ['Amotz', 'Lerner-Friedman', 16],
                ['John', 'Smith', 35],
                ['David', 'Cohen', 60],
                ['Sarah', 'Friedman', 59]                
               ],
              columns='firstname lastname age'.split())     # ['firstname', 'lastname', 'age']

In [33]:
df

Unnamed: 0,firstname,lastname,age
0,Reuven,Lerner,51
1,Atara,Lerner-Friedman,21
2,Shikma,Lerner-Friedman,19
3,Amotz,Lerner-Friedman,16
4,John,Smith,35
5,David,Cohen,60
6,Sarah,Friedman,59


In [35]:
# what's the average age of people in my address book?

df['age'].mean()

37.285714285714285

In [36]:
# what are the names of people whose ages are above average

df['age'] > df['age'].mean()   # this returns a boolean series

0     True
1    False
2    False
3    False
4    False
5     True
6     True
Name: age, dtype: bool

In [37]:
#        row selectors via a mask/boolean index
df.loc[df['age'] > df['age'].mean()]

Unnamed: 0,firstname,lastname,age
0,Reuven,Lerner,51
5,David,Cohen,60
6,Sarah,Friedman,59


In [38]:
#       row selector                    ,  column selector
df.loc[df['age'] > df['age'].mean(),      ['firstname', 'lastname']  ]

Unnamed: 0,firstname,lastname
0,Reuven,Lerner
5,David,Cohen
6,Sarah,Friedman


In [41]:
# find all people
# whose first name is shorter than the average first name

df['firstname'].str.len().mean()

5.142857142857143

In [42]:
df['firstname'].str.len() < df['firstname'].str.len().mean()

0    False
1     True
2    False
3     True
4     True
5     True
6     True
Name: firstname, dtype: bool

In [44]:
# find all rows
# where the first name is shorter than the average first name
# all columns (so we don't need a column selector)

df.loc[df['firstname'].str.len() < df['firstname'].str.len().mean()]

Unnamed: 0,firstname,lastname,age
1,Atara,Lerner-Friedman,21
3,Amotz,Lerner-Friedman,16
4,John,Smith,35
5,David,Cohen,60
6,Sarah,Friedman,59


In [45]:
# earlier, I created the data frame as a list of lists

df = DataFrame([['Reuven', 'Lerner', 51],
                ['Atara', 'Lerner-Friedman', 21],
                ['Shikma', 'Lerner-Friedman', 19],
                ['Amotz', 'Lerner-Friedman', 16],
                ['John', 'Smith', 35],
                ['David', 'Cohen', 60],
                ['Sarah', 'Friedman', 59]                
               ],
              columns='firstname lastname age'.split())     # ['firstname', 'lastname', 'age']

In [46]:
# I can also create this data frame as a list of dicts
# each dictionary represents one row
# the keys are the column names, and the values are .. the values

# we don't need to specify column names


df = DataFrame([{'firstname':'Reuven', 'lastname':'Lerner', 'age':51},
               {'firstname':'Atara', 'lastname':'Lerner-Friedman', 'age':21},
               {'firstname':'Shikma', 'lastname':'Lerner-Friedman', 'age':19},
               {'firstname':'Amotz', 'lastname':'Lerner-Friedman', 'age':16}
               ])

In [47]:
df

Unnamed: 0,firstname,lastname,age
0,Reuven,Lerner,51
1,Atara,Lerner-Friedman,21
2,Shikma,Lerner-Friedman,19
3,Amotz,Lerner-Friedman,16


# Reading from and writing to files

Yesterday, we saw that we can read from CSV, Excel, and feather files.  We can write to them, as well.

We saw a few of the parameters we can specify when reading from a CSV file:
- `sep`, the separator, defaulting to `,`
- `usecols`, a list of column names that we want to include in our data frame

In [49]:
!head airlines.dat

1,"Private flight",\N,"-","N/A","","","Y" 
2,"135 Airways",\N,"","GNL","GENERAL","United States","N"
3,"1Time Airline",\N,"1T","RNX","NEXTIME","South Africa","Y"
4,"2 Sqn No 1 Elementary Flying Training School",\N,"","WYT","","United Kingdom","N"
5,"213 Flight Unit",\N,"","TFU","","Russia","N"
6,"223 Flight Unit State Airline",\N,"","CHD","CHKALOVSK-AVIA","Russia","N"
7,"224th Flight Unit",\N,"","TTF","CARGO UNIT","Russia","N"
8,"247 Jet Ltd",\N,"","TWF","CLOUD RUNNER","United Kingdom","N"
9,"3D Aviation",\N,"","SEC","SECUREX","United States","N"
10,"40-Mile Air",\N,"Q5","MLA","MILE-AIR","United States","Y"


In [50]:
# read in data about every airline in the world
df = pd.read_csv('airlines.dat')

In [51]:
df.head()

Unnamed: 0,1,Private flight,\N,-,N/A,Unnamed: 5,Unnamed: 6,Y
0,2,135 Airways,\N,,GNL,GENERAL,United States,N
1,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
2,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
3,5,213 Flight Unit,\N,,TFU,,Russia,N
4,6,223 Flight Unit State Airline,\N,,CHD,CHKALOVSK-AVIA,Russia,N


In [60]:
# if the CSV file doesn't start with column names, we need to:
# (1) tell it not to use the first row as columns, so we don't lose data
# (2) name the columns ourselves

df = pd.read_csv('airlines.dat', 
                header=None,   # the first row of the file is *not* a header row
                names=['name', 'junk', '2code', '3code', 'formal name', 'country' 'morejunk'])

In [61]:
df.head()

Unnamed: 0,Unnamed: 1,name,junk,2code,3code,formal name,countrymorejunk
1,Private flight,\N,-,,,,Y
2,135 Airways,\N,,GNL,GENERAL,United States,N
3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,United Kingdom,N
5,213 Flight Unit,\N,,TFU,,Russia,N
