# PANDAS

## Like R in Python but better

Today we'll cover
- What are DataFrames  
  - Indeces vs. Columns  
  - Setting the index  
  - Reference by name/Reference by position  
- Basic operations on DataFrames  
  - Filter  
  - Iterate  
  - Transform to other datatypes  
- Advanced operations  
  - join, merge, append  
  - Multi index  

# Prologue

## DataTypes 

### Basic Types
- Int
  - 1
  - 2
  - 3
- Float
  - 2.2222
  - 3.2333
  - 3.14359
- Char
  - '2'
  - 'a'
  - 'b'
  - '"'

### Structured Types
- List
  - `(a -> (b -> (c -> ...)))`
- Array  
  - `[a, b, c]`
- Hashtable
  - `[(1 -> 'a'), (2 -> 'b'), (3 -> 'c')]`

### Abstractions
- Char + List = String
- Array + List = ArrayList (Creative Right?)

# Python is a liar
- Strings are secretly Lists
- Lists are secretly hashtables
- Dicts are openly hashtables
- Everything is hashtables

# What is a Dataframe?

# A bunch of hashtables 

# What is special about them?

In [104]:
import numpy as np
import pandas as pd
from pprint import pprint
# Let's say we want to collect information about particulars days
# and access that information based on the date

# We want to know the day of the week it was
day_of_week = {
    pd.Timestamp('20200101'): 'Wednesday',
    pd.Timestamp('20200102'): 'Thursday',
    pd.Timestamp('20200103'): 'Friday',
    pd.Timestamp('20200104'): 'Saturday',
    pd.Timestamp('20200105'): 'Sunday'
}

day_of_week

{Timestamp('2020-01-01 00:00:00'): 'Wednesday',
 Timestamp('2020-01-02 00:00:00'): 'Thursday',
 Timestamp('2020-01-03 00:00:00'): 'Friday',
 Timestamp('2020-01-04 00:00:00'): 'Saturday',
 Timestamp('2020-01-05 00:00:00'): 'Sunday'}

In [105]:
# We also want to know the high tempurature
high_temp = {
    pd.Timestamp('20200101'): 48,
    pd.Timestamp('20200102'): 54,
    pd.Timestamp('20200103'): 45,
    pd.Timestamp('20200104'): 61,
    pd.Timestamp('20200105'): 55
}
high_temp

{Timestamp('2020-01-01 00:00:00'): 48,
 Timestamp('2020-01-02 00:00:00'): 54,
 Timestamp('2020-01-03 00:00:00'): 45,
 Timestamp('2020-01-04 00:00:00'): 61,
 Timestamp('2020-01-05 00:00:00'): 55}

In [106]:
# And the low tempurature 
low_temp = {
    pd.Timestamp('20200101'): 30,
    pd.Timestamp('20200102'): 38,
    pd.Timestamp('20200103'): 33,
    pd.Timestamp('20200104'): 45,
    pd.Timestamp('20200105'): 30
}
low_temp

{Timestamp('2020-01-01 00:00:00'): 30,
 Timestamp('2020-01-02 00:00:00'): 38,
 Timestamp('2020-01-03 00:00:00'): 33,
 Timestamp('2020-01-04 00:00:00'): 45,
 Timestamp('2020-01-05 00:00:00'): 30}

In [107]:
# This quickly gets difficult to work with
# Let's say you want to compare weekday and weekend high temps
# You'd have to do something terrible like the following
weekends = []
weekdays = []

# Collect the info from one dict
for timestamp, day in day_of_week.items():
    if day == 'Saturday' or day == 'Sunday':
        weekends.append(timestamp)
    else:
        weekdays.append(timestamp)

weekend_temps = []
weekday_temps = []
# Use it to collect info from another dict
for timestamp in weekends:
    weekend_temps.append(high_temp[timestamp])
for timestamp in weekdays:
    weekday_temps.append(high_temp[timestamp])

print("Weekend average {}".format(np.mean(weekend_temps)))
print("Weekday average {}".format(np.mean(weekday_temps)))

Weekend average 58.0
Weekday average 49.0


In [108]:
## You want to access the data all in the same place
df = pd.DataFrame({'day_of_week': day_of_week,
                   'high_temp': high_temp,
                   'low_temp': low_temp})
df

Unnamed: 0,day_of_week,high_temp,low_temp
2020-01-01,Wednesday,48,30
2020-01-02,Thursday,54,38
2020-01-03,Friday,45,33
2020-01-04,Saturday,61,45
2020-01-05,Sunday,55,30


In [109]:
# How do you access it?
# Indexes (Rows) and Columns (Columns)
print(df.index)
print(df.columns)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05'],
              dtype='datetime64[ns]', freq=None)
Index(['day_of_week', 'high_temp', 'low_temp'], dtype='object')


In [110]:
# So how would we find the weekday/weekend temps?
weekends = ['Saturday', 'Sunday']
weekend_data, weekday_data = df.groupby(df.day_of_week.isin(weekends))

print("Weekend average {}".format(weekend_data[1].high_temp.mean()))
print("Weekday average {}".format(weekday_data[1].high_temp.mean()))

Weekend average 49.0
Weekday average 58.0


In [111]:
# Step 1: Accessing the day_of_week information
day_of_week = df.day_of_week
day_of_week

2020-01-01    Wednesday
2020-01-02     Thursday
2020-01-03       Friday
2020-01-04     Saturday
2020-01-05       Sunday
Name: day_of_week, dtype: object

In [112]:
# Step 2: Determining if it's a weekend
is_weekend = day_of_week.isin(weekends)
is_weekend

2020-01-01    False
2020-01-02    False
2020-01-03    False
2020-01-04     True
2020-01-05     True
Name: day_of_week, dtype: bool

In [113]:
# Step 3: Group the data based on those True/False values
group1, group2 = df.groupby(is_weekend)
# groupby returns a tuple e.g. (group_value, dataframe)
# In this case group_value is True or False but it could be any value
print(group1[0])
print("-----------")
print(group1[1])


False
-----------
           day_of_week  high_temp  low_temp
2020-01-01   Wednesday         48        30
2020-01-02    Thursday         54        38
2020-01-03      Friday         45        33


In [114]:
# Step 4: Get the high_temp from the group
high_temps = group1[1].high_temp
high_temps

2020-01-01    48
2020-01-02    54
2020-01-03    45
Name: high_temp, dtype: int64

In [115]:
# Step 5: Calculate the average
high_temps.mean()

49.0

# Adding to a DataFrame
Finding the average tempurature was very exciting but it's left a few things unanswered

Who are the days for?

In [116]:
for_whom = {
    'Sunday': 'The girls',
    'Monday': 'The birds',
    'Tuesday': 'The non-binary',
    'Wednesday': 'The camels',
    'Thursday': 'The dogs',
    'Friday': 'The cats',
    'Saturday': 'The boys',
}
for_whom

{'Sunday': 'The girls',
 'Monday': 'The birds',
 'Tuesday': 'The non-binary',
 'Wednesday': 'The camels',
 'Thursday': 'The dogs',
 'Friday': 'The cats',
 'Saturday': 'The boys'}

In [117]:
for_whom = pd.Series(for_whom, name='for_whom')
for_whom

Sunday            The girls
Monday            The birds
Tuesday      The non-binary
Wednesday        The camels
Thursday           The dogs
Friday             The cats
Saturday           The boys
Name: for_whom, dtype: object

In [118]:
df.append(for_whom)

Unnamed: 0,day_of_week,high_temp,low_temp,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
2020-01-01 00:00:00,Wednesday,48.0,30.0,,,,,,,
2020-01-02 00:00:00,Thursday,54.0,38.0,,,,,,,
2020-01-03 00:00:00,Friday,45.0,33.0,,,,,,,
2020-01-04 00:00:00,Saturday,61.0,45.0,,,,,,,
2020-01-05 00:00:00,Sunday,55.0,30.0,,,,,,,
for_whom,,,,The cats,The birds,The boys,The girls,The dogs,The non-binary,The camels


Thats not right

In [119]:
# We need to transform the data frame so pandas can match the new data to the original
idf = df.set_index('day_of_week')
idf

Unnamed: 0_level_0,high_temp,low_temp
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1
Wednesday,48,30
Thursday,54,38
Friday,45,33
Saturday,61,45
Sunday,55,30


In [120]:
idf.append(for_whom)

Unnamed: 0_level_0,high_temp,low_temp,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
day_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Wednesday,48.0,30.0,,,,,,,
Thursday,54.0,38.0,,,,,,,
Friday,45.0,33.0,,,,,,,
Saturday,61.0,45.0,,,,,,,
Sunday,55.0,30.0,,,,,,,
for_whom,,,The cats,The birds,The boys,The girls,The dogs,The non-binary,The camels


Still not right

In [121]:
idft = idf.T
idft

day_of_week,Wednesday,Thursday,Friday,Saturday,Sunday
high_temp,48,54,45,61,55
low_temp,30,38,33,45,30


In [122]:
df = idft.append(for_whom)
df

Unnamed: 0,Wednesday,Thursday,Friday,Saturday,Sunday,Monday,Tuesday
high_temp,48,54,45,61,55,,
low_temp,30,38,33,45,30,,
for_whom,The camels,The dogs,The cats,The boys,The girls,The birds,The non-binary


That's better

In [123]:
# To get it back to the previous format we undo to the transformations
df = df.T
df

Unnamed: 0,high_temp,low_temp,for_whom
Wednesday,48.0,30.0,The camels
Thursday,54.0,38.0,The dogs
Friday,45.0,33.0,The cats
Saturday,61.0,45.0,The boys
Sunday,55.0,30.0,The girls
Monday,,,The birds
Tuesday,,,The non-binary


In [124]:
df = df.reset_index()
df

Unnamed: 0,index,high_temp,low_temp,for_whom
0,Wednesday,48.0,30.0,The camels
1,Thursday,54.0,38.0,The dogs
2,Friday,45.0,33.0,The cats
3,Saturday,61.0,45.0,The boys
4,Sunday,55.0,30.0,The girls
5,Monday,,,The birds
6,Tuesday,,,The non-binary


Notice the Week days now have the column label 'index'. The original label was lost when we set them to be the index