1. Restart kernel, Run All 
2. Modify C7 (0 rerun cells)
3. Add new cell between C5 and C6, move C5 prints (6 rerun cells)
4. Modify C2 (5 rerun cells)

# Birth analysis from split annotations 
- Given a dataset of number of births
by name/year, computes fraction of
names starting with “Lesl” grouped
by gender and year-of-birth
- Code [Original Python Script]: https://github.com/weld-project/split-annotations/blob/master/python/benchmarks/birth_analysis/birth_analysis.py
- Data: https://github.com/weld-project/split-annotations/blame/master/python/benchmarks/datasets/birth_analysis/babynames.txt.gz

## Notes: 
- Refactored into a more notebook style 
- There are some groupby -> sort -> filter (by name and uniqueness (more like merge?)) -> sum, which may be effective organize sort after groupby

In [None]:
import argparse
import pandas as pd
import sys
import time

In [None]:
years = range(1880, 2011)
pieces = []
columns = ['year', 'sex', 'name', 'births']

In [None]:
filename = './data/babynames.txt'
print("File:", filename)

print("Reading data...")
names = pd.read_csv(filename, names=columns)
print("done.")

File: ./data/babynames.txt
Reading data...
done.


In [None]:
e2e_start = time.time()
start0 = time.time()
grouped = names.groupby(['year', 'sex']) #  Groups the data by year and sex 
end0 = time.time()
print("GroupBy:", end0 - start0)

GroupBy: 0.003265857696533203


In [None]:
start0 = end0

top1000 = grouped.apply(lambda group: group.sort_values(by='births', ascending=False)[0:1000])
top1000.reset_index(inplace=True, drop=True)

end0 = time.time()
print("Apply:", end0-start0)
print("Elements in top1000:", len(top1000))

Apply: 0.5344250202178955
Elements in top1000: 267877


  top1000 = grouped.apply(lambda group: group.sort_values(by='births', ascending=False)[0:1000])


In [None]:
start1 = time.time()
all_names = pd.Series(top1000.name.unique()) # find all unique names 
lesley_like = all_names[all_names.str.lower().str.contains('lesl')]
filtered = top1000[top1000.name.isin(lesley_like)] # filter 
table = filtered.pivot_table('births', index='year',
                             columns='sex', aggfunc='sum') # births summed by year and sex

table = table.div(table.sum(1), axis=0) # Normalize by dividing each row / total_births
end1 = time.time()
result = table
print("Analysis:", end1 - start1)

Analysis: 0.0894629955291748


In [None]:
e2e_end = time.time()
print("Total time:", e2e_end - e2e_start)

print(top1000['births'].sum())

Total time: 0.6431000232696533
304919459
