# Concatenating Tables with Set-Like Operations

One of the two way of combining two tables is to stack one table on top of the other.  When stacking two tables on top of one another, we need to decide

1. If we combine columns based on position or name (and if combining by name, what do we do with mismatches?)
2. How to decide which rows to keep.  In this case, we will take some guidance from SQL clauses.

## Three Types of Operations

* **Union:** Keeps rows from either table.
* **Intersection:** Only keeps common columns
* **Set Difference/Except:** Keep rows from the left table *except* those in the right table.

## Set Operations in Action 

<img src="./img/table_verbs_set.gif" width=800>

## All Operations Match by Position

All operations

* Match columns by position
* Require same number/type of columns

## Distinct Versus All

* **UNION/INTERSECT/SET DIFFERENCE** are **DISTINCT**
    * Only keeps distinct rows, removing duplicates.
* **UNION ALL/INTERSECT ALL/SET DIFFERENCE ALL**
    * Keeps duplicate rows

In [2]:
import pandas as pd
from dfply import *

In [3]:
sales_may = pd.read_csv('./data/auto_sales_may.csv')
sales_may

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9


In [4]:
sales_apr = pd.read_csv('./data/auto_sales_apr.csv')
sales_apr

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


## Unions with `dfply`

Use `left_table >> union(right_table)`

In [5]:
sales_may >> union(sales_apr)

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


## `dfply.union` is distinct

Since Ann have the same sales each month, her row only included one row.  Note that we can use `keep='last'` to `keep='first'` to determine which row is kept.

In [6]:
sales_may >> union(sales_apr, keep='last')

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
0,0,Ann,22,18,15,12
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


In [7]:
sales_may >> union(sales_apr, keep='first')

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
1,1,Bob,19,12,17,20
2,2,Yolanda,19,8,32,15
3,3,Xerxes,12,23,18,9


## Making `union_all`

We can use `pd.concat` to perform a `UNION ALL`

In [8]:
from more_dfply import union_all
sales_may >> union_all(sales_apr)

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9
4,0,Ann,22,18,15,12
5,1,Bob,19,12,17,20
6,2,Yolanda,19,8,32,15
7,3,Xerxes,12,23,18,9


## Adding a month column

Another way to keep both of Ann's sales rows is adding a month column (which we should probably do anyway).

In [9]:
sales_may >> mutate(month = 'May') >> union(sales_apr >> mutate(month = 'April'))

  stacked = df.append(other)


Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck,month
0,0,Ann,22,18,15,12,May
1,1,Bob,20,14,6,24,May
2,2,Yolanda,19,10,28,17,May
3,3,Xerxes,11,27,17,9,May
0,0,Ann,22,18,15,12,April
1,1,Bob,19,12,17,20,April
2,2,Yolanda,19,8,32,15,April
3,3,Xerxes,12,23,18,9,April


## Finding common rows with `dfply.intersect`

In [10]:
sales_may >> intersect(sales_apr)

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
0,0,Ann,22,18,15,12


## Finding rows unique to the left table.

Use `left_table >> dfply.set_diff(right_table)`

In [11]:
sales_may >> set_diff(sales_apr)

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck
1,1,Bob,20,14,6,24
2,2,Yolanda,19,10,28,17
3,3,Xerxes,11,27,17,9


# Working with many and/or large files

In this section, we will take a look at techniques for working with many files, as well as large files.

In [12]:
import pandas as pd
from dfply import *

## Baseball data

We will be using the [Baseball Databank](https://github.com/chadwickbureau/baseballdatabank), make sure you have these data cloned into `./data/baseball`.

!git clone https://github.com/chadwickbureau/baseballdatabank.git ./data/baseball

## Working with many files.

* Use `glob.glob` to find all files that match a pattern
* Convert all files to `pd.DataFrames`
* Store the `df` in a list or dictionary

## What the heck is a `glob`

`glob.glob`

* Takes a path regular expression
* Returns a list of files that match the patterm
* Relative paths!

## Store in `dict` or `list`?

* Natural sequence/order? $\rightarrow$ `list`
    *  Example: Lakes data and years are a natural sequence
* Easier to refer by name? $\rightarrow$ `dict`
    * Baseball files have no order and easier to refer to by name

## Example 1 - Using `glob` to read and combine the sales data

Using `glob` with a `list` to automate reading an combining files 

#### Step 1 - Get the file names

In [13]:
from glob import glob
sales_files = glob('./data/auto_sales_*.csv')
sales_files

['./data/auto_sales_apr.csv', './data/auto_sales_may.csv']

#### Step 2 - Read the files into a list of data frames

In [14]:
sales_by_month = [pd.read_csv(f) for f in sales_files]

 #### Inspect each data from with head

In [15]:
[df.head(2) for df in sales_by_month]

[   Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
 0           0         Ann       22     18   15     12
 1           1         Bob       19     12   17     20,
    Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
 0           0         Ann       22     18   15     12
 1           1         Bob       20     14    6     24]

#### Step 3 - Pull off the month from the file names and repackage as a `dict`

In [16]:
import re

MONTH_RE = re.compile(r'^\./data/auto_sales_([a-zA-Z_]*)\.csv$')
get_month = lambda p: MONTH_RE.match(p).group(1) 
month_names = lambda files: [get_month(p) for p in files]
month_names(sales_files)

['apr', 'may']

In [17]:
month_name_and_file = list(zip(month_names(sales_files), sales_files))
month_name_and_file

[('apr', './data/auto_sales_apr.csv'), ('may', './data/auto_sales_may.csv')]

#### Now repackage with a `list` comprehension

Note that we will need the month name later, so we are storing it in a `tuple` with the data frame for now.

In [18]:
sales_by_month = [(mon,pd.read_csv(file)) for mon, file in month_name_and_file]
[(mon, df.head(2)) for mon, df in sales_by_month]

[('apr',
     Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
  0           0         Ann       22     18   15     12
  1           1         Bob       19     12   17     20),
 ('may',
     Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck
  0           0         Ann       22     18   15     12
  1           1         Bob       20     14    6     24)]

#### Step 4 - Add a month column to each file

Notice that we need to put the `dfply` pipe *inside* the `list` comprehension to allow access to the names.

In [20]:
sale_files_with_month = [(df  >> mutate(month = mon))
                         for mon, df in sales_by_month] 

[df.head(2) for df in sale_files_with_month]

[   Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck month
 0           0         Ann       22     18   15     12   apr
 1           1         Bob       19     12   17     20   apr,
    Unnamed: 0 Salesperson  Compact  Sedan  SUV  Truck month
 0           0         Ann       22     18   15     12   may
 1           1         Bob       20     14    6     24   may]

#### Step 5 - Combine the files using `pd.concat`

Note that `pd.concat` is `dfply.union_all`

In [31]:
?pd.concat

In [57]:
combined_files = pd.concat(sale_files_with_month)
combined_files

Unnamed: 0.1,Unnamed: 0,Salesperson,Compact,Sedan,SUV,Truck,month
0,0,Ann,22,18,15,12,apr
1,1,Bob,19,12,17,20,apr
2,2,Yolanda,19,8,32,15,apr
3,3,Xerxes,12,23,18,9,apr
0,0,Ann,22,18,15,12,may
1,1,Bob,20,14,6,24,may
2,2,Yolanda,19,10,28,17,may
3,3,Xerxes,11,27,17,9,may


## <font color="red"> Exercise 2.10.1</font>

In the data folder, you will find 6 files that contain a sample 100,000 rows from the uber data for the month apr14-sep14.  Perform the following tasks:

1. Use `glob` to get all 6 file paths.
2. Use a regular expression to create a `lambda` function that pulls the month from the files.
3. Read the 6 data frames into a `list` of `tuples` containing the month name corresponding data frame.
4. Add the month column each data frame using a pipe inside of a comprehension.
5. Use `pd.concat` to combine these 6 data frames into one combined `df`

In [22]:
pd.set_option('display.max_columns', None)

In [23]:
import glob
paths = glob.glob("./data/uber/uber-trip-data/uber-raw-data-???14.csv")
paths

['./data/uber/uber-trip-data/uber-raw-data-apr14.csv',
 './data/uber/uber-trip-data/uber-raw-data-aug14.csv',
 './data/uber/uber-trip-data/uber-raw-data-jul14.csv',
 './data/uber/uber-trip-data/uber-raw-data-jun14.csv',
 './data/uber/uber-trip-data/uber-raw-data-may14.csv',
 './data/uber/uber-trip-data/uber-raw-data-sep14.csv']

In [25]:
import re
month_extractor = re.compile("([a-z]{3})14")
extract_month = lambda s: month_extractor.search(s).group(1)

In [26]:
extract_month("uber-raw-data-jul14.csv")

'jul'

In [29]:
month_df = [(extract_month(filepath), pd.read_csv(filepath)) for filepath in paths]
month_df

[('apr',
                   Date/Time      Lat      Lon    Base
  0         4/1/2014 0:11:00  40.7690 -73.9549  B02512
  1         4/1/2014 0:17:00  40.7267 -74.0345  B02512
  2         4/1/2014 0:21:00  40.7316 -73.9873  B02512
  3         4/1/2014 0:28:00  40.7588 -73.9776  B02512
  4         4/1/2014 0:33:00  40.7594 -73.9722  B02512
  ...                    ...      ...      ...     ...
  564511  4/30/2014 23:22:00  40.7640 -73.9744  B02764
  564512  4/30/2014 23:26:00  40.7629 -73.9672  B02764
  564513  4/30/2014 23:31:00  40.7443 -73.9889  B02764
  564514  4/30/2014 23:32:00  40.6756 -73.9405  B02764
  564515  4/30/2014 23:48:00  40.6880 -73.9608  B02764
  
  [564516 rows x 4 columns]),
 ('aug',
                   Date/Time      Lat      Lon    Base
  0         8/1/2014 0:03:00  40.7366 -73.9906  B02512
  1         8/1/2014 0:09:00  40.7260 -73.9918  B02512
  2         8/1/2014 0:12:00  40.7209 -74.0507  B02512
  3         8/1/2014 0:12:00  40.7387 -73.9856  B02512
  4         8/

In [30]:
dfs_with_month = [df >> mutate(month = mon) for mon, df in month_df]
dfs_with_month

[                 Date/Time      Lat      Lon    Base month
 0         4/1/2014 0:11:00  40.7690 -73.9549  B02512   apr
 1         4/1/2014 0:17:00  40.7267 -74.0345  B02512   apr
 2         4/1/2014 0:21:00  40.7316 -73.9873  B02512   apr
 3         4/1/2014 0:28:00  40.7588 -73.9776  B02512   apr
 4         4/1/2014 0:33:00  40.7594 -73.9722  B02512   apr
 ...                    ...      ...      ...     ...   ...
 564511  4/30/2014 23:22:00  40.7640 -73.9744  B02764   apr
 564512  4/30/2014 23:26:00  40.7629 -73.9672  B02764   apr
 564513  4/30/2014 23:31:00  40.7443 -73.9889  B02764   apr
 564514  4/30/2014 23:32:00  40.6756 -73.9405  B02764   apr
 564515  4/30/2014 23:48:00  40.6880 -73.9608  B02764   apr
 
 [564516 rows x 5 columns],
                  Date/Time      Lat      Lon    Base month
 0         8/1/2014 0:03:00  40.7366 -73.9906  B02512   aug
 1         8/1/2014 0:09:00  40.7260 -73.9918  B02512   aug
 2         8/1/2014 0:12:00  40.7209 -74.0507  B02512   aug
 3        

In [34]:
uber_data_combined = pd.concat(dfs_with_month)
uber_data_combined.sample(10)

Unnamed: 0,Date/Time,Lat,Lon,Base,month
930052,9/17/2014 9:59:00,40.699,-73.9805,B02764,sep
487528,8/20/2014 21:41:00,40.7299,-73.9885,B02617,aug
231512,8/28/2014 19:21:00,40.6626,-73.9328,B02598,aug
123143,6/11/2014 20:55:00,40.678,-73.9706,B02598,jun
94200,5/8/2014 7:46:00,40.745,-74.0024,B02598,may
361078,5/17/2014 17:58:00,40.7751,-73.9591,B02617,may
243879,9/26/2014 19:16:00,40.6902,-74.1785,B02598,sep
835512,9/28/2014 12:56:00,40.7662,-73.9813,B02682,sep
548495,7/28/2014 17:45:00,40.7526,-73.9795,B02617,jul
857058,9/3/2014 9:33:00,40.7526,-74.0075,B02764,sep


## Example 2 - Reading and joining the baseball database using `dict`

**Task:** Collect the number of total hits for each batters in the 2010 season join on their first and last name.

In the second example, we will store the data frames in a `dict`, which will make it easier to join the files by ne

#### Step 1 - Get the files names

In [35]:
from glob import glob
files = glob('./data/baseball/core/*.csv')
files

['./data/baseball/core/AllstarFull.csv',
 './data/baseball/core/Appearances.csv',
 './data/baseball/core/AwardsManagers.csv',
 './data/baseball/core/AwardsPlayers.csv',
 './data/baseball/core/AwardsShareManagers.csv',
 './data/baseball/core/AwardsSharePlayers.csv',
 './data/baseball/core/Batting.csv',
 './data/baseball/core/BattingPost.csv',
 './data/baseball/core/CollegePlaying.csv',
 './data/baseball/core/Fielding.csv',
 './data/baseball/core/FieldingOF.csv',
 './data/baseball/core/FieldingOFsplit.csv',
 './data/baseball/core/FieldingPost.csv',
 './data/baseball/core/HallOfFame.csv',
 './data/baseball/core/HomeGames.csv',
 './data/baseball/core/Managers.csv',
 './data/baseball/core/ManagersHalf.csv',
 './data/baseball/core/Parks.csv',
 './data/baseball/core/People.csv',
 './data/baseball/core/Pitching.csv',
 './data/baseball/core/PitchingPost.csv',
 './data/baseball/core/Salaries.csv',
 './data/baseball/core/Schools.csv',
 './data/baseball/core/SeriesPost.csv',
 './data/baseball/core

* Only need the `Batting.csv` and `People.csv`.  
* Narrow with a RegEx

In [51]:
import re
needed_file = re.compile(r'./data/baseball/core/(Batting|People).csv')

needed_files = [f for f in files if needed_file.match(f)]
needed_files

['./data/baseball/core/Batting.csv', './data/baseball/core/People.csv']

#### Step 2 - Make helper functions to get the name from path

In [39]:
import re
FILE_NAME_RE = re.compile(r'^\./data/baseball/core/([a-zA-Z_]*)\.csv$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
file_names(needed_files)

['Batting', 'People']

#### Step 3 - Use a comprehension to read in all files

**Note:** The data is small (< 10mb total) so it is safe to read all at once.

In [52]:
dfs = {name:pd.read_csv(path) for name, path in zip(file_names(needed_files), needed_files)}
dfs['Batting'].head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19.0,3.0,1.0,2,5.0,,,,,1.0
3,allisdo01,1871,1,WS3,,27,133,28,44,10,2,2,27.0,1.0,1.0,0,2.0,,,,,0.0
4,ansonca01,1871,1,RC1,,25,120,29,39,11,3,0,16.0,6.0,2.0,2,1.0,,,,,0.0


In [93]:
dfs['People'].head()

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,...,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,...,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,,,,...,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
2,aaronto01,1939.0,8.0,5.0,USA,AL,Mobile,1984.0,8.0,16.0,...,Aaron,Tommie Lee,190.0,75.0,R,R,1962-04-10,1971-09-26,aarot101,aaronto01
3,aasedo01,1954.0,9.0,8.0,USA,CA,Orange,,,,...,Aase,Donald William,190.0,75.0,R,R,1977-07-26,1990-10-03,aased001,aasedo01
4,abadan01,1972.0,8.0,25.0,USA,FL,Palm Beach,,,,...,Abad,Fausto Andres,184.0,73.0,L,L,2001-09-10,2006-04-13,abada001,abadan01


#### Step 4 - Preprocess each file.

In [104]:
# Filter, select, and aggregate hits for 2010.
hits_in_2010_raw = (dfs['Batting']
                   >> select(X.yearID, X.playerID, X.H)
                   >> filter_by(X.yearID == 2010)
                   >> group_by(X.playerID)
                   >> summarise(total_hits = mean(X.H))
                   )
hits_in_2010_raw.head(2)

Unnamed: 0,playerID,total_hits
0,aardsda01,0.0
1,abadfe01,0.0


In [53]:
# Grab the first and last names from People.

player_names = (dfs['People']
         >> select(X.playerID, X.nameFirst, X.nameLast))
player_names.head(2)

Unnamed: 0,playerID,nameFirst,nameLast
0,aardsda01,David,Aardsma
1,aaronha01,Hank,Aaron


#### Step 4 -- Join the tables

In [109]:
hits_in_2010 = (hits_in_2010_raw 
                >> left_join(player_names, by='playerID')
                >> drop(X.playerID)
               )
hits_in_2010.head()

Unnamed: 0,total_hits,nameFirst,nameLast
0,0.0,David,Aardsma
1,0.0,Fernando,Abad
2,146.0,Bobby,Abreu
3,45.0,Tony,Abreu
4,0.0,Jeremy,Accardo


## <font color="red"> Exercise 2.10.2 </font>

We want to get the total hits allowed for all pitchers during the 2000-2010 seasons.  Use `glob` and a `dict` to collect this information into a table that includes the players first and last names.

In [37]:
pitching_files = glob("./data/baseball/core/Pitching*.csv")
pitching_files

['./data/baseball/core/Pitching.csv', './data/baseball/core/PitchingPost.csv']

In [40]:
pitching_df_dict = { file_name(file) : pd.read_csv(file) for file in pitching_files}
pitching_df_dict

{'Pitching':         playerID  yearID  stint teamID lgID   W   L   G  GS  CG  SHO  SV  \
 0      bechtge01    1871      1    PH1  NaN   1   2   3   3   2    0   0   
 1      brainas01    1871      1    WS3  NaN  12  15  30  30  30    0   0   
 2      fergubo01    1871      1    NY2  NaN   0   0   1   0   0    0   0   
 3      fishech01    1871      1    RC1  NaN   4  16  24  24  22    1   0   
 4      fleetfr01    1871      1    NY2  NaN   0   1   1   1   1    0   0   
 ...          ...     ...    ...    ...  ...  ..  ..  ..  ..  ..  ...  ..   
 47623  zamorda01    2019      1    NYN   NL   0   1  17   0   0    0   0   
 47624  zeuchtj01    2019      1    TOR   AL   1   2   5   3   0    0   0   
 47625  zimmejo02    2019      1    DET   AL   1  13  23  23   0    0   0   
 47626  zimmeky01    2019      1    KCA   AL   0   1  15   0   0    0   0   
 47627  zobribe01    2019      1    CHN   NL   0   0   1   0   0    0   0   
 
        IPouts    H   ER  HR  BB  SO  BAOpp    ERA  IBB  WP  H

In [42]:
# are the columns the same?
set(pitching_df_dict["Pitching"].columns).difference(set(pitching_df_dict["PitchingPost"].columns))

{'stint'}

In [44]:
combined_pitching_data = pd.concat([df for df in pitching_df_dict.values()])
combined_pitching_data.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,W,L,G,GS,CG,SHO,SV,IPouts,H,ER,HR,BB,SO,BAOpp,ERA,IBB,WP,HBP,BK,BFP,GF,R,SH,SF,GIDP,round
0,bechtge01,1871,1.0,PH1,,1,2,3,3,2,0,0,78,43,23,0,11,1,,7.96,,7.0,,0.0,146.0,0,42,,,,
1,brainas01,1871,1.0,WS3,,12,15,30,30,30,0,0,792,361,132,4,37,13,,4.5,,7.0,,0.0,1291.0,0,292,,,,
2,fergubo01,1871,1.0,NY2,,0,0,1,0,0,0,0,3,8,3,0,0,0,,27.0,,2.0,,0.0,14.0,0,9,,,,
3,fishech01,1871,1.0,RC1,,4,16,24,24,22,1,0,639,295,103,3,31,15,,4.35,,20.0,,0.0,1080.0,1,257,,,,
4,fleetfr01,1871,1.0,NY2,,0,1,1,1,1,0,0,27,20,10,0,3,0,,10.0,,0.0,,0.0,57.0,0,21,,,,


In [55]:
# total hits per year
(combined_pitching_data
    >> filter_by((X.yearID >= 2000) & (X.yearID <= 2010))
    >> group_by(X.playerID, X.yearID)
    >> summarize(hits_allowed = X.H.sum())
    >> inner_join(player_names, by="playerID")
    >> select("nameFirst", "nameLast", "yearID", "hits_allowed")
)

Unnamed: 0,nameFirst,nameLast,yearID,hits_allowed
0,David,Aardsma,2004,20
1,David,Aardsma,2006,41
2,David,Aardsma,2007,39
3,David,Aardsma,2008,49
4,David,Aardsma,2009,49
...,...,...,...,...
6902,Joel,Zumaya,2006,58
6903,Joel,Zumaya,2007,23
6904,Joel,Zumaya,2008,24
6905,Joel,Zumaya,2009,34


In [57]:
# sum over entire period
(combined_pitching_data
    >> filter_by((X.yearID >= 2000) & (X.yearID <= 2010))
    >> group_by(X.playerID)
    >> summarize(hits_allowed = X.H.sum())
    >> inner_join(player_names, by="playerID")
    >> select("nameFirst", "nameLast", "hits_allowed")
)

Unnamed: 0,nameFirst,nameLast,hits_allowed
0,David,Aardsma,231
1,Fernando,Abad,14
2,Paul,Abbott,519
3,Winston,Abreu,57
4,Jeremy,Accardo,203
...,...,...,...
1866,Jeff,Zimmerman,128
1867,Jordan,Zimmermann,126
1868,Charlie,Zink,11
1869,Barry,Zito,1993


In [None]:
# questions:
# Why use a dictionary? do we care about the filenames?
# We meant Pitching and PitchingPost (postseason), right? That was why we used the glob?
# Hits per pitcher per year (i.e. multiple rows per person) or sum over the period?
# Output (whatever order): pitcher first and last name, year?, total hits allowed