# INFO 3402 – Week 03: Combining and Validating

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)  

In [1]:
import numpy as np
import pandas as pd
import os

pd.options.display.max_columns = 200
pd.options.display.float_format = '{:,.2f}'.format

## Background
Baseball is my favorite sport (go Red Sox!) in part because there is a long tradition of collecting statistical data about baseball games. So we're going to take a break from last week's very happy data about death statistics to work with baseball data this week.

The [Retrosheet game logs](http://www.retrosheet.org/gamelogs/index.html) ([docs]([documentation](https://www.retrosheet.org/gamelogs/glfields.txt))) and [Lahman Database](http://www.seanlahman.com/baseball-archive/statistics/) are the two most famous baseball datasets. There are more proprietary datasets out there, but much of the data you see on sites like [Baseball-Reference](http://www.baseball-reference.com) is [derived from](https://www.baseball-reference.com/about/coverage.shtml) the  Retrosheet and Lahman data.

Download the "Retrosheet.zip"  and "Lahman.zip" files from Canvas. Unzip each file inside your class directory where you keep your notebooks into directories named something like "Retrosheet" and "Lahman".

You could alternatively download "1871-2021 Game Logs" and "2020 – comma-delimited version" from their respective websites. Note the Lahman CSV file data from the website is nested within a deeper file structure than the "Lahman.zip" file I shared on Canvas. If this is confusing you, just use the zip files from Canvas. 

## Learning Objectives
The other reason we're going to work with these data is because they will help us develop our skills around joining, combining, and validating data. We are going to focus on two pandas functions this week: [`concat`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) and [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.merge.html).

At a high level, a `concat` lets us add rows on top of each other with **columns** that are the same or similar. (Image from [User Guide](https://pandas.pydata.org/docs/user_guide/merging.html))

![Example of concat](https://pandas.pydata.org/docs/_images/merging_concat_basic.png)

A `merge` lets us add columns next to each other with **rows** that are the same or similar. (Image from [User Guide](https://pandas.pydata.org/docs/user_guide/merging.html))

![Example of merge](https://pandas.pydata.org/docs/_images/merging_merge_on_key.png)

It's not enough to simply run these functions but also to *validate* their output to make sure they are not including or excluding data. These functions can be abused to generate data that doesn't throw exceptions or errors but generates unreliable, biased, and/or incomplete data. So it is *essential* that we develop a validation practice around them.

Definitely make sure to read these Getting Started tutorials and User Guides:

* [Getting Started - How to combine data from multiple tables?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/08_combine_dataframes.html)
* [User Guide - Database-style DataFrame or named Series joining/merging](https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging)
* Chan, L. (2021). "[Python Tricks: How to Check Table Merging with Pandas](https://towardsdatascience.com/python-tricks-how-to-check-table-merging-with-pandas-cae6b9b1d540)." *Towards Data Science*. Medium.

### Raw NBConvert cells

We're primarily used either Code or Markdown cells so far. We will use "Raw" cells as well in the Weekly Assignment. This is mostly a stylistic choice that helps us with grading. But you can also convert a "Code" cell into a "Raw" cell if you want to preserve some code without accidentally running it.

(1) Change the cell type below to a "Raw" cell using any of:
* the dropdown menu in the toolbar
* navigating to Cell > Cell Type in the menu
* using the "R" keyboard shortcut

(2) And write your name in the cell.

### Set and Reset index

`.reset_index()` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html), [User Guide](https://pandas.pydata.org/docs/user_guide/indexing.html#set-reset-index)) turns an index into columns. Or you can turn a column into an index with `.set_index()` ([docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html), [User Guide](https://pandas.pydata.org/docs/user_guide/indexing.html#set-reset-index)).

#### Mini-exercise
Read in the "Batting.csv" file from the Lahman folder.

In [2]:
batting_df = pd.read_csv('./Lahman/Batting.csv')

batting_df.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19.0,3.0,1.0,2,5.0,,,,,1.0
3,allisdo01,1871,1,WS3,,27,133,28,44,10,2,2,27.0,1.0,1.0,0,2.0,,,,,0.0
4,ansonca01,1871,1,RC1,,25,120,29,39,11,3,0,16.0,6.0,2.0,2,1.0,,,,,0.0


Make a pivot_table or groupby-aggregation with "yearID" and "teamID" as the index and total "HR"s. 

In [3]:
team_annual_hrs = pd.pivot_table(
    data = batting_df,
    index = 'yearID',
    columns = 'teamID',
    values = 'HR',
    aggfunc = 'sum'
)

team_annual_hrs

teamID,ALT,ANA,ARI,ATL,BAL,BFN,BFP,BL1,BL2,BL3,BL4,BLA,BLF,BLN,BLU,BOS,BR1,BR2,BR3,BR4,BRF,BRO,BRP,BS1,BS2,BSN,BSP,BSU,BUF,CAL,CH1,CH2,CHA,CHF,CHN,CHP,CHU,CIN,CL1,CL2,CL3,CL4,CL5,CL6,CLE,CLP,CN1,CN2,CN3,CNU,COL,DET,DTN,ELI,FLO,FW1,HAR,HOU,HR1,IN1,IN2,IN3,IND,KC1,KC2,KCA,KCF,KCN,KCU,KEO,LAA,LAN,LS1,LS2,LS3,MIA,MID,MIL,MIN,ML1,ML2,ML3,ML4,MLA,MLU,MON,NEW,NH1,NY1,NY2,NY3,NY4,NYA,NYN,NYP,OAK,PH1,PH2,PH3,PH4,PHA,PHI,PHN,PHP,PHU,PIT,PRO,PT1,PTF,PTP,RC1,RC2,RIC,SDN,SE1,SEA,SFN,SL1,SL2,SL3,SL4,SL5,SLA,SLF,SLN,SLU,SPU,SR1,SR2,TBA,TEX,TL1,TL2,TOR,TRN,TRO,WAS,WIL,WOR,WS1,WS2,WS3,WS4,WS5,WS6,WS7,WS8,WS9,WSU
yearID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1
1871,,,,,,,,,,,,,,,,,,,,,,,,3.00,,,,,,,10.00,,,,,,,,7.00,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.00,,,,,,,9.00,,,,,,,,,,,,,,3.00,,,,,,,,,,,,,,,,,,,,,,,,,6.00,,,,,,6.00,,,,,,,
1872,,,,,,,,14.00,,,,,,,,,0.00,1.00,,,,,,7.00,,,,,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,4.00,,,,,,,4.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00,,,,,,0.00,0.00,,,,,,
1873,,,,,,,,9.00,,,0.00,,,,,,,6.00,,,,,,13.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00,,,,,,,4.00,8.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,
1874,,,,,,,,1.00,,,,,,,,,,1.00,,,,,,17.00,,,,,,,,4.00,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.00,,,,,,,6.00,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1875,,,,,,,,,,,,,,,,,,2.00,,,,,,15.00,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,2.00,,7.00,,,,,,,7.00,5.00,0.00,,,,,,,,,,,,,,,,,,,0.00,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,0.00,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016,,,190.00,122.00,253.00,,,,,,,,,,,208.00,,,,,,,,,,,,,,,,,168.00,,199.00,,,164.00,,,,,,,185.00,,,,,,204.00,211.00,,,,,,198.00,,,,,,,,147.00,,,,,156.00,189.00,,,,128.00,,194.00,200.00,,,,,,,,,,,,,,183.00,218.00,,169.00,,,,,,161.00,,,,153.00,,,,,,,,177.00,,223.00,130.00,,,,,,,,225.00,,,,,216.00,215.00,,,221.00,,,203.00,,,,,,,,,,,,
2017,,,220.00,165.00,232.00,,,,,,,,,,,168.00,,,,,,,,,,,,,,,,,186.00,,223.00,,,219.00,,,,,,,212.00,,,,,,192.00,187.00,,,,,,238.00,,,,,,,,193.00,,,,,186.00,221.00,,,,194.00,,224.00,206.00,,,,,,,,,,,,,,241.00,224.00,,234.00,,,,,,174.00,,,,151.00,,,,,,,,189.00,,200.00,128.00,,,,,,,,196.00,,,,,228.00,237.00,,,222.00,,,215.00,,,,,,,,,,,,
2018,,,176.00,175.00,188.00,,,,,,,,,,,208.00,,,,,,,,,,,,,,,,,182.00,,167.00,,,172.00,,,,,,,216.00,,,,,,210.00,135.00,,,,,,205.00,,,,,,,,155.00,,,,,214.00,235.00,,,,128.00,,218.00,166.00,,,,,,,,,,,,,,267.00,170.00,,227.00,,,,,,186.00,,,,157.00,,,,,,,,162.00,,176.00,133.00,,,,,,,,205.00,,,,,150.00,194.00,,,217.00,,,191.00,,,,,,,,,,,,
2019,,,220.00,249.00,213.00,,,,,,,,,,,245.00,,,,,,,,,,,,,,,,,182.00,,256.00,,,227.00,,,,,,,223.00,,,,,,224.00,149.00,,,,,,288.00,,,,,,,,162.00,,,,,220.00,279.00,,,,146.00,,250.00,307.00,,,,,,,,,,,,,,306.00,242.00,,257.00,,,,,,215.00,,,,163.00,,,,,,,,219.00,,239.00,167.00,,,,,,,,210.00,,,,,217.00,223.00,,,247.00,,,231.00,,,,,,,,,,,,


Reset the index.

In [4]:
reset = team_annual_hrs.reset_index()
reset

teamID,yearID,ALT,ANA,ARI,ATL,BAL,BFN,BFP,BL1,BL2,BL3,BL4,BLA,BLF,BLN,BLU,BOS,BR1,BR2,BR3,BR4,BRF,BRO,BRP,BS1,BS2,BSN,BSP,BSU,BUF,CAL,CH1,CH2,CHA,CHF,CHN,CHP,CHU,CIN,CL1,CL2,CL3,CL4,CL5,CL6,CLE,CLP,CN1,CN2,CN3,CNU,COL,DET,DTN,ELI,FLO,FW1,HAR,HOU,HR1,IN1,IN2,IN3,IND,KC1,KC2,KCA,KCF,KCN,KCU,KEO,LAA,LAN,LS1,LS2,LS3,MIA,MID,MIL,MIN,ML1,ML2,ML3,ML4,MLA,MLU,MON,NEW,NH1,NY1,NY2,NY3,NY4,NYA,NYN,NYP,OAK,PH1,PH2,PH3,PH4,PHA,PHI,PHN,PHP,PHU,PIT,PRO,PT1,PTF,PTP,RC1,RC2,RIC,SDN,SE1,SEA,SFN,SL1,SL2,SL3,SL4,SL5,SLA,SLF,SLN,SLU,SPU,SR1,SR2,TBA,TEX,TL1,TL2,TOR,TRN,TRO,WAS,WIL,WOR,WS1,WS2,WS3,WS4,WS5,WS6,WS7,WS8,WS9,WSU
0,1871,,,,,,,,,,,,,,,,,,,,,,,,3.00,,,,,,,10.00,,,,,,,,7.00,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.00,,,,,,,9.00,,,,,,,,,,,,,,3.00,,,,,,,,,,,,,,,,,,,,,,,,,6.00,,,,,,6.00,,,,,,,
1,1872,,,,,,,,14.00,,,,,,,,,0.00,1.00,,,,,,7.00,,,,,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,4.00,,,,,,,4.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00,,,,,,0.00,0.00,,,,,,
2,1873,,,,,,,,9.00,,,0.00,,,,,,,6.00,,,,,,13.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00,,,,,,,4.00,8.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,
3,1874,,,,,,,,1.00,,,,,,,,,,1.00,,,,,,17.00,,,,,,,,4.00,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.00,,,,,,,6.00,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,1875,,,,,,,,,,,,,,,,,,2.00,,,,,,15.00,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,2.00,,7.00,,,,,,,7.00,5.00,0.00,,,,,,,,,,,,,,,,,,,0.00,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,0.00,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,2016,,,190.00,122.00,253.00,,,,,,,,,,,208.00,,,,,,,,,,,,,,,,,168.00,,199.00,,,164.00,,,,,,,185.00,,,,,,204.00,211.00,,,,,,198.00,,,,,,,,147.00,,,,,156.00,189.00,,,,128.00,,194.00,200.00,,,,,,,,,,,,,,183.00,218.00,,169.00,,,,,,161.00,,,,153.00,,,,,,,,177.00,,223.00,130.00,,,,,,,,225.00,,,,,216.00,215.00,,,221.00,,,203.00,,,,,,,,,,,,
146,2017,,,220.00,165.00,232.00,,,,,,,,,,,168.00,,,,,,,,,,,,,,,,,186.00,,223.00,,,219.00,,,,,,,212.00,,,,,,192.00,187.00,,,,,,238.00,,,,,,,,193.00,,,,,186.00,221.00,,,,194.00,,224.00,206.00,,,,,,,,,,,,,,241.00,224.00,,234.00,,,,,,174.00,,,,151.00,,,,,,,,189.00,,200.00,128.00,,,,,,,,196.00,,,,,228.00,237.00,,,222.00,,,215.00,,,,,,,,,,,,
147,2018,,,176.00,175.00,188.00,,,,,,,,,,,208.00,,,,,,,,,,,,,,,,,182.00,,167.00,,,172.00,,,,,,,216.00,,,,,,210.00,135.00,,,,,,205.00,,,,,,,,155.00,,,,,214.00,235.00,,,,128.00,,218.00,166.00,,,,,,,,,,,,,,267.00,170.00,,227.00,,,,,,186.00,,,,157.00,,,,,,,,162.00,,176.00,133.00,,,,,,,,205.00,,,,,150.00,194.00,,,217.00,,,191.00,,,,,,,,,,,,
148,2019,,,220.00,249.00,213.00,,,,,,,,,,,245.00,,,,,,,,,,,,,,,,,182.00,,256.00,,,227.00,,,,,,,223.00,,,,,,224.00,149.00,,,,,,288.00,,,,,,,,162.00,,,,,220.00,279.00,,,,146.00,,250.00,307.00,,,,,,,,,,,,,,306.00,242.00,,257.00,,,,,,215.00,,,,163.00,,,,,,,,219.00,,239.00,167.00,,,,,,,,210.00,,,,,217.00,223.00,,,247.00,,,231.00,,,,,,,,,,,,


Set the index back to a MultiIndex by passing a list of column names to `.set_index()`.

In [5]:
reset.set_index('yearID')

teamID,ALT,ANA,ARI,ATL,BAL,BFN,BFP,BL1,BL2,BL3,BL4,BLA,BLF,BLN,BLU,BOS,BR1,BR2,BR3,BR4,BRF,BRO,BRP,BS1,BS2,BSN,BSP,BSU,BUF,CAL,CH1,CH2,CHA,CHF,CHN,CHP,CHU,CIN,CL1,CL2,CL3,CL4,CL5,CL6,CLE,CLP,CN1,CN2,CN3,CNU,COL,DET,DTN,ELI,FLO,FW1,HAR,HOU,HR1,IN1,IN2,IN3,IND,KC1,KC2,KCA,KCF,KCN,KCU,KEO,LAA,LAN,LS1,LS2,LS3,MIA,MID,MIL,MIN,ML1,ML2,ML3,ML4,MLA,MLU,MON,NEW,NH1,NY1,NY2,NY3,NY4,NYA,NYN,NYP,OAK,PH1,PH2,PH3,PH4,PHA,PHI,PHN,PHP,PHU,PIT,PRO,PT1,PTF,PTP,RC1,RC2,RIC,SDN,SE1,SEA,SFN,SL1,SL2,SL3,SL4,SL5,SLA,SLF,SLN,SLU,SPU,SR1,SR2,TBA,TEX,TL1,TL2,TOR,TRN,TRO,WAS,WIL,WOR,WS1,WS2,WS3,WS4,WS5,WS6,WS7,WS8,WS9,WSU
yearID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1,Unnamed: 129_level_1,Unnamed: 130_level_1,Unnamed: 131_level_1,Unnamed: 132_level_1,Unnamed: 133_level_1,Unnamed: 134_level_1,Unnamed: 135_level_1,Unnamed: 136_level_1,Unnamed: 137_level_1,Unnamed: 138_level_1,Unnamed: 139_level_1,Unnamed: 140_level_1,Unnamed: 141_level_1,Unnamed: 142_level_1,Unnamed: 143_level_1,Unnamed: 144_level_1,Unnamed: 145_level_1,Unnamed: 146_level_1,Unnamed: 147_level_1,Unnamed: 148_level_1,Unnamed: 149_level_1
1871,,,,,,,,,,,,,,,,,,,,,,,,3.00,,,,,,,10.00,,,,,,,,7.00,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.00,,,,,,,9.00,,,,,,,,,,,,,,3.00,,,,,,,,,,,,,,,,,,,,,,,,,6.00,,,,,,6.00,,,,,,,
1872,,,,,,,,14.00,,,,,,,,,0.00,1.00,,,,,,7.00,,,,,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,4.00,,,,,,,4.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00,,,,,,0.00,0.00,,,,,,
1873,,,,,,,,9.00,,,0.00,,,,,,,6.00,,,,,,13.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.00,,,,,,,4.00,8.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,
1874,,,,,,,,1.00,,,,,,,,,,1.00,,,,,,17.00,,,,,,,,4.00,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,7.00,,,,,,,6.00,2.00,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1875,,,,,,,,,,,,,,,,,,2.00,,,,,,15.00,,,,,,,,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,,2.00,,,,,,,,,,,0.00,,,,,,,,,,,,,,,,,,2.00,,7.00,,,,,,,7.00,5.00,0.00,,,,,,,,,,,,,,,,,,,0.00,0.00,,,,,,,,,,,,,,,,,,,,,,,,,,0.00,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2016,,,190.00,122.00,253.00,,,,,,,,,,,208.00,,,,,,,,,,,,,,,,,168.00,,199.00,,,164.00,,,,,,,185.00,,,,,,204.00,211.00,,,,,,198.00,,,,,,,,147.00,,,,,156.00,189.00,,,,128.00,,194.00,200.00,,,,,,,,,,,,,,183.00,218.00,,169.00,,,,,,161.00,,,,153.00,,,,,,,,177.00,,223.00,130.00,,,,,,,,225.00,,,,,216.00,215.00,,,221.00,,,203.00,,,,,,,,,,,,
2017,,,220.00,165.00,232.00,,,,,,,,,,,168.00,,,,,,,,,,,,,,,,,186.00,,223.00,,,219.00,,,,,,,212.00,,,,,,192.00,187.00,,,,,,238.00,,,,,,,,193.00,,,,,186.00,221.00,,,,194.00,,224.00,206.00,,,,,,,,,,,,,,241.00,224.00,,234.00,,,,,,174.00,,,,151.00,,,,,,,,189.00,,200.00,128.00,,,,,,,,196.00,,,,,228.00,237.00,,,222.00,,,215.00,,,,,,,,,,,,
2018,,,176.00,175.00,188.00,,,,,,,,,,,208.00,,,,,,,,,,,,,,,,,182.00,,167.00,,,172.00,,,,,,,216.00,,,,,,210.00,135.00,,,,,,205.00,,,,,,,,155.00,,,,,214.00,235.00,,,,128.00,,218.00,166.00,,,,,,,,,,,,,,267.00,170.00,,227.00,,,,,,186.00,,,,157.00,,,,,,,,162.00,,176.00,133.00,,,,,,,,205.00,,,,,150.00,194.00,,,217.00,,,191.00,,,,,,,,,,,,
2019,,,220.00,249.00,213.00,,,,,,,,,,,245.00,,,,,,,,,,,,,,,,,182.00,,256.00,,,227.00,,,,,,,223.00,,,,,,224.00,149.00,,,,,,288.00,,,,,,,,162.00,,,,,220.00,279.00,,,,146.00,,250.00,307.00,,,,,,,,,,,,,,306.00,242.00,,257.00,,,,,,215.00,,,,163.00,,,,,,,,219.00,,239.00,167.00,,,,,,,,210.00,,,,,217.00,223.00,,,247.00,,,231.00,,,,,,,,,,,,


## Concatenating

A `concat` lets us add rows on top of each other with **columns** that are the same or similar. (Image from [User Guide](https://pandas.pydata.org/docs/user_guide/merging.html))

![Example of concat](https://pandas.pydata.org/docs/_images/merging_concat_basic.png)

### Checking data shape
The "Retrosheet" gamelogs files are records of each game played by all Major League teams in that season. The first files is from 1871 and there's a file for the 2021 season and everything in between. All 151 seasons. Each file should have a similar format in terms of columns, but let's check.

In [6]:
gamelog_1921 = pd.read_csv('Retrosheet/GL1921.TXT')
gamelog_1921.head(2)

Unnamed: 0,19210413,0,Wed,PHA,AL,1,NYA,AL.1,1.1,1.2,11,51,D,Unnamed: 13,Unnamed: 14,Unnamed: 15,NYC14,37000,Unnamed: 18,000000100,02000036x,30,3,0.1,1.3,0.2,1.4,0.3,0.4,0.5,1.5,0.6,2,0.7,0.8,0.9,0.10,3.1,2.1,7,7.1,0.11,0.12,24,11.1,1.6,0.13,0.14,0.15,39,17,5,1.7,1.8,11.2,2.2,0.16,1.9,1.10,0.17,6,0.18,0.19,0.20,0.21,8,1.11,1.12,1.13,0.22,0.23,27,15,0.24,0.25,0.26,0.27,dinnb101,Bill Dinneen,nalld901,Dick Nallin,Unnamed: 81,(none),wilsf901,Frank Wilson,Unnamed: 85,(none).1,Unnamed: 87,(none).2,mackc101,Connie Mack,huggm101,Miller Huggins,maysc101,Carl Mays,perrs101,Scott Perry,Unnamed: 97,(none).3,warda101,Aaron Ward,perrs101.1,Scott Perry.1,maysc101.1,Carl Mays.1,dykej101,Jimmy Dykes,4,wittw101,Whitey Witt,9,walkt101,Tillie Walker,7.2,brazf101,Frank Brazill,3.2,dugaj101,Joe Dugan,5.1,perkc101,Cy Perkins,2.3,welcf101,Frank Welch,8.1,gallc101,Chick Galloway,6.1,perrs101.2,Scott Perry.2,1.14,fewsc101,Chick Fewster,4.1,peckr101,Roger Peckinpaugh,6.2,ruthb101,Babe Ruth,7.3,pippw101,Wally Pipp,3.3,meusb101,Bob Meusel,9.1,bodip101,Ping Bodie,8.2,warda101.1,Aaron Ward.1,5.2,schaw101,Wally Schang,2.4,maysc101.2,Carl Mays.2,1.15,Unnamed: 159,Y
0,19210413,0,Wed,CLE,AL,1,SLA,AL,1,2,4,51,D,,,,STL07,15000.0,,2,00103000x,34,10,1,0,1,2,0,0,0,1,0,5,0,0,2,0,6,2,3,3,0,0,24,8,3,0,1,0,31,5,3,1,0,3,0,0,0,2,0,4,1,1,0,0,5,1,2,2,0,0,27,10,0,0,2,0,evanb901,Billy Evans,hildg101,George Hildebrand,,(none),,(none),,(none),,(none),speat101,Tris Speaker,fohll101,Lee Fohl,shocu101,Urban Shocker,coves101,Stan Coveleski,,(none),,(none),coves101,Stan Coveleski,shocu101,Urban Shocker,jamic101,Charlie Jamieson,7,johnd107,Doc Johnston,3,speat101,Tris Speaker,8,smite104,Elmer Smith,9,gardl101,Larry Gardner,5,sewej101,Joe Sewell,6,stepr101,Riggs Stephenson,4,oneis101,Steve O'Neill,2,coves101,Stan Coveleski,1,tobij101,Jack Tobin,9,gerbw101,Wally Gerber,6,sislg101,George Sisler,3,jacow101,Baby Doll Jacobson,8,willk101,Ken Williams,7,gleab101,Billy Gleason,4,lee-d101,Dud Lee,5,seveh101,Hank Severeid,2,shocu101,Urban Shocker,1,,Y
1,19210413,0,Wed,BOS,AL,1,WS1,AL,1,6,3,54,D,,,,WAS09,18000.0,120.0,110200110,120000000,36,15,0,3,0,5,3,0,1,0,-1,4,0,1,-1,0,7,1,3,0,1,0,26,14,1,0,2,0,33,9,0,1,0,2,0,0,0,1,-1,2,1,0,-1,0,4,3,5,0,0,0,27,11,1,0,1,0,connt901,Tommy Connolly,morig101,George Moriarty,,(none),,(none),,(none),,(none),duffh101,Hugh Duffy,mcbrg101,George McBride,jones104,Sad Sam Jones,johnw102,Walter Johnson,,(none),,(none),jones104,Sad Sam Jones,johnw102,Walter Johnson,vitto101,Ossie Vitt,5,foste103,Eddie Foster,4,menom101,Mike Menosky,7,hendt101,Tim Hendryx,9,mcins101,Stuffy McInnis,3,colls101,Shano Collins,8,scote101,Everett Scott,6,ruelm101,Muddy Ruel,2,jones104,Sad Sam Jones,1,judgj101,Joe Judge,3,milac101,Clyde Milan,9,rices101,Sam Rice,8,lewid101,Duffy Lewis,7,harrb106,Bucky Harris,4,shanh101,Howie Shanks,5,orouf101,Frank O'Rourke,6,piciv101,Val Picinich,2,johnw102,Walter Johnson,1,,D


So the data doesn't have columns names, and pandas assumes the first row are columns, so we need to tell `read_csv` there is no header column.

In [7]:
gamelog_1921 = pd.read_csv('Retrosheet/GL1921.TXT',header=None)
gamelog_1921.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160
0,19210413,0,Wed,PHA,AL,1,NYA,AL,1,1,11,51,D,,,,NYC14,37000.0,,100,02000036x,30,3,0,1,0,1,0,0,0,1,0,2,0,0,0,0,3,2,7,7,0,0,24,11,1,0,0,0,39,17,5,1,1,11,2,0,1,1,0,6,0,0,0,0,8,1,1,1,0,0,27,15,0,0,0,0,dinnb101,Bill Dinneen,nalld901,Dick Nallin,,(none),wilsf901,Frank Wilson,,(none),,(none),mackc101,Connie Mack,huggm101,Miller Huggins,maysc101,Carl Mays,perrs101,Scott Perry,,(none),warda101,Aaron Ward,perrs101,Scott Perry,maysc101,Carl Mays,dykej101,Jimmy Dykes,4,wittw101,Whitey Witt,9,walkt101,Tillie Walker,7,brazf101,Frank Brazill,3,dugaj101,Joe Dugan,5,perkc101,Cy Perkins,2,welcf101,Frank Welch,8,gallc101,Chick Galloway,6,perrs101,Scott Perry,1,fewsc101,Chick Fewster,4,peckr101,Roger Peckinpaugh,6,ruthb101,Babe Ruth,7,pippw101,Wally Pipp,3,meusb101,Bob Meusel,9,bodip101,Ping Bodie,8,warda101,Aaron Ward,5,schaw101,Wally Schang,2,maysc101,Carl Mays,1,,Y
1,19210413,0,Wed,CLE,AL,1,SLA,AL,1,2,4,51,D,,,,STL07,15000.0,,2,00103000x,34,10,1,0,1,2,0,0,0,1,0,5,0,0,2,0,6,2,3,3,0,0,24,8,3,0,1,0,31,5,3,1,0,3,0,0,0,2,0,4,1,1,0,0,5,1,2,2,0,0,27,10,0,0,2,0,evanb901,Billy Evans,hildg101,George Hildebrand,,(none),,(none),,(none),,(none),speat101,Tris Speaker,fohll101,Lee Fohl,shocu101,Urban Shocker,coves101,Stan Coveleski,,(none),,(none),coves101,Stan Coveleski,shocu101,Urban Shocker,jamic101,Charlie Jamieson,7,johnd107,Doc Johnston,3,speat101,Tris Speaker,8,smite104,Elmer Smith,9,gardl101,Larry Gardner,5,sewej101,Joe Sewell,6,stepr101,Riggs Stephenson,4,oneis101,Steve O'Neill,2,coves101,Stan Coveleski,1,tobij101,Jack Tobin,9,gerbw101,Wally Gerber,6,sislg101,George Sisler,3,jacow101,Baby Doll Jacobson,8,willk101,Ken Williams,7,gleab101,Billy Gleason,4,lee-d101,Dud Lee,5,seveh101,Hank Severeid,2,shocu101,Urban Shocker,1,,Y


Now the columns are 0 to 160, but the [documentation](https://www.retrosheet.org/gamelogs/glfields.txt) indexes these columns from 1. To avoid future confusion, let's fix this now.

In [8]:
gamelog_1921.columns = range(1,162)
gamelog_1921.head(2)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161
0,19210413,0,Wed,PHA,AL,1,NYA,AL,1,1,11,51,D,,,,NYC14,37000.0,,100,02000036x,30,3,0,1,0,1,0,0,0,1,0,2,0,0,0,0,3,2,7,7,0,0,24,11,1,0,0,0,39,17,5,1,1,11,2,0,1,1,0,6,0,0,0,0,8,1,1,1,0,0,27,15,0,0,0,0,dinnb101,Bill Dinneen,nalld901,Dick Nallin,,(none),wilsf901,Frank Wilson,,(none),,(none),mackc101,Connie Mack,huggm101,Miller Huggins,maysc101,Carl Mays,perrs101,Scott Perry,,(none),warda101,Aaron Ward,perrs101,Scott Perry,maysc101,Carl Mays,dykej101,Jimmy Dykes,4,wittw101,Whitey Witt,9,walkt101,Tillie Walker,7,brazf101,Frank Brazill,3,dugaj101,Joe Dugan,5,perkc101,Cy Perkins,2,welcf101,Frank Welch,8,gallc101,Chick Galloway,6,perrs101,Scott Perry,1,fewsc101,Chick Fewster,4,peckr101,Roger Peckinpaugh,6,ruthb101,Babe Ruth,7,pippw101,Wally Pipp,3,meusb101,Bob Meusel,9,bodip101,Ping Bodie,8,warda101,Aaron Ward,5,schaw101,Wally Schang,2,maysc101,Carl Mays,1,,Y
1,19210413,0,Wed,CLE,AL,1,SLA,AL,1,2,4,51,D,,,,STL07,15000.0,,2,00103000x,34,10,1,0,1,2,0,0,0,1,0,5,0,0,2,0,6,2,3,3,0,0,24,8,3,0,1,0,31,5,3,1,0,3,0,0,0,2,0,4,1,1,0,0,5,1,2,2,0,0,27,10,0,0,2,0,evanb901,Billy Evans,hildg101,George Hildebrand,,(none),,(none),,(none),,(none),speat101,Tris Speaker,fohll101,Lee Fohl,shocu101,Urban Shocker,coves101,Stan Coveleski,,(none),,(none),coves101,Stan Coveleski,shocu101,Urban Shocker,jamic101,Charlie Jamieson,7,johnd107,Doc Johnston,3,speat101,Tris Speaker,8,smite104,Elmer Smith,9,gardl101,Larry Gardner,5,sewej101,Joe Sewell,6,stepr101,Riggs Stephenson,4,oneis101,Steve O'Neill,2,coves101,Stan Coveleski,1,tobij101,Jack Tobin,9,gerbw101,Wally Gerber,6,sislg101,George Sisler,3,jacow101,Baby Doll Jacobson,8,willk101,Ken Williams,7,gleab101,Billy Gleason,4,lee-d101,Dud Lee,5,seveh101,Hank Severeid,2,shocu101,Urban Shocker,1,,Y


What is the "shape" of this data? The `.shape` attribute on a DataFrame returns the number of rows and columns as tuple. The 1921 data has 1229 rows and 161 columns.

In [9]:
gamelog_1921.shape

(1229, 161)

Compare to 2021. Almost twice as many games played (2429 rows), but still just 161 columns.

In [10]:
gamelog_2021 = pd.read_csv('Retrosheet/GL2021.TXT',header=None)
gamelog_2021.columns = range(1,162)
gamelog_2021.shape

(2429, 161)

### Looping through files

For more on how to use `os.listdir`, see the section in the Appendix at the bottom. `os.listdir` takes a file directory path and returns a list of strings for the files in that directory. We are going to use it on the "Retrosheet" folder that contains 151 files for 151 baseball seasons. 

Use `os.listdir` to get a list of files in the directory. I wrap a `len` around this to get the length of the list and confirm it has 151 files like I expected.

In [11]:
len(os.listdir('Retrosheet/'))

151

We will use a loop to make sure that all 151 files have the same number of columns. 

In [12]:
# For each file name in the directory
for file in os.listdir('Retrosheet/'):
    
    # Gamelog files are prepended with "GL", dont' read in anything else
    if 'GL' in file:
        
        # Read the file into a DataFrame
        _df = pd.read_csv('Retrosheet/'+file,header=None)
        
        # Check if the DataFrame doesn't have 161 columns
        if _df.shape[1] != 161:
            
            # Print the name of the file
            print(file)

If you have the same data as me, nothing should happen. All 151 files have exactly 161 columns. But we wanted to test that to make sure before we proceeded to next steps.

Now we can read all 151 files into Python's memory by slightly modifying the loop above. These files in `gamelog_list` are not yet concatenated into a single DataFrame, it's just a list of separate DataFrames.

In [13]:
# Empty container to store the DataFrames
gamelog_list = []

# For each file name in the directory
for file in os.listdir('Retrosheet/'):
    
    # Gamelog files are prepended with "GL", dont' read in anything else
    if 'GL' in file:
        
        # Read the file into a DataFrame
        _df = pd.read_csv('Retrosheet/'+file,header=None)
        _df.columns = range(1,162)
        
        # Append the DataFrame into the container
        gamelog_list.append(_df)

Develop a practice of *constantly* verifying the operations you're doing are preserving the size and shape of your data and generally matching your expectations. In our case, we had 151 files in the directory, so our `gamelog_list` should also have 151 objects.

In [14]:
len(gamelog_list)

151

### Using `concat`

Let's practice by concatenating two DataFrames. First, come up with some expectation of the data shape after combining. How many rows of data are in each DataFrame we're going to concatenate?

In [15]:
gamelog_list[-1]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161
0,20050403,0,Sun,BOS,AL,1,NYA,AL,1,2,9,51,N,,,,NYC16,54818,199,010000001,01300203x,32,6,3,0,0,2,0,1,0,3,0,10,0,0,1,0,7,7,7,7,0,1,24,8,2,0,1,0,38,15,4,0,1,7,0,1,2,6,0,6,2,0,1,0,14,3,2,2,0,0,27,11,1,0,1,0,westj901,Joe West,gormb901,Brian Gorman,dimum901,Mike DiMuro,hallt901,Tom Hallion,,(none),,(none),frant001,Terry Francona,torrj101,Joe Torre,johnr005,Randy Johnson,welld001,David Wells,,(none),shefg001,Gary Sheffield,welld001,David Wells,johnr005,Randy Johnson,damoj001,Johnny Damon,8,rente001,Edgar Renteria,6,ramim002,Manny Ramirez,7,ortid001,David Ortiz,10,millk005,Kevin Millar,3,varij001,Jason Varitek,2,paytj001,Jay Payton,9,muelb001,Bill Mueller,5,bellm002,Mark Bellhorn,4,jeted001,Derek Jeter,6,rodra001,Alex Rodriguez,5,shefg001,Gary Sheffield,9,sierr001,Ruben Sierra,10,matsh001,Hideki Matsui,7,posaj001,Jorge Posada,2,giamj001,Jason Giambi,3,willb002,Bernie Williams,8,womat001,Tony Womack,4,,Y
1,20050404,0,Mon,OAK,AL,1,BAL,AL,1,0,4,51,D,,,,BAL12,48271,164,000000000,02110000x,36,8,1,0,0,0,0,0,0,1,0,4,0,0,0,0,10,3,4,4,0,0,24,7,2,0,1,0,30,7,1,0,1,4,0,1,0,4,0,4,0,0,1,0,7,4,0,0,0,0,27,11,1,0,0,0,demud901,Dana DeMuth,joycj901,Jim Joyce,fostm901,Marty Foster,diazl901,Laz Diaz,,(none),,(none),machk101,Ken Macha,mazzl001,Lee Mazzilli,loper001,Rodrigo Lopez,zitob001,Barry Zito,,(none),matol001,Luis Matos,zitob001,Barry Zito,loper001,Rodrigo Lopez,kotsm001,Mark Kotsay,8,kendj001,Jason Kendall,2,chave001,Eric Chavez,5,durae001,Erubiel Durazo,10,hatts001,Scott Hatteberg,3,byrne001,Eric Byrnes,7,crosb002,Bobby Crosby,6,swisn001,Nick Swisher,9,ellim001,Mark Ellis,4,robeb003,Brian Roberts,4,moram002,Melvin Mora,5,tejam001,Miguel Tejada,6,sosas001,Sammy Sosa,9,palmr001,Rafael Palmeiro,3,lopej001,Javy Lopez,2,gibbj002,Jay Gibbons,10,matol001,Luis Matos,8,bigbl001,Larry Bigbie,7,,Y
2,20050404,0,Mon,CLE,AL,1,CHA,AL,1,0,1,51,D,,,,CHI12,38141,111,000000000,00000010x,27,2,0,0,0,0,0,0,0,1,0,6,0,0,2,0,1,1,1,1,0,0,24,16,1,0,1,0,28,4,1,0,0,1,0,0,0,1,0,3,1,0,1,0,4,2,0,0,0,0,27,13,0,0,2,0,reedr901,Rick Reed,craft901,Terry Craft,barrt901,Ted Barrett,marqa901,Alfonso Marquez,,(none),,(none),wedge001,Eric Wedge,guilo001,Ozzie Guillen,buehm001,Mark Buehrle,westj001,Jake Westbrook,takas001,Shingo Takatsu,rowaa001,Aaron Rowand,westj001,Jake Westbrook,buehm001,Mark Buehrle,crisc001,Coco Crisp,8,bellr002,Ronnie Belliard,4,hafnt001,Travis Hafner,10,martv001,Victor Martinez,2,boona001,Aaron Boone,5,blakc001,Casey Blake,9,broub001,Ben Broussard,3,hernj001,Jose Hernandez,7,peraj001,Jhonny Peralta,6,podss001,Scott Podsednik,7,iguct001,Tadahito Iguchi,4,everc001,Carl Everett,10,konep001,Paul Konerko,3,dye-j001,Jermaine Dye,9,rowaa001,Aaron Rowand,8,piera001,A.J. Pierzynski,2,credj001,Joe Crede,5,uribj002,Juan Uribe,6,,Y
3,20050404,0,Mon,KCA,AL,1,DET,AL,1,2,11,51,D,,,,DET05,44105,162,000010010,03203012x,34,7,0,0,1,2,0,0,0,2,0,9,0,0,0,0,7,5,10,10,0,0,24,11,1,0,0,0,37,13,1,0,4,11,0,0,1,4,0,5,0,0,0,0,7,3,2,2,0,0,27,6,1,0,0,0,marsr901,Randy Marsh,vanol901,Larry Vanover,holbs901,Sam Holbrook,wolfj901,Jim Wolf,,(none),,(none),penat001,Tony Pena,trama001,Alan Trammell,bondj001,Jeremy Bonderman,limaj001,Jose Lima,,(none),yound001,Dmitri Young,limaj001,Jose Lima,bondj001,Jeremy Bonderman,dejed001,David DeJesus,8,gotar001,Ruben Gotay,4,sweem002,Mike Sweeney,3,pickc001,Calvin Pickering,10,staim001,Matt Stairs,9,berra001,Angel Berroa,6,longt002,Terrence Long,7,buckj001,John Buck,2,teahm001,Mark Teahen,5,infao001,Omar Infante,4,guilc001,Carlos Guillen,6,rodri001,Ivan Rodriguez,2,ordom001,Magglio Ordonez,9,yound001,Dmitri Young,10,whitr001,Rondell White,7,penac001,Carlos Pena,3,monrc001,Craig Monroe,8,ingeb001,Brandon Inge,5,,Y
4,20050404,0,Mon,MIN,AL,1,SEA,AL,1,1,5,51,D,,,,SEA03,46249,142,000010000,30200000x,33,5,1,0,0,1,0,0,1,0,0,7,1,0,0,0,6,2,4,4,0,0,24,10,1,0,0,0,29,5,0,0,2,5,0,0,0,0,0,5,0,1,0,0,0,4,0,0,0,0,27,9,1,0,0,0,froeb901,Bruce Froemming,wintm901,Mike Winters,mealj901,Jerry Meals,wendh902,Hunter Wendelstedt,,(none),,(none),gardr001,Ron Gardenhire,hargm001,Mike Hargrove,moyej001,Jamie Moyer,radkb001,Brad Radke,,(none),sexsr001,Richie Sexson,radkb001,Brad Radke,moyej001,Jamie Moyer,stews002,Shannon Stewart,7,bartj001,Jason Bartlett,6,mauej001,Joe Mauer,2,mornj001,Justin Morneau,3,huntt001,Torii Hunter,8,jonej003,Jacque Jones,9,fordl001,Lew Ford,10,cuddm001,Michael Cuddyer,5,rival001,Luis Rivas,4,suzui001,Ichiro Suzuki,9,reedj004,Jeremy Reed,8,belta001,Adrian Beltre,5,sexsr001,Richie Sexson,3,boonb002,Bret Boone,4,ibanr001,Raul Ibanez,10,winnr001,Randy Winn,7,olivm001,Miguel Olivo,2,valdw001,Wilson Valdez,6,,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2426,20051002,0,Sun,MIL,NL,162,PIT,NL,162,1,3,51,D,,,,PIT08,23008,158,100000000,00030000x,32,9,1,0,0,1,1,0,1,1,0,7,0,0,3,0,7,3,3,3,0,0,24,9,0,0,2,0,31,9,3,0,2,3,0,0,0,3,0,4,0,0,1,0,7,3,1,1,0,0,27,12,0,0,3,0,delld901,Dusty Dellinger,randt901,Tony Randazzo,laynj901,Jerry Layne,dowda901,Adam Dowdy,,(none),,(none),yoste001,Ned Yost,mackp101,Pete Mackanin,dukez001,Zach Duke,capuc001,Chris Capuano,torrs001,Salomon Torres,mclon001,Nate McLouth,capuc001,Chris Capuano,dukez001,Zach Duke,clarb003,Brady Clark,8,hardj003,J.J. Hardy,6,overl001,Lyle Overbay,3,lee-c001,Carlos Lee,7,jenkg001,Geoff Jenkins,9,hallb001,Bill Hall,5,weekr001,Rickie Weeks,4,moelc001,Chad Moeller,2,capuc001,Chris Capuano,1,sancf001,Freddy Sanchez,4,wilsj002,Jack Wilson,6,bay-j001,Jason Bay,7,wilsc003,Craig Wilson,9,eldrb001,Brad Eldred,3,paulr001,Ronny Paulino,2,bautj002,Jose Bautista,5,mclon001,Nate McLouth,8,dukez001,Zach Duke,1,,Y
2427,20051002,0,Sun,LAN,NL,162,SDN,NL,162,1,3,51,D,,,,SAN02,37748,152,000000010,10001001x,31,4,1,0,1,1,0,0,0,2,0,12,0,0,0,0,5,4,2,2,1,0,24,10,2,0,1,0,32,9,3,0,0,2,0,0,0,3,0,3,3,1,0,0,8,4,1,1,0,0,27,2,0,0,0,0,timmt901,Tim Timmons,monte901,Ed Montague,emmep901,Paul Emmel,meric901,Chuck Meriwether,,(none),,(none),tracj101,Jim Tracy,bochb002,Bruce Bochy,eatoa001,Adam Eaton,desse001,Elmer Dessens,hofft001,Trevor Hoffman,lorem001,Mark Loretta,desse001,Elmer Dessens,eatoa001,Adam Eaton,aybaw001,Willy Aybar,4,repkj001,Jason Repko,8,perea001,Antonio Perez,6,edwam001,Mike Edwards,5,choih001,Hee-Seop Choi,3,wertj001,Jayson Werth,9,grabj001,Jason Grabowski,7,rosem001,Mike Rose,2,desse001,Elmer Dessens,1,youne001,Eric Young,8,lorem001,Mark Loretta,4,klesr001,Ryan Klesko,7,hernr002,Ramon Hernandez,2,sweem001,Mark Sweeney,3,greek002,Khalil Greene,6,johnb003,Ben Johnson,9,burrs001,Sean Burroughs,5,eatoa001,Adam Eaton,1,,Y
2428,20051002,0,Sun,ARI,NL,162,SFN,NL,162,1,3,51,D,,,,SFO03,40239,115,000001000,01100010x,33,6,1,0,0,1,0,0,0,0,0,4,0,0,0,0,5,2,2,2,0,0,24,12,1,0,0,0,33,9,0,1,1,3,1,0,0,1,0,6,1,0,0,0,8,1,1,1,0,0,27,11,0,0,0,0,millb901,Bill Miller,drakr901,Rob Drake,reilm901,Mike Reilly,kellj901,Jeff Kellogg,,(none),,(none),melvb001,Bob Melvin,alouf101,Felipe Alou,tomkb001,Brett Tomko,webbb001,Brandon Webb,,(none),aloum001,Moises Alou,webbb001,Brandon Webb,tomkb001,Brett Tomko,mccrq001,Quinton McCracken,7,greea001,Andy Green,4,tracc001,Chad Tracy,9,glaut001,Troy Glaus,5,grees001,Shawn Green,8,cinta001,Alex Cintron,6,jackc002,Conor Jackson,3,hillk002,Koyie Hill,2,webbb001,Brandon Webb,1,vizqo001,Omar Vizquel,6,snowj001,J.T. Snow,3,winnr001,Randy Winn,8,durhr001,Ray Durham,4,aloum001,Moises Alou,7,felip001,Pedro Feliz,5,ortmd001,Daniel Ortmeier,9,knoej001,Justin Knoedler,2,tomkb001,Brett Tomko,1,,Y
2429,20051002,0,Sun,CIN,NL,163,SLN,NL,162,5,7,51,D,,,,STL09,50434,191,023000000,01032010x,39,10,3,0,3,5,0,0,0,4,1,9,0,0,0,0,11,4,6,6,1,0,24,4,1,0,0,0,39,16,2,0,2,7,0,1,0,3,1,6,0,0,0,0,12,9,5,5,0,0,27,11,3,0,0,0,kulpr901,Ron Kulpa,cousd901,Derryl Cousins,reint901,Travis Reininger,davig901,Gerry Davis,,(none),,(none),narrj001,Jerry Narron,larut101,Tony LaRussa,thomb002,Brad Thompson,claub001,Brandon Claussen,isrij001,Jason Isringhausen,duncc002,Chris Duncan,claub001,Brandon Claussen,morrm001,Matt Morris,freer001,Ryan Freel,7,lopef001,Felipe Lopez,6,dunna001,Adam Dunn,3,keara001,Austin Kearns,9,valej004,Javier Valentin,2,encae001,Edwin Encarnacion,5,denoc001,Chris Denorfia,8,holba001,Aaron Holbert,4,claub001,Brandon Claussen,1,ecksd001,David Eckstein,6,tagus001,So Taguchi,8,pujoa001,Albert Pujols,3,sandr002,Reggie Sanders,7,walkl001,Larry Walker,9,grudm001,Mark Grudzielanek,4,moliy001,Yadier Molina,2,nunea001,Abraham Nunez,5,morrm001,Matt Morris,1,,Y


In [16]:
gamelog_list[-2].shape

(2429, 161)

In [17]:
gamelog_list[-1].shape

(2431, 161)

The 2020 season was shortened with the COVID-19 pandemic, hence the 898 games. The 2021 season had a normal-ish 2429 games. Concatenating both DataFrames should produce a new DataFrame is 898 + 2429 = 3327 rows. And 161 columns.

In [18]:
gamelog_list[-2].shape[0] + gamelog_list[-1].shape[0]

4860

To do the concatenation, we pass an object (list or dictionary are most common) to the `concat` function's "objs" parameter. In this example the `gamelog_list`'s last two objects are the DataFrames for 2020 and 2021.

In [19]:
gl_20_21 = pd.concat(
    objs = gamelog_list[-2:]
)

gl_20_21.shape

(4860, 161)

Identical, but even more explicit: make a list with the DataFrames in the second-to-last and last positions (2020 and 2021 data).

In [20]:
gl_20_21 = pd.concat(
    objs = [gamelog_list[-2],gamelog_list[-1]]
)

gl_20_21.shape

(4860, 161)

### Managing indices in concatenated DataFrames
One interesting/frustrating consequence of concatenating DataFrames is the indices from each of the parent DataFrames are preserved. This is usually a problem because we'd like our index to be unique and not have duplicate values.

Get the row at the index named 13 with `.loc`.

In [21]:
gl_20_21.loc[13]

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161
13,20110401,0,Fri,SEA,AL,1,OAK,AL,1,6,2,54,N,,,,OAK01,36067,170,1002300,200000000,36,8,1,0,1,5,1,0,0,7,0,14,3,1,1,0,11,1,2,2,0,0,27,14,0,0,2,0,30,5,0,0,1,2,0,0,0,0,0,5,0,0,2,0,1,6,4,4,0,0,27,12,5,0,2,0,darlg901,Gary Darling,drecb901,Bruce Dreckman,emmep901,Paul Emmel,drakr901,Rob Drake,,(none),,(none),wedge001,Eric Wedge,gereb001,Bob Geren,hernf002,Felix Hernandez,bresc001,Craig Breslow,,(none),figgc001,Chone Figgins,hernf002,Felix Hernandez,cahit001,Trevor Cahill,suzui001,Ichiro Suzuki,9,figgc001,Chone Figgins,5,bradm001,Milton Bradley,7,custj001,Jack Cust,10,smoaj001,Justin Smoak,3,olivm001,Miguel Olivo,2,langr002,Ryan Langerhans,8,ryanb002,Brendan Ryan,6,wilsj002,Jack Wilson,4,crisc001,Coco Crisp,8,bartd001,Daric Barton,3,dejed001,David DeJesus,9,willj004,Josh Willingham,7,matsh001,Hideki Matsui,10,suzuk001,Kurt Suzuki,2,ellim001,Mark Ellis,4,kouzk001,Kevin Kouzmanoff,5,pennc001,Cliff Pennington,6,,Y
13,20050405,0,Tue,MIN,AL,2,SEA,AL,2,8,4,54,N,,,,SEA03,28373,181,70100,400000000,39,14,0,0,1,8,0,0,0,2,0,8,0,0,1,0,6,5,4,4,0,0,27,12,1,0,1,0,33,7,3,0,0,4,0,0,1,2,0,10,2,0,1,0,5,5,8,8,0,0,27,13,0,0,1,0,wintm901,Mike Winters,mealj901,Jerry Meals,wendh902,Hunter Wendelstedt,froeb901,Bruce Froemming,,(none),,(none),gardr001,Ron Gardenhire,hargm001,Mike Hargrove,santj003,Johan Santana,thorm001,Matt Thornton,,(none),huntt001,Torii Hunter,santj003,Johan Santana,mechg001,Gil Meche,stews002,Shannon Stewart,7,bartj001,Jason Bartlett,6,mauej001,Joe Mauer,2,mornj001,Justin Morneau,3,huntt001,Torii Hunter,8,jonej003,Jacque Jones,9,fordl001,Lew Ford,10,cuddm001,Michael Cuddyer,5,rival001,Luis Rivas,4,suzui001,Ichiro Suzuki,9,reedj004,Jeremy Reed,8,belta001,Adrian Beltre,5,sexsr001,Richie Sexson,3,boonb002,Bret Boone,4,ibanr001,Raul Ibanez,7,winnr001,Randy Winn,10,olivm001,Miguel Olivo,2,valdw001,Wilson Valdez,6,,Y


There are two rows named 13, one from the 2020 DataFrame and the other from the 2021 DataFrame. If we had added all 151 DataFrames for each season, there would be 151 rows with index 13.

Option one for dealing with the repeated indices is to use the `.reset_index()` method. We don't want the index to become a column in our concat'd DataFrame (like in the example above), we just want to replace it with a new index. Pass `True` to the "drop" parameter. Test that the new DataFrame only has one row at index 13.

In [22]:
gl_20_21_reset = gl_20_21.reset_index(drop=True)

gl_20_21_reset.loc[13]

1              20110401
2                     0
3                   Fri
4                   SEA
5                    AL
             ...       
157            pennc001
158    Cliff Pennington
159                   6
160                 NaN
161                   Y
Name: 13, Length: 161, dtype: object

Option two for dealing with the repeated indices is to use the "ignore_index" parameter inside `concat` when constructing the concat'd DataFrame. I'm lazy and would like to skip doing cleanup when possible, so this is my preferred option.

In [23]:
gl_20_21_ignore = pd.concat(
    objs = [gamelog_list[-2],gamelog_list[-1]],
    ignore_index = True
)

gl_20_21_ignore.loc[13]

1              20110401
2                     0
3                   Fri
4                   SEA
5                    AL
             ...       
157            pennc001
158    Cliff Pennington
159                   6
160                 NaN
161                   Y
Name: 13, Length: 161, dtype: object

## Merging

A `merge` lets us add columns next to each other with **rows** that are the same or similar. This is known more commonly as a join and pandas also has [`join`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) methods. But `merge` is the most general function, so we will start with it. (Image from [User Guide](https://pandas.pydata.org/docs/user_guide/merging.html))

![Example of merge](https://pandas.pydata.org/docs/_images/merging_merge_on_key.png)

Let's try to merge the "People" and "Batting" tables from the Lahman database together. Before we even attempt a join or merge, look closely at each table and be able to answer these questions.

* **Variables.** What variables do they share in common that could be used as keys in a join?
* **Coverage.** Do both tables cover the same ranges of time?
* **Duplicates.** Are there instances of multiple/repeated rows of these key variables? Why is that?
* **Strategy.** What is the most appropriate merging strategy to handle the different time windows and repeated dates?
* **Shape.** What should the data look like afterwards?  
  * How many rows? 
  * How many columns? 
  * Which values should repeat? 
  * Which values should be null? 
  * Which values should disappear?
  
If you do not have answers for these questions, you are not prepared to evaluate the quality of a join and are likely to make serious errors resulting in duplicated or dropped data.

Start by reading in the data, printing out the DataFrame shape, and inspecting the first few rows.

In [24]:
people_df = pd.read_csv('Lahman/People.csv')
print(people_df.shape)
people_df.head(2)

(20358, 24)


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,2021.0,1.0,22.0,USA,GA,Atlanta,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01


In [25]:
batting_df = pd.read_csv('Lahman/Batting.csv')
print(batting_df.shape)
batting_df.head(2)

(108789, 22)


Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,0.0


### Variables?

What columns do these two DataFrames (`people_df` and `batting_df`) have in common that could be used as keys in a join? Not these may not always be columns with identical names.

In [26]:
set(batting_df.columns) & set(people_df.columns)

{'playerID'}

Do some set operations to determine if there are values in one DataFrame that aren't in the other. Do some exploratory analysis to develop some explanation for why there are more values in one than the other.

In [27]:
len(set(batting_df['playerID']))

19898

In [28]:
len(set(people_df['playerID']))

20358

In [29]:
len(set(batting_df['playerID']) & set(people_df['playerID']))

19898

### Coverage?

Do these DataFrames cover the same ranges of time? If not, where does each start and stop?

In [30]:
batting_df['yearID'].describe()

count   108,789.00
mean      1,967.22
std          39.75
min       1,871.00
25%       1,937.00
50%       1,976.00
75%       2,001.00
max       2,020.00
Name: yearID, dtype: float64

In [31]:
people_df['birthYear'].describe()

count   20,247.00
mean     1,935.21
std         43.01
min      1,820.00
25%      1,897.00
50%      1,943.00
75%      1,974.00
max      2,001.00
Name: birthYear, dtype: float64

### Duplicates?

Are there instances of multiple/repeated rows of key variables? Why?

In [32]:
people_df['playerID'].value_counts()

aardsda01    1
odoulle01    1
ofarrbo01    1
oestero01    1
oeschjo01    1
            ..
gomezch02    1
gomezch01    1
gomezca01    1
gomezal01    1
zychto01     1
Name: playerID, Length: 20358, dtype: int64

In [33]:
batting_df['playerID'].value_counts()

mcguide01    31
henderi01    29
newsobo01    29
kaatji01     28
johnto01     28
             ..
weingel01     1
waltefr01     1
walleno01     1
walczed01     1
zuberty01     1
Name: playerID, Length: 19898, dtype: int64

### Strategy?

What is the most appropriate merging strategy to handle the different time windows and repeated dates? The `people_df` has fewer rows than `batting_df` because `batting_df` has repeated observations of players across years and teams. We ideally want a DataFrame that preserves the information about each player's season even if it means repeating biographical information. There should be no data that is dropped because of non-overlapping dates. I should expect there to be some missing data since there are more unique "playerID" values in `people_df` than `batting_df` that corresponds to pitchers.

We'll start with left and right joins using `pd.merge`. You will want to pass five arguments to `pd.merge`:

* **left** - the left-hand DataFrame
* **right** - the right-hand DataFrame
* **left_on** - the name of the column containing the values to join on in the left DataFrame
* **right_on** - the name of the column containing the values to join on in the right DataFrame
* **how** - options include "left", "right", "inner" and "outer" - this is what we'll focus on for now!

This all suggests a few different strategies. Let's merge `people_df` as the left DataFrame and `batting_df` as the right DataFrame using each of these different strategies and see how the shape and values compare.

#### Left join
A **left join** would preserve all the information in the left DataFrame while dropping information in the right DataFrame if the right's keys were not present in the left. In other words, you keep all the rows on the left but risk losing rows on the right.

![From W3 Schools](https://www.w3schools.com/sql/img_leftjoin.gif)

![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_left.png)

When we call `how='left'` in `pd.merge`, we make sure all the keys we're joining on that are present in the "left" DataFrame remain in the merged DataFrame. A left-join is ideal if you really trust and want to preserve all the data in your left DataFrame, even if it means losing data in your right DataFrame.

Make a DataFrame called `people_batting_left` by performing a left join using the "playerID" column as a key. What is the shape of the data? What values are repeated? What rows are missing?

In [34]:
people_batting_left = pd.merge(
    left = people_df,
    right = batting_df,
    left_on = 'playerID',
    right_on = 'playerID',
    how = 'left'
)
print(people_df.shape, batting_df.shape, people_batting_left.shape)
people_batting_left.head(9)

(20358, 24) (108789, 22) (109249, 45)


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2004.0,1.0,SFN,NL,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2006.0,1.0,CHN,NL,45.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2007.0,1.0,CHA,AL,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2008.0,1.0,BOS,AL,47.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2009.0,1.0,SEA,AL,73.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2010.0,1.0,SEA,AL,53.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2012.0,1.0,NYA,AL,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2013.0,1.0,NYN,NL,43.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2015.0,1.0,ATL,NL,33.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


#### Right join
A **right join** would preserve all the the information in the right DataFrame while dropping information in the left DataFrame if the left's keys were not present in right. In other words, you keep all the rows on the right but risk losing rows on the left.

![From W3 Schools](https://www.w3schools.com/sql/img_rightjoin.gif)

![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_right.png)

When we call `how='right'` in `pd.merge`, we make sure all the keys we're joining on that are present in the "right" DataFrame remain in the merged DataFrame. A right-join is ideal if you really trust and want to preserve all the data in your right DataFrame, even if it means losing data in your left DataFrame.

Make a DataFrame called `people_batting_right` by performing a right join using the "playerID" column as a key. What is the shape of the data? What values are repeated? What rows are missing?

In [35]:
people_batting_right = pd.merge(
    left = people_df,
    right = batting_df,
    left_on = 'playerID',
    right_on = 'playerID',
    how = 'right'
)
print(people_df.shape, batting_df.shape, people_batting_right.shape)
people_batting_right.head(9)

(20358, 24) (108789, 22) (108789, 45)


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,abercda01,1850.0,1.0,2.0,USA,OK,Fort Towson,1939.0,11.0,11.0,USA,PA,Philadelphia,Frank,Abercrombie,Francis Patterson,,,,,1871-10-21,1871-10-21,aberd101,abercda01,1871,1,TRO,,1,4,0,0,0,0,0,0.0,0.0,0.0,0,0.0,,,,,0.0
1,addybo01,1842.0,2.0,,CAN,ON,Port Hope,1910.0,4.0,9.0,USA,ID,Pocatello,Bob,Addy,Robert Edward,160.0,68.0,L,L,1871-05-06,1877-10-06,addyb101,addybo01,1871,1,RC1,,25,118,30,32,6,0,0,13.0,8.0,1.0,4,0.0,,,,,0.0
2,allisar01,1849.0,1.0,29.0,USA,PA,Philadelphia,1916.0,2.0,25.0,USA,DC,Washington,Art,Allison,Arthur Algernon,150.0,68.0,,,1871-05-04,1876-10-05,allia101,allisar01,1871,1,CL1,,29,137,28,40,4,5,0,19.0,3.0,1.0,2,5.0,,,,,1.0
3,allisdo01,1846.0,7.0,12.0,USA,PA,Philadelphia,1916.0,12.0,19.0,USA,DC,Washington,Doug,Allison,Douglas L.,160.0,70.0,R,R,1871-05-05,1883-07-13,allid101,allisdo01,1871,1,WS3,,27,133,28,44,10,2,2,27.0,1.0,1.0,0,2.0,,,,,0.0
4,ansonca01,1852.0,4.0,17.0,USA,IA,Marshalltown,1922.0,4.0,14.0,USA,IL,Chicago,Cap,Anson,Adrian Constantine,227.0,72.0,R,R,1871-05-06,1897-10-03,ansoc101,ansonca01,1871,1,RC1,,25,120,29,39,11,3,0,16.0,6.0,2.0,2,1.0,,,,,0.0
5,armstbo01,1850.0,7.0,4.0,USA,MD,Baltimore,1917.0,12.0,3.0,USA,TX,Fort Worth,Robert,Armstrong,Robert Livingston,160.0,74.0,,,1871-06-26,1871-08-29,armsr101,armstsa01,1871,1,FW1,,12,49,9,11,2,1,0,5.0,0.0,1.0,0,1.0,,,,,0.0
6,barkeal01,1839.0,1.0,18.0,USA,IN,Lost Creek,1912.0,9.0,15.0,USA,IL,Rockford,Al,Barker,Alfred L.,162.0,72.0,,,1871-06-01,1871-06-01,barka101,barkeal01,1871,1,RC1,,1,4,0,1,0,0,0,2.0,0.0,0.0,1,0.0,,,,,0.0
7,barnero01,1850.0,5.0,8.0,USA,NY,Mount Morris,1915.0,2.0,5.0,USA,IL,Chicago,Ross,Barnes,Charles Roscoe,145.0,68.0,R,R,1871-05-05,1881-09-21,barnr102,barnero01,1871,1,BS1,,31,157,66,63,10,9,0,34.0,11.0,6.0,13,1.0,,,,,1.0
8,barrebi01,,,,USA,MD,Baltimore,,,,,,,Bill,Barrett,William,,,,,1871-07-08,1873-10-18,barrb102,barrebi01,1871,1,FW1,,1,5,1,1,1,0,0,1.0,0.0,0.0,0,0.0,,,,,0.0


#### Outer join
An **outer join** would preserve all the information in both the left and the right DataFrames even if it results in empty values where the data don't overlap. In other words, you keep all the rows of data in both but risk having missing values.

![From W3 Schools](https://www.w3schools.com/sql/img_fulljoin.gif)

![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_outer.png)

When we call `how='outer'` in `pd.merge`, we make sure all the keys we're joining on that are present in both the "left" and "right" DataFrame remain in the merged DataFrame. An outer-join is ideal if you want to preserve all the data in both right DataFrame, even if it means having missing values where the DataFrames don't overlap.

Make a DataFrame called `people_batting_outer` by performing a outer join using the "playerID" column as a key. What is the shape of the data? What values are repeated? What rows are missing?

In [36]:
people_batting_outer = pd.merge(
    left = people_df,
    right = batting_df,
    left_on = 'playerID',
    right_on = 'playerID',
    how = 'outer'
)
print(people_df.shape, batting_df.shape, people_batting_outer.shape)
people_batting_outer.head(9)

(20358, 24) (108789, 22) (109249, 45)


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2004.0,1.0,SFN,NL,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2006.0,1.0,CHN,NL,45.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2007.0,1.0,CHA,AL,25.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2008.0,1.0,BOS,AL,47.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2009.0,1.0,SEA,AL,73.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2010.0,1.0,SEA,AL,53.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2012.0,1.0,NYA,AL,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2013.0,1.0,NYN,NL,43.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2015.0,1.0,ATL,NL,33.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### Inner join
An **inner join** would preserve only the information present in both the left and right DataFrames even if it means dropping rows where neither DataFrame overlaps. In other words, you keep only the rows of data in common and lose everything else.

![From W3 Schools](https://www.w3schools.com/sql/img_innerjoin.gif)

![](https://pandas.pydata.org/pandas-docs/stable/_images/merging_merge_on_key_inner.png)

When we call `how='inner'` in `pd.merge`, we make sure all the *only* keys present in the merged DataFrame are those that are present in *both* the "left" and "right" DataFrames. An inner-join is ideal if you only want the overlapping data, even if it means losing data in both the left and right DataFrames.

Make a DataFrame called `people_batting_inner` by performing a outer join using the "playerID" column as a key. What is the shape of the data? What values are repeated? What rows are missing?

In [37]:
people_batting_inner = pd.merge(
    left = people_df,
    right = batting_df,
    left_on = 'playerID',
    right_on = 'playerID',
    how = 'inner'
)
print(people_df.shape, batting_df.shape, people_batting_inner.shape)
people_batting_inner.head(9)

(20358, 24) (108789, 22) (108789, 45)


Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2004,1,SFN,NL,11,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2006,1,CHN,NL,45,2,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0
2,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2007,1,CHA,AL,25,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2008,1,BOS,AL,47,1,0,0,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
4,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2009,1,SEA,AL,73,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2010,1,SEA,AL,53,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
6,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2012,1,NYA,AL,1,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
7,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2013,1,NYN,NL,43,0,0,0,0,0,0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
8,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01,2015,1,ATL,NL,33,1,0,0,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0


### Shape?

What should the data look like afterwards? How many rows? How many columns? Which values should repeat? Which values should be null? Which values should disappear?  

## Exercises

In [38]:
people_batting_left = pd.merge(
    left = people_df,
    right = batting_df,
    left_on = 'playerID',
    right_on = 'playerID',
    how = 'left'
)

people_batting_left.shape

(109249, 45)

In [39]:
people_batting_right = pd.merge(
    left = batting_df,
    right = people_df,
    left_on = 'playerID',
    right_on = 'playerID',
    how = 'right'
)

people_batting_right.shape

(109249, 45)

### Exercise 01: Do the big concatenation

Concatenate all 151 seasons of gamelog data in `gamelog_list` together. Make sure the index of this combined DataFrame is unique. Make sure to give the columns the right numeric labels. What is the `.shape` of the combined DataFrame?

### Exercise 02: Find some extreme matches. (Week 01 review)
The documentation explaining what each numeric column corresponds to is described [here](https://www.retrosheet.org/gamelogs/glfields.txt). Answer these questions.

* What team scored the most runs ever in a single game, against whom, and when?
* What was the longest game ever played? How long, with whom, and when?

### Exercise 03: Make a pivot table (Week 02 review)
The documentation explaining what each numeric column corresponds to is described [here](https://www.retrosheet.org/gamelogs/glfields.txt). 
* Column 3 is the day of the week. 
* Column 13 is whether the game was a night or day game. 
* Column 18 is the attendance.

Make a pivot table of day of week and day/night time with average attendance as values. What time has the highest attendance? The lowest?

### Exercise 04: Which country produced the most Hall of Famers?

Join the "HallOfFame" and "People" tables together. Do an aggregation to count the number of Hall of Famers by birth country. Display the sorted list.

### Exercise 05: Three-way join

Which schools produce the highest-paid players? You will need to load and join the tables for "Salaries", "Schools", and "CollegePlaying" and come up with a strategy for joining all three.

Each row is now the salary of the player in a year and the college they attended. There are two yearIDs as well, so the merged DataFrame as \_x and \_y suffixes added to the column names. These are distinct years and definitely shouldn't be used in the join: 

* What is now "yearID_x" from `salaries_df` is the salary of the player in that year. 
* What is now "yearID_y" from `college_df` or `schools_college_df` is the year that player played for that school.

This next part is tricky and probably has a few different options and no single right answer. 

First I'll try something complicated but sensible. I'm going to group on schoolID and aggregate with salary as a sum, the number of unique player IDs, and the count of all the unique years.

Maybe try something easy with groupby-aggregation to compare. This is definitely biased towards players who have enormous contracts or played for a long time.

How many players ever played for "Maple Woods Community College"? And who are they?

"pujolal01" is [Albert Pujols](https://en.wikipedia.org/wiki/Albert_Pujols) who comes in at \#2 for [career earnings](https://en.wikipedia.org/wiki/List_of_highest-paid_Major_League_Baseball_players#Career_earnings_as_of_the_end_of_the_2019_season) of \$285 million. Just played one season (1999) at Maple Woods CC (yearID_y). But this super-earner was enough to send the school to the top of the list.

## Appendix

There's nothing you need to do here, but there are some tips and hints in here.

### `os` library

The "os" library provides an interface for accessing parts of your operating system like the file directory (Finder in macOS or File Explorer in Windows). This makes is powerful... but dangerous! We want to use it when we have to read in multiple (dozens, hundreds) of files instead of writing out each `read_csv` statement one-by-one. Here is the [documentation](https://docs.python.org/3/library/os.html) and some [examples](https://towardsdatascience.com/7-common-file-system-operations-you-can-do-with-python-e4670c0d92f2). 

You could also use [PathLib](https://docs.python.org/3/library/pathlib.html) ([examples](https://towardsdatascience.com/why-you-should-start-using-pathlib-as-an-alternative-to-the-os-module-d9eccd994745)) that represents directories and files as objects with cool methods and attributes, but I'm an old-school geezer who just likes simple strings and lists. You're welcome to use PathLib if you prefer.

Start with importing the library.

In [86]:
import os 

If you're on a Mac or Linux, your directory separator character is probably "/". If you're on a Windows, your directory separator character is probably "\\\\" but you should still be about to use "/".

In [87]:
os.sep

'/'

We really only need to do two things with "os":

1. Get the current working directory: `os.getcwd()`
2. Get a list of files in a directory : `os.listdir()`

`os.getcwd()` returns a string with the current working directory (CWD) of where the notebook is running. Your CWD will be different than mine.

In [88]:
os.getcwd()

'/Users/briankeegan/Dropbox/Courses/2022 Spring - INFO 3402/Week 03 - Combining and Validation'

`os.listdir()` returns a list of strings corresponding to the files and directories in the path I pass to the function. If I don't pass a directory path string, it defaults to the CWD. Here are a list of all the files in my CWD.

In [89]:
os.listdir()

['co_cannabis_sales_medical.zip',
 'Quiz.ipynb',
 '.DS_Store',
 'Week 03.pptx',
 'co_county_demographics.csv',
 'Retrosheet',
 'co_county_cannabis.csv',
 'Week 03.pdf',
 'Lahman.zip',
 'co_cannabis_sales_recreational.zip',
 'co_cannabis_sales',
 'Week 03 - Lecture - Solutions.ipynb',
 'covid19_counties.csv',
 'ColoradoCannabisSalesReport.zip',
 '.ipynb_checkpoints',
 'co_county_covid.csv',
 'Retrosheet.zip',
 'Lahman',
 'cbi_crime_month_county.csv',
 'Week 03 - Assignment - Solution.ipynb',
 'Week 03 - Assignment.ipynb',
 'co_county_crimes.csv']

Outside of this notebook, I used my File Explorer/Finder to unzip the Retrosheet database to a folder inside my CWD called "Retrosheet". Now I want to list the files in this folder.

I can list the files inside this folder in a few ways. The first is to use "relative" paths. Because the "Lahman" folder is inside my CWD, just use its folder name.

In [90]:
os.listdir('Lahman')

['AwardsManagers.csv',
 'Managers.csv',
 'AwardsPlayers.csv',
 'Fielding.csv',
 'Salaries.csv',
 'Parks.csv',
 'Schools.csv',
 'People.csv',
 'PitchingPost.csv',
 'Teams.csv',
 'Appearances.csv',
 'AwardsSharePlayers.csv',
 'TeamsFranchises.csv',
 'Batting.csv',
 'ManagersHalf.csv',
 'FieldingOF.csv',
 'Pitching.csv',
 'CollegePlaying.csv',
 'HomeGames.csv',
 'HallOfFame.csv',
 'readme2014.txt',
 'AwardsShareManagers.csv',
 'BattingPost.csv',
 'TeamsHalf.csv',
 'SeriesPost.csv',
 'FieldingPost.csv',
 'AllstarFull.csv',
 'FieldingOFsplit.csv']

I could also pass the full CWD as a string and add the "Lahman" sub-directory.

In [91]:
os.listdir(os.getcwd() + '/Lahman')

['AwardsManagers.csv',
 'Managers.csv',
 'AwardsPlayers.csv',
 'Fielding.csv',
 'Salaries.csv',
 'Parks.csv',
 'Schools.csv',
 'People.csv',
 'PitchingPost.csv',
 'Teams.csv',
 'Appearances.csv',
 'AwardsSharePlayers.csv',
 'TeamsFranchises.csv',
 'Batting.csv',
 'ManagersHalf.csv',
 'FieldingOF.csv',
 'Pitching.csv',
 'CollegePlaying.csv',
 'HomeGames.csv',
 'HallOfFame.csv',
 'readme2014.txt',
 'AwardsShareManagers.csv',
 'BattingPost.csv',
 'TeamsHalf.csv',
 'SeriesPost.csv',
 'FieldingPost.csv',
 'AllstarFull.csv',
 'FieldingOFsplit.csv']

#### Advanced
If I had a different strategy for organizing my folders and kept the data somewhere else than a sub-folder of where the notebook is running, I would need to write out the path to that directory.

In [96]:
# This command will almost certainly fail for you ...unless 
# (1) You have a Windows machine, (2) with a "Brian" user, (3) a "Data" folder, and (4) an unzipped "Lahman" folder there

os.listdir('C:/Users/Brian/Data/Lahman')

['co_cannabis_sales_medical.zip',
 'Quiz.ipynb',
 '.DS_Store',
 'Week 03.pptx',
 'co_county_demographics.csv',
 'Retrosheet',
 'co_county_cannabis.csv',
 'Week 03.pdf',
 'Lahman.zip',
 'co_cannabis_sales_recreational.zip',
 'co_cannabis_sales',
 'Week 03 - Lecture - Solutions.ipynb',
 'covid19_counties.csv',
 'ColoradoCannabisSalesReport.zip',
 '.ipynb_checkpoints',
 'co_county_covid.csv',
 'Retrosheet.zip',
 'Lahman',
 'cbi_crime_month_county.csv',
 'Week 03 - Assignment - Solution.ipynb',
 'Week 03 - Assignment.ipynb',
 'co_county_crimes.csv']

If I was more familiar with UNIX/Terminal syntax, I could also use the operating system's `curdir` and `sep` characters to navigate into the subfolder. First check what's your operating system's characters for the current directory and separatators.

In [93]:
os.curdir

'.'

In [94]:
os.sep

'/'

In [95]:
# Windows has an altsep character
os.altsep

In [97]:
os.listdir('./Lahman')

['AwardsManagers.csv',
 'Managers.csv',
 'AwardsPlayers.csv',
 'Fielding.csv',
 'Salaries.csv',
 'Parks.csv',
 'Schools.csv',
 'People.csv',
 'PitchingPost.csv',
 'Teams.csv',
 'Appearances.csv',
 'AwardsSharePlayers.csv',
 'TeamsFranchises.csv',
 'Batting.csv',
 'ManagersHalf.csv',
 'FieldingOF.csv',
 'Pitching.csv',
 'CollegePlaying.csv',
 'HomeGames.csv',
 'HallOfFame.csv',
 'readme2014.txt',
 'AwardsShareManagers.csv',
 'BattingPost.csv',
 'TeamsHalf.csv',
 'SeriesPost.csv',
 'FieldingPost.csv',
 'AllstarFull.csv',
 'FieldingOFsplit.csv']

### Accessing zip files directly

We may not want to have sub-directories with hundreds of files littering our file system for organizational, security, or space reasons. You can also use Python to unzip files directly into memory without unzipping them first. In this way, you can access the data inside the zip files without cluttering your computer with lots of directories and files.

You can use the [ZipFile](https://docs.python.org/3/library/zipfile.html) ([tutorial](https://medium.com/dev-bits/ultimate-guide-for-working-with-i-o-streams-and-zip-archives-in-python-3-6f3cf96dca50)) library that is part of the base Python 3.x library (no installation needed).

In [98]:
from zipfile import ZipFile
from io import BytesIO

# Store the files in a dictionary. Not ideal for accessing
lahman = dict()

# Use the with functionality to handle closing the file
# https://realpython.com/python-with-statement/#using-the-python-with-statement
with ZipFile('Lahman.zip') as input_zipfile:

    # The .namelist() method returns a list of filenames inside the zipfile
    for name in input_zipfile.namelist():
        
        # Only open CSV files
        if name.endswith('.csv'):

            # Split on the '.csv' and get the filename part of the string in the first element of the list
            noncsv_name = name.split('.csv')[0]
            
            # Zip file contents are read in as raw byte strings
            read_file = input_zipfile.read(name)
            
            # Use the BytesIO to parse bytestring to a regular string and read into a DataFrave with read_csv
            # Assign the DataFrame as a value keyed by the non-CSV filename in the lahman dictionary
            # https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#dealing-with-unicode-data
            lahman[noncsv_name] = pd.read_csv(BytesIO(read_file))

Treat the dictionary like a database and cccess the tables using the filenames as keys.

In [99]:
lahman.keys()

dict_keys(['AllstarFull', 'Appearances', 'AwardsManagers', 'AwardsPlayers', 'AwardsShareManagers', 'AwardsSharePlayers', 'Batting', 'BattingPost', 'CollegePlaying', 'Fielding', 'FieldingOF', 'FieldingOFsplit', 'FieldingPost', 'HallOfFame', 'HomeGames', 'Managers', 'ManagersHalf', 'Parks', 'People', 'Pitching', 'PitchingPost', 'Salaries', 'Schools', 'SeriesPost', 'Teams', 'TeamsFranchises', 'TeamsHalf'])

In [100]:
lahman['People'].head(2)

Unnamed: 0,playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
0,aardsda01,1981.0,12.0,27.0,USA,CO,Denver,,,,,,,David,Aardsma,David Allan,215.0,75.0,R,R,2004-04-06,2015-08-23,aardd001,aardsda01
1,aaronha01,1934.0,2.0,5.0,USA,AL,Mobile,2021.0,1.0,22.0,USA,GA,Atlanta,Hank,Aaron,Henry Louis,180.0,72.0,R,R,1954-04-13,1976-10-03,aaroh101,aaronha01
