# Pandas: Working with Data Frames

Data wrangling with real data nearly always involves combining data from multiple sources.

In this notebook we will experiment with ways to do this

In [None]:
import numpy as np              #standard imports
import scipy as sc
import pandas as pd
import matplotlib as plt
%matplotlib inline

## Read the Hall of Fame dataset from the Baseball-Databank

In [None]:
hall = pd.read_csv("../../baseballdatabank/core/HallOfFame.csv")

## List the first 5 rows of the hall Data Frame

In [None]:
hall.head()

## Show the shape of the hall Data Frame

In [None]:
hall.shape

Note that there are 4,120 records in the Hall of Fame table.  

In [None]:
master = pd.read_csv("../../baseballdatabank/core/Master.csv")

## List the first 5 rows of the master Data Frame

In [None]:
master.head()

   ## Show the shape of the master Data Fram

In [None]:
master.shape

## List the columns of the master Data Frame

In [None]:
master.columns

## Merge the hall of fame data with the master Data Frame

In [None]:
mhf =pd.merge(master,hall)      #this is a pandas merge

In [None]:
mhf.head()

In [None]:
mhf.shape                       #note that only rows whose index appears in both Data Frames are kept

## How did we get multiple rows for some players?

The first thing that jumps out is that in the merged Data Frame, some players have more than one row.  The number of rows matches the hall Data Frame, so let's examine hall to see if it has multiple records.

We'll subset hall by selecting only the rows for playerID 'adamsba01'

In [None]:
hall[hall.playerID=='adamsba01']

## Evidently, the Hall of Fame table has all players nominated,

even if they weren't inducted.

Poor B.A. Adams was apparently nominated 13 times, but never got enough votes to be inducted.

## Eliminating duplicates

One thing we might consider is selecting only players who were inducted.  Presumably, no one would be inducted twice, so 
this should solve the problem.

In [None]:
hall2 = hall[hall.inducted=='Y']
hall2.shape

## This looks more reasonable

There are just over 300 inductees in the Baseball Hall of Fame

Let's redo the merge

In [None]:
mhf = pd.merge(master,hall2)
mhf.shape

In [None]:
mhf.head()

## Inner and Outer Merges and Joins

The merge we just did matched master and hall2 by the index column playerID.  

By default, only the rows with playerID values from the intersection of the playerID values from the two Data Frames were kept.

This is called an inner merge.

Alternatively, we could match records by playerID and keep rows with playerID in the union of the two sets of playerID values.

This is called an outer merge, and has to be explicitly specified.

In [None]:
mhf_outer=pd.merge(master,hall2,how='outer')
mhf_outer.shape

In [None]:
mhf_outer.head()

## Outer merge supplies missing values

An outer merge will usually result in missing values because some playerID values might not appear in hall2 (in fact, most will not)

For these players, the value of the columns that would come from hall2 are filled with missing values: NaN

In general you need to be careful merging and joining tables to avoid inadvertantly generating extra rows.