# Data Mining - Lab 1
### Team 2 - Patricia Goresen, Jeffrey Lancon, Brychan Manry, George Sturrock
#### May 27, 2018
------

## Business Understanding

Since its inception, statistical data analysis has been an integral part of the game.  Coaches, players, and baseball fans can
recite many of these stats and often use them in coaching and team roster development decisions.  Books and movies have been based on the pursuit of utilizing baseball statistics to build the ultimate baseball team. They have even coined the term Sabermetrics: The empirical analysis of baseball, especially baseball statistics that measure in-game activity.

The book; Moneyball: The Art of Winning an Unfair Game by Michael Lewis and a film based on the book, staring Brad Pitt and Jonah Hill, are about Billy Bean, General Manager of the Oakland Athletics who focuses on player/team Sabermetrics to assemble a competitive baseball team using limited funding during the 2002 and 2003 seasons. Billy Bean was able to field a competitive team with a salary budget of less than half of their larger markets competitors, by focusing on hiring undervalued players.  Despite the salary roadblock, the Athletics made the playoffs in 2002 & 2003.

The source of baseball data for this lab is the Sean Lahman Baseball Database [
http://www.seanlahman.com/baseball-archive/statistics/].  Often cited as the most complete baseball database, the data set includes twenty-seven data tables and millions of records covering most non-proprietary baseball data pertaining to offense, fielding, pitching, payroll, player demographics, team statistics, manager data and much more.  For this lab, the focus will be on team level data.  The team level data will be explored using graphical analysis using standard python libraries.  After data exploration is complete, statistical methods can be employed to determine variable correlation and/or outcome prediction using techniques such as principle component analysis (PCA) or logistic regression.  Predictive models will be verified to assure the output is useful.  For example, a logistic regression model will have its assumptions validated, competing models will be compared using best practice measurements such as Area Under the Curve (AUC), Specificity, Sensitivity and Misclassification Scores.  The outcomes of this study could provide helpful insights for baseball fans who play fantasy sports, baseball reporters and baseball managers to assist with building a more competitive team.  

------

## Data Types and Meaning

The base team data from the Lahman baseball database contains forty-eight different attributes.  The team level data begins with the 1871 season and contains team level data thorugh the conclusion of the 2017 major league baseball season.  The grain of the team data is at the season and team level.  The attributes can be divided into three general categories:  informational, offense and pitching.  Informational attributes are descriptive elements about the team.  These include unique team identifiers, league membership, division membership, post-season success indicators and home ballpark information.  Convential team offensive statistics such as total hits, at bats and home runs are also present.  Team Pitching statitistics such as earned run average (ERA), saves and strikeouts are available for analysis as well.  

In [22]:
# LOAD CSV WITH DATA TYPES AND MEANING DEF FOR DISPLAY
import pandas as pd
from IPython.display import display, HTML
display(HTML(pd.read_csv("./docobjects/DataTypeandMeaning.csv").to_html(index=False)))

Attribute,Type,Description
yearID,int64,The professional baseball season.
lgID,object,The league to which the team was a member during the specified year.
teamID,object,Unique identifier for the professional baseball team.
franchID,object,Unique identifier for the professional baseball franchise.
divID,object,The division within the league to which the team belonged during the specified year.
Rank,int64,The team's finishing rank in their division for the specified year.
G,int64,The number of games played by the team during the specified year.
Ghome,float64,The number of home games played by the team during the specified year.
W,int64,Games won by the team during the year.
L,int64,Games lost by the team during the year.


------

## Data Quality

Overall, the team level data is a high quality data set.  That said the data will be examined for missing values, unique identifiers, consistency between team names over time and any other anomalies which may arise.  The data will be subset only those records from 1970 and forward.  Professional baseball has grown tremendously since 1871.  The merger of the American League and the National League, the introduction of divisions, teams have gone out of business, new teams have been created, wild card playoffs, and evolving strategy makes data from approximatley 150 years ago likely to be too dated to be useful.  Data quality analysis will be conducted on this subset of the team data. 

In [20]:
import numpy as np
import os
import string
from matplotlib import pyplot as plt
import seaborn as sns

#Read Teams data file
teams = pd.read_csv('./sourcedata/Teams.csv')
#Select rows where year > 1969
teams1970 = teams[teams.yearID > 1969]
teams1970.head()

Unnamed: 0,yearID,lgID,teamID,franchID,divID,Rank,G,Ghome,W,L,...,DP,FP,name,park,attendance,BPF,PPF,teamIDBR,teamIDlahman45,teamIDretro
1541,1970,NL,ATL,ATL,W,5,162,81.0,76,86,...,118,0.977,Atlanta Braves,Atlanta-Fulton County Stadium,1078848.0,106,106,ATL,ATL,ATL
1542,1970,AL,BAL,BAL,E,1,162,81.0,108,54,...,148,0.981,Baltimore Orioles,Memorial Stadium,1057069.0,101,98,BAL,BAL,BAL
1543,1970,AL,BOS,BOS,E,3,162,81.0,87,75,...,131,0.974,Boston Red Sox,Fenway Park II,1595278.0,108,107,BOS,BOS,BOS
1544,1970,AL,CAL,ANA,W,3,162,81.0,86,76,...,169,0.98,California Angels,Anaheim Stadium,1077741.0,96,97,CAL,CAL,CAL
1545,1970,AL,CHA,CHW,W,6,162,84.0,56,106,...,187,0.975,Chicago White Sox,Comiskey Park,495355.0,101,102,CHW,CHA,CHA


#### Identify Uniqueness Issues within Each Column

The following script shows there are no columns with the same value is each row.  There are four columns where there are only three unique values.  This is to be expected as these are binary indicators.  Missing values in these columns will be examined later in this section.  Alternatively, there are no columns where each row contains a different value.  The attendance column contain a large number of unique values.  This is to be expected as attendance can realisticaly be as low as a five digit integer and as high as a six digit integer.  In summary, there are no issues with regards to non-uniquenes or over-uniqueness in the subset team data set.  

In [21]:
#Count Unique Values for each column
teamUnique = teams1970.nunique(dropna = False)
print(teamUnique)

yearID            48  
lgID              2   
teamID            36  
franchID          30  
divID             3   
Rank              7   
G                 27  
Ghome             32  
W                 68  
L                 70  
DivWin            3   
WCWin             3   
LgWin             3   
WSWin             3   
R                 393 
AB                450 
H                 406 
2B                202 
3B                58  
HR                197 
BB                334 
SO                625 
SB                194 
CS                98  
HBP               88  
SF                56  
RA                416 
ER                390 
ERA               258 
CG                71  
SHO               25  
SV                56  
IPouts            309 
HA                431 
HRA               174 
BBA               325 
SOA               635 
E                 123 
DP                116 
FP                24  
name              36  
park              88  
attendance        1323
BPF        

In [42]:
#Print basic stats for attendance to address any concerns about this column being overly unique
pd.set_option('display.float_format', lambda x: '%0.2f' % x) # Suppress scientific notation
teams1970.attendance.describe()

count   1324.00   
mean    2049799.53
std     792309.99 
min     306763.00 
25%     1439223.75
50%     2001874.50
75%     2588625.00
max     4483350.00
Name: attendance, dtype: float64

#### Check for Columns with high levels of missing data

Only four columns in the teams data set have missing values.  None of these columns have missing values for every row in the subset team data set.  The missing values in the "DivWin", "LgWin" and "WSWin" columns are due to the baseball players strike in 1994 which caused the season to end prematurely.  There were no post season games in 1994.  The 640 missing values in "WCWin" are due to the wild card playoff system being introduced in the 1995 season.  The missing values are accurate missing values due to the 1994 players strike and the introduction of the wild card post season format in 1995.  

In [31]:
#Identify any columns with no values
teamNullCols = teams1970.isnull().sum()
print(teamNullCols[teamNullCols > 0])

DivWin    28 
WCWin     640
LgWin     28 
WSWin     28 
dtype: int64


#### Incorrect Values

{insert text regarding verification of statistics}