# Capstone Project: Vital Statistics of Professional Athletes

## Problem Statement:
### What impact does a career in professional sports have on the life expectancy of an athlete?

More specifically:
- Do different sports have different levels of impact on athletes' life expectancies?
- Are life expectancies further affected by the length of the playing career, the number of games played, or the positions they played?
- Can I build a model that predicts how many players will die in a given year, as well as the distribution of their ages when they died? Would this model be better than one based on vital statistics of the general population?

### Proposed Data Sources:

#### Athlete Statistics

- baseball-reference.com
- pro-football-reference.com
- hockey-reference.com

The Football reference sites have individual pages that list all the players that died in a given year. They also list the length of their careers, when they played, and the total number of games played. The football site lists the players' positions and how old they were when they died. However, the football site also lists coaches along with players.

The Baseball reference site also has pages that list all the players who died in a given year. While they list the legnth of careers and number of games played, they do not list their birthdates or ages. The baseball site also does list players born in a given year, but doesn't list death dates. To compute age, I will need to do work to combine all this information into one table.

The hockey reference site lists players who died in a certain season, along with their ages, number of seasons and games played. However, the death dates of some players may be unknown.

There is a basketball reference site that has each player's death dates only on their individual pages. There are listings of players who were born in a given year, but the death dates aren't listed on those pages. Scraping for death dates might take considerably more work.

If there is time, I would like to see if I could collect data on additional sports such as soccer, tennis, golf, boxing, auto racing, or cycling.

#### General Statistics

An example of where I can get the age distribution of all deaths that happened in the U.S. is 
https://www.statista.com/statistics/241572/death-rate-by-age-and-sex-in-the-us/

### Collecting Player Data

In [4]:
# Import libaries
import numpy as np
import pandas as pd
import requests
import time

In [37]:
from datetime import datetime, timedelta

In [5]:
from bs4 import BeautifulSoup

#### NFL player data

In [19]:
# Collect all football players who died in a particular year
def football_players_who_died_in(year):
    url = "https://www.pro-football-reference.com/years/" + str(year) + "/deaths.htm"
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table')
    players = []
    for row in table.find_all('tr')[1:]:
        player = {}
        hidden = row.find('th', {'data-stat': 'player'})
        player['name'] = hidden.find('a').text
        player['position'] = row.find('td', {'data-stat': 'pos'}).text
        player['age'] = row.find('td', {'data-stat': 'age'}).text
        player['death date'] = row.find('td', {'data-stat': 'death_date_mod'}).text
        player['birth date'] = row.find('td', {'data-stat': 'birth_date_mod'}).text
        player['experience'] = row.find('td', {'data-stat': 'experience'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'g'}).text
        player['all star'] = row.find('td', {'data-stat': 'pro_bowls'}).text
        players.append(player)
        
    return players

In [20]:
players = []
for year in range(1970, 2019):
    players.extend(football_players_who_died_in(year))
    time.sleep(1)

In [21]:
football_players = players

In [22]:
football_df = pd.DataFrame(football_players)
football_df.head()

Unnamed: 0,age,all star,birth date,death date,experience,first year,games,last year,name,position
0,67,0,5/14/1902,2/21/1970,3,1926,24,1928,Neely Allison,E
1,68,0,2/6/1902,11/30/1970,1,1925,1,1925,Paul Anderson,G
2,63,0,1/18/1907,3/9/1970,3,1931,23,1933,Corrie Artman,T
3,76,0,9/5/1893,6/7/1970,1,1926,1,1926,Ben Bangs,WB
4,57,0,2/8/1913,2/15/1970,3,1936,34,1938,Jeff Barrett,E


In [39]:
football_df.tail()

Unnamed: 0,age,all star,birth date,death date,experience,first year,games,last year,name,position
5584,65,0,11/7/1952,5/8/2018,4,1976,49,1980,Don Testerman,RB
5585,85,5,1/14/1932,1/6/2018,11,1955,133,1965,Frank Varrichione,T
5586,80,1,3/9/1938,8/27/2018,14,1964,194,1977,Bobby Walden,P
5587,79,0,3/2/1939,8/11/2018,1,1962,4,1962,Manch Wheeler,QB
5588,38,0,2/21/1980,8/1/2018,2,2004,18,2005,Taylor Whitley,G


In [26]:
football_df['age'].value_counts()

79     211
74     196
77     195
80     192
78     181
82     175
75     174
73     172
76     170
84     168
85     162
83     157
81     155
72     153
70     143
71     142
86     134
87     132
69     131
88     127
68     121
67     119
66     115
65     101
89      98
63      95
62      94
90      85
61      83
64      80
      ... 
37      22
39      21
96      21
36      19
95      19
38      17
40      17
26      16
28      15
33      14
27      14
29      13
25      13
35      12
24      12
34       9
31       9
97       8
32       8
98       7
99       6
23       6
30       5
         3
101      2
108      1
103      1
104      1
102      1
100      1
Name: age, Length: 84, dtype: int64

In [37]:
football_df[[len(x) > 2 for x in football_df['age']]]

Unnamed: 0,age,all star,birth date,death date,experience,first year,games,last year,name,position
2744,100,0,8/3/1896,5/26/1997,3.0,1921.0,22.0,1923.0,Ralph Horween,B
2745,108,0,9/4/1888,7/28/1997,2.0,1925.0,4.0,1926.0,Merle Hunter,T
2801,103,0,9/23/1893,5/24/1997,,,,,Babe Ruetz,Coach
2991,102,0,7/20/1897,10/29/1999,1.0,1920.0,3.0,1920.0,Tom Dickinson,E
4112,104,0,8/7/1903,10/29/2007,1.0,1928.0,5.0,1928.0,Sam Salemi,WB
4920,101,0,6/6/1912,10/11/2013,1.0,1938.0,6.0,1938.0,Johnny Kovatch,E
4951,101,0,5/17/1912,11/6/2013,7.0,1937.0,68.0,1946.0,Ace Parker,TB-DB-QB


#### NHL Player data

In [139]:
# Collect all hockey players who died in a particular year
def hockey_players_who_died_in(year):
    url = "https://www.hockey-reference.com/leagues/NHL_" + str(year) + "_deaths.html"
    res = requests.get(url)
    soup = BeautifulSoup(res.content, 'lxml')
    
    # For 2005, I needed to save html file locally and edit out a commented section
    if (year == 2005):
        url = "./2004-05 NHL Deaths _ Hockey-Reference.com.html"
        soup = BeautifulSoup(open(url), 'lxml')
        
    table = soup.find('table', {'id': 'deaths'})
    
    players = []
    for row in table.find_all('tr'):
        #print(row)
        player = {}
        hidden = row.find('td', {'data-stat': 'name_full'})
        if hidden is None:
            continue
        player['name'] = hidden.find('a').text
        player['position'] = row.find('td', {'data-stat': 'pos'}).text
        player['age'] = row.find('td', {'data-stat': 'age_at_death'}).text
        player['death date'] = row.find('td', {'data-stat': 'death_date'}).text
        player['birth date'] = row.find('td', {'data-stat': 'birth_date'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'games_played'}).text
        players.append(player)
        
    return players

In [142]:
hockey_players = []
for year in range(1970, 2019):
    hockey_players.extend(hockey_players_who_died_in(year))
    print(year)
    time.sleep(0.25)

1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018


In [143]:
hockey_df = pd.DataFrame(hockey_players)
hockey_df.head()

Unnamed: 0,age,birth date,death date,first year,games,last year,name,position
0,62,1907-09-03,1969-09-14,1930,85,1940,Cliff Barton,RW
1,50,1919-04-13,1970-02-08,1941,215,1947,Paul Bibeault,G
2,71,1899-02-02,1970-07-05,1933,3,1933,Bob Davis,RW
3,61,1908-04-26,1969-11-13,1931,453,1940,Cecil Dillon,RW
4,73,1897-07-22,1970-08-09,1924,1,1924,Charles Fraser,D


In [144]:
hockey_df['position'].value_counts()

D       292
LW      198
RW      190
C       182
G        92
C/RW     16
C/LW     16
LW/C     15
LW/D     14
RW/D     12
W        11
D/LW      8
C/D       6
RW/C      6
D/RW      4
F         2
D/C       1
D/W       1
Name: position, dtype: int64

In [145]:
hockey_df.tail()

Unnamed: 0,age,birth date,death date,first year,games,last year,name,position
1061,60,1957-11-29,2018-05-24,1983,62,1983,Bob Sullivan,LW
1062,64,1954-06-10,2018-08-17,1976,71,1978,Kurt Walker,D
1063,97,1920-11-03,2018-01-18,1950,14,1950,Chick Webster,C
1064,71,1946-02-18,2018-01-02,1969,52,1971,Jim Wiste,C
1065,49,1968-04-22,2017-12-10,1988,637,2000,Zarley Zalapski,D


In [146]:
hockey_df.to_csv('./hockey.csv')

In [108]:
football_df.to_csv('./football.csv')

#### Major League Baseball Player data

In [42]:
def age(birth_string, death_string, format_string):
    delta = datetime.strptime(death_string, format_string) - datetime.strptime(birth_string, format_string)
    return int(delta.days / 365.25)

In [44]:
age(died_2018[0]['birth date'], died_2018[0]['death date'], '%Y-%m-%d')

83

In [55]:
def get_vital_baseball_stats(link, player):
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    birth_item = soup.find('span', {'id': 'necro-birth'})
    if not birth_item is None:
        player['birth date'] = birth_item.attrs['data-birth']
    death_item = soup.find('span', {'id': 'necro-death'})
    if not death_item is None:
        player['death date'] = death_item.attrs['data-death']
    if (not birth_item is None) and (not death_item is None):
        player['age'] = age(player['birth date'], player['death date'], "%Y-%m-%d")

In [61]:
# Collect all baseball players who died in a particular year
def baseball_players_who_died_in(year):
    url = "https://www.baseball-reference.com/leagues/MLB/" + str(year) + "-deaths.shtml"
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'misc_batting'})
    players = []
    for row in table.find_all('tr')[1:]:
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        item = hidden.find('a')
        player['name'] = hidden.find('a').text
        #print(player['name'])
        link = item.attrs['href']
        full_link = "https://baseball-reference.com" + link
        get_vital_baseball_stats(full_link, player)
        #player['position'] = row.find('td', {'data-stat': 'pos'}).text
        #player['age'] = row.find('td', {'data-stat': 'age'}).text
        #player['death date'] = row.find('td', {'data-stat': 'deathdate'}).attrs['csk']
        #player['birth date'] = row.find('td', {'data-stat': 'birth_date_mod'}).text
        player['experience'] = row.find('td', {'data-stat': 'experience'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'G'}).text
        player['all star'] = row.find('td', {'data-stat': 'allstar_games'}).text
        players.append(player)
        
    return players

In [48]:
baseball_players = []
for year in range(1970,2019):
    baseball_players.extend(baseball_players_who_died_in(year))
    print(year)

baseball_df = pd.DataFrame(baseball_players)

1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998


AttributeError: 'NoneType' object has no attribute 'attrs'

In [50]:
baseball_players[-1]

{'name': 'Jerry Zimmerman',
 'birth date': '1934-09-21',
 'death date': '1998-09-09',
 'age': 63,
 'experience': '8',
 'first year': '1961',
 'last year': '1968',
 'games': '483',
 'all star': '0'}

In [51]:
baseball_df = pd.DataFrame(baseball_players)
baseball_df.to_csv('./baseball_players.csv')

In [57]:
died_1999 = baseball_players_who_died_in(1999)

Joe Adcock
Tex Aulds
Gene Baker
Larry Bearnarth
Allen Benson
Dick Bertell
Clay Bryant
Mike Budnick
Paul Burris
Paul Calvert
Ed Cole
Woody Davis
Joe DiMaggio
Len Dondero
Dutch Dotterer
Jim Dunn
Jim Dyck
Arnold Earley
Charlie English
Dick Errickson
Dee Fondy
Bob Garber
Greek George
Oscar Georgy
Johnny Gerlach
George Gill
Ray Goolsby
Paul Gregory
Bert Haas
Doug Hansen
Jay Heard
Wally Hebert
Randy Heflin
Clarence Heise
Tom Herrin
Earl Huckleberry
Catfish Hunter
Warren Huston
Ike Kahdot
Ray Katt
Eddie Kazak
Harry Kimberlin
Whitey Kurowski
Tim Layana
Bill Lohrman
Larry Loughlin
Gene Markland
Doc Marshall
Hector Martinez
Pat McLaughlin
Pete Milne
Vinegar Bend Mizell
Pat Mullin
Bob Patrick
Bill Peterman
Boots Poffenberger
Dave Pope
Carl Powis
Pee Wee Reese
Ken Robinson
Buck Rogers
Cliff Ross
Joe Rossi
Al Schroll
Bernie Snyder
Eddie Stanky
Carl Sumner
Roy Talcott
Ben Taylor
Birdie Tebbetts
Faye Throneberry
Paul Toth
Earl Turner
Harry Walker
Ace Williams
Johnnie Wittig
Whit Wyatt
Early Wynn
Norm

In [62]:
baseball_players.extend(died_1999)

In [63]:
for year in range(2000,2019):
    baseball_players.extend(baseball_players_who_died_in(year))
    print(year)

baseball_df = pd.DataFrame(baseball_players)

2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012


SSLError: HTTPSConnectionPool(host='www.baseball-reference.com', port=443): Max retries exceeded with url: /players/m/mussiba01.shtml (Caused by SSLError(SSLError("bad handshake: SysCallError(10054, 'WSAECONNRESET')",),))

In [65]:
for year in range(2013, 2019):
    baseball_players.extend(baseball_players_who_died_in(year))
    print(year)

baseball_df = pd.DataFrame(baseball_players)

2013
2014
2015
2016
2017
2018


In [67]:
baseball_df.tail()

Unnamed: 0,age,all star,birth date,death date,experience,first year,games,last year,name
4481,76.0,0,1942-04-18,2018-06-05,8,1969,305,1976,Chuck Taylor
4482,91.0,0,1926-09-28,2018-08-18,2,1954,11,1955,Ozzie Van Brabant
4483,61.0,0,1957-08-01,2018-08-04,1,1978,7,1978,Myron White
4484,84.0,0,1933-09-26,2018-05-05,1,1956,1,1956,Roy Wright
4485,89.0,0,1928-06-03,2018-01-07,2,1951,20,1952,Dick Young


In [68]:
baseball_df.to_csv('./baseball.csv')

### NBA Players

In [80]:
# If player is still alive, return False.
# Otherwise, get the NBA player's date of birth and death, and calculate age. Return True.

def get_vital_basketball_stats(link, player):
    #print(player['name'])
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    death_item = soup.find('span', {'id': 'necro-death'})
    if not death_item is None:
        player['death date'] = death_item.attrs['data-death']
    else:
        return False
    birth_item = soup.find('span', {'id': 'necro-birth'})
    if not birth_item is None:
        player['birth date'] = birth_item.attrs['data-birth']
    if (not birth_item is None) and (not death_item is None):
        try:
            player['age'] = age(player['birth date'], player['death date'], "%Y-%m-%d")
        except:
            pass
    return True

In [76]:
# Collect all basketball players who were born in a particular year
def basketball_players_born_in(year):
    url = "https://www.basketball-reference.com/friv/birthyears.fcgi?year=" + str(year)
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'stats'})
    if table is None:
        return []
    players = []
    for row in table.find_all('tr'):
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        if hidden is None:
            continue
        item = hidden.find('a')
        player['name'] = hidden.find('a').text
        #print(player['name'])
        link = item.attrs['href']
        full_link = "https://basketball-reference.com" + link
        get_vital_basketball_stats(full_link, player)
        player['experience'] = row.find('td', {'data-stat': 'years'}).text
        #player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        #player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'g'}).text
        players.append(player)
        
    return players

In [81]:
basketball_players = []
for year in range(1902, 2000):
    basketball_players.extend(basketball_players_born_in(year))
    print(year)

1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999


In [82]:
basketball_df = pd.DataFrame(basketball_players)
basketball_df.to_csv('./basketball.csv')
basketball_df.tail()

Unnamed: 0,age,birth date,death date,experience,games,name
4573,,,,1,61,Terrance Ferguson
4574,,,,1,14,Markelle Fultz
4575,,,,1,63,Malik Monk
4576,,,,1,78,Frank Ntilikina
4577,,,,1,80,Jayson Tatum


In [83]:
basketball_df.dropna(axis=0, inplace=True)

In [84]:
basketball_df.tail()

Unnamed: 0,age,birth date,death date,experience,games,name
3545,25.0,1982-05-30,2007-08-17,5,303,Eddie Griffin
3561,35.0,1982-01-16,2017-10-20,3,136,Justin Reed
4122,27.0,1990-07-15,2018-07-06,2,24,Tyler Honeycutt
4137,26.0,1990-06-20,2017-02-11,1,6,Fab Melo
4269,23.0,1992-08-21,2016-05-28,1,14,Bryce Dejean-Jones


In [85]:
all_basketball_df = pd.read_csv('./all_basketball.csv')

In [96]:
deceased_basketball_df = all_basketball_df[all_basketball_df['death date'].isnull() == False]

In [97]:
deceased_basketball_df.head()

Unnamed: 0.1,Unnamed: 0,age,birth date,death date,experience,games,name
0,0,77.0,1902-01-30,1979-09-16,1,2,Nat Hickey
1,1,,1913-09-04,1967-01-00,1,30,Joe Fabel
2,2,92.0,1913-11-03,2006-03-14,1,6,Nat Frankel
3,3,99.0,1913-09-06,2013-03-25,1,2,Ben Goldfaden
4,4,74.0,1913-12-03,1988-03-21,1,23,Charley Shipp


In [98]:
deceased_basketball_df.to_csv('./basketball.csv')

In [99]:
deceased_basketball_df[deceased_basketball_df['age'].isnull()]

Unnamed: 0.1,Unnamed: 0,age,birth date,death date,experience,games,name
1,1,,1913-09-04,1967-01-00,1,30,Joe Fabel
8,8,,1915-09-07,1979-05-00,1,50,Wilbert Kautz
14,14,,1916-11-02,1988-08-00,1,51,Chet Carlisle
39,39,,1918-11-22,1970--00,2,20,Elmer Gainer
57,57,,1919-09-24,1974-10-00,1,21,Woody Grimshaw
64,64,,,1973--00,1,19,Howie McCarty
129,129,,1921-07-31,1968-12-00,1,13,Mel Hirsch
237,237,,1923-10-02,1980-09-00,1,54,Harold Brown
524,524,,1930-02-17,1966-01-00,1,11,Gene Dyker
549,549,,1930-08-29,1990-08-00,1,3,Skippy Whitaker


In [100]:
deceased_basketball_df.sort_values('death date', ascending=False).head(20)

Unnamed: 0.1,Unnamed: 0,age,birth date,death date,experience,games,name
784,784,78.0,1939-09-29,2018-09-27,8,548,Art Williams
831,831,77.0,1941-01-31,2018-07-12,10,670,Len Chappell
576,576,86.0,1931-07-13,2018-07-08,9,623,Frank Ramsey
1807,1807,62.0,1955-10-19,2018-07-08,10,673,Lonnie Shelton
4122,4122,27.0,1990-07-15,2018-07-06,2,24,Tyler Honeycutt
859,859,75.0,1942-07-28,2018-05-14,1,69,Howard Bayne
685,685,81.0,1936-06-26,2018-04-14,15,1122,Hal Greer
3474,3474,36.0,1981-05-29,2018-04-02,3,73,Alton Ford
434,434,90.0,1927-09-02,2018-03-10,1,65,Gene Rhodes
3339,3339,38.0,1979-05-23,2018-01-31,13,809,Rasual Butler


In [111]:
deceased_basketball_df[deceased_basketball_df['age'] < 40].sort_values('death date', ascending=False)

Unnamed: 0.1,Unnamed: 0,age,birth date,death date,experience,games,name
4122,4122,27.0,1990-07-15,2018-07-06,2,24,Tyler Honeycutt
3474,3474,36.0,1981-05-29,2018-04-02,3,73,Alton Ford
3339,3339,38.0,1979-05-23,2018-01-31,13,809,Rasual Butler
3561,3561,35.0,1982-01-16,2017-10-20,3,136,Justin Reed
4137,4137,26.0,1990-06-20,2017-02-11,1,6,Fab Melo
4269,4269,23.0,1992-08-21,2016-05-28,1,14,Bryce Dejean-Jones
3511,3511,34.0,1981-06-06,2015-06-29,2,87,Jackson Vroman
3242,3242,37.0,1977-02-01,2014-10-10,2,25,Lari Ketner
3271,3271,34.0,1977-02-01,2011-05-11,7,438,Robert Traylor
3158,3158,34.0,1975-11-04,2010-07-29,13,778,Lorenzen Wright
