# Capstone Project: Vital Statistics of Professional Athletes

## Problem Statement:
### What impact does a career in professional sports have on the life expectancy of an athlete?

More specifically:
- Do different sports have different levels of impact on athletes' life expectancies?
- Are life expectancies further affected by the length of the playing career, the number of games played, or the positions they played?
- Can I build a model that predicts how many players will die in a given year, as well as the distribution of their ages when they died? Would this model be better than one based on vital statistics of the general population?

### Proposed Data Sources:

#### Athlete Statistics

- baseball-reference.com
- pro-football-reference.com
- hockey-reference.com

The Football reference sites have individual pages that list all the players that died in a given year. They also list the length of their careers, when they played, and the total number of games played. The football site lists the players' positions and how old they were when they died. However, the football site also lists coaches along with players.

The Baseball reference site also has pages that list all the players who died in a given year. While they list the legnth of careers and number of games played, they do not list their birthdates or ages. The baseball site also does list players born in a given year, but doesn't list death dates. To compute age, I will need to do work to combine all this information into one table.

The hockey reference site lists players who died in a certain season, along with their ages, number of seasons and games played. However, the death dates of some players may be unknown.

There is a basketball reference site that has each player's death dates only on their individual pages. There are listings of players who were born in a given year, but the death dates aren't listed on those pages. Scraping for death dates might take considerably more work.

If there is time, I would like to see if I could collect data on additional sports such as soccer, tennis, golf, boxing, auto racing, or cycling.

#### General Statistics

An example of where I can get the age distribution of all deaths that happened in the U.S. is 
https://www.statista.com/statistics/241572/death-rate-by-age-and-sex-in-the-us/

## Collecting Player Data

In [1]:
# Import libaries
import numpy as np
import pandas as pd
import requests
import time

from datetime import datetime, timedelta

from bs4 import BeautifulSoup

### NFL player data

#### Gather all football players who died between 1923 and 2018

In [25]:
# Collect all football players who died in a particular year
def football_players_who_died_in(year):
    url = "https://www.pro-football-reference.com/years/" + str(year) + "/deaths.htm"
    res = requests.get(url)
    if res.status_code != 200:
        return []
    
    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table')
    players = []
    for row in table.find_all('tr')[1:]:
        player = {}
        hidden = row.find('th', {'data-stat': 'player'})
        if hidden is None:
            break
        player['name'] = hidden.find('a').text
        player['position'] = row.find('td', {'data-stat': 'pos'}).text
        player['age'] = row.find('td', {'data-stat': 'age'}).text
        player['death year'] = year
        player['death date'] = row.find('td', {'data-stat': 'death_date_mod'}).text
        player['birth date'] = row.find('td', {'data-stat': 'birth_date_mod'}).text
        player['experience'] = row.find('td', {'data-stat': 'experience'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'g'}).text
        player['all star'] = row.find('td', {'data-stat': 'pro_bowls'}).text
        player['link'] = hidden.find('a').attrs['href']
        players.append(player)
        
    return players

In [26]:
deceased_football_players = []
for year in range(1923, 2019):
    deceased_football_players.extend(football_players_who_died_in(year))
    print(year)
    #time.sleep(.1)

1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018


In [27]:
# Create DataFrame
football_df = pd.DataFrame(deceased_football_players)
football_df.head()

Unnamed: 0,age,all star,birth date,death date,death year,experience,first year,games,last year,link,name,position
0,31.0,0,3/30/1892,6/6/1923,1923,2,1920,11,1921,/players/L/LaRoPa20.htm,Paul LaRosa,E
1,28.0,0,12/15/1895,9/9/1924,1924,1,1920,2,1920,/players/G/GepfSi20.htm,Sid Gepford,HB
2,23.0,0,9/28/1902,11/25/1925,1925,1,1925,1,1925,/players/H/HammCh20.htm,Ching Hammill,BB
3,34.0,0,8/7/1890,6/7/1925,1925,2,1921,9,1922,/players/W/WaldRa20.htm,Ralph Waldsmith,C-G
4,,0,,11/12/1926,1926,3,1924,22,1926,/players/F/FeisLo20.htm,Lou Feist,E-T-FB


Out of curiosity, which (recent) players lived to over 100?

In [24]:
football_df[[len(x) > 2 for x in football_df['age']]]

Unnamed: 0,age,all star,birth date,death date,experience,first year,games,last year,link,name,position
3931,100,0,8/3/1896,5/26/1997,3.0,1921.0,22.0,1923.0,/players/H/HorwRa20.htm,Ralph Horween,B
3932,108,0,9/4/1888,7/28/1997,2.0,1925.0,4.0,1926.0,/players/H/HuntMe20.htm,Merle Hunter,T
3988,103,0,9/23/1893,5/24/1997,,,,,/coaches/RuetBa0.htm,Babe Ruetz,Coach
4178,102,0,7/20/1897,10/29/1999,1.0,1920.0,3.0,1920.0,/players/D/DickTo20.htm,Tom Dickinson,E
5299,104,0,8/7/1903,10/29/2007,1.0,1928.0,5.0,1928.0,/players/S/SaleSa20.htm,Sam Salemi,WB
6107,101,0,6/6/1912,10/11/2013,1.0,1938.0,6.0,1938.0,/players/K/KovaJo21.htm,Johnny Kovatch,E
6138,101,0,5/17/1912,11/6/2013,7.0,1937.0,68.0,1946.0,/players/P/ParkAc20.htm,Ace Parker,TB-DB-QB


#### Collect all players born since 1900

In [33]:
# Collect all football players who were born in a particular year
def football_players_born_in(year):
    url = "https://www.pro-football-reference.com/years/" + str(year) + "/births.htm"
    res = requests.get(url)
    
    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table')
    players = []
    for row in table.find_all('tr')[1:]:
        player = {}
        hidden = row.find('th', {'data-stat': 'player'})
        if hidden is None:
            break
        player['name'] = hidden.find('a').text
        player['position'] = row.find('td', {'data-stat': 'pos'}).text
        player['birth year'] = year
        player['birth date'] = row.find('td', {'data-stat': 'birth_date_mod'}).text
        player['experience'] = row.find('td', {'data-stat': 'experience'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'g'}).text
        player['all star'] = row.find('td', {'data-stat': 'pro_bowls'}).text
        player['link'] = hidden.find('a').attrs['href']
        players.append(player)
        
    return players

In [36]:
# Collect all football players
all_football_players = []
for year in range(1900, 1999):
    all_football_players.extend(football_players_born_in(year))
    print(year)
    #time.sleep(.1)

1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998


In [39]:
all_football_players_df = pd.DataFrame(all_football_players)
all_football_players_df.head()

Unnamed: 0,all star,birth date,birth year,experience,first year,games,last year,link,name,position
0,0,11/13/1900,1900,4,1922,43,1925,/players/A/AndeEd20.htm,Eddie Anderson,E
1,0,7/12/1900,1900,1,1926,2,1926,/players/A/AshxJu20.htm,Juddy Ash,G
2,0,7/10/1900,1900,2,1924,3,1925,/players/A/AultCh20.htm,Chalmer Ault,T
3,0,12/9/1900,1900,1,1926,8,1926,/players/B/BabcSa20.htm,Sam Babcock,WB-FB-BB
4,0,11/4/1900,1900,5,1927,50,1931,/players/B/BakeBu20.htm,Bullet Baker,BB-WB-TB-HB


Save Data Frames to CSV files

In [40]:
all_football_players_df.to_csv('./all_football.csv')

In [41]:
football_df.to_csv('./deceased_football.csv')

#### Additional Data Gathering: Heights and Weights of Players

In [12]:
# Get the height and weight of a player from his individual page
def get_ht_wt(link, player):
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    player['height'] = soup.find('span', {'itemprop': 'height'}).text
    player['weight'] = soup.find('span', {'itemprop': 'weight'}).text

In [13]:
# This file was created later in the NFL analysis notebook
nfl_merged_df = pd.read_csv('./data/nfl_merged.csv')

In [15]:
# Gather heights and weights of players born between 1930 and 1945
football_players = []
for year in range(1930, 1946):
    print(year)
    links = nfl_merged_df[nfl_merged_df['birth year'] == year]['link']

    count = 0
    for link in links:
        full_link = "https://pro-football-reference.com" + link
        player = {}
        player['link'] = link
        get_ht_wt(full_link, player)
        football_players.append(player)
        count +=1
        if count % 50 == 0:
            print(count, "players scraped")

1930
50 players scraped
100 players scraped
1931
50 players scraped
100 players scraped
1932
50 players scraped
100 players scraped
1933
50 players scraped
1934
50 players scraped
100 players scraped
1935
50 players scraped
100 players scraped
1936
50 players scraped
100 players scraped
150 players scraped
1937
50 players scraped
100 players scraped
150 players scraped
1938
50 players scraped
100 players scraped
150 players scraped
200 players scraped
1939
50 players scraped
100 players scraped
150 players scraped
1940
50 players scraped
100 players scraped
150 players scraped
1941
50 players scraped
100 players scraped
150 players scraped
200 players scraped
1942
50 players scraped
100 players scraped
150 players scraped
200 players scraped
1943
50 players scraped
100 players scraped
150 players scraped
200 players scraped
1944
50 players scraped
100 players scraped
150 players scraped
200 players scraped
250 players scraped
1945
50 players scraped
100 players scraped
150 players scra

In [18]:
# Save height-weight dataframe
pd.DataFrame(football_players).to_csv('./data/nfl_ht_wt.csv')

### NHL Player data

#### Collect data on deceased players

In [51]:
# Collect all hockey players who died in a particular year
def hockey_players_who_died_in(year):
    url = "https://www.hockey-reference.com/leagues/NHL_" + str(year) + "_deaths.html"
    res = requests.get(url)
    soup = BeautifulSoup(res.content, 'lxml')
    
    # For 2005, I needed to save html file locally and edit out a commented section
    if (year == 2005):
        url = "./2004-05 NHL Deaths _ Hockey-Reference.com.html"
        soup = BeautifulSoup(open(url), 'lxml')
        
    table = soup.find('table', {'id': 'deaths'})
    if table is None:
        return []
    
    players = []
    for row in table.find_all('tr'):
        #print(row)
        player = {}
        hidden = row.find('td', {'data-stat': 'name_full'})
        if hidden is None:
            continue
        player['name'] = hidden.find('a').text
        player['link'] = hidden.find('a').attrs['href']
        player['position'] = row.find('td', {'data-stat': 'pos'}).text
        player['age'] = row.find('td', {'data-stat': 'age_at_death'}).text
        player['death date'] = row.find('td', {'data-stat': 'death_date'}).text
        player['birth date'] = row.find('td', {'data-stat': 'birth_date'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'games_played'}).text
        player['death season'] = year
        players.append(player)
        
    return players

In [46]:
# Collect all hockey players who were born in a particular year
def hockey_players_born_in(year):
    url = "https://www.hockey-reference.com/friv/birthyears.cgi?year=" + str(year)
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'stats'})
    if table is None:
        return []
    players = []
    for row in table.find_all('tr'):
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        if hidden is None:
            continue
        item = hidden.find('a')
        player['name'] = hidden.find('a').text
        player['position'] = row.find('td', {'data-stat': 'pos'}).text
        #print(player['name'])
        link = item.attrs['href']
        player['death date'] = row.find('td', {'data-stat': 'death_date'}).text
        player['birth date'] = row.find('td', {'data-stat': 'birth_date'}).text
        player['birth year'] = year
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'games_played'}).text
        player['link'] = link
        players.append(player)
        
    return players

#### Gather data about all hockey players

In [49]:
all_hockey_players = []
for year in range(1879, 2001):
    all_hockey_players.extend(hockey_players_born_in(year))
    print(year)

1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000


In [50]:
all_hockey_df = pd.DataFrame(all_hockey_players)
all_hockey_df.head()

Unnamed: 0,birth date,birth year,death date,first year,games,last year,link,name,position
0,July 17,1879,"January 9, 1960",1918,18,1918,/players/l/lavioja01.html,Jack Laviolette,LW
1,May 3,1881,"April 5, 1919",1918,37,1919,/players/h/halljo01.html,Joe Hall,D
2,July 23,1881,"November 11, 1960",1918,20,1919,/players/l/lindsbe01.html,Bert Lindsay,G
3,May 29,1881,"November 25, 1933",1918,1,1918,/players/t/thompke01.html,Ken Thompson,F
4,December 31,1883,"June 1, 1960",1927,1,1927,/players/p/patrile01.html,Lester Patrick,D


#### Gather data on deceased hockey players

In [67]:
deceased_hockey_players = []
for year in range(1917, 2019):
    deceased_hockey_players.extend(hockey_players_who_died_in(year))
    print(year)
    #time.sleep(0.25)

1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018


In [68]:
deceased_hockey_df = pd.DataFrame(deceased_hockey_players)
deceased_hockey_df.tail()

Unnamed: 0,age,birth date,death date,death season,first year,games,last year,link,name,position
1288,60,1957-11-29,2018-05-24,2018,1983,62,1983,/players/s/sullibo01.html,Bob Sullivan,LW
1289,64,1954-06-10,2018-08-17,2018,1976,71,1978,/players/w/walkeku01.html,Kurt Walker,D
1290,97,1920-11-03,2018-01-18,2018,1950,14,1950,/players/w/webstjo01.html,Chick Webster,C
1291,71,1946-02-18,2018-01-02,2018,1969,52,1971,/players/w/wisteji01.html,Jim Wiste,C
1292,49,1968-04-22,2017-12-10,2018,1988,637,2000,/players/z/zalapza01.html,Zarley Zalapski,D


#### Save DataFrames

In [75]:
all_hockey_df.to_csv('./all_hockey.csv')
deceased_hockey_df.to_csv('./deceased_hockey.csv')

### Major League Baseball Player data

#### Vital Statistics from individual pages
These functions when I gathered vital stats from the individual pages rather than combining lists of players born in a year or died in a year. They aren't used now for gathering baseball player data, but they are used for gathering basketball player data later.

In [5]:
# Compute age from strings of a particular format
def age(birth_string, death_string, format_string):
    delta = datetime.strptime(death_string, format_string) - datetime.strptime(birth_string, format_string)
    return int(delta.days / 365.25)

In [44]:
# Test age function
age(died_2018[0]['birth date'], died_2018[0]['death date'], '%Y-%m-%d')

83

In [55]:
# Gather birth and death date from 
def get_vital_baseball_stats(link, player):
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    birth_item = soup.find('span', {'id': 'necro-birth'})
    if not birth_item is None:
        player['birth date'] = birth_item.attrs['data-birth']
    death_item = soup.find('span', {'id': 'necro-death'})
    if not death_item is None:
        player['death date'] = death_item.attrs['data-death']
    if (not birth_item is None) and (not death_item is None):
        player['age'] = age(player['birth date'], player['death date'], "%Y-%m-%d")

#### Gather players born/died in a particular year

In [90]:
# Collect all baseball players who died in a particular year
def baseball_players_who_died_in(year):
    url = "https://www.baseball-reference.com/leagues/MLB/" + str(year) + "-deaths.shtml"
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'misc_batting'})
    players = []
    for row in table.find_all('tr')[1:]:
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        item = hidden.find('a')
        player['name'] = hidden.find('a').text
        #print(player['name'])
        link = item.attrs['href']
        player['link'] = link
        player['death year'] = year
        try:
            player['death date'] = row.find('td', {'data-stat': 'deathdate'}).attrs['csk']
        except:
            pass
        player['experience'] = row.find('td', {'data-stat': 'experience'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'G'}).text
        player['all star'] = row.find('td', {'data-stat': 'allstar_games'}).text
        players.append(player)
        
    return players

In [93]:
# Collect all baseball players who died in a particular year
def baseball_players_born_in(year):
    url = "https://www.baseball-reference.com/leagues/MLB/" + str(year) + "-births.shtml"
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'misc_batting'})
    players = []
    for row in table.find_all('tr')[1:]:
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        item = hidden.find('a')
        player['name'] = hidden.find('a').text
        #print(player['name'])
        link = item.attrs['href']
        player['link'] = link
        player['birth city'] = row.find('td', {'data-stat': 'birth_city'}).attrs['csk']
        player['birth date'] = row.find('td', {'data-stat': 'birthdate'}).attrs['csk']
        player['birth year'] = year
        player['experience'] = row.find('td', {'data-stat': 'experience'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'G'}).text
        player['all star'] = row.find('td', {'data-stat': 'allstar_games'}).text
        players.append(player)
        
    return players

In [94]:
all_baseball_players = []
for year in range(1900,1999):
    all_baseball_players.extend(baseball_players_born_in(year))
    print(year)
    

1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998


In [95]:
all_baseball_df = pd.DataFrame(all_baseball_players)
all_baseball_df.tail()

Unnamed: 0,all star,birth city,birth date,birth year,experience,first year,games,last year,link,name
14001,0,"United States, FL, Tampa",1997-01-17,1997,1,2018,28,2018,/players/t/tuckeky01.shtml,Kyle Tucker
14002,0,"Mexico, Sonora, Magdalena de Kino",1997-06-03,1997,1,2018,12,2018,/players/u/uriaslu01.shtml,Luis Urias
14003,0,"Colombia, Cartagena",1997-02-15,1997,1,2018,10,2018,/players/v/vilorme01.shtml,Meibrys Viloria
14004,0,"United States, NC, Durham",1997-12-20,1997,1,2018,3,2018,/players/w/wilsobr02.shtml,Bryse Wilson
14005,0,"Dominican Republic, Distrito Nacional, Santo D...",1998-10-25,1998,1,2018,116,2018,/players/s/sotoju01.shtml,Juan Soto


In [96]:
deceased_baseball_players = []
for year in range(1919,2019):
    deceased_baseball_players.extend(baseball_players_who_died_in(year))
    print(year)

deceased_baseball_df = pd.DataFrame(deceased_baseball_players)

1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018


In [99]:
deceased_baseball_df.tail()

Unnamed: 0,all star,death date,death year,experience,first year,games,last year,link,name
8577,0,2018-08-18,2018,2,1954,11,1955,/players/v/vanbroz01.shtml,Ozzie Van Brabant
8578,0,2018-06-01,2018,1,1955,1,1955,/players/v/vandufr01.shtml,Fred Van Dusen
8579,0,2018-08-04,2018,1,1978,7,1978,/players/w/whitemy01.shtml,Myron White
8580,0,2018-05-05,2018,1,1956,1,1956,/players/w/wrighro01.shtml,Roy Wright
8581,0,2018-01-07,2018,2,1951,20,1952,/players/y/youngdi01.shtml,Dick Young


In [100]:
all_baseball_df.to_csv('./all_baseball_players.csv')
deceased_baseball_df.to_csv('./deceased_baseball_players.csv')

#### Get height and weight of baseball players

In [3]:
# Gather height and weight from individual player page
def get_baseball_ht_wt(link, player):
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    player['height'] = soup.find('span', {'itemprop': 'height'}).text
    player['weight'] = soup.find('span', {'itemprop': 'weight'}).text

In [4]:
# This DF from the MLB Analysis notebook
mlb_merged_df = pd.read_csv('./data/mlb_merged.csv')

In [8]:
for year in range(1931, 1946):
    print(year)
    links = mlb_merged_df[mlb_merged_df['birth year'] == year]['link']

    count = 0
    for link in links:
        full_link = "https://baseball-reference.com" + link
        player = {}
        player['link'] = link
        get_baseball_ht_wt(full_link, player)
        players.append(player)
        count +=1
        if count % 50 == 0:
            print(count, "players scraped")

1931
50 players scraped
100 players scraped
1932
50 players scraped
1933
50 players scraped
1934
50 players scraped
1935
50 players scraped
1936
50 players scraped
100 players scraped
1937
50 players scraped
100 players scraped
1938
50 players scraped
100 players scraped
1939
50 players scraped
1940
50 players scraped
100 players scraped
1941
50 players scraped
100 players scraped
1942
50 players scraped
100 players scraped
1943
50 players scraped
100 players scraped
1944
50 players scraped
100 players scraped
1945
50 players scraped
100 players scraped


In [16]:
pd.DataFrame(players).to_csv('./data/mlb_ht_wt.csv')

### NBA Players
Pages exist only for NBA (or ABA) players born in a particular year; no pages were found that list players who died in a particular year. Thus death dates needed to be collected from individual pages.

In [6]:
# If player is still alive, return False.
# Otherwise, get the NBA player's date of birth and death, and calculate age. Return True.

def get_vital_basketball_stats(link, player):
    #print(player['name'])
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    death_item = soup.find('span', {'id': 'necro-death'})
    if not death_item is None:
        player['death date'] = death_item.attrs['data-death']
    birth_item = soup.find('span', {'id': 'necro-birth'})
    if not birth_item is None:
        player['birth date'] = birth_item.attrs['data-birth']
    if (not birth_item is None) and (not death_item is None):
        try:
            player['age'] = age(player['birth date'], player['death date'], "%Y-%m-%d")
        except:
            pass
    return True

In [8]:
# Collect all basketball players who were born in a particular year
def basketball_players_born_in(year):
    url = "https://www.basketball-reference.com/friv/birthyears.fcgi?year=" + str(year)
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'stats'})
    if table is None:
        return []
    players = []
    for row in table.find_all('tr'):
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        if hidden is None:
            continue
        item = hidden.find('a')
        player['name'] = hidden.find('a').text
        #print(player['name'])
        link = item.attrs['href']
        full_link = "https://basketball-reference.com" + link
        get_vital_basketball_stats(full_link, player)
        player['experience'] = row.find('td', {'data-stat': 'years'}).text
        player['games'] = row.find('td', {'data-stat': 'g'}).text
        player['link'] = link
        players.append(player)
        
    return players

#### Gather data on all basketball players

In [9]:
basketball_players = []
for year in range(1902, 2000):
    basketball_players.extend(basketball_players_born_in(year))
    print(year)

1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999


In [10]:
basketball_df = pd.DataFrame(basketball_players)
basketball_df.to_csv('./basketball.csv')
basketball_df.tail()

Unnamed: 0,age,birth date,death date,experience,games,link,name
4573,,1998-05-17,,1,61,/players/f/fergute01.html,Terrance Ferguson
4574,,1998-05-29,,1,14,/players/f/fultzma01.html,Markelle Fultz
4575,,1998-02-04,,1,63,/players/m/monkma01.html,Malik Monk
4576,,1998-07-28,,1,78,/players/n/ntilila01.html,Frank Ntilikina
4577,,1998-03-03,,1,80,/players/t/tatumja01.html,Jayson Tatum


#### Gather players by country of birth
This data was gathered in case I wanted to restrict myself to analyzing players born in a certain country.

In [12]:
# Obtain players by place of birth
def nba_players_from(link, country="United States"):
    res = requests.get(link)
    
    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'stats'})
    if table is None:
        return []
    players = []
    for row in table.find_all('tr'):
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        if hidden is None:
            continue
        item = hidden.find('a')
        player['name'] = item.text
        player['link'] = item.attrs['href']
        #print(player['name'])
        player['experience'] = row.find('td', {'data-stat': 'years'}).text
        player['first year'] = row.find('td', {'data-stat': 'year_min'}).text    
        player['last year'] = row.find('td', {'data-stat': 'year_max'}).text
        player['games'] = row.find('td', {'data-stat': 'g'}).text
        player['country'] = country
        try:
            player['birth date'] = row.find('td', {'data-stat': 'birth_date'}).attrs['csk']
        except:
            pass
        players.append(player)
        
    return players

In [13]:
# Obtain players born in the United States
def nba_players_from_us():
    players = []
    home_url = "https://www.basketball-reference.com/friv/birthplaces.fcgi"
    res = requests.get(home_url)
    
    soup = BeautifulSoup(res.content, 'lxml')
    us_table = soup.find('div', {'id': 'birthplace_1'})
    for state in us_table.find_all('p'):
        print(state.find('a').text)
        link = state.find('a').attrs['href']
        full_link = "https://basketball-reference.com" + link
        players.extend(nba_players_from(full_link))
        
    return players

In [14]:
us_nba_players = nba_players_from_us()

Alabama
Alaska
Arizona
Arkansas
California
Colorado
Connecticut
Delaware
District of Columbia
Florida
Georgia
Hawaii
Idaho
Illinois
Indiana
Iowa
Kansas
Kentucky
Louisiana
Maine
Maryland
Massachusetts
Michigan
Minnesota
Mississippi
Missouri
Montana
Nebraska
Nevada
New Hampshire
New Jersey
New Mexico
New York
North Carolina
North Dakota
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
South Carolina
South Dakota
Tennessee
Texas
Utah
Virginia
Washington
West Virginia
Wisconsin
Wyoming


In [15]:
us_nba_df = pd.DataFrame(us_nba_players)

In [16]:
us_nba_df[us_nba_df['birth date'].isnull()]

Unnamed: 0,birth date,country,experience,first year,games,last year,link,name
3471,,United States,1,1968,70,1968,/players/s/spragbr01.html,Bruce Spraggins
3533,,United States,1,1968,25,1968,/players/s/stollra01.html,Randy Stoll


In [17]:
# Obtain players by place of birth from the rest of the world

def nba_players_from_world():
    players = []
    home_url = "https://www.basketball-reference.com/friv/birthplaces.fcgi"
    res = requests.get(home_url)
    
    soup = BeautifulSoup(res.content, 'lxml')
    world_table = soup.find('div', {'id': 'birthplace_2'})
    for country in world_table.find_all('p'):
        print(country.find('a').text)
        link = country.find('a').attrs['href']
        full_link = "https://basketball-reference.com" + link
        country_string = country.find('a').text
        players.extend(nba_players_from(full_link, country_string))
        
    return players

In [18]:
world_nba_df = pd.DataFrame(nba_players_from_world())

Argentina
Australia
Austria
Bahamas
Belgium
Bosnia and Herzegovina
Brazil
Bulgaria
Cameroon
Canada
Cape Verde
China
Croatia
Cuba
Czech Republic
Democratic Republic of the Congo
Denmark
Dominica
Dominican Republic
Egypt
Estonia
Finland
France
French Guiana
Gabon
Georgia
Germany
Ghana
Greece
Guadeloupe
Guyana
Haiti
Hungary
Iceland
Ireland
Islamic Republic of Iran
Israel
Italy
Jamaica
Japan
Latvia
Lebanon
Lithuania
Luxembourg
Mali
Martinique
Mexico
Montenegro
Morocco
Netherlands
New Zealand
Nigeria
Norway
Panama
Poland
Puerto Rico
Republic of Korea
Republic of Macedonia
Republic of the Congo
Romania
Russian Federation
Saint Lucia
Saint Vincent and the Grenadines
Senegal
Serbia
Slovakia
Slovenia
South Africa
South Sudan
Spain
Sweden
Switzerland
Taiwan
Trinidad and Tobago
Tunisia
Turkey
U.S. Virgin Islands
Ukraine
United Kingdom
United Republic of Tanzania
Uruguay
Venezuela


In [19]:
us_nba_df.shape, world_nba_df.shape

((3636, 8), (439, 8))

In [20]:
us_nba_df.to_csv('./us_nba.csv')

In [21]:
world_nba_df.to_csv('./world_nba.csv')

#### Gathering birth years
Simply gather birth years for each player. These will be merged with the other table later.

In [8]:
# Collect all basketball players who were born in a particular year
def basketball_players_birth_year(year):
    url = "https://www.basketball-reference.com/friv/birthyears.fcgi?year=" + str(year)
    res = requests.get(url)

    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', {'id': 'stats'})
    if table is None:
        return []
    players = []
    for row in table.find_all('tr'):
        player = {}
        hidden = row.find('td', {'data-stat': 'player'})
        if hidden is None:
            continue
        item = hidden.find('a')
        link = item.attrs['href']
        player['link'] = link
        player['birth year'] = year
        players.append(player)
        
    return players

In [9]:
nba_birth_years = []
for year in range(1902, 2000):
    print(year)
    nba_birth_years.extend(basketball_players_birth_year(year))
    
pd.DataFrame(nba_birth_years).to_csv('./nba_birth_years.csv', index=False)

1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999


## Gathering Vital Statistics of General Population

### Life Tables From Social Security Administration

In [52]:
# Get the life table from a particular URL:
def get_life_table(url):
    res = requests.get(url)
    
    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table')
    table2 = table.find('table')
    matrix_list = []
    for row in table2.find_all('tr')[2:]:
        if row is None:
            continue
        #print(row)
        row_list = []
        for entry in row.find_all('td'):
            row_list.append(entry.text)
        matrix_list.append(row_list)
    
    return matrix_list

In [57]:
# Gather life tables from 2004 to 2015; life tables may be missing for some years
for year in range(2013, 2016):
    url = "https://www.ssa.gov/oact/STATS/table4c6_" + str(year) + ".html"
    if year == 2015:
        url = "https://www.ssa.gov/oact/STATS/table4c6.html"
    ssa_array = np.array(get_life_table(url))
    df = pd.DataFrame(ssa_array, columns=['age', 'male prob death', 'male num lives', 'male life exp', 
                                 'female prob death', 'female num lives', 'female life exp'])
    df.to_csv('./ssa_tables/ssa_' + str(year) + '.csv')

In [54]:
# Sample DataFrame of SSA Life Table
pd.DataFrame(ssa_array, columns=['age', 'male prob death', 'male num lives', 'male life exp', 
                                 'female prob death', 'female num lives', 'female life exp'])

Unnamed: 0,age,male prob death,male num lives,male life exp,female prob death,female num lives,female life exp
0,0,0.007566,100000,74.81,0.006156,100000,79.95
1,1,0.000522,99243,74.38,0.000416,99384,79.45
2,2,0.000358,99192,73.42,0.000257,99343,78.48
3,3,0.000255,99156,72.45,0.000181,99318,77.50
4,4,0.000204,99131,71.47,0.000155,99300,76.52
5,5,0.000184,99111,70.48,0.000147,99284,75.53
6,6,0.000174,99092,69.49,0.000142,99270,74.54
7,7,0.000163,99075,68.51,0.000137,99255,73.55
8,8,0.000143,99059,67.52,0.000129,99242,72.56
9,9,0.000117,99045,66.53,0.000117,99229,71.57


#### Augment life tables
For each life table, we will create a new column "male death pct" that will estimate the percentage of males that were born a certain number of years ago will die in the next year. We take the difference in the "male num lives" column (from which we stripped out a comma) and then divide by 100,000.

In [3]:
for year in range(2004, 2016):
    try:
        ssa = pd.read_csv('./ssa_tables/ssa_' + str(year) + '.csv')
    except:
        continue
    ssa['male num lives'] = [int(n.replace(',', '')) for n in ssa['male num lives']]
    ssa['male death pct'] = -1 * ssa['male num lives'].diff() / 100000
    
    ssa.to_csv('./ssa_tables/death_pct_' + str(year) + '.csv')