<a href="https://colab.research.google.com/github/s017274/SportsReference/blob/main/Sports_Reference_Data_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Sports Reference Data Intern Question**

Challenge: As soon as NBA teams make their draft selections, we want to input new entries into our database for the drafted players. In order to do so, we will need to know their biographical information (birthdates, schools attended, name pronunciation, shooting hand, Twitter handle, etc.) so that the bio section at the top of their new player pages will be as complete as possible. You have 1-2 weeks before the draft to research and prepare for draft night. 

**Process**
1. Scrape player names and schools from the ESPN Top 100 draft list.
2. Scrape position, height, weight, and hometown from Sports-Reference College Basketball.
3. Download data to a .csv file.



####Set-up

In [95]:
#imports
import requests
import pandas as pd
from bs4 import BeautifulSoup

####1. Scrape player names and schools from ESPN Top 100 Draft list

In [None]:
#URLs for each page of the ESPN top 100 best available players
URL_list = ["https://www.espn.com/nba/draft/bestavailable", "https://www.espn.com/nba/draft/bestavailable/_/position/ovr/page/2", "https://www.espn.com/nba/draft/bestavailable/_/position/ovr/page/3", "https://www.espn.com/nba/draft/bestavailable/_/position/ovr/page/4"]

#Uses BeautifulSoup to parse indicated webpage. Returns a list of lists (these inner lists have two elements: player name and school).
def parse_html(pageURL):
  page=requests.get(pageURL)
  soup = BeautifulSoup(page.content, "html.parser")
  results=soup.find(id="draftcast-bestavailable")

  names = results.find_all("div", class_="draftTable__playerInfo")

  players = []

  for element in names:
    player_html = element.find("span", class_="draftTable__headline draftTable__headline--player")
    school_html = element.find("span", class_="draftTable__headline draftTable__headline--school")
    players.append([player_html.text, school_html.text])

  return players

#empty list for all players to be added to
all_players = []

#loops through relevant URLs and adds the returned values to the all_players list
for url in URL_list:
  all_players = all_players + parse_html(url)

#prints the list of all 100 players
print(all_players)

[['Chet Holmgren', 'Gonzaga'], ['Paolo Banchero', 'Duke'], ['Jabari Smith', 'Auburn'], ['Jaden Ivey', 'Purdue'], ['Jalen Duren', 'Memphis'], ['Shaedon Sharpe', 'Kentucky'], ['Keegan Murray', 'Iowa'], ['Johnny Davis', 'Wisconsin'], ['Bennedict Mathurin', 'Arizona'], ['Jaden Hardy', ''], ['TyTy Washington Jr', 'Kentucky'], ['A.J. Griffin', 'Duke'], ['Jean Montero', ''], ['Trevor Keels', 'Duke'], ['Patrick Baldwin Jr', 'Milwaukee'], ['MarJon Beauchamp', ''], ['Kendall Brown', 'Baylor'], ['Kennedy Chandler', 'Tennessee'], ['Dyson Daniels', ''], ['Ochai Agbaji', 'Kansas'], ['JD Davison', 'Alabama'], ['Ousmane Dieng', ''], ['Mark Williams', 'Duke'], ['Wendell Moore Jr', 'Duke'], ['Blake Wesley', 'Notre Dame'], ['E.J. Liddell', 'Ohio State'], ['Hugo Besson', ''], ['Yannick Nzosa', ''], ['Nikola  Jovic', ''], ['Bryce McGowens', 'Nebraska'], ['Christian Koloko', 'Arizona'], ['Peyton Watson', 'UCLA'], ['Caleb Houstan', 'Michigan'], ['Christian Braun', 'Kansas'], ['Harrison Ingram', 'Stanford'], 

In [99]:
#convert list of lists to data frame
players_basic_df = pd.DataFrame(data=all_players, columns=["Name", "School"])
#check for well-formed df
players_basic_df.head()

Unnamed: 0,Name,School
0,Chet Holmgren,Gonzaga
1,Paolo Banchero,Duke
2,Jabari Smith,Auburn
3,Jaden Ivey,Purdue
4,Jalen Duren,Memphis


####2. Scrape position, height, weight, and hometown from Sports-Reference College Basketball

Note: This will not work for overseas and G-league players.

Creates the list of URLs for Sports Reference player profiles

In [None]:
#function to create URL strings to access sports reference profiles
def createURL(player_list, base_URL):
  #create empty list
  URL_list = []
  #loop through player names
  for player in player_list:
    #modify name string to all lowercase, replace space with hyphen to replicate URL structure
    name = player[0]
    name = name.lower()
    name = name.replace(' ', '-')
    #add created URL to list
    URL_list.append(base_URL + name + "-1.html")
  #return full list
  return URL_list

#base url - basis of all sports reference profile URLs
base_URL = "https://www.sports-reference.com/cbb/players/"
#previously created list of Top 100 draftable players
player_list = all_players

#runs function to create URLs
URL_list2 = createURL(player_list, base_URL)

Creates data frame by scraping biographical information from Sports Reference profiles

In [105]:
#function to scrape information from sports reference website
def parse_html_2(pageURL):
  #make a request to the url passed to the function
  page=requests.get(pageURL)
  #parses html and stores it in the 'soup' variable
  soup = BeautifulSoup(page.content, "html.parser")
  #identifies relevant section of HTML to scrape
  results=soup.find(id="meta")

  #if no relevant section is found, return an empty list of the correct dimension
  if results == None:
    return ["","","","",""]
  #otherwise, parse the relevant section and extract name, height, weight, hometown, and position
  else:
    #finds all lines with 'span' tags in the HTML
    parsed1 = results.find_all("span")
    #finds all lines with 'p' tags in the HTML
    parsed2 = results.find_all("p")

    #combines all lines of text
    parsed_final = parsed1 + parsed2

    #identifies names from HTML
    name = parsed_final[0].text
    #identifies height from HTML
    height = parsed_final[1].text
    #identifies weight from HTMl
    weight = parsed_final[2].text

    #hometown string is found between "Hometown" and "High School"
    hometown_index = parsed_final[3].text.find("Hometown:") + 10
    high_school_index = parsed_final[3].text.find("High School:")
    hometown = parsed_final[3].text[hometown_index: high_school_index]

    #position string is found after "Position"
    position_index = parsed_final[3].text.find("Position:") + 10
    position = parsed_final[3].text[position_index:position_index+15]

    #removes new line characters and empty spaces in position and hometown strings
    position = position.replace("\n", "")
    position = position.replace(" ","")
    hometown = hometown.replace("\n", "")

    #catches bug whereby hometown extraction doesn't work right
    if len(hometown) > 40:
      hometown = "N/A"
    
    #combines all information in a list
    player_info = [name, height, weight, position, hometown]

    #returns list
    return player_info

#uses the player's information
all_player_info = []
#loops through the URL list created above
for i in range(len(URL_list2)):
  URL = URL_list2[i]
  #calls the function and adds its return for each player in the loop
  all_player_info.append(parse_html_2(URL))

#checks the information gathered
print(len(all_player_info))
print(all_player_info)

#adds information to a data frame
players_df = pd.DataFrame(data=all_player_info, columns=["Name", "Height", "Weight", "Position", "Hometown"])
players_df.head()

100
[['Chet Holmgren', '7-0', '195lb', 'Center', 'Minneapolis, MN'], ['Paolo Banchero', '6-10', '250lb', 'Forward', 'Seattle, WA'], ['Jabari Smith', '6-11', '250lb', 'Center', 'N/A'], ['Jaden Ivey', '6-4', '200lb', 'Guard', 'South Bend, IN'], ['Jalen Duren', '6-11', '250lb', 'Center', 'Sharon Hill, PA'], ['', '', '', '', ''], ['Keegan Murray', '6-8', '215lb', 'Forward', 'Cedar Rapids, IA'], ['Johnny Davis', '6-2', '170lb', '', ''], ['Bennedict Mathurin', '6-7', '195lb', 'Guard', 'ion:    Guard  6-7,\xa0195lb\xa0(201cm,\xa088kg) '], ['', '', '', '', ''], ['', '', '', '', ''], ['', '', '', '', ''], ['', '', '', '', ''], ['Trevor Keels', '6-4', '221lb', 'Guard', 'Clinton, MD'], ['', '', '', '', ''], ['', '', '', '', ''], ['Kendall Brown', '6-8', '205lb', 'Guard', 'Cottage Grove, MN'], ['Kennedy Chandler', '6-0', '172lb', 'Guard', 'Memphis, TN'], ['', '', '', '', ''], ['Ochai Agbaji', '6-5', '210lb', 'Guard', 'Kansas City, MO'], ['JD Davison', '6-3', '195lb', 'Guard', 'Latohatchee, AL'], [

Unnamed: 0,Name,Height,Weight,Position,Hometown
0,Chet Holmgren,7-0,195lb,Center,"Minneapolis, MN"
1,Paolo Banchero,6-10,250lb,Forward,"Seattle, WA"
2,Jabari Smith,6-11,250lb,Center,
3,Jaden Ivey,6-4,200lb,Guard,"South Bend, IN"
4,Jalen Duren,6-11,250lb,Center,"Sharon Hill, PA"


In [106]:
#merges player information data frame with player names and schools 
merged_df = players_df.merge(players_basic_df)
merged_df

Unnamed: 0,Name,Height,Weight,Position,Hometown,School
0,Chet Holmgren,7-0,195lb,Center,"Minneapolis, MN",Gonzaga
1,Paolo Banchero,6-10,250lb,Forward,"Seattle, WA",Duke
2,Jabari Smith,6-11,250lb,Center,,Auburn
3,Jaden Ivey,6-4,200lb,Guard,"South Bend, IN",Purdue
4,Jalen Duren,6-11,250lb,Center,"Sharon Hill, PA",Memphis
...,...,...,...,...,...,...
57,Gabe Brown,7-0,230lb,Center,"Stony Brook, NY",Michigan State
58,Pete Nance,6-10,225lb,Forward,"Akron, OH",Northwestern
59,Alondes Williams,6-5,201lb,Guard,"Milwaukee, WI",Wake Forest
60,Azuolas Tubelis,6-11,245lb,Forward,"Vilnius, Lithuania",Arizona


In [97]:
#downloads data frame to .csv file
merged_df.to_csv("players.csv")