<a href="https://colab.research.google.com/github/colinrsmall/ehm_roster_tools/blob/master/EP_Face_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions:

1. To add leagues to the list, copy and paste an entry in the following list and replace the league's name and EliteProspects link with the name and link of the league you want to scrape. Make sure that all entries except the last end with a comma (as you can see with the first entry - Spain). The name you choose for the entry only influences the name of the output file. You can get a league's URL by going to the league's homepage on EP and copying the URL for that page from your browser.

2. To change which season you're scraping for, change the season string following the list of leagues. The string should be of the format 'YYYY-YYYY' (such as '2019-2020' or '2017-2018').

3. If you want the scraper to print out links for players who are missing information on their EP page, change show_error_links to True.

4. When the settings in the below cell are correct, click on the "Runtime" dropdown menu from the top bar and click "Run All"

5. Go to the section titled "Output"

In [1]:
leagues = [
           ('ahl', 'https://www.eliteprospects.com/league/ahl'),
]

season = '2019-20'
show_error_links = True

# Expand this if you want to look at the code (optional)

In [2]:
!mkdir '/content/leagues/'
!mkdir '/content/faces/'

In [3]:
import requests, random, csv, traceback, time, urllib.request, os
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
from datetime import datetime
from google.colab import files

In [38]:
def get_name(player_page):
  name = name = player_page.find('div', class_='ep-entity-header__name').text.strip().split(' ')
  first_name = name[0]
  last_name = ' '.join(name[1:]).split('\n')[0].strip()
  return first_name, last_name


def get_dob(player_page):
  dob_search_text = """
                                    Date of Birth
                                """
  try:
    dob_text = player_page.find('div', text=dob_search_text).next_element.next_element.next_element.text.strip()
    dob = datetime.strptime(dob_text, '%b %d, %Y').strftime('%-d.%-m.%Y')
  except Exception as e:
    try:
      dob = datetime.strptime(dob_text, '%Y').strftime('1.1.%Y')
    except Exception as e:
      dob = ""
      print(f'Missing dob information: {get_name(player_page)[0]} {get_name(player_page)[1]}')
  return dob



def scrape_player_page(link, league_name):
  player_page = requests.get(link, headers = {
        'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
        'referrer': 'https://google.com',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.9',
        'Pragma': 'no-cache',
    })
  player_page = BeautifulSoup(player_page.content)

  first_name, last_name = get_name(player_page)

  dob = get_dob(player_page)

  player_image = player_page.select(".ep-entity-header__main-image")[0]['style'][25:-3]
  if "static" in player_image:
    print(f"No player image for {link}")
  else:
    urllib.request.urlretrieve("https://"+player_image, f"faces/{league_name}/{first_name}_{last_name}_{dob.replace('.', '_')}.jpg")

In [39]:
def scrape(draft = False):
  for league in tqdm(leagues, desc='Leagues'):
    os.makedirs(f"faces/{league[0]}/", exist_ok=True)
    # Get draft page's HTML and parse with BeautifulSoup
    league_link = league[1]
    league_page = requests.get(league_link)
    league_page = BeautifulSoup(league_page.content)
    team_links = set([team['href'] for team in league_page.select('table.standings.table-sortable > tbody > tr > .team > a')])

    for team_link in tqdm(team_links, desc='Teams'):
      team_page = requests.get(team_link)
      team_page = BeautifulSoup(team_page.content)
      players = team_page.select('[data-sort-ajax-container="#roster"] > tbody > tr .txt-blue a[href]')
      player_links = [player['href'] for player in players]

      for link in tqdm(player_links, desc='Players', leave=False):
        try:
          scrape_player_page(link, league[0])
        except Exception as e:
          if "team-captaincy" not in link and show_error_links:
            traceback.print_exc()
            print(f'Missing player information for: {link}')
        time.sleep(random.random() * 3)

# Output

You should see three progress bars: one showing the progress through the leagues you want to scrape, one showing progress through all of the teams for a given league, and one showing progress through all of the players for a given team.

To download the .zip, can click the folder icon on the bar in the top-left of the screen and right-click -> download file the file "leagues.zip".

In [40]:
scrape()
time.sleep(5)

HBox(children=(FloatProgress(value=0.0, description='Leagues', max=1.0, style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='Teams', max=31.0, style=ProgressStyle(description_width='…

HBox(children=(FloatProgress(value=0.0, description='Players', max=30.0, style=ProgressStyle(description_width…

No player image for https://www.eliteprospects.com/player/88706/ryan-culkin
No player image for https://www.eliteprospects.com/player/45281/maxim-lamarche
No player image for https://www.eliteprospects.com/player/88681/joe-cox


KeyboardInterrupt: ignored