# Crawling NBA teams and players

## 1. Introduction
I decided to implement a crawler that extracts information about NBA teams and players. The main URL is 'http://www.espn.com/nba/players', belonging to ESPN, an american company focused on broadcasting sports. I decided to crawl from this site because I had some issues (described later) with the official NBA site.

## 2. Issues with crawling and description of robots.txt
The main issues with crawling could be:
- Technical limitations: if content is dynamically generated with JavaScript some search engine bots may not be able to crawl the content.
- Server overload: the website's servers can become slow or unresponsive if the load or the requests' frequency is too high.
- Duplicate content: If bots crawl multiple versions of the same page you can experience duplicate content issues.
- Security issues: a bot can get blocked if it violates security measures of a website.
- Legal issues: crawling private or copyrighted information may be illegal or unethical.

The robots.txt file of this site can be seen here: http://www.espn.com/robots.txt
It sets some rules valid for all user-agents. Firstly there are a lot of "Disallow" lines that specify the URLs that should not be crawled or indexed (scoreboard,week,year,admin,boxscore,etc...). Then there are a few "Allow" lines that, on the other hand, specify the URLs that are allowed to be crawled and indexed, even if they are under a disallowed parent directory. In this particular robots.txt the "Allow" lines specify the URLs that are allowed to be crawled on the AMP version of the website. Finally there's the "Sitemap" line that specifies the location of the website's sitemap, which is a file that lists all the pages on the website that are available for crawling. In this case, the sitemap is located at https://www.espn.com/sitemap.xml. This line tells web crawlers to access and use this sitemap file to discover all the pages on the website that they are allowed to crawl.

## 3. Input / Output
The input is the start url and the output are two json files: one contains informations about each NBA team and the other informations about each NBA player. The output data are contained in the files 'teams.json' and 'players.json'.

## 4. Implementation

In [None]:
# Import required packages
import json
import time
import requests
from bs4 import BeautifulSoup

In [None]:
# Definition of some useful functions

# This function helps me dealing with incomplete values (explained later in the 'issues' section)
def check_null_other(x):
    
    res = x.findAll('div')
    if res == None:
        return 'NONE'
    else:
        return res[-1].getText()


# This function helps me dealing with incomplete values (explained later in the 'issues' section)
def check_null_stats(x):
    
    res = x.find('div', class_='StatBlockInner__Value')
    if res == None:
        return 'NONE'
    else:
        return res.getText()


# Function to get a player informations
def crawl_player_info(page, fp):
    
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Name and surname
    name_surname = soup.find('h1', class_='PlayerHeader__Name')
    if name_surname == None:
        name_surname = ['NONE','NONE']
    else:
        name_surname = name_surname.findAll('span')
        if name_surname == None:
            name_surname = ['NONE','NONE']
        else:
            if(len(name_surname)==0):
                name_surname = ['NONE','NONE']
    
    # General info (postioned under the name): team name, jersey number and role
    info_s = soup.find('ul', class_='PlayerHeader__Team_Info')
    if info_s == None:
        team = 'NONE'
        jersey_num = 'NONE'
        position = 'NONE'
    else:
        info_s = info_s.findAll('li')
        if info_s == None:
            team = 'NONE'
            jersey_num = 'NONE'
            position = 'NONE'
        else:
            if(info_s == None or len(info_s) < 3):
                team = 'NONE'
                jersey_num = 'NONE'
                position = 'NONE'
            else:
                team = info_s[0].getText()
                jersey_num = info_s[1].getText()
                position = info_s[2].getText()
    
    # Statistics: points per game, rebounds per game, 
    # assists per game, field goal percentage
    stats = soup.find('ul', class_='StatBlock__Content')
    if stats == None:
        stats = ['NONE','NONE','NONE','NONE']
    else:
        stats = stats.findAll('li')
        if stats == None:
            stats = ['NONE','NONE','NONE','NONE']
        else:
            stats = [check_null_stats(s) for s in stats]
            if stats == None:
                stats = ['NONE','NONE','NONE','NONE']
    
    # Other info: height, weight, birthdate/age, college,  
    # which draft he was selected in, status of his career
    other = soup.find('ul', class_='PlayerHeader__Bio_List')
    if other == None:
        hw = 'NONE'
        birthdate = 'NONE'
        college = 'NONE'
        draft = 'NONE'
        status = 'NONE'
    else:
        other = other.findAll('li')
        if other == None or len(other) < 5:
            hw = 'NONE'
            birthdate = 'NONE'
            college = 'NONE'
            draft = 'NONE'
            status = 'NONE'
        else:
            hw = check_null_other(other[0]).split(',')
            birthdate = check_null_other(other[1])
            college = check_null_other(other[2])
            draft = check_null_other(other[3])
            status = check_null_other(other[4])
    
    player = {
        'name': name_surname[0].getText(),
        'surname': name_surname[1].getText(),
        'team': team,
        'jersey number': jersey_num,
        'role': position,
        'ppg': stats[0],
        'rpg': stats[1],
        'apg': stats[2],
        'fg%': stats[3] + ' %',
        'height': hw[0],
        'weight': hw[1].strip(),
        'birthdate (age)': birthdate,
        'draft': draft,
        'college': college,
        'status': status,   
    }
    json_player = json.dumps(player, indent=2)
    fp.write(json_player)
    
    return


# Function to get a team information and team's players links
def crawl_team_roster(page, ft, fp):
    
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # Name of the team (city and name)
    name = soup.findAll('span', class_='db')
    if name == None:
        name = 'NONE'
    else:
        name = name[0].getText() + ' ' + name[1].getText()
    
    # Record
    record = soup.find('ul', class_='list flex ClubhouseHeader__Record n8 ml4')
    if record == None:
        position = 'NONE'
        record = 'NONE'
    else:
        record = record.findAll('li')
        if record == None:
            position = 'NONE'
            record = 'NONE' 
        else: 
            position = record[1].getText()
            record = record[0].getText()
    
    # Write in json file
    team = {
        'name': name,
        'record': record,
        'postion': position,
    }
    json_team = json.dumps(team, indent=2)
    ft.write(json_team)
    
    # Now parse all team's players information
    crawled = []
    links = soup.findAll('a', class_='AnchorLink')
    if links == None:
        print("Players' links not found, try again!!")
        return
    else:
        links = [l.get('href') for l in links if 'player/_/' in l.get('href')][0::2]
    
    for link in links:
        if link not in crawled:
            page = requests.get(link)
            if page.status_code == 200:
                crawl_player_info(page, fp)
                crawled.append(link)
            else:
                print('Code ' + str(page.status_code) + ' for link: ' + link)
            time.sleep(1)
    
    return 

In [None]:
# Setup of variables
url = 'http://www.espn.com/nba/players'
start = requests.get(url)
soup = BeautifulSoup(start.content, 'html.parser')
links = []
visited = []

# json files where results will be written 
# !!! directories have to be changed for the faculty gitlab !!!
fp = open("players1.json", "w")
ft = open("teams1.json", "w") 

# Links of each team
links = [l.get('href') for l in soup.findAll('a') if 'name' in l.get('href')]
links = [l.replace('_', 'roster/_') for l in links]

for link in links:
    if link not in visited:
        page = requests.get(link)
        if page.status_code == 200:
            crawl_team_roster(page, ft, fp)
            visited.append(link)
        else:
            print('Code ' + str(page.status_code) + ' for link: ' + link)
        time.sleep(1)
    
ft.close()
fp.close()

## 5. Issues experienced and possible extensions/improvements

At first I decided to crawl these informations from the official NBA site (https://www.nba.com/players) but I experienced three main problems:
sometimes the server was sending empty pages and even reducing drastically the frequency of requests (1 each 10 seconds) didn't solve the problem; the second was that the page shows just the first 50 players and if you want to see all of them you have to select the filter 'all' from a dropdown menu. The problem here was that clicking on the filter doesn't change the url so I think the content is generated dynamically and I don't know how to deal with it. The third one was that sometimes I was getting a 'TimeoutError' and I could'nt find the source of that error. With the ESPN site I didn't experienced many issues, except that the elements of the site aren't always given a specific class/id forcing me to do nested call of the 'find' method. The only important issue I faced is that some players have incomplete data (for example a player that is free agent doesn't have a jersey number at the moment, or a european player that didn't go to an american college doesn't have this information) and for this reason I had to check every single value and substitute it with 'NONE' if missing. Anyway, there are hundreds of players with 15 attributes each and there are in total 36 'NONE' values, so it's barely noticeable.

To conclude, a possibile extension could be crawling the players' data for all past seasons and analyzing their performance over the year or understanding which team changed the most players during the years. A possible improvement could be implementing a crawler able to deal with dynamically generated content.