# Assignment 1: Web Scraping

## Objective

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to use [requests](http://www.python-requests.org/en/master/) to download HTML pages from a website?
* How to select content on a webpage with [lxml](http://lxml.de/)? 

You can either use Spark DataFrame or [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) to do the assignment. In comparison, pandas.DataFrame has richer APIs, but is not good at distributed computing.

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of HTML, DOM, and XPath. I found that this is a good resource: [https://data-lessons.github.io](https://data-lessons.github.io/library-webscraping-DEPRECATED/). Please take a look at

* [Selecting content on a web page with XPath
](https://data-lessons.github.io/library-webscraping-DEPRECATED/xpath/)
* [Web scraping using Python: requests and lxml](https://data-lessons.github.io/library-webscraping-DEPRECATED/04-lxml/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at SFU. One day, you want to analyze CS faculty data and answer two interesting questions:

1. Who are the CS faculty members?
2. What are their research interests?

To do so, the first thing is to figure out what data to collect.

## Task 1: SFU CS Faculty Members

You find that there is a web page in the CS school website, which lists all the faculty members as well as their basic information. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://www.sfu.ca/computing/people/faculty.html](https://www.sfu.ca/computing/people/faculty.html).




### (a) Crawling Web Page

A web page is essentially a file stored in a remote machine (called web server). You can use [requests](http://www.python-requests.org/en/master/) to open such a file and read data from it. Please complete the following code to download the HTML page and save it as a text file (like [this](./faculty.txt)). 

In [70]:
import requests
# 1. Download the webpage
response = requests.get('https://www.sfu.ca/computing/people/faculty.html')
# 2. Save it as a text file (named faculty.txt)
file = open("faculty.txt", "w")
file.write(response.text)
file.close()

### (b) Extracting Structured Data

An HTML page follows the Document Object Model (DOM). It models an HTML page as a tree structure wherein each node is an object representing a part of the page. The nodes can be searched and extracted programmatically using XPath. Please complete the following code to transform the above HTML page to a CSV file (like [this](./faculty_table.csv)). 

In [71]:
import lxml.html 
import pandas as pd
import re

# Initialize regex patterns
area_pattern = 'Area:(.*)\n'
profile_pattern = '^(.*|/).(computing/).*\.(html)$'
homepage_pattern = '((^.*.(/~).*.)|(^.*.(people.html|/)$))'

# Read downloaded faculty.txt file
def read_file(filename):
    file = open(filename,'r') 
    return file.read()

#Parse tree and get the required sub section
def get_sub_tree(file_value):
    
    # Parse the HTML page as a tree structure
    tree = lxml.html.fromstring(file_value)
    
    # Extract related content from the tree using XPath
    title_selected_elem = tree.cssselect('div.parsys_column.cq-colctrl-lt0.people.faculty-list')[0]
    selected_sub_tree = title_selected_elem.xpath("//div[@class='text']")
    return selected_sub_tree

# get name and rank value
def get_name_rank(element):
    name_rank_value = element.xpath("h4/text()")
    value = name_rank_value[0].split(',')
    name = value[0]
    rank = value[1]
    return name,rank

# get area value
def get_area(element):
    temp = element.text_content()
    result = re.search(area_pattern, temp)
    area = result.group(1)
    return area

# filter out profile links
def filter_profile_values(link):
    if 'people.html' not in link:
        return re.search(profile_pattern, link)

# get profile value
def get_profile(profile_homepage):
    profile_list = list(filter(filter_profile_values, profile_homepage))
    if len(profile_list) < 1:
        profile = ''
    else:
        profile = profile_list.pop()
        profile = re.search(r'computing(.*?)html', profile).group(1)
        profile =  'http://www.sfu.ca/computing'+profile+'html'
    return profile

# get homepage value
def get_homepage(profile_homepage):
    homepage_list = list(filter(lambda k: re.search( homepage_pattern, k) , profile_homepage))
    if len(homepage_list) < 1:
        homepage = ''
    else:
        homepage = homepage_list[0]
    return homepage

def main(filename):
    
    # Read downloaded faculty.txt file
    file_value = read_file(filename)

    # Initialize an empty dataframe with columns 'name','rank','area','profile','homepage'
    df = pd.DataFrame(columns=['Name','Rank','Area','Profile','Homepage'])
    
    # Get sub section
    sub_tree = get_sub_tree(file_value)
    
    # Interate through each element of the sub section
    for element in sub_tree:
        
        # Getting required field values
        name, rank = get_name_rank(element)
        area = get_area(element)
        profile_homepage = element.xpath("p/a/@href")
        profile = get_profile(profile_homepage)
        homepage = get_homepage(profile_homepage)
        
        # Appending to dataframe
        df = df.append({'Name': name,'Rank': rank, 'Area': area, 'Profile': profile,'Homepage':homepage}, ignore_index=True)
    
    # Save the extracted content as an csv file (named faculty_table.csv)
    df.to_csv('faculty_table.csv', encoding='latin-1', index=False)
    
if __name__ == '__main__':
    filename = 'faculty.txt'
    main(filename)

## Task 2: Research Interests

Suppose you want to know the research interests of each faculty. However, the above crawled web page does not contain such information. 

### (a) Crawling Web Page

You notice that such information can be found on the profile page of each faculty. For example, you can find the research interests of Dr. Jiannan Wang from [http://www.sfu.ca/computing/people/faculty/jiannanwang.html](http://www.sfu.ca/computing/people/faculty/jiannanwang.html). 


Please complete the following code to download the profile pages and save them as text files. There are 60 faculties, so you need to download 60 web pages in total. 

In [72]:
import requests
import pandas as pd
import os

# Download the profile pages of 60 faculties
def download_faculty_profile_pages():
    
    # Read faculty_table.csv to a pandas dataframe
    table = pd.read_csv("faculty_table.csv",encoding='latin1')
    
    # Fetch all faculty profile links
    faculty_profile_links = table['Profile'].fillna('')
    
    # Initialize index
    index = 0
    
    # Iterate through each profile link and save the output html in faculty_profile_pages folder
    for profile_link in faculty_profile_links:
        if len(profile_link) > 0:
            page_name = './faculty_profile_pages/'+str(index)+'.txt'
            response = requests.get(profile_link)
            os.makedirs(os.path.dirname(page_name), exist_ok=True)
            with open(page_name, "w") as f:
                # Save each page as a text file
                f.write(response.text)
                f.close()
        index = index + 1

download_faculty_profile_pages()



### (b) Extracting Structured Data

Please complete the following code to extract the research interests of each faculty, and generate a file like [this](./faculty_more_table.csv). 

In [74]:
import lxml.html 
import pandas as pd
import re
import requests
import unicodedata

def add_faculty_research():
    
    # Read faculty_table.csv to a pandas dataframe
    table = pd.read_csv("faculty_table.csv",encoding='latin1').fillna('')
    table["Research_Interests"] = ''
    
    # Fetch all faculty profile links
    faculty_profile_links = table['Profile'].fillna('')
    
    # Initialize index
    index = 0
    
     # Iterate through each profile link
    for profile_link in faculty_profile_links:
        if len(profile_link) > 0:
            
            # Read the profile html page with the current index value
            profile_path = './faculty_profile_pages/'+str(index)+'.txt'
            file = open(profile_path,'r')
            file_value = file.read()
            
            # Parse the HTML page as a tree structure
            tree = lxml.html.fromstring(file_value)
            
            # Extract related content from the tree using XPath
            research_interest_section = tree.xpath("//div[@class='text parbase section']/div[contains(translate(., 'RI', 'ri'),'research interests')]/ul/li")
            
            faculty_research_interest = []
            
            # Iterate through each faculty research_interest and perform some cleaning
            for research_interest in research_interest_section:
                normalize_research_interest = unicodedata.normalize("NFKD",research_interest.text_content())
                cleaned_research_interest = normalize_research_interest.replace('\n','')
                faculty_research_interest.append(cleaned_research_interest)
            
            # Add the faculty_research_interest list
            table.set_value(index,'Research_Interests', faculty_research_interest)
            
        index = index+1
    # Save the extracted content as an csv file (named faculty_more_table.csv)
    table.to_csv('faculty_more_table.csv', encoding='latin-1', index=False)

add_faculty_research()





## Submission

Complete the code in this [notebook](A1.ipynb), and submit it to the CourSys activity `Assignment 1`.