# Obtaining, parsing and structuring static HTML websites

In this notebook we will learn how to scrape basic static, i.e. non-interactive HTML-based websites. We will
- obtain the HTML raw content using the `requests` module
- convert the raw HTML into a format that is easier to search, or parse, using the `BeautifulSoup` module
- learn how to identify the elements of interest in the raw HTML using the browser's inspect functionality and the CSS SelectorGadget
- construct a table, or dataframe, with the popular table calculation module `pandas` and store the output locally in a standard spreadsheet format

1. Open the Anaconda Prompt and install the module `requests`

In [28]:
import requests

In [29]:
seed = 'https://www.uni-potsdam.de/de/'

2. What data type is the object `seed`? How can you check?

3. Is this domain an admissible path? Hint: Check the `robots.txt`

In [30]:
html = requests.get(seed)

4. Was the request successful? How can you check the status? Hint: Check the available methods by using Jupyter's auto-complete functionality, i.e. type a dot at the end of the object you're investigating followed by <kbd>Tab</kbd>

5. Which method could be most informative w.r.t. actual content? How many characters long is the raw HTML file?

6. Display the first 518 characters of the `html` object.

7. Display meta information on the origin of the HTTP request, e.g. date. Note that it is possible to specify the `user-agent` that the server receives and provides the response (website representation) such that it optimised, e.g. Desktop vs. mobile. If it's not specified, the request will be sent using default values (potentially) containing information about your operating system, screen resolution, keyboard language, IP address and many more.

The cell below saves the HTML object's text attribute in HTML format locally.

In [40]:
with open('Uni_Potsdam.html', 'w', encoding='utf-8') as f:
    
    f.write(html.text)

8. Install the module `BeautifulSoup` via `pip install beautifulsoup4`

In [42]:
from bs4 import BeautifulSoup

In [44]:
soup = BeautifulSoup(html.text, "html.parser")

9. Parse the BeautifulSoup object `soup` for all Affiliate Links. Hint: In a HTML document all elements that lead to another domain are indicated by an `a` and follow the structure `<a href="...", ... >text</a>`. Hint: Use `soup`'s method `find_all()` where the input argument is the elements' prefix. What object type is the output? Can you iterate over it? How many elements of an Affiliate Link type are contained in the HTML file?

10. Convert the BeautifulSoup object into a "plain" Python list object containing the elements' **text** attributes by iterating over it. Hint: Instantiate an empty `list` object, write a for-loop and `append` each element to the list object. You may also remove any unwanted whitespaces by using the `strip` function.

In [61]:
empty_list = []

for link in soup.find_all('a'):
    
    empty_list.append(link.text.strip())

#### Pro-Tipp
Instead of explicitly writing a for-loop when disentangling specific objects from an aggregate object you can use Python's built-in `map` and `lambda` functions as a one-liner.

In [71]:
results_list = list(map(lambda x: x.text.strip(), soup.find_all('a')))

11. Identify the element which text attribute's value is equal to "alle Artikel". Return the element's position (`index`) within the list.

In [74]:
results_list.index('alle Artikel')

224

12. Obtain this element's value of the `href` attribute. It should be an URL pointing at the domain where the news at Universität Potsdam are collected.

In [78]:
new_seed = soup.find_all('a')[224].get('href')

In [79]:
new_seed

'https://www.uni-potsdam.de/nachrichten.html'

13. Write a function which takes a String-type object (e.g. an URL) as input and returns a readily parse-able `BeautifulSoup` object.

In [80]:
def URL_to_BS(url):
    
    html = requests.get(url)
    soup = BeautifulSoup(html.text, "html.parser")
    
    return soup

In [81]:
news_soup = URL_to_BS(new_seed)

14. Open the `new_seed` URL in your browser and enable the CSS SelectorGadget. Highlight the box containing the first article. The other, similar boxes should be highlighted as well. Copy the identified CSS selector and parse through the `news_soup` object but this time over elements corresponding to the CSS selector you found (use `.select()` instead of `find_all()`). Store the subset of elements in a list. You can achieve all of this in one line of code. How many items does this list contain?

In [139]:
news_list = list(map(lambda x: x, news_soup.select('.up-news-list-item')))

In [140]:
len(news_list)

10

15. Split the list's elements into their hyperlinks (`href`) and text attributes' values.

In [169]:
link_list = []
title_list = []

for link_num in range(len(news_list)):
    
    sub_link = news_list[link_num].findChild("a")['href']
    sub_title = news_list[link_num].findChild("a")['title']
    
    if type(sub_link) is str and 'www' not in sub_link:
        
        link_list.append('https://www.uni-potsdam.de' + sub_link)
        title_list.append(sub_title)

In [170]:
lot = list(zip(title_list, link_list))

In [171]:
D = dict(lot)

In [184]:
import json

In [185]:
with open('Uni_Potsdam_dict.json', 'w', encoding='utf-8') as f:
    
    json.dump(D, f, ensure_ascii=False)

In [186]:
with open('Uni_Potsdam_dict.json', 'r', encoding='utf-8') as f:
    
    D_read = json.load(f)

In [187]:
D == D_read

True

## Pagination
You have probably realised that the articles presented on the first news page are not the entire collection of the University of Potsdam. Your goal is to retrieve a complete collection of all articles that are available on the university's website and you can easily apply your new knowledge in a repetive manner.

16. Figure out how many pages containing articles content there are in total. You can do it manually by e.g. inspecting the URL when you proceed through the collection in your browser or by checking it programmatically by writing a `while` loop that continues until some condition, such as a status returned from your request, is violated.

17. Once you have the maximum iteration amount, write a loop that passes through the pages and append each element to a list. Bonus: Take note of the page's number the element was found. Hint: You can use Python's `enumerate` method. Separate the hyperlinks and titles as above, combine them again to a tuple, convert it into a dictionary and save it as a JSON file.

## Analysis
Actually, having a complete list of links is sufficient for the next task. We want to iterate over the entire articles collection and conduct a simple analysis that involves text analysis, image processing and publication record.

18. Read in the JSON file you stored in step 17 and iterate over each hyperlink. In each iteration, obtain the HTML, parse it and identify the elements of the publication date, the contact, the contact's email address, the image's hyperlink/reference and the main text body's length. Define an appropriate data type for each field and append it **as a dictionary** in each iteration to a list. Convert the final list into a `pandas` dataframe and save it as a `.csv`.

19. Convert the `publication_date` into a `pandas` `datetime` object and plot a time series of published articles on a daily basis. Bonus: Aggregate the time series into monthly frequency. In which month-year were most articles published?