_____________________________________________________________
# Coursework 1 - Python Web Scraper
_____________________________________________________________
**Web scraping** is essentially about downloading structured data from the web, selecting some of it, and passing this to some another process. This may be incredibly handy in situations where you do not have any data available to you and want to source it from the web, or when you want to keep the track of some information appearing on it (e.g., monitoring news, stock prices, etc.).

In the following coursework, you will have to exploit the knowledge you collected throughout the course in order to create a web scrapper module that you may later use to gather the data that interests you the most. 
___________
## Install & Import Packages
___________
You will be using `Python 3` and you will also need to install these 2 packages:
- `requests` - for performing the HTTP requests
- `BeautifulSoup4` - for handling the HTML processing

The executable cells below will install these packages to your current conda environment and import all the necessary modules required for this coursework.

`Note` Keep in mind that later you will have to move the imports to their corresponding modules. Not the installs, though, these are permament (but only for this conda environment).

In [1]:
# # Install a `conda package in the current Jupyter kernel
# import sys
# !conda install --yes --prefix {sys.prefix} requests
# !conda install --yes --prefix {sys.prefix} beautifulsoup4

In [2]:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

___________
## 01. Web Requests
___________
Your first task will be to develop a set of functions that will help you to download the chosen web page content using the `requests` package and its `.get()` method.

Essentially, your developed function `get_url_content(url)` should accept a single url (string) argument and make a `GET` request to that `URL`. If nothing goes wrong, you end up with the raw HTML content for the page you requested, but you should also think of any problems that may arise when making this request (e.g., bad `URL`, remote server down, etc.) and return `None` if so. 

You may also use `with closing(some_function) as resp: ....` in order to ensure that any network resources are freed when they go out of scope in that `with` block. That is good practice that helps to prevent fatal errors and/or network timeouts. 

`Important!` You may prototype with your code here, but, after you are finished, create a module (file) `scrapper.py` and move your code there. Now you can test your solution by importing your `get_url_content` method from the `scrapper` module and calling it with a URL argument:

    from scrapper import get_url_content
    raw_html = get_url_content('https://www.cnbc.com/stocks/')
    len(raw_html)
    600151
    
    no_html = get_url_content('https://realpython.com/blog/nope-not-gonna-find-it')
    no_html is None

`Note` For more info/examples on `requests`, check [this documentation](https://www.w3schools.com/python/module_requests.asp) or [this tutorial](https://realpython.com/python-requests/).

___________
## 02. Wrangling HTML With BeautifulSoup
___________
After you retrieve the raw HTML material, you will have to parse in order to get only the material that is of relevance to you. 

For this, you will be using the `BeautifulSoup` library (that you've installed earlier). The `BeautifulSoup` constructor parses raw HTML strings and produces an object that mirrors the HTML document’s structure, as well as includes numerous methods for selecting, viewing, and/or manipulating the content. For more info, you may to check its [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), but generally, the following methods should suffice for the present exercise:

- `find_all(element_tag, ...)`: return all HTML elements from a webpage that have a specified tag and/or some additional attributes. 
In order to get the first one, you may use `find()` instead.
    
- get_text(): extract the text from a HTML element.

You should develop a method `parse_content(raw_html, element_tag, ...)` that:
1. Takes a raw `HTML` content as input, passes it to the `BeautifulSoup` constructor.
2. Uses `find_all()` method to retrieve a list of the elements of your choice (might also add additional arguments to choose a `class` or etc.).
3. Uses the `get_text()` method to retrieve the text content for all the elements retrieved in 2. 

The method should return a list of texts retrieved from the elements, and an additional list of attributes related to it (e.g., dates, url links, or anything else that you may find to be useful about). 

    You may find dictionary format more suitable in case you decide to store multiple element attributes that do not always exist.
    
`Note!` - Feel free to use any other `BeautifulSoup` methods if you believe them to fit your approach better.

`How to start?`
Decide upon a webpage and it's field(s) that you want to scrape and 
move your mouse cursor on any of those fields within the webpage. Click the right mouse button, and choose `inspect_element` from a drop down list. Now you will see the content information that will be required for you to access the relevant fields.
    
Move your code to the `scrapper.py` file, and test your `parse_content()` method as you did in the first exercise.

## 03. Combining Your Methods
Now combine the methods defined above into a single method that allows you to get the text content of the elements of your choice within a given url. 

    `retrieve_text_url(url, element_tag, ...)`
    
As previously, add this method to `scrapper.py` and test it below:

## 04. Finalizing Your Module
Now, add `__main__` section to your `scrapper.py`, allowing to run the script standalone via terminal by calling:

    `python scrapper.py url element_tag`

## 05. Scrapper Extensions
Think of a scrapper extension that would increase its functionality. Some examples are given below:

    Maybe it would be useful to also retrieve data from the sub-links within your elements? How would you do that?
    
    Or maybe this could be done both more efficiently using a different package, e.g.: Scrapy?
    
    What if you want to traverse a large list of webpages?
    
    Maybe it is possible to get around the `robot security` problem that may arise when scrapping pages like Bloomberg news.