# Practical 1: Data Acquisition

Machine learning algorithms require **a lot** of data, typically the more the better. Of course, there are many pre-existing datasets available and often used for learning purposes, or as benchmarks for particular NLP tasks, such as SQuAD and GLUE. These datasets are often well studied and can simply be downloaded and used with minimal pre-processing.

However, applying NLP to a new problem or task will often require data to be gathered, processed and if ground-truth labels are needed (e.g. for supervised learning), annotated. Indeed, the process of data acquisition can often be one of the most time consuming and labour intensive of any NLP project. Depending on the problem the data could come from existing documents, created by hand, or we can use the largest source of information - the internet. [Web scraping](https://en.wikipedia.org/wiki/Web_scraping) allows us to extract data from websites, so it is possible to obtain huge amounts of information. In fact, scraping was used to extract the ~500 billion token datasets used to train some of the largest SOTA language models, like GPT-3 ([Brown, T.B., et al. (2020)](https://papers.nips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf)).

In this practical we will use web scraping to gather some movie reviews written by IMDB users. Specifically, from the [1001 Movies You Must See Before You Die (2020 Edition)](https://www.imdb.com/list/ls052535080/) list. Then, we will annotate these reviews with a sentiment, positive or negative. In later practicals we will learn how to process this data and then build a model to classify the sentiment.

The objectives of this practical are:
1. Understand the process of web scraping to obtain data

2. Use existing tools to annotate data and manage data versioning

3. Consider the legal and ethical implications of web scraping and data acquisition in general

4. Produce a set of IMDB user reviews, annotated with positive or negative sentiment

## 1.0 Import libraries

Most of these Python libraries you should already be familiar with. For the web scraping we will use two specifially:

1. Requests - allows us to make HTTP requests for web pages i.e. ask a web server to send a web page and its data.

2. [Beautiful Soup 4 (bs4)](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - is a Python library for parsing and navigating HTML files. This makes the job of finding the data we want, from within a recieved page, much easier.

In [4]:
import os
import json
import time
import random
import requests
import pandas as pd
from collections import OrderedDict
from bs4 import BeautifulSoup

# Set the directory to the data folder
data_dir = os.path.join('..', 'data', 'imdb')

## 1.1 Get a list of movie names and the URL to their IMDB page

As previously stated, we will be getting reviews for movies in IMDB's curated list of [1001 Movies You Must See Before You Die (2020 Edition)](https://www.imdb.com/list/ls052535080/). If you follow the link you should can see the list of movies for yourself.

The process of web scraping simply involves requesting a web page from a server and then extracting the data we are interested in. However, in reality we may not know the exact URL, or we may wish to scrape many web pages at once. In this case we have a list of movies and we need to find the links to each of their IMDB pages.

1. First we send a request for the lists page. The response is the same information used by your browser to render the page. If you uncomment `print(response.content)` you can see the full response (it's pretty horrible) and `print(response.status_code)` tells us if it returned correctly or if there was an error.

2. Next we use beautiful soup to parse the response into a more manageable object ('soup'). Again, if you uncomment `print(soup.prettify())` you can see what this looks like (better but still horrible).

3. Now we can begin to parse the page's data to find the links to individual movies. If you opened the page in your browser you can right click on a movie title and select 'inspect'. This will open the developer console and you should see that each movie title is actually a link (`a` tag) to the movies page and each of these is held within a header (`h3` tag) of class `lister-item-header`. So we can use bs4 to get a list of all the headers of this class.

4. Next we loop over all the movie headers get the name and link from within the `a` tags, then store it in a dictionary: keys = movie name, values = full URL.

5. Finally print out the first 10 movies. You can see the movies all have a unique code at the end of the URL.


<div class = "alert alert-block alert-info"><b>Note:</b> The list page only shows the first 100 of 1001 movies.<br>
We could use pagination to get the rest, but let's just stick with 100 for now.</div>

In [5]:
# Base IMDB url and 1001 Movies You Must See Before You Die (2020 Edition) url
imbd_url = 'https://www.imdb.com'
list_url = 'https://www.imdb.com/list/ls052535080/'
# header = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}

# Send http request to get the list page
response = requests.get(list_url)
# print(response.status_code)
# print(response.content)
soup = BeautifulSoup(response.content, 'html.parser')
# print(soup.prettify())

# Get the movie list
# Inspecting the page we can see that the movie names are links to their pages
# These header tags <h3> have the class "lister-item-header"
mov_headers = soup.find_all('h3', {'class': 'lister-item-header'})

# Now loop over the movies and store the name and link in a dictionary
movies = OrderedDict()
for h in mov_headers:
    link = h.find('a', href=True)

    # Movie name as key and link as value
    # We append to the base IMDB url to get the full link, and discard the query string
    movies[link.contents[0]] = imbd_url + link['href'].split('?')[0]

# Now dispaly the first 10 movies
for movie_name in list(movies.keys())[:10]:
    print(movie_name + "\t" + movies[movie_name])

A Trip to the Moon	https://www.imdb.com/title/tt0000417/
The Great Train Robbery	https://www.imdb.com/title/tt0000439/
The Birth of a Nation	https://www.imdb.com/title/tt0004972/
Les vampires	https://www.imdb.com/title/tt0006206/
Intolerance: Love's Struggle Throughout the Ages	https://www.imdb.com/title/tt0006864/
The Cabinet of Dr. Caligari	https://www.imdb.com/title/tt0010323/
Broken Blossoms	https://www.imdb.com/title/tt0009968/
Within Our Gates	https://www.imdb.com/title/tt0011870/
Orphans of the Storm	https://www.imdb.com/title/tt0012532/
The Phantom Carriage	https://www.imdb.com/title/tt0012364/


## 1.2 Get the user reviews

Now that we have the links for each movie we can get their reviews. You might expect movies in this list to have only positive reviews. However, #3 is highly controversial (and rightfully so) and has garnered some very negative reviews. For our purposes this means we will have a bit more balance between the sentiment classes. The process is similar to the previous step:

1. Loop over each movie and request its review page (movie_url + "reviews?").

2. Get the titles of the reviews and also the main texts.

3. Store these in a list of dictionaries, along with a unique review id.

4. Create a Dataframe to hold the movie id, name, url, review title and text and save as .csv.

<div class = "alert alert-block alert-info"><b>Note:</b> The review page only shows the first 25 reviews.<br>
Again, we could use pagination to get the rest, but let's just stick with 25 for now.
</div>


<div class="alert alert-warning" role="alert">
<b>Legality of Web Scraping:</b> There are all kinds of <a href=https://www.blog.datahut.co/post/is-web-scraping-legal> legal and ethical considerations </a> surrounding web scraping, including copyright, scraping non-public data, or data behind a login, such as Facebook or Linkedin.<br>

Notice that there is a time delay added after each movie request has been processed? This is to slow down the number of requests per second and prevent repeated requests overloading the server, or at least creating unnecessary traffic. Excessive 'crawl rates' could violate "trespass to chattels" law, though for this use case it is unlikely. Still, it is worth being polite while scraping.
</div>

In [6]:
# Let's get reviews for the first few movies
# Alternatively you could get all movies (time consuming), or randomly select a subset
num_movies = 5
movie_reviews = []
for movie_name in list(movies.keys())[:num_movies]:
    
    # Send http request to get the review page
    # Appending "reviews?" to the movie url gets the review page
    movie_url = movies[movie_name]
    response = requests.get(movie_url + 'reviews?')
    soup = BeautifulSoup(response.content, 'html.parser')

    # Get the review titles
    review_titles = soup.find_all('a', {'class': 'title'})
    titles = [t.text for t in review_titles]

    # Get the text of each review
    review_contents = soup.find_all('div', {'class': 'text'})
    reviews = [c.text for c in review_contents]

    # Add to the list of reviews
    for i, (title, review) in enumerate(zip(titles, reviews)):
        id = str(i) + '-' + movie_url.split('/')[-2] # Create unique review id from the movie id
        movie_reviews.append({'id': id,'name': movie_name, 'url': movie_url, 'title': title, 'review': review})

    # Add a time delay to prevent excessive requests
    time.sleep(random.randint(2, 5))

# Create a dataframe from the list of reviews and save to csv
reviews_df = pd.DataFrame(movie_reviews)
reviews_df.to_csv('imdb_reviews_raw.csv')
reviews_df

Unnamed: 0,id,name,url,title,review
0,0-tt0000417,A Trip to the Moon,https://www.imdb.com/title/tt0000417/,Wonderfully imaginative and innovative\n,A group of scientists build a rocket and fly t...
1,1-tt0000417,A Trip to the Moon,https://www.imdb.com/title/tt0000417/,I can now say that I've seen a movie that's o...,Georges Méliès's 1902 masterpiece is not just ...
2,2-tt0000417,A Trip to the Moon,https://www.imdb.com/title/tt0000417/,Narrative Development: Magic\n,"""A Trip to the Moon"" is justly the most popula..."
3,3-tt0000417,A Trip to the Moon,https://www.imdb.com/title/tt0000417/,"Trip to the Moon, A\n","Trip to the Moon, A (1902) **** (out of 4) aka..."
4,4-tt0000417,A Trip to the Moon,https://www.imdb.com/title/tt0000417/,Tripping on the Moon.\n,Since seeing nods to the landmark work in Mart...
...,...,...,...,...,...
120,20-tt0006864,Intolerance: Love's Struggle Throughout the Ages,https://www.imdb.com/title/tt0006864/,"impressive for it's time as it is now, and ju...",The most remarkable thing about Intolerance wh...
121,21-tt0006864,Intolerance: Love's Struggle Throughout the Ages,https://www.imdb.com/title/tt0006864/,MOST ADVENTUROUS MOVIE EVER MADE\n,Only CITIZEN KANE and 2001:A SPACE ODYSSEY com...
122,22-tt0006864,Intolerance: Love's Struggle Throughout the Ages,https://www.imdb.com/title/tt0006864/,An Immortal Masterpiece. There's nothing in t...,Intolerance (1916) :\nBrief Review -An Immorta...
123,23-tt0006864,Intolerance: Love's Struggle Throughout the Ages,https://www.imdb.com/title/tt0006864/,immense silent experience\n,"Four strands here, sometimes merging (as in th..."


## Exercise: 1.3 Annotate sentiment labels for the reviews

We are going to be analysing the sentiment of these reviews, so we need to add some sentiment labels. Later we can use these as an extra test set to evaluate a classifier. We could do this manually, but it is much simpler to use existing tools. In this case it is suggested you use [LightTag](https://www.lighttag.io/) which is a free text annotation tool. 

1. You can use your UWE email to create a free educational LightTag account.

2. Following the LightTag introduction you can create an annotation schema with two classes 'positive' and 'negative'.

3. Upload the `imdb_reviews_raw.csv` file and choose which field you want to annotate and create a job.

4. You can also assign more annotators to your team. It is suggested you create a team for your group and share the workload. You should aim for ~100 reviews in total.

5. Once you have finished assigning labels you can download the dataset as a JSON file and use the code below to merge the labels with the reviews dataframe we  created earlier.

In [7]:
# Open the JSON file and load the data
with open(os.path.join(data_dir, 'annotated_imdb_reviews.json')) as json_data:
    data = json.load(json_data)

# Loop over each example (movie review) getting the id and classification
classifications = []
for example in data['examples']:
    if example['classifications']:
        classifications.append({'id': example['metadata'].get('id'), 'sentiment': example['classifications'][0]['classname']})

# Create a dataframe from the classifications and merge with the reviews
classes_df = pd.DataFrame(classifications)
reviews_df = pd.read_csv(os.path.join(data_dir, 'imdb_reviews_raw.csv'), index_col=0)
reviews_df = pd.merge(reviews_df, classes_df, on='id')

# Save the annotated reviews
reviews_df.to_csv(os.path.join(data_dir, 'imdb_reviews_raw.csv'))

reviews_df.head()

FileNotFoundError: [Errno 2] No such file or directory: '..\\data\\imdb\\annotated_imdb_reviews.json'