__Exercise 1__ We want to scrape movie scripts from [IMSDB](https://imsdb.com/) and ratings from [Rotten Tomatoes](https://www.rottentomatoes.com/) and in order to learn about their relationship. 

For this exercise, you are allowed to use 

```python
import requests
import lxml.html as lx
import re
import time
import nltk
import numpy as np

import matplotlib.pyplot as plt # or your favorite alternative

from sklearn.preprocessing import normalize
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

```




__(a, i)__ Write a function that retrieves all links to movies in the database by scraping the _alphabetical_ section on [imsdb.com](https://imsdb.com/) and retrieve the links. _How many links did you find?_ __(ii)__ Write a function `fetch_script` that, given the link of a movie retrieved in (i), returns a dictionary that contains all relevant information of the movie: 

```python
fetch_script('/Movie Scripts/10 Things I Hate About You Script.html')

>>> {'title': '10 Things I Hate About You',
     'writers': ['Karen McCullah Lutz', 'Kirsten Smith', 'William Shakespeare'],
     'genres': ['Comedy', 'Romance'],
     'date': 1997,
     'script': '...'}
```

The `script` field contains a string of the scraped script. Retrieve the information for all movies. _How many scripts did you retrieve?_

In [1]:
import requests
import lxml.html as lx
import re
import time
import nltk
import numpy as np

import matplotlib.pyplot as plt # or your favorite alternative

from sklearn.preprocessing import normalize
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster

In [2]:
# a, i

def get_movie_links():
    # Base URL for IMSDB alphabetical sections
    base_url = "https://www.imsdb.com/alpha.php"
    
    # Send request to the IMSDB alphabetical page
    response = requests.get(base_url)
    if response.status_code != 200:
        raise Exception(f"Failed to fetch page: {response.status_code}")
    
    # Parse the page using lxml
    tree = lx.fromstring(response.content)
    
    # Extract all movie links
    # Movie links are typically in anchor tags under a specific structure
    movie_links = tree.xpath("//a[contains(@href, '/Movie Scripts/')]/@href")
    
    # Convert relative URLs to absolute URLs
    movie_links = [f"https://www.imsdb.com{link}" for link in movie_links]
    
    return movie_links

# Retrieve and count the links
movie_links = get_movie_links()
print(f"Number of movie links found: {len(movie_links)}")

Number of movie links found: 1290


In [12]:
def fetch_script(link):
    base_url = "https://www.imsdb.com"
    url = base_url + link
    
    # Fetch the movie page
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch: {url}")
        return None
    
    # Parse the HTML content
    tree = lx.fromstring(response.content)
    
    # Extract writers, making sure only to get those listed before the Genres section
    writers = tree.xpath('//b[text()="Writers"]/following-sibling::node()[self::a and following-sibling::b[1][text()="Genres"]]/text()')
    
    # Extract genres, ensuring it only captures the links directly after "Genres" and before another <b> tag
    genres = tree.xpath('//b[text()="Genres"]/following-sibling::a[following-sibling::b]/text()')
    
    # Return a dictionary with both writers and genres
    return {
        "writers": writers,
        "genres": genres,
    }

# Example usage
sample_link = '/Movie Scripts/10 Things I Hate About You Script.html'
movie_data = fetch_script(sample_link)
print(movie_data)  # This should now print both writers and genres in the desired format

{'writers': ['Karen McCullah Lutz', 'Kirsten Smith', 'William Shakespeare']}


In [26]:

def fetch_script(link):
    base_url = "https://www.imsdb.com"
    url = base_url + link
    #code here is just fetching w/ request package
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch: {url}")
        return None
    tree = lx.fromstring(response.content)
    title = tree.xpath('//title/text()')[0]
    writers = tree.xpath('//b[text()="Writers"]/following-sibling::node()[self::a and following-sibling::b[1][text()="Genres"]]/text()')
    genres = tree.xpath('//b[text()="Genres"]/following-sibling::a[following-sibling::b]/text()')
    script_date = tree.xpath('normalize-space(//b[text()="Script Date"]/following-sibling::text()[1])')
    script_link = tree.xpath('//a[contains(text(), "Read")]/@href')[0]
    script_url = base_url + script_link

    script_response = requests.get(script_url)
    if script_response.status_code != 200:
        print(f"Failed to fetch script: {script_url}")
        return None
    script_tree = lx.fromstring(script_response.content)
    script_texts = script_tree.xpath('//pre/text()') #pre!!!!!!!!!!!!!!!!!!!!!!!!!! <pre>...<pre>
    script_text = ''.join(script_texts) if script_texts else "Script not found."
    return {
        "title": title,
        "writers": writers,
        "genres": genres,
        "script_date": script_date,
        "script": script_text 
    }

link = '/Movie Scripts/10 Things I Hate About You Script.html'
fetch_movie = fetch_script(link)
print(fetch_movie)  

{'title': '10 Things I Hate About You Script at IMSDb.', 'writers': ['Karen McCullah Lutz', 'Kirsten Smith', 'William Shakespeare'], 'genres': ['Comedy', 'Romance'], 'script_date': ': November 1997', 'script': '\r\n\r\n\r\n\r\n\r\n\r\n                written by Karen McCullah Lutz & Kirsten Smith\r\n              based on \'Taming of the Shrew" by William Shakespeare\r\n          Revision November 12, 1997\r\n          Welcome to Padua High School,, your typical urban-suburban \r\n          high school in Portland, Oregon.  Smarties, Skids, Preppies, \r\n          Granolas. Loners, Lovers, the In and the Out Crowd rub sleep \r\n          out of their eyes and head for the main building.\r\n          KAT STRATFORD, eighteen, pretty -- but trying hard not to be \r\n          -- in a baggy granny dress and glasses, balances a cup of \r\n          coffee and a backpack as she climbs out of her battered, \r\n          baby blue \'75 Dodge Dart.\r\n          A stray SKATEBOARD clips her, cau

__(b, i)__ Write your own tokenizer `myTokenizer` that 
- matches all consecutive instances  of word class characters, 
- lowercases all retrieved tokens
- stems all tokens using `nltk.PorterStemmer`
- and removes all stopwords in `nltk.corpus.stopwords.words("english")`. 

Consider the collection of scripts as corpus and each script as document and use `CountVectorizer` to compute a matrix of token frequencies per document. *How many tokens did you find in the whole corpus?*

In [None]:
# b, i

__(c, i)__ We want to see if related movies elicit a common pattern. To this end, remove tokens that appear in less than 40 documents (thus we hope to avoid picking up patterns by observing character names). *How many tokens are left?* __(ii)__ Compute the cosine similarity between all documents. Print the similarity submatrix of the movies

```python
['After School Special',
 'Armageddon',
 'Lord of the Rings: Fellowship of the Ring, The',
 'Lord of the Rings: Return of the King',
 'Top Gun']
```

using `matplotlib.pyplot.matshow` (or similar). Make sure to include the movie titles as axis ticks. 

In [None]:
# c, i

__(d, i)__ For this sub-exercise, consider only the documents that have a similarity score of `0.9` or larger. *How many scripts remain? Hint: I have between 0-50.* __(ii)__ Reformulate the similarity matrix into a distance matrix and use a hierarchical clustering method to group the remaining movies according to their corresponding distance. Use 

```python 
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
```

and print the corresponding dendrogram. Then, run `fcluster` with an appropriate threshold that you determine visually from the dendrogram. __(iii)__ Print the clustered titles and briefly discuss whether the grouping seems reasonable. 

In [None]:
# d, i subexercise

__(d, i)__ We want to learn about authors that write the best-rated movies according to [Rotten Tomatoes](www.rottentomatoes.com). Write a function `getTomatometer` that takes a movie name as input and returns the `Tomatometer` score if available. 

```python
getTomatometer('10 Things I Hate About You')
> 0.71
```


Run the function on all movies you obtained in (a). Also include those movies for which no script is available. 
*For how many movies did you obtain the Tomatometer metric?* __(ii)__ For each author and the movies he produced, compute his median Tomatometer rating. Use a data structure of this form: 

```python 
{'Karen McCullah Lutz': {
        'movies': ['10 Things I Hate About You', 'Legally Blonde'],
        'TomatoMedian': 0.715
    }
...}
```

*Print the best and worst five authors that have written at least two scripts.*



In [None]:
# d, i