![Final Lesson Exercise](images/Banner_FEX.png)

# Lesson #3: Data Science Introduction 
## Good movies

## About this assignment
In this assignment, you will explore information regarding good movies. 

This time you will do so, using new capabilities, namely, scraping and crawling.

In this assignment, you will explore good movies from the [IMDb web site](https://www.imdb.com/chart/top/).<br/>
You will do this per movie genre (genres such as *comedy* or *drama*).

You will *crawl* along web pages and *scrape* information about top rated movies.

## Preceding Step - import modules (packages)
This step is necessary in order to use external packages. 

**Use the following libraries for the assignment, when needed**:

In [1]:
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# === CELL TYPE: IMPORTS AND SETUP 

import time      # for testing use only
import os         # for testing use only

import bs4
from bs4 import BeautifulSoup  
import pandas as pd
import scipy as sc
import numpy as np

## 1. Scraping 

### The HTML files in this assignment - important note:

In this assignment we will used chached web-pages, instead of working with real web-pages. <br/>
This is done for the following reasons:<br/>
* To avoid any problems with IMDB blocking us from downloading the data many times
* to avoid any inconsistency in the data

We have cached (downloaded and saved) all the html files (in the [data](data) sub-folder). 
* The cached entry page is called `IMDb_Top_movies.html` and located in [`data/IMDb_Top_movies.html`](data/IMDb_Top_movies.html)

#### Viewing the HTML data files:
1. First duplicate the assignment tab (in your browser).
   * You will work on your assignemt in the original tab, and view the HTML files in the duplicated tab.
   * The duplicated tab, should show the 'Jupyter' file explorer.
   * The following steps refer to actions in the duplicated tab.
+ Click the 'data' folder (in the 'Jupyter' file explorer)
+ Select the checkbox, near the relevant file, which you want to view.
  * You should start with the entry page `IMDb_Top_movies.html`
+ Click `view` button in the top of the page.
  * <u>Be careful: do not delete this file</u> (the delete button is near the view button).
  * After clicking view, the HTML file will be displayed (in another tab).

**Note**: After an html file is displayed, you could view the HTML code and 'inspect' it's elements.

### 1.a. Load the Data
As mentioned above, we cached the web pages.<br/>
* You should refer to the cached HTML files, located in the [`data`](data) folder.

**For every html file use load a `Beautiful Soup` object**.<br/>
You will later use this object to *scrape* information from this web-page.

### Renaming the input file name - please read

**Important note - don't forget** - all the html files are cached and are located in a sub-folder called 'data'.<br/>
Thus, **you need to add a './data/' prefix to the given 'html_file_name' input parameter**,<br/> 
before loading the soap object.<br/>

For example, **if the name** of the input `html_file_name` parameter **is `'dummy_html.html'`, ** you should **use `'./data/dummy_html.html'` instead**


### Instructions
<u>method name</u>: <b>load_soup_object</b>
<pre>The following is expected:
--- Complete the 'load_soup_object' function to create and return a soup object 
    for a given html file.    
    Use BeautifulSoup(html_doc, 'html.parser')
</pre>    

#### don't forget to rename the html file name as mentioned above

<hr>

Hints:
* [open()](https://docs.python.org/3/library/functions.html#open)
* [Beaufitul soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)

In [2]:
# 1.a.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation)
# === CODE TYPE: ANSWER 

def load_soup_object(html_file_name):
    file_path = './data/' + html_file_name
    with open(file_path, 'r') as file:
        html_content = file.read()

    soup = BeautifulSoup(html_content, 'html.parser')
    return soup

In [3]:
# 1.a.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation)
# === CODE TYPE: SELF TESTING
# Use the following code to test your implementation:

dummy_html = 'dummy_html.html'
soup_dummy = load_soup_object(dummy_html)
soup_dummy


<!DOCTYPE html>

<html>
<body>
<h1>This is a dummy HTML</h1>
<p>It is used only to test your Soup object creation</p>
</body>
</html>

In [4]:
# 1.a.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation, if used)
# === CODE TYPE: SELF TESTING
# Add your additional tests here if needed:

###
### YOUR CODE HERE
###


In [5]:
# 1.a.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 1.a - Test 1 (0.3 points) - Sanity")
print ("\t--->Testing the implementation of 'load_soup_object' ...")

dummy_html = 'dummy_html.html'
soup_dummy = None
try:
    soup_dummy = load_soup_object(dummy_html)
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise
    
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))

print ("Good Job!\nYou've passed the 1st test for the 'load_soup_object' function implementation :-)")

Part 1.a - Test 1 (0.3 points) - Sanity
	--->Testing the implementation of 'load_soup_object' ...
Good Job!
You've passed the 1st test for the 'load_soup_object' function implementation :-)


In [6]:
# 1.a.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 1.a - Test 2 (0.7 points)")
print ("\t---> - Testing the implementation of 'load_soup_object' ...")

dummy_html = 'dummy_html.html'
soup_dummy = None
try:
    soup_dummy = load_soup_object(dummy_html)
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise
    
assert type(soup_dummy)==bs4.BeautifulSoup, "Wrong returned object, expecting a 'soup' object"
assert soup_dummy.h1.get_text()=='This is a dummy HTML', "Soup object doesn't include expected content"

print ('-----------------')
print ('Dummy html:')
print(soup_dummy.prettify())
print ('-----------------\n')

print ('Seams that the soup object was created successfully!')
print ("Good Job!\nYou've passed the 2nd test for the 'load_soup_object' function implementation :-)")

Part 1.a - Test 2 (0.7 points)
	---> - Testing the implementation of 'load_soup_object' ...
-----------------
Dummy html:
<!DOCTYPE html>
<html>
 <body>
  <h1>
   This is a dummy HTML
  </h1>
  <p>
   It is used only to test your Soup object creation
  </p>
 </body>
</html>

-----------------

Seams that the soup object was created successfully!
Good Job!
You've passed the 2nd test for the 'load_soup_object' function implementation :-)


### 1.b. Extract IMDb movie genres
In this sub-section you will *scape* a list of movie genres from the main html page.

You need to *scrape* all of the movie genres (such as 'adventure', 'musical' and so on) and the link to the top rated movies by these genres.<br />
For example, the link to the adventure page on this site is `"adventure.html"`.

### Instructions
<u>method name</u>: <b>scrape_movie_genre_links</b>
<pre>The following is expected:
--- Complete the 'scrape_movie_genre_links' function to scrape the movie 
    genre information described above,
    from a soup object corresponding to a given 'html_file_name' file. 
    
    You need to return a dataframe with the following columns:
    'genre_name', 'link_to_genre_page'
    
    Each row in the dataframe, should contain the information for 
    these 2 columns (as described below).
    
    Start with loading the soup object. Then find out, using inspect element, what defines a catagory and a link.

</pre>

Below you can see a sample dataframe with 2 rows (obviously there are more links on the page):

| | genre_name | link_to_genre_page | 
| :- | :- | :- |
| 0 | Action | action.html | 
| 1 | Adventure | adventure.html | 

In [7]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation)
# === CODE TYPE: ANSWER 

def scrape_movie_genre_links(html_file_name):
    soup = load_soup_object(html_file_name)
    genres = soup.find_all('li', {'class': 'subnav_item_main'})
    genres = [genre.find('a') for genre in genres]
    genre_data = [{'genre_name': genre_link.get_text().strip(), 'link_to_genre_page': genre_link.get('href').strip()} for genre_link in genres]

    genre_df = pd.DataFrame(genre_data)
    return genre_df


In [8]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation, if used)
# === CODE TYPE: SELF TESTING
# Add your additional tests here if needed:

html_file_name = 'IMDb_Top_movies.html'
genre_df = scrape_movie_genre_links(html_file_name)
genre_df


Unnamed: 0,genre_name,link_to_genre_page
0,Action,action.html
1,Adventure,adventure.html
2,Animation,animation.html
3,Biography,biography.html
4,Comedy,comedy.html
5,Crime,crime.html
6,Drama,drama.html
7,Family,family.html
8,Fantasy,fantasy.html
9,Film-Noir,film_noir.html


In [9]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation, if used)
# === CODE TYPE: SELF TESTING
# Add your additional tests here if needed:

genre_df.shape


(21, 2)

In [10]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 1.b - Test 1 (0.5 points) - Sanity")
print ("\t--->Testing the implementation of 'scrape_movie_genre_links' ...")

html_file_name = 'IMDb_Top_movies.html'
categories_df = None

try:
    genre_df = scrape_movie_genre_links(html_file_name)
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise

print ("Good Job!\nYou've passed the 1st test for the 'scrape_movie_genre_links' function implementation :-)")

Part 1.b - Test 1 (0.5 points) - Sanity
	--->Testing the implementation of 'scrape_movie_genre_links' ...
Good Job!
You've passed the 1st test for the 'scrape_movie_genre_links' function implementation :-)


In [11]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 1.b - Test 2 (0.5 points)")
print ("\t---> - Testing the implementation of 'scrape_movie_genre_links' ...")

html_file_name = 'IMDb_Top_movies.html'
categories_df = None

try:
    genre_df = scrape_movie_genre_links(html_file_name)
    genre_names = [genre_name.strip() for genre_name in genre_df['genre_name'].values]
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise
    
assert 'Animation' in genre_names, 'missing categoies'

print ("Good Job!\nYou've passed the 2nd test for the 'scrape_movie_genre_links' function implementation :-)")

genre_df.head()

Part 1.b - Test 2 (0.5 points)
	---> - Testing the implementation of 'scrape_movie_genre_links' ...
Good Job!
You've passed the 2nd test for the 'scrape_movie_genre_links' function implementation :-)


Unnamed: 0,genre_name,link_to_genre_page
0,Action,action.html
1,Adventure,adventure.html
2,Animation,animation.html
3,Biography,biography.html
4,Comedy,comedy.html


In [12]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 1.b - Test 3 (0.5 points)")
print ("\t---> - Testing the implementation of 'scrape_movie_genre_links' ...")

html_file_name = 'IMDb_Top_movies.html'
categories_df = None
try:
    genre_df = scrape_movie_genre_links(html_file_name)
    genre_names = [genre_name.strip() for genre_name in genre_df['genre_name'].values]  
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise  
    
assert 21==genre_df.shape[0], "Wrong number of results for query"

print ("Good Job!\nYou've passed the 3rd test for the 'scrape_movie_genre_links' function implementation :-)")

Part 1.b - Test 3 (0.5 points)
	---> - Testing the implementation of 'scrape_movie_genre_links' ...
Good Job!
You've passed the 3rd test for the 'scrape_movie_genre_links' function implementation :-)


In [13]:
# 1.b.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 1.b - Test 4 (0.5 points)")
print ("\t--->Testing the implementation of 'scrape_movie_genre_links' ...")
print ("\n\t====> Full grading test - the following test can not be seen before submission")

###
### AUTOGRADER TEST - DO NOT REMOVE
###


Part 1.b - Test 4 (0.5 points)
	--->Testing the implementation of 'scrape_movie_genre_links' ...

	====> Full grading test - the following test can not be seen before submission


## 2. Scraping and crawling

### Extract top rated movies per genre
In this section you will need to extract information regarding the top rated movies, for a specific genre (such as *Drama*).<br/>

The link to the page of the first top movies for a genre, appears in the dataframe returned from the<br/> 'scrape_movie_genre_links' method (the previous method).<br/>

#### Important note - crawling (and then scaping info from next pages) is needed:
The first page (for instance *drama.html*), includes only the first chunk of the top movies, <br/>
and you are required to *crawl* to the next page in order to get more results.<br/>

The number of pages will be given as a parameter and will include the first page of the top movies to that genre.<br/>
* For example, if n_pages=2, and genre is *Drama*, you need another page besides *drama.html*

<hr> 

#### Information you are expected to scrape:
You will need to *scrape* the relevant information about the union of all of the movies in these pages.<br/>
You need to extract the following information for each of the top IMDb ranked movies:
1. Name of the movie 
+ The release year
+ Genres - each movie, could belong to a few genres. These genres are mentioned in these web-pages.
+ Rating for the movie

### Instructions
<u>method name</u>: <b>load_top_rated_movies_per_genre</b>
<pre>The following is expected:
--- Complete the 'load_top_rated_movies_per_genre' function to scrape all the required 
    information for each of the top movies, as described above, for a specific genre, 
    given in the 'genre_url_address' parameter (for example: action.html)

    Use inspect element to find the link to the next page.
    Then, for each URL, crawl it and the next pages of the genre.
    You could expect between 1-5 pages (the 'n_pages' parameter) to scrape (including 
    'genre_url_address', the first page of the genre).
    
    Use the 'load_soup_object' method (which you have already implemented), to get a soup object for each of the 
    top rated movies web pages.
    
    You can create a single soup object using BeautifulSoups' append function
    (for additional information visit documentation at:
    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#append).
    (or join the lists in any other way)

    You need to return a dataframe with the following columns:
    'movie_name', 'release_year', 'genre_names', 'rating'
</pre>

In [14]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation)
# === CODE TYPE: ANSWER 

def load_top_rated_movies_per_genre(genre_url_address, n_pages):
    genre_pages = [genre_url_address]
    for i in range(1, n_pages):
        genre_pages.append(genre_url_address.split('.')[0] + f'_start_{i * 50 + 1}.html')

    movie_data = []
    for url in genre_pages:
        soup = load_soup_object(url)
        movies = soup.find_all('div', {'class': 'lister-item mode-advanced'})
        for movie in movies:
            movie_name = movie.h3.a.get_text()
            release_year = movie.h3.find('span', {'class': 'lister-item-year'}).get_text().strip('()')
            genre_names = ', '.join(movie.find('span', {'class': 'genre'}).get_text().strip().split(', '))
            rating = movie.find('div', {'class': 'inline-block ratings-imdb-rating'}).strong.get_text()
            movie_data.append({'movie_name': movie_name, 'release_year': release_year, 'genre_names': genre_names, 'rating': rating})

    movie_df = pd.DataFrame(movie_data)
    return movie_df


In [15]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation, if used)
# === CODE TYPE: SELF TESTING
# Add your additional tests here if needed:

comedy_movies_df = load_top_rated_movies_per_genre('comedy.html', 3)
comedy_movies_df
###
### YOUR CODE HERE
###


Unnamed: 0,movie_name,release_year,genre_names,rating
0,Hababam Sinifi,1975,"Comedy, Drama",9.3
1,Dil Bechara,2020,"Comedy, Drama, Romance",8.8
2,Parasite,2019,"Comedy, Drama, Thriller",8.6
3,Life Is Beautiful,1997,"Comedy, Drama, Romance",8.6
4,The Intouchables,2011,"Biography, Comedy, Drama",8.5
...,...,...,...,...
145,Evil Dead II,1987,"Action, Comedy, Fantasy",7.8
146,Ferris Bueller's Day Off,1986,Comedy,7.8
147,Down by Law,1986,"Comedy, Crime, Drama",7.8
148,The Goonies,1985,"Adventure, Comedy, Family",7.8


In [16]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run after implementation, if used)
# === CODE TYPE: SELF TESTING
# Add your additional tests here if needed:

comedy_movies_df.shape


(150, 4)

In [17]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 2 - Test 1 (0.5 points) - Sanity")
print ("\t--->Testing the implementation of 'load_top_rated_movies_per_genre' ...")

try:
    action_movies_df = load_top_rated_movies_per_genre('action.html', 2)
    comedy_movies_df = load_top_rated_movies_per_genre('comedy.html', 3)
    drama_movies_df  = load_top_rated_movies_per_genre('drama.html', 4)
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise

print ("Good Job!\nYou've passed the 1st test for the 'load_top_rated_movies_per_genre' function implementation :-)")

Part 2 - Test 1 (0.5 points) - Sanity
	--->Testing the implementation of 'load_top_rated_movies_per_genre' ...
Good Job!
You've passed the 1st test for the 'load_top_rated_movies_per_genre' function implementation :-)


In [18]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 2 - Test 2 (1 point)")
print ("\t--->Testing the implementation of 'load_top_rated_movies_per_genre' ...")

try:
    action_movies_df = load_top_rated_movies_per_genre('action.html', 2)
    comedy_movies_df = load_top_rated_movies_per_genre('comedy.html', 3)
    drama_movies_df = load_top_rated_movies_per_genre('drama.html', 4)
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise
    
assert action_movies_df['movie_name'].iloc[0]=='The Dark Knight','wrong top action movie name'
assert comedy_movies_df['genre_names'].iloc[69]=='Animation, Adventure, Comedy','wrong 70th rated comedy movie genre names'
assert drama_movies_df['movie_name'].iloc[109]=='The Wolf of Wall Street','wrong 110th rated drama movie name'

print ('Top rated action movie names and ratings:')
print (action_movies_df[['movie_name', 'rating']].head())
print ('\n----------------')
print ('Top rated comedy movie names and ratings:')
print (comedy_movies_df[['movie_name', 'rating']].head())
print ('\n----------------')
print ('Top rated drama movie names and ratings:')
print (drama_movies_df[['movie_name', 'rating']].head())

print ("Good Job!\nYou've passed the 2nd test for the 'load_top_rated_movies_per_genre' function implementation :-)")

Part 2 - Test 2 (1 point)
	--->Testing the implementation of 'load_top_rated_movies_per_genre' ...
Top rated action movie names and ratings:
                                          movie_name rating
0                                    The Dark Knight    9.0
1      The Lord of the Rings: The Return of the King    8.9
2                                    The Mountain II    8.8
3                                          Inception    8.8
4  The Lord of the Rings: The Fellowship of the Ring    8.8

----------------
Top rated comedy movie names and ratings:
          movie_name rating
0     Hababam Sinifi    9.3
1        Dil Bechara    8.8
2           Parasite    8.6
3  Life Is Beautiful    8.6
4   The Intouchables    8.5

----------------
Top rated drama movie names and ratings:
                 movie_name rating
0            Hababam Sinifi    9.3
1  The Shawshank Redemption    9.3
2             The Godfather    9.2
3           The Dark Knight    9.0
4    The Godfather: Part II    9.0
Go

In [19]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 2 - Test 3 (0.5 points)")
print ("\t--->Testing the implementation of 'load_top_rated_movies_per_genre' ...")

try:
    action_movies_df = load_top_rated_movies_per_genre('action.html', 5)
    comedy_movies_df = load_top_rated_movies_per_genre('comedy.html', 5)
    drama_movies_df  = load_top_rated_movies_per_genre('drama.html', 4)
except Exception as e:
    print ('You probably have a syntax error, we got the following exception:')
    print (str(e))
    raise
    
assert action_movies_df.shape[0]==250, 'Wrong number of results for action top movies'
assert comedy_movies_df.shape[0]==250, 'Wrong number of results for comedy top movies'

print ("Good Job!\nYou've passed the 3rd test for the 'load_top_rated_movies_per_genre' function implementation :-)")

Part 2 - Test 3 (0.5 points)
	--->Testing the implementation of 'load_top_rated_movies_per_genre' ...
Good Job!
You've passed the 3rd test for the 'load_top_rated_movies_per_genre' function implementation :-)


In [20]:
# 2.
# ------------>>>>>>>> RUN THIS CODE CELL <<<<<<<<------------
# --------  (run only)
# === CODE TYPE: GRADED TEST 

print ("Part 2 - Test 4 (1 point)")
print ("\t--->Testing the implementation of 'load_top_rated_movies_per_genre' ...")
print ("\n\t====> Full grading test - the following test can not be seen before submission")

###
### AUTOGRADER TEST - DO NOT REMOVE
###


Part 2 - Test 4 (1 point)
	--->Testing the implementation of 'load_top_rated_movies_per_genre' ...

	====> Full grading test - the following test can not be seen before submission
