## Python Programming - Web Scraping

_**Name: Aditya Ramesh Parab**_

### All imports

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import json
import pandas as pd
import numpy as np
import string
import math
import random

### User Agents
<br>
Creating a list of various user agents for headers<br>
- Reference for user_agent_list: https://www.scrapehero.com/how-to-fake-and-rotate-user-agents-using-python-3/

In [2]:
# Creating a list of few user-agents for rotating
user_agent_list = [
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
]

### Scrape and fetch IMDB Top 250 movie links <br>
Following function will generate a get request for URL 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'<br>
From the received response it will scrape all 250 movie links and store it in a list and the list is returned.

In [3]:
# Function defined to fetch top 250 IMDB chart and saving all movie links in a list which is returned by the function
def getMovies():
    # URL of top 250 IMDB movies
    url = 'https://www.imdb.com/chart/top/?ref_=nv_mv_250'
    
    # Fetching header randomly from the defined list
    headers = {'User-Agent': random.choice(user_agent_list)}
    page = requests.get(url, headers)
    soup = BeautifulSoup(page.content, "html.parser")
#     print(soup)
    movies_scraped = soup.find_all('td', class_ = 'titleColumn')
    
    # Saving all movie links from scraped data
    movie_links = []
    for i in movies_scraped:
        hyperlink = i.find('a')
        url = hyperlink.get('href')
        url = 'https://www.imdb.com' + url
        movie_links.append(url)
    
    return movie_links

### Scrape required data accurately from json response <br>
Following function will generate a get request for every link in the movie_links list and scrape Movie Name, Directors and Genre information for each movie from JSON content received in the response and store the data in a dictionary.<br>
A list of such dictionaries _movies = []_ is will contain all the scarped data for each movie. This list is then returned by the fucntion.

In [4]:
# This function scrapes data from a script, which is received as response from the get URL. 
# This script consists of a JSON object which has accurate data of the content displayed on the page.
def movie_data_from_script(movie_links, count=0):
    movies = []

    # Iterating over movies to extract each movie's details
    for link in movie_links:
#         print(link)
        headers = {'User-Agent': random.choice(user_agent_list)}
        page = requests.get(link, headers = headers)
        soup = BeautifulSoup(page.content, "html.parser")
#         print(soup)

        # Movie name
        title = soup.find('h1', attrs = {'data-testid':"hero-title-block__title"}).get_text()

        # The JSON content which consist of all the data displayed on the web page
        content_json = soup.find_all("script", {"type": "application/ld+json"})[0]
        jsn = json.loads(content_json.string)
#         print(json.dumps(jsn, indent=4))
        genre = jsn["genre"]

        director_list = jsn["director"]
        director = []
        for obj in director_list:
            director.append(obj["name"])

        movie_data = {
            "movie_title" : title,
            "director" : director,
            "genre" : genre
        }
        count = count+1

        movies.append(movie_data)

# Block written to test only 20 initial entries 
#         if count == 20:
#             break;
    
    return movies

### Scrape required data from web page HTML/CSS tags <br>
Following function is same as previous, the only difference being this functions tries to find data from actual HTML tags.<br>
Function will generate a get request for every link in the movie_links list and scrape Movie Name, Directors and Genre information for each movie and store the data in a dictionary.<br>
A list of such dictionaries _movies = []_ is will contain all the scarped data for each movie. This list is then returned by the fucntion.

In [5]:
# This function scrapes data from actual HTML/CSS tags.
def movie_data_from_web_html(movie_links):
    movies = []
    for link in movie_links:
        page = requests.get(link, headers = headers)
        soup = BeautifulSoup(page.content, "html.parser")
        
        # Block to extract movie title
        title_row = soup.find('h1', attrs = {'data-testid':"hero-title-block__title"})
        title = title_row.get_text()
#         print(title)

        # Block to extract director data
        director_block = soup.find('div', class_="sc-fa02f843-0 fjLeDR")
        director_tags = director_block.find_all(class_='ipc-metadata-list-item__content-container')[0].find_all('a')
        director = []
        for i in director_tags:
            name = i.get_text()
            director.append(name)
#         print(director)

        # Block to extract genre data
        genre_block = soup.find('div',attrs = { 'data-testid' : 'genres'})
        genre_tags = genre_block.find_all('li')
        genres = []
        for i in genre_tags:
            genre = i.get_text()
            genres.append(genre)
#         print(genres)


        movie_dic = {
            "movie_title":title,
            "director": director,
            "genre": genre
        }
    
        movies.append(movie_dic)

    return movies

### Create Dataframe and export to CSV <br>
The data obtained from above functions is converted into a Pandas Dataframe and exported to a .csv file.

In [6]:
# This function will call above function, scrape the data and load it into a Pandas dataframe as well as export it to a csv file
def create_dataframe_save_csv(movie_links):
    # Toggle (comment/uncomment) below two lines to check functioning of above two functions
    df = pd.DataFrame(movie_data_from_script(movie_links))
#     df = pd.DataFrame(movie_data_from_web_html(movie_links))

    df.columns = ['Movie Title', 'Director', 'Genres']
    df.index = np.arange(1, len(df) + 1)
    df.index.name = 'Ranking'

    delimiter = ', '

    for index, row in df.iterrows():
        row['Director'] = delimiter.join(row['Director'])
        row['Genres'] = delimiter.join(row['Genres'])

    df.to_csv("AdityaParab_movies.csv")

    return df

### Convert dataframe to dictionary
Dataframe object is converted to dictionary for representation purpose for the given records only.

In [7]:
# Function to summarize data send in the records dataframe
def fetch_data_by_director(records, director_name):
    movies_directed = []
    genre_summary = {}
    
    if records is not None and len(records) > 0:
        for index, row in records.iterrows():
            movies_directed.append(string.capwords(row['Movie Title']))
            genres = [string.capwords(x.strip()) for x in row['Genres'].split(',')]
            for key in genres:
                if key not in genre_summary:
                    genre_summary[key] = 1
                else:
                    val = genre_summary[key] + 1
                    genre_summary[key] = val
                    
        result = {
            'movies' : movies_directed,
            'director' : director_name,
            'genres' : genre_summary
        }

        return result
    

### Convert dictionary to String
Custom implementation of toString() for a dictionary.

In [8]:
# Function to convert a dictionary in printable format
def convert_dic_to_string(dic):
    # Convert the data into a presentable form
        movie_titles = ', '.join(dic['movies'])
        director = string.capwords(dic['director'])
        genres = ', '.join("{}: {}".format(k, v) for k, v in dic['genres'].items())
        
        result = {
            'movies' : movie_titles,
            'director' : director,
            'genres' : genres
        }
        
        return result

### Cosine Similarity Computation
Computing cosine similarity between two genre dictionaries

In [9]:
# Function to calculate cosine similarity between two genre dictionaries
def cosine_similarity(dic1,dic2):
    numerator = 0
    denominator1 = 0
    for key,val in dic1.items():
        numerator += val * dic2.get(key,0.0)
        denominator1 += val * val

    denominator2 = 0
    for val in dic2.values():
        denominator2 += val * val
        
    return numerator/math.sqrt(denominator1*denominator2)

# dir2 = {'drama': 22, 'war': 6, 'action': 10, 'crime': 4, 'thriller':9} 
# dir1 =  {'drama': 17, 'crime':20, 'action' : 31}

# print("{:.5f}".format(cosine_dic(dir1,dir2)))


### Main Driver Function
The main function will take time to fetch all data from web and build the dataframe. Faster option is to comment the first two lines i.e
<br>
_movie_links = getMovies()_<br>
_data = create_dataframe_save_csv(movie_links)_
And use the csv file attached in submission to create dataframe directly from csv.

In [11]:
# Main function
def menu_driven():
    movie_links = getMovies()
    data = create_dataframe_save_csv(movie_links)
    data = pd.read_csv('AdityaParab_movies.csv', index_col=0)

    for index, row in data.iterrows():
        row['Movie Title'] = (row['Movie Title']).upper()
        row['Director'] = (row['Director']).upper()
        row['Genres'] = (row['Genres']).upper()

#     print(data)
    
    while True:
        # Asking for user input on what he needs to check
        inp = input('\nWhat do you want to check on IMDB? (Please choose ‘movie’, ‘director’, or ‘comparison’)\nEnter ‘exit’ to quit\nYour choice: ')
       
        # If choice is Movie
        if inp.upper() == 'MOVIE':
            # Ask for movie name input
            mov_inp = input('Which movie do you want to check?\nYour choice:')
            
            # Check if user's entered movie is in the list
            if mov_inp.upper() in data['Movie Title'].values:
                # Fetch the dataframe record for the matched movie title
                record = data.loc[data['Movie Title'] == mov_inp.upper()]
               
                # Convert the data into a presentable form
                movie_title = string.capwords(record['Movie Title'].item())
                director = string.capwords(record['Director'].item())
                genres = string.capwords(record['Genres'].item())
                
                # Output
                print('\nThe director of movie ‘' + movie_title + '’ is', director + '.' +
                      '\nThe genre of the movie is', genres)
                
            else:
                print('\nThe movie you entered is not in the list. Please try some other movie.')

        # If choice is Director
        elif inp.upper() == 'DIRECTOR':
            print('DIRECTOR')
            # Ask for director's name input
            dir_inp = input('Who do you want to check?\nYour choice:')
            
            filtered = data['Director'].str.contains(dir_inp, case=False)
            records = data[filtered]
            
            movies_directed = []
            genre_summary = {}
            if records is not None and len(records) > 0:
                # Function call to fetch data from dataframe and get the data in representable format
                dic_dir = convert_dic_to_string(fetch_data_by_director(records, dir_inp))
                
                # Output
                print('\n'+ dic_dir['director'] + ' has directed ' + dic_dir['movies'] + '.'
                      '\nThe most directed genres are', dic_dir['genres'])
                
            else:
                print('\nThe director\'s name you entered is not in the list. Please try some other director.')
        
       
        # If choice is Comparison
        elif inp.upper() == 'COMPARISON':
            print('COMPARISON')
            # Ask for director's name input
            dir_inp1 = input('Who do you want to compare?\nEnter first Director:')
            dir_inp2 = input('Enter second Director:')
            
            if dir_inp1.upper() == dir_inp2.upper():
                print('*** Invalid input. You have entered same value for both directors. Please provide separate values. ... try again ')
            else:
                filter_dir1 = data['Director'].str.contains(dir_inp1, case=False)
                filter_dir2 = data['Director'].str.contains(dir_inp2, case=False)

                records_dir1 = data[filter_dir1]
                records_dir2 = data[filter_dir2]

                if records_dir1 is not None and records_dir2 is not None and len(records_dir1)>0 and len(records_dir2)>0:
                    # Records fetched from dataframe and store in dictionary in raw format 
                    dic_dir1 = fetch_data_by_director(records_dir1, dir_inp1)
                    dic_dir2 = fetch_data_by_director(records_dir2, dir_inp2)
                    cosine_sim = cosine_similarity(dic_dir1['genres'], dic_dir2['genres'])

                    # Converting fetched data to represntable format
                    dic_dir1_strings = convert_dic_to_string(dic_dir1)
                    dic_dir2_strings = convert_dic_to_string(dic_dir2)
                    
                    # Output
                    print('\n'+ dic_dir1_strings['director'] + ' has directed ' + dic_dir1_strings['movies'] + '.'
                      '\nThe most directed genres are', dic_dir1_strings['genres']) 
                    print('\n'+ dic_dir2_strings['director'] + ' has directed ' + dic_dir2_strings['movies'] + '.'
                      '\nThe most directed genres are', dic_dir2_strings['genres']) 
                    print('\nBased on that, they have a cosine similarity score of {:.5f}'.format(cosine_sim))
                else:
                    print('\nThe directors you entered are not in the list. Please try some other directors.')

        # If choice is to exit the program
        elif inp.upper() == 'EXIT':
            print('\nGoodbye!')
            break;
        
        # If choice is invalid
        else:
            print('\n*** Invalid choice. You can only enter ‘movie’, ‘director’, or ‘comparison’... try again ***')
            continue;
#             inp = input('What do you want to check on IMDB? (Please choose ‘movie’, ‘director’, or ‘comparison’)\nEnter ‘exit’ to quit\nInput: ')

    
    

menu_driven()


What do you want to check on IMDB? (Please choose ‘movie’, ‘director’, or ‘comparison’)
Enter ‘exit’ to quit
Your choice: MoVIe
Which movie do you want to check?
Your choice:The Dark KNIGHT

The director of movie ‘The Dark Knight’ is Christopher Nolan.
The genre of the movie is Action, Crime, Drama

What do you want to check on IMDB? (Please choose ‘movie’, ‘director’, or ‘comparison’)
Enter ‘exit’ to quit
Your choice: dIRECtoR
DIRECTOR
Who do you want to check?
Your choice:Christopher Nolan

Christopher Nolan has directed The Dark Knight, Inception, Interstellar, The Prestige, Memento, The Dark Knight Rises, Batman Begins.
The most directed genres are Action: 4, Crime: 3, Drama: 5, Adventure: 2, Sci-fi: 2, Mystery: 2, Thriller: 2

What do you want to check on IMDB? (Please choose ‘movie’, ‘director’, or ‘comparison’)
Enter ‘exit’ to quit
Your choice: cOMpaRISOn
COMPARISON
Who do you want to compare?
Enter first Director:Christopher Nolan
Enter second Director:Steven Spielberg

Christ