# Instructions
         
For Part A, you need to scrape IMDB web page to find out top movies sorted by user votes. For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage. 


**You need to write code after where I have <span style="color:red">'''  Your code here ...    '''.</span>**

***
Now let’s look at the read_m_from_html_string(url, num_of_m=50) function in detail. The parameter “num_of_m” in the function def read_m_from_html_string(url, num_of_m=50)
  represents the top number of movies you want to retrieve. For example, read_m_from_html_string(url,500) means that we want to extract top 500 movies released between, sorted by users' votes.

This function returns a list of dictionaries. Each dictionary represents one of the top movies, which could look like the following:

{
  
    'movie_id': 'tt7286456',
    'rank': '1.',
    'title': 'Joker',
    'runtime': 2h 2m,
    'year': '2019',
    'rating': '8.4',
    'votes': '1,421,777',
}


After you implement “read_m_from_html_string”, which will return a list of top movies, you need to export the movies list to a csv file.


***

After you done with scraping the needed data, you should clean and transform it as needed to make it ready for enriching the given "Movies.csv" dataset.
***

Finaly, export the enriched dataset to a CSV file:
Use the following naming convention: Project_3_PartA_Lastname.csv




In [115]:
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
import pandas as pd

***

## read_m_from_html_string

Inside this function, you need to write your code to pull the movies information from the provided Movies 500 HTML String text file.

For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

To give examples on how to pull data from the web bage html string, I have included the code to pull the movie_id.
You need to inculde your code to pull the other needed movie information (title, rank, year, ......). You should have no missing values for each of the collected data.

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage using the Inspect option.



In [118]:
# This function, read a number of movies from a url html string. The default value is 50
def read_m_from_html_string(url, num_of_m=50):
    
    print(url)
    
    with open('TopVoted_500_Movies_HTML.txt', 'r', encoding="utf8") as file:
        html_string = file.read()   # to read the hmtl file as a string
        # I have included the Movies 500 HTML String.txt file in the project folder. Please take a look.
    
    # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
    
    '''
    Click the URL and investigate how you can pull movie_id, rank, title,... from the webpage.
    To investigate the html of a web page , For example:
    URL: https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
    Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, 
    you will see "Inspect", Click on it.
    '''
    '''
    Fetching a div that includes all the movies. This can be done by using find and find_all functions.
    for example, find_all('div') will give you all divs on the page. Actually, 
    this find or find_all function can have two parameters,
    in the code below 'div' is the tag name and 'ipc-page-grid__item ipc-page-grid__item--span-2' is an attribute 
    value of the tag. You can also do movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2'). 
    Here you explicitly say: I want to find a div with 
    attribute class = 'ipc-page-grid__item ipc-page-grid__item--span-2'.
    
    Since on each imdb page, there's only one div with class = 'lister-list', we can use find rather than find_all. 
    Find_all will return a list of div tags, while find() will return only one div.
   '''     
    movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2') 
    # this div contains all the listed movies in the requested html web page string.
    
    list_movies = [] # initialize the function return value, which is a list of movies. 
                     # This list will contains the scraped data transformed to a structured format.
    
    # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0
    
    # each movie listed in a div with attribute value 'ipc-metadata-list-summary-item'.
    divs=  movie_list.find_all('li','ipc-metadata-list-summary-item') # To find all the listed movies in the page.
    for d in divs:
        dict_each_movie = {}  # initialize the movie dictionary to store the movie information.

        # Pulling the movie_id
        try:
            movie_id= d.find('a', 'ipc-title-link-wrapper').attrs['href']
            movie_id= movie_id[7:16]
            
        except:
            movie_id=""
        finally:
            dict_each_movie["movie_id"] = movie_id
            print(movie_id)
            
        # Pulling the rank
        try:
            h3_element= d.find('h3', 'ipc-title__text')
            rank= h3_element.text.split('. ')[0]
            
        except:
            rank=""
        finally:
            dict_each_movie["rank"] = rank
            print(rank)
        
        # Pulling the title
        try:
            h3_element1= d.find('h3', 'ipc-title__text')
            title= h3_element1.text.split('. ')[1]
            
        except:
            title=""
        finally:
            dict_each_movie["title"] = title
            print(title)
            
        # Pulling the runtime
        try:
            runtime = ""
            runtime_element = d.find_all('span','dli-title-metadata-item')  # Fetch all span elements
            runtime = runtime_element[1].text.strip()
        except:
            runtime = ""  # Set to empty string if not found
        finally:
            dict_each_movie["runtime"] = runtime
            print("Runtime:", runtime)
            
        # Pulling the year
        try:
            year = ""
            year_element = d.find_all('span','dli-title-metadata-item')  # Fetch all span elements
            year = year_element[0].text.strip()
        except:
            year = ""  # Set to empty string if not found
        finally:
            dict_each_movie["year"] = year
            print("Year:", year)
        
        # Pulling the rating
          # the rating out of 10
        try:
            rating_element = d.find('span','ipc-rating-star--rating')
            rating = rating_element.text.strip()
            
        except:
            rating=""
        finally:
            dict_each_movie["rating"] = rating
            
        # Pulling the votes
        try:
            votes_element = d.find('span','ipc-rating-star--voteCount')
            votes = votes_element.text.strip().strip('()')
            
        except:
            votes=""
        finally:
            dict_each_movie["votes"] = votes
        
        list_movies.append(dict_each_movie)  # To add the movie information to the movies list.
        print(dict_each_movie)
        count +=1
        print('===============================')
        print()
        if count == num_of_m:
            break # to exit from the loop.

    return list_movies


###  Call statement to scrap the TopVoted 500 movies
##### read_m_from_html_string(url,500)

In [121]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc"

Movies_list = read_m_from_html_string(url,500)  #to read the topVoted 500 movies
Movies_list

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
tt7286456
1
Joker
Runtime: 2h 2m
Year: 2019
{'movie_id': 'tt7286456', 'rank': '1', 'title': 'Joker', 'runtime': '2h 2m', 'year': '2019', 'rating': '8.4', 'votes': '1.6M'}

tt4154796
2
Avengers: Endgame
Runtime: 3h 1m
Year: 2019
{'movie_id': 'tt4154796', 'rank': '2', 'title': 'Avengers: Endgame', 'runtime': '3h 1m', 'year': '2019', 'rating': '8.4', 'votes': '1.3M'}

tt4154756
3
Avengers: Infinity War
Runtime: 2h 29m
Year: 2018
{'movie_id': 'tt4154756', 'rank': '3', 'title': 'Avengers: Infinity War', 'runtime': '2h 29m', 'year': '2018', 'rating': '8.4', 'votes': '1.2M'}

tt6751668
4
Parasite
Runtime: 2h 12m
Year: 2019
{'movie_id': 'tt6751668', 'rank': '4', 'title': 'Parasite', 'runtime': '2h 12m', 'year': '2019', 'rating': '8.5', 'votes': '1M'}

tt7131622
5
Once Upon a Time..
Runtime: 2h 41m
Year: 2019
{'movie_id': 'tt7131622', 'rank': '5', 'title': 'Once Upon a Time..', 'runtime'

[{'movie_id': 'tt7286456',
  'rank': '1',
  'title': 'Joker',
  'runtime': '2h 2m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1.6M'},
 {'movie_id': 'tt4154796',
  'rank': '2',
  'title': 'Avengers: Endgame',
  'runtime': '3h 1m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1.3M'},
 {'movie_id': 'tt4154756',
  'rank': '3',
  'title': 'Avengers: Infinity War',
  'runtime': '2h 29m',
  'year': '2018',
  'rating': '8.4',
  'votes': '1.2M'},
 {'movie_id': 'tt6751668',
  'rank': '4',
  'title': 'Parasite',
  'runtime': '2h 12m',
  'year': '2019',
  'rating': '8.5',
  'votes': '1M'},
 {'movie_id': 'tt7131622',
  'rank': '5',
  'title': 'Once Upon a Time..',
  'runtime': '2h 41m',
  'year': '2019',
  'rating': '7.6',
  'votes': '878K'},
 {'movie_id': 'tt1825683',
  'rank': '6',
  'title': 'Black Panther',
  'runtime': '2h 14m',
  'year': '2018',
  'rating': '7.3',
  'votes': '856K'},
 {'movie_id': 'tt8946378',
  'rank': '7',
  'title': 'Knives Out',
  'runtime': '2h 10m',
  'year':

In [123]:
print(Movies_list)

[{'movie_id': 'tt7286456', 'rank': '1', 'title': 'Joker', 'runtime': '2h 2m', 'year': '2019', 'rating': '8.4', 'votes': '1.6M'}, {'movie_id': 'tt4154796', 'rank': '2', 'title': 'Avengers: Endgame', 'runtime': '3h 1m', 'year': '2019', 'rating': '8.4', 'votes': '1.3M'}, {'movie_id': 'tt4154756', 'rank': '3', 'title': 'Avengers: Infinity War', 'runtime': '2h 29m', 'year': '2018', 'rating': '8.4', 'votes': '1.2M'}, {'movie_id': 'tt6751668', 'rank': '4', 'title': 'Parasite', 'runtime': '2h 12m', 'year': '2019', 'rating': '8.5', 'votes': '1M'}, {'movie_id': 'tt7131622', 'rank': '5', 'title': 'Once Upon a Time..', 'runtime': '2h 41m', 'year': '2019', 'rating': '7.6', 'votes': '878K'}, {'movie_id': 'tt1825683', 'rank': '6', 'title': 'Black Panther', 'runtime': '2h 14m', 'year': '2018', 'rating': '7.3', 'votes': '856K'}, {'movie_id': 'tt8946378', 'rank': '7', 'title': 'Knives Out', 'runtime': '2h 10m', 'year': '2019', 'rating': '7.9', 'votes': '796K'}, {'movie_id': 'tt8579674', 'rank': '8', 'ti

In [125]:
# to convert the movies list of dics to dataframe
df_movies = pd.DataFrame(Movies_list)
df_movies

Unnamed: 0,movie_id,rank,title,runtime,year,rating,votes
0,tt7286456,1,Joker,2h 2m,2019,8.4,1.6M
1,tt4154796,2,Avengers: Endgame,3h 1m,2019,8.4,1.3M
2,tt4154756,3,Avengers: Infinity War,2h 29m,2018,8.4,1.2M
3,tt6751668,4,Parasite,2h 12m,2019,8.5,1M
4,tt7131622,5,Once Upon a Time..,2h 41m,2019,7.6,878K
...,...,...,...,...,...,...,...
495,tt3089630,496,Artemis Fowl,1h 35m,2020,4.3,31K
496,tt1308728,497,The Happytime Murders,1h 31m,2018,5.5,31K
497,tt1138238,498,The Dissident,1h 59m,2020,7.8,31K
498,tt1031014,499,Fatman,1h 40m,2020,5.9,31K


***
#  To export the colleted movies to IMDb_TopVoted.csv file.


In [128]:
df_movies.to_csv('IMDb_TopVoted_Group6.csv', index = False)

# Importing the given dataset "Movies.csv" to Pandas DataFrame called df1

In [131]:
# Importing the csv file to df1 and print the df1.

df1 = pd.read_csv ('Movies.csv')
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   movie_id        500 non-null    object
 1   originalTitle   500 non-null    object
 2   description     500 non-null    object
 3   ratingCategory  497 non-null    object
 4   genres          500 non-null    object
dtypes: object(5)
memory usage: 19.7+ KB


# Import the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2

In [134]:
# You need to import the collected dataset "IMDb_TopVoted.csv" and print the df2.
# To handel Latin characters that may contained in the csv file
# with no issue, use  encoding= "ISO-8859-1" with the pd.read_csv()
# Example: df1 = pd.read_csv('thefilename.csv', encoding= "ISO-8859-1") 
# Using encoding= "ISO-8859-1" will avoid Unicode-Decode-Errors.

df2 = pd.read_csv('IMDb_TopVoted_Group6.csv)', encoding= "ISO-8859-1")
df2

Unnamed: 0,movie_id,rank,title,runtime,year,rating,votes
0,tt7286456,1,Joker,2h 2m,2019,8.4,1.6M
1,tt4154796,2,Avengers: Endgame,3h 1m,2019,8.4,1.3M
2,tt4154756,3,Avengers: Infinity War,2h 29m,2018,8.4,1.2M
3,tt6751668,4,Parasite,2h 12m,2019,8.5,1M
4,tt7131622,5,Once Upon a Time..,2h 41m,2019,7.6,878K
...,...,...,...,...,...,...,...
495,tt3089630,496,Artemis Fowl,1h 35m,2020,4.3,31K
496,tt1308728,497,The Happytime Murders,1h 31m,2018,5.5,31K
497,tt1138238,498,The Dissident,1h 59m,2020,7.8,31K
498,tt1031014,499,Fatman,1h 40m,2020,5.9,31K


# Data cleansing and transformation for df2.

In [137]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    int64  
 2   title     500 non-null    object 
 3   runtime   500 non-null    object 
 4   year      500 non-null    int64  
 5   rating    500 non-null    float64
 6   votes     500 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.5+ KB


In [139]:
df2.info()
df2.shape
df2_Rows = df2.shape[0]
df2_Columns = df2.shape[1]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    int64  
 2   title     500 non-null    object 
 3   runtime   500 non-null    object 
 4   year      500 non-null    int64  
 5   rating    500 non-null    float64
 6   votes     500 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.5+ KB


In [141]:
print(f"df2 dataset contains {df2_Rows} rows and {df2_Columns} columns.")

df2 dataset contains 500 rows and 7 columns.


In [143]:
df2.isnull().sum()

movie_id    0
rank        0
title       0
runtime     0
year        0
rating      0
votes       0
dtype: int64

In [145]:
df2.duplicated
duplicate = df2[df2.duplicated()]
  
print("Duplicate Rows :")
duplicate

Duplicate Rows :


Unnamed: 0,movie_id,rank,title,runtime,year,rating,votes


---

## Data Profiling Summary 
### List of the issues that the data contains

Steps for Data Profiling and Transformation

1. Convert Data Types:
Ensure the rank and year columns are converted to integers for consistency and accurate numerical operations.

2. Normalize the votes Column:
Create a function to convert votes values into integers:
If the value contains M, multiply the numeric portion by 1,000,000.
If the value contains K, multiply the numeric portion by 1,000.
Ensure all votes values are stored as integers.

3. Standardize the runtime Column:
Rename the runtime column to runtimeMinutes for clarity.
Convert runtime values into minutes using a custom function:
Extract hours (h) and minutes (m), calculate the total runtime in minutes.
Replace the original runtime format (e.g., 2h 2m) with the computed integer values.

4. Redundant Columns:
Drop the original runtime column after creating runtimeMinutes to avoid duplication.


---
---
---






# Data  Cleansing

In [150]:
# Cleaning and tranforming df2
 # rank, year, and votes should have a numeric integer data type.
 # runtime column should be renamed to runtimeMinutes and the value should be in minutes, 
 # for example: 2h 2m should be 122
    
# 1. Convert 'rank' and 'year' to integers
df2['rank'] = df2['rank'].astype(int)
df2['year'] = df2['year'].astype(int)

# 2. Method that performs the action of converting 'votes' to integers and removing 'M' or 'K'
# This method will be called in the below lines of code
def convert_votes(votes):
    # Ensure the input is a string for processing
    votes = str(votes) if not pd.isna(votes) else ''
    if 'M' in votes:
        return int(float(votes.replace('M', '')) * 1000000)
    elif 'K' in votes:
        return int(float(votes.replace('K', '')) * 1000)
    elif votes.isdigit():  # Check if the value is a plain numeric string
        return int(votes)
    else:
        return 0  # Default value for invalid or missing data

# Apply the function to the 'votes' column
df2['votes'] = df2['votes'].apply(convert_votes)


# 3. Rename 'runtime' to 'runtimeMinutes' and convert to minutes
def convert_runtime_to_runtimeMinutes(runtime):
    hours, minutes = 0, 0
    if 'h' in runtime:
        hours = int(runtime.split('h')[0].strip())
        runtime = runtime.split('h')[1].strip()  # Keep only the part after hours
    if 'm' in runtime:
        minutes = int(runtime.split('m')[0].strip())
    return hours * 60 + minutes

df2['runtimeMinutes'] = df2['runtime'].apply(convert_runtime_to_runtimeMinutes)
df2.drop(columns=['runtime'], inplace=True)  # Drop the original 'runtime' column

# Final DataFrame
print(df2.info())
print(df2.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        500 non-null    object 
 1   rank            500 non-null    int32  
 2   title           500 non-null    object 
 3   year            500 non-null    int32  
 4   rating          500 non-null    float64
 5   votes           500 non-null    int64  
 6   runtimeMinutes  500 non-null    int64  
dtypes: float64(1), int32(2), int64(2), object(2)
memory usage: 23.6+ KB
None
    movie_id  rank                   title  year  rating    votes  \
0  tt7286456     1                   Joker  2019     8.4  1600000   
1  tt4154796     2       Avengers: Endgame  2019     8.4  1300000   
2  tt4154756     3  Avengers: Infinity War  2018     8.4  1200000   
3  tt6751668     4                Parasite  2019     8.5  1000000   
4  tt7131622     5      Once Upon a Time..  2019     7.6   878000   

# 	Enrich the given dataset (df1) by merging it to the scraped data (df2).

In [154]:
# Merege the two dataframes to one dataframe called df.

# Merge the datasets on the 'movie_id' column
df = pd.merge(df1, df2, on='movie_id', how='left')

# Display the structure and first few rows of the enriched dataset
print(df.info())

df.dropna(inplace=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        500 non-null    object 
 1   originalTitle   500 non-null    object 
 2   description     500 non-null    object 
 3   ratingCategory  497 non-null    object 
 4   genres          500 non-null    object 
 5   rank            498 non-null    float64
 6   title           498 non-null    object 
 7   year            498 non-null    float64
 8   rating          498 non-null    float64
 9   votes           498 non-null    float64
 10  runtimeMinutes  498 non-null    float64
dtypes: float64(5), object(6)
memory usage: 43.1+ KB
None
<class 'pandas.core.frame.DataFrame'>
Index: 495 entries, 0 to 498
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        495 non-null    object 
 1   originalTitl

In [156]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 495 entries, 0 to 498
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        495 non-null    object 
 1   originalTitle   495 non-null    object 
 2   description     495 non-null    object 
 3   ratingCategory  495 non-null    object 
 4   genres          495 non-null    object 
 5   rank            495 non-null    float64
 6   title           495 non-null    object 
 7   year            495 non-null    float64
 8   rating          495 non-null    float64
 9   votes           495 non-null    float64
 10  runtimeMinutes  495 non-null    float64
dtypes: float64(5), object(6)
memory usage: 46.4+ KB


In [158]:
df['rank'] = df['rank'].astype(int)
df['year'] = df['year'].astype(int)
df['votes'] = df['votes'].astype(int)

In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 495 entries, 0 to 498
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        495 non-null    object 
 1   originalTitle   495 non-null    object 
 2   description     495 non-null    object 
 3   ratingCategory  495 non-null    object 
 4   genres          495 non-null    object 
 5   rank            495 non-null    int32  
 6   title           495 non-null    object 
 7   year            495 non-null    int32  
 8   rating          495 non-null    float64
 9   votes           495 non-null    int32  
 10  runtimeMinutes  495 non-null    float64
dtypes: float64(2), int32(3), object(6)
memory usage: 40.6+ KB


# 
Rearrange the dataset fields to be listed in the following order: 
movie_id , rank , title ,  originalTitle ,  description ,
          year ,  votes , rating ,  runtimeMinutes ,  ratingCategory ,  genres

In [163]:
df = df[["movie_id" , "rank" , "title" , "originalTitle" , "description" , "year" , "votes" , "rating" , "runtimeMinutes" , "ratingCategory" , "genres"]]
df.head()



Unnamed: 0,movie_id,rank,title,originalTitle,description,year,votes,rating,runtimeMinutes,ratingCategory,genres
0,tt7286456,1,Joker,Joker,"During the 1980s, a failed stand-up comedian i...",2019,1600000,8.4,122.0,R,"Crime,Drama,Thriller"
1,tt4154796,2,Avengers: Endgame,Avengers: Endgame,After the devastating events of Avengers: Infi...,2019,1300000,8.4,181.0,PG-13,"Action,Adventure,Drama"
2,tt4154756,3,Avengers: Infinity War,Avengers: Infinity War,The Avengers and their allies must be willing ...,2018,1200000,8.4,149.0,PG-13,"Action,Adventure,Sci-Fi"
3,tt6751668,4,Parasite,Parasite,Greed and class discrimination threaten the ne...,2019,1000000,8.5,132.0,R,"Drama,Thriller"
4,tt1825683,6,Black Panther,Black Panther,"T'Challa, heir to the hidden but advanced king...",2018,856000,7.3,134.0,PG-13,"Action,Adventure,Sci-Fi"


# Export the enriched dataset to a CSV file:

In [166]:
# Use the following naming convention: 
#  Project_3_PartA_Lastname.csv
df.to_csv('Project_3_Part_A_Group6.csv', index = False)

