# Instructions
         
For Part A, you need to scrape IMDB web page to find out top movies sorted by user votes. For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage. 


**You need to write code after where I have <span style="color:red">'''  Your code here ...    '''.</span>**

***
Now let’s look at the read_m_from_html_string(url, num_of_m=50) function in detail. The parameter “num_of_m” in the function def read_m_from_html_string(url, num_of_m=50)
  represents the top number of movies you want to retrieve. For example, read_m_from_html_string(url,500) means that we want to extract top 500 movies released between, sorted by users' votes.

This function returns a list of dictionaries. Each dictionary represents one of the top movies, which could look like the following:

{
  
    'movie_id': 'tt7286456',
    'rank': '1.',
    'title': 'Joker',
    'runtime': 2h 2m,
    'year': '2019',
    'rating': '8.4',
    'votes': '1,421,777',
}


After you implement “read_m_from_html_string”, which will return a list of top movies, you need to export the movies list to a csv file.


***

After you done with scraping the needed data, you should clean and transform it as needed to make it ready for enriching the given "Movies.csv" dataset.
***

Finaly, export the enriched dataset to a CSV file:
Use the following naming convention: Project_3_PartA_Lastname.csv




In [1]:
import warnings
warnings.filterwarnings('ignore')
from bs4 import BeautifulSoup
import pandas as pd

***

## read_m_from_html_string

Inside this function, you need to write your code to pull the movies information from the provided Movies 500 HTML String text file.

For each movie, you need to pull :
- movie_id
- rank
- title 
- runtime
- year
- rating
- votes

To give examples on how to pull data from the web bage html string, I have included the code to pull the movie_id.
You need to inculde your code to pull the other needed movie information (title, rank, year, ......). You should have no missing values for each of the collected data.

The URL of an page that include movies released between 2018 and 2020 sorted by number of votes is: 

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc

Please click the URL and investigate how you can pull movie_id, rank, title,... from the webpage using the Inspect option.



In [2]:
# This function, read a number of movies from a url html string. The default value is 50
def read_m_from_html_string(url, num_of_m=50):
    
    print(url)
    
    with open('TopVoted_500_Movies_HTML.txt', 'r', encoding="utf8") as file:
        html_string = file.read()   # to read the hmtl file as a string
        # I have included the Movies 500 HTML String.txt file in the project folder. Please take a look.
    
    # create a soup object
    soup = BeautifulSoup(html_string, "html.parser")
    
    '''
    Click the URL and investigate how you can pull movie_id, rank, title,... from the webpage.
    To investigate the html of a web page , For example:
    URL: https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
    Right-click anywhere on the webpage, and at the very bottom of the menu that pops up, 
    you will see "Inspect", Click on it.
    '''
    '''
    Fetching a div that includes all the movies. This can be done by using find and find_all functions.
    for example, find_all('div') will give you all divs on the page. Actually, 
    this find or find_all function can have two parameters,
    in the code below 'div' is the tag name and 'ipc-page-grid__item ipc-page-grid__item--span-2' is an attribute 
    value of the tag. You can also do movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2'). 
    Here you explicitly say: I want to find a div with 
    attribute class = 'ipc-page-grid__item ipc-page-grid__item--span-2'.
    
    Since on each imdb page, there's only one div with class = 'lister-list', we can use find rather than find_all. 
    Find_all will return a list of div tags, while find() will return only one div.
   '''     
    movie_list = soup.find('div', 'ipc-page-grid__item ipc-page-grid__item--span-2') 
    # this div contains all the listed movies in the requested html web page string.
    
    list_movies = [] # initialize the function return value, which is a list of movies. 
                     # This list will contains the scraped data transformed to a structured format.
    
    # Using count track the number of movies processed. now it's 0 - No movie has been processed yet.
    count = 0
    
    # each movie listed in a div with attribute value 'ipc-metadata-list-summary-item'.
    divs=  movie_list.find_all('li','ipc-metadata-list-summary-item') # To find all the listed movies in the page.
    for d in divs:
        dict_each_movie = {}  # initialize the movie dictionary to store the movie information.

        # Pulling the movie_id
        try:
            movie_id= d.find('a', 'ipc-title-link-wrapper').attrs['href']
            movie_id= movie_id[7:16]
            
        except:
            movie_id=""
        finally:
            dict_each_movie["movie_id"] = movie_id
            print(movie_id)
            
        # Pulling the rank
        '''  Your code here ...    '''

        try:
            rank = d.find('h3','ipc-title__text').text
            rank= rank.split('. ')[0]
        
        except:
            rank=""
        finally:
            dict_each_movie["rank"] = rank
            print (rank)        
        
        
        
        

        # Pulling the title
        '''  Your code here ...    '''

        try:
            title= d.find('h3','ipc-title__text').text
            title =title.split('. ')[1]
        except:
            title=""
        finally:
            dict_each_movie["title"] = title
            print (title)        
        
        
        
     
        
        # Pulling the runtime
        '''  Your code here ...    '''
      
        try:
            runtime= d.find_all('span','sc-479faa3c-8 bNrEFi dli-title-metadata-item')
            run_time=''
            for r in runtime:
                run_time=run_time + ',' + r.text
            run_time=run_time.split(',')[2]
            
        except:
            run_time=""
        finally:
            dict_each_movie["run_time"] = run_time
            print (run_time)        
        
        
        
        # Pulling the year
        '''  Your code here ...    ''' 
        
        try:
            release_year= d.find_all('span','sc-479faa3c-8 bNrEFi dli-title-metadata-item')
            year=''
            for y in release_year:
                year=year + ',' + y.text
            year=year.split(',')[1]
            
        except:
            year=""
        finally:
            dict_each_movie["year"] = year
            print (year)        
        
        
                
        # Pulling the rating
          # the rating out of 10
        '''  Your code here ...    '''     
 
        try:
            rating = d.find('span','ipc-rating-star ipc-rating-star--base ipc-rating-star--imdb ratingGroup--imdb-rating').text
            rating = rating.split('(')[0]
            rating=rating[:-1]
            
        except:
            rating=""
        finally:
            dict_each_movie["rating"] = rating
            print (rating)        
        
        
        
        
        # Pulling the votes
        '''  Your code here ...    '''
        try:
            votes = d.find('div','sc-21df249b-0 jmcDPS').text
            votes = votes.split('Votes')[1]
            # votes = votes[:-1]
        except:
            votes=""
        finally:
            dict_each_movie["votes"] = votes
            print (votes)  
            
        
        
        list_movies.append(dict_each_movie)  # To add the movie information to the movies list.

        count +=1
        print('===============================')
        print()
        if count == num_of_m:
            break # to exit from the loop.

    return list_movies


###  Call statement to scrap the TopVoted 500 movies
##### read_m_from_html_string(url,500)

In [3]:
url = "https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc"

#to read the topVoted 500 movies
Movies_list = read_m_from_html_string(url,500)
Movies_list

https://www.imdb.com/search/title/?title_type=feature&release_date=2018-01-01,2020-12-31&sort=num_votes,desc
tt7286456
1
Joker
2h 2m
2019
8.4
1,422,218

tt4154796
2
Avengers: Endgame
3h 1m
2019
8.4
1,227,564

tt4154756
3
Avengers: Infinity War
2h 29m
2018
8.4
1,167,536

tt6751668
4
Parasite
2h 12m
2019
8.5
903,939

tt1825683
5
Black Panther
2h 14m
2018
7.3
820,837

tt7131622
6
Once Upon a Time in Hollywood
2h 41m
2019
7.6
809,786

tt8946378
7
Knives Out
2h 10m
2019
7.9
747,547

tt8579674
8
1917
1h 59m
2019
8.2
650,257

tt4633694
9
Spider-Man: Into the Spider-Verse
1h 57m
2018
8.4
641,118

tt5463162
10
Deadpool 2
1h 59m
2018
7.6
624,815

tt4154664
11
Captain Marvel
2h 3m
2019
6.8
597,379

tt1727824
12
Bohemian Rhapsody
2h 14m
2018
7.9
574,079

tt6723592
13
Tenet
2h 30m
2020
7.3
568,337

tt6644200
14
A Quiet Place
1h 30m
2018
7.5
567,953

tt6966692
15
Green Book
2h 10m
2018
8.2
542,702

tt6320628
16
Spider-Man: Far from Home
2h 9m
2019
7.4
537,741

tt1270797
17
Venom
1h 52m
2018
6.6
523,

33,411

tt9086228
454
Gretel & Hansel
1h 27m
2020
5.5
33,221

tt7098658
455
Raazi
2h 18m
2018
7.7
33,133

tt4986098
456
The Titan
1h 37m
2018
4.8
33,014

tt7946422
457
Prospect
1h 40m
2018
6.3
32,910

tt6269368
458
The Clovehitch Killer
1h 49m
2018
6.5
32,724

tt3317234
459
Game Over, Man!
1h 41m
2018
5.4
32,640

tt9477520
460
Asuran
2h 21m
2019
8.4
32,475

tt6987770
461
Destination Wedding
1h 27m
2018
6.0
32,400

tt6218358
462
Calibre
1h 41m
2018
6.8
32,359

tt9196192
463
Cuties
1h 36m
2020
3.6
32,353

tt7242142
464
Blindspotting
1h 35m
2018
7.4
32,333

tt7721800
465
Bharat
2h 30m
2019
4.7
32,076

tt4995776
466
The Red Sea Diving Resort
2h 9m
2019
6.6
31,786

tt3721964
467
Gringo
1h 51m
2018
6.1
31,711

tt6436726
468
7500
1h 33m
2019
6.3
31,700

tt6857166
469
Book Club
1h 44m
2018
6.1
31,661

tt1298789
470
American Murder: The Family Next Door
1h 23m
2020
7.2
31,528

tt5144174
471
The Dry
1h 57m
2020
6.8
31,257

tt7137380
472
Destroyer
2h 1m
2018
6.2
31,227

tt7430722
473
War
2h 31m
2

[{'movie_id': 'tt7286456',
  'rank': '1',
  'title': 'Joker',
  'run_time': '2h 2m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1,422,218'},
 {'movie_id': 'tt4154796',
  'rank': '2',
  'title': 'Avengers: Endgame',
  'run_time': '3h 1m',
  'year': '2019',
  'rating': '8.4',
  'votes': '1,227,564'},
 {'movie_id': 'tt4154756',
  'rank': '3',
  'title': 'Avengers: Infinity War',
  'run_time': '2h 29m',
  'year': '2018',
  'rating': '8.4',
  'votes': '1,167,536'},
 {'movie_id': 'tt6751668',
  'rank': '4',
  'title': 'Parasite',
  'run_time': '2h 12m',
  'year': '2019',
  'rating': '8.5',
  'votes': '903,939'},
 {'movie_id': 'tt1825683',
  'rank': '5',
  'title': 'Black Panther',
  'run_time': '2h 14m',
  'year': '2018',
  'rating': '7.3',
  'votes': '820,837'},
 {'movie_id': 'tt7131622',
  'rank': '6',
  'title': 'Once Upon a Time in Hollywood',
  'run_time': '2h 41m',
  'year': '2019',
  'rating': '7.6',
  'votes': '809,786'},
 {'movie_id': 'tt8946378',
  'rank': '7',
  'title': 'Kn

In [4]:
# to convert the movies list of dicts to dataframe
df_movies = pd.DataFrame(Movies_list)
df_movies

Unnamed: 0,movie_id,rank,title,run_time,year,rating,votes
0,tt7286456,1,Joker,2h 2m,2019,8.4,1422218
1,tt4154796,2,Avengers: Endgame,3h 1m,2019,8.4,1227564
2,tt4154756,3,Avengers: Infinity War,2h 29m,2018,8.4,1167536
3,tt6751668,4,Parasite,2h 12m,2019,8.5,903939
4,tt1825683,5,Black Panther,2h 14m,2018,7.3,820837
...,...,...,...,...,...,...,...
495,tt9072352,496,Relic,1h 29m,2020,6.0,29282
496,tt1006569,497,Antebellum,1h 45m,2020,5.8,29199
497,tt8652728,498,Waves,2h 15m,2019,7.5,29103
498,tt7748244,499,Mortal World,1h 47m,2018,7.6,29052


***
#  To export the colleted movies to IMDb_TopVoted.csv file.


In [5]:
df_movies.to_csv('IMDb_TopVoted.csv', index = False)

# Importing the given dataset "Movies.csv" to Pandas DataFrame called df1

In [6]:
# Importing the movies.csv file to df1 and print the df1.

df1= pd.read_csv('Movies.csv')

In [7]:
#Find Empty values in ratingCategory and Fill it with 'Not Rated'
df1['ratingCategory'].fillna('Not Rated', inplace=True)

In [8]:
#Replace 'Unrated' to 'Not Rated' to have uniform data in ratingCategory field
df1['ratingCategory']=df1['ratingCategory'].replace('Unrated','Not Rated')

In [9]:
#Replace "\N" with "-" to make it more meaningful
df1['genres']=df1['genres'].replace(r"\N","-")

In [10]:
df1.head

<bound method NDFrame.head of       movie_id                originalTitle  \
0    tt7286456                        Joker   
1    tt4154796            Avengers: Endgame   
2    tt4154756       Avengers: Infinity War   
3    tt6751668                     Parasite   
4    tt1825683                Black Panther   
..         ...                          ...   
495  tt9072352                        Relic   
496  tt1006569  Episode dated 9 August 2005   
497  tt8652728                        Waves   
498  tt7748244                 Mortal World   
499  tt6768578                       Dogman   

                                           description ratingCategory  \
0    During the 1980s, a failed stand-up comedian i...              R   
1    After the devastating events of Avengers: Infi...          PG-13   
2    The Avengers and their allies must be willing ...          PG-13   
3    Greed and class discrimination threaten the ne...              R   
4    T'Challa, heir to the hidden but ad

# Import the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2

In [11]:
# You need to import the collected dataset "IMDb_TopVoted.csv" and print the df2.
# To handel Latin characters that may contained in the csv file
# with no issue, use  encoding= "ISO-8859-1" with the pd.read_csv()
# Example: df1 = pd.read_csv('thefilename.csv', encoding= "ISO-8859-1") 
# Using encoding= "ISO-8859-1" will avoid Unicode-Decode-Errors.

df2= pd.read_csv("IMDb_TopVoted.csv", encoding= "ISO-8859-1")



# Data cleansing and transformation for df2.

In [12]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    int64  
 2   title     500 non-null    object 
 3   run_time  500 non-null    object 
 4   year      500 non-null    int64  
 5   rating    500 non-null    float64
 6   votes     500 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 27.5+ KB


In [13]:
# Cleaning and tranforming df2
 # rank, year, and votes should have a numeric integer data type.

#rank and year are already in int64 according to the top result
#converting votes to int
df2['votes']=df2['votes'].replace(',','',regex=True)
df2['votes']=df2['votes'].astype('int64')


In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   movie_id  500 non-null    object 
 1   rank      500 non-null    int64  
 2   title     500 non-null    object 
 3   run_time  500 non-null    object 
 4   year      500 non-null    int64  
 5   rating    500 non-null    float64
 6   votes     500 non-null    int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 27.5+ KB


In [15]:
#converting runtime to minutes
# runtime column should be renamed to runtimeMinutes and the value should be in minutes, 
# for example: 2h 2m should be 122
df2['runtimeMinutes']=df2['run_time'].apply(lambda x: sum(int(num[:-1]) * (60 if 'h' in num else 1) for num in x.split()))
df2=df2.drop(columns=['run_time'])
print(df2)


      movie_id  rank                   title  year  rating    votes  \
0    tt7286456     1                   Joker  2019     8.4  1422218   
1    tt4154796     2       Avengers: Endgame  2019     8.4  1227564   
2    tt4154756     3  Avengers: Infinity War  2018     8.4  1167536   
3    tt6751668     4                Parasite  2019     8.5   903939   
4    tt1825683     5           Black Panther  2018     7.3   820837   
..         ...   ...                     ...   ...     ...      ...   
495  tt9072352   496                   Relic  2020     6.0    29282   
496  tt1006569   497              Antebellum  2020     5.8    29199   
497  tt8652728   498                   Waves  2019     7.5    29103   
498  tt7748244   499            Mortal World  2018     7.6    29052   
499  tt6768578   500                  Dogman  2018     7.2    28823   

     runtimeMinutes  
0               122  
1               181  
2               149  
3               132  
4               134  
..             

In [16]:
#check if the data types are according to the requirement
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   movie_id        500 non-null    object 
 1   rank            500 non-null    int64  
 2   title           500 non-null    object 
 3   year            500 non-null    int64  
 4   rating          500 non-null    float64
 5   votes           500 non-null    int64  
 6   runtimeMinutes  500 non-null    int64  
dtypes: float64(1), int64(4), object(2)
memory usage: 27.5+ KB


In [17]:
#check runtimeMinutes which is converted into minutes 
df2.head(20)

Unnamed: 0,movie_id,rank,title,year,rating,votes,runtimeMinutes
0,tt7286456,1,Joker,2019,8.4,1422218,122
1,tt4154796,2,Avengers: Endgame,2019,8.4,1227564,181
2,tt4154756,3,Avengers: Infinity War,2018,8.4,1167536,149
3,tt6751668,4,Parasite,2019,8.5,903939,132
4,tt1825683,5,Black Panther,2018,7.3,820837,134
5,tt7131622,6,Once Upon a Time in Hollywood,2019,7.6,809786,161
6,tt8946378,7,Knives Out,2019,7.9,747547,130
7,tt8579674,8,1917,2019,8.2,650257,119
8,tt4633694,9,Spider-Man: Into the Spider-Verse,2018,8.4,641118,117
9,tt5463162,10,Deadpool 2,2018,7.6,624815,119


# 	Enrich the given dataset (df1) by merging it to the scraped data (df2).

In [18]:
# Merege the two dataframes to one dataframe called df bases on movied_id 
df=df1.merge(df2)


In [19]:
df

Unnamed: 0,movie_id,originalTitle,description,ratingCategory,genres,rank,title,year,rating,votes,runtimeMinutes
0,tt7286456,Joker,"During the 1980s, a failed stand-up comedian i...",R,"Crime,Drama,Thriller",1,Joker,2019,8.4,1422218,122
1,tt4154796,Avengers: Endgame,After the devastating events of Avengers: Infi...,PG-13,"Action,Adventure,Drama",2,Avengers: Endgame,2019,8.4,1227564,181
2,tt4154756,Avengers: Infinity War,The Avengers and their allies must be willing ...,PG-13,"Action,Adventure,Sci-Fi",3,Avengers: Infinity War,2018,8.4,1167536,149
3,tt6751668,Parasite,Greed and class discrimination threaten the ne...,R,"Drama,Thriller",4,Parasite,2019,8.5,903939,132
4,tt1825683,Black Panther,"T'Challa, heir to the hidden but advanced king...",PG-13,"Action,Adventure,Sci-Fi",5,Black Panther,2018,7.3,820837,134
...,...,...,...,...,...,...,...,...,...,...,...
495,tt9072352,Relic,"A daughter, mother and grandmother are haunted...",R,"Drama,Horror,Mystery",496,Relic,2020,6.0,29282,89
496,tt1006569,Episode dated 9 August 2005,Successful author Veronica Henley finds hersel...,R,News,497,Antebellum,2020,5.8,29199,105
497,tt8652728,Waves,Traces the journey of a suburban family - led ...,R,"Drama,Romance,Sport",498,Waves,2019,7.5,29103,135
498,tt7748244,Mortal World,Mermer Family lives a double life working at t...,R,"Action,Comedy,Crime",499,Mortal World,2018,7.6,29052,107


# Rearrange the dataset fields to be listed in the following order: 
movie_id , rank , title ,  originalTitle ,  description ,
          year ,  votes , rating ,  runtimeMinutes ,  ratingCategory ,  genres

In [20]:
# Rearrange the dataset fields as suggested.
df=df[['movie_id','rank','title','originalTitle','description','year','votes','rating','runtimeMinutes','ratingCategory','genres']]

In [21]:
#Check if it is rearranged
df

Unnamed: 0,movie_id,rank,title,originalTitle,description,year,votes,rating,runtimeMinutes,ratingCategory,genres
0,tt7286456,1,Joker,Joker,"During the 1980s, a failed stand-up comedian i...",2019,1422218,8.4,122,R,"Crime,Drama,Thriller"
1,tt4154796,2,Avengers: Endgame,Avengers: Endgame,After the devastating events of Avengers: Infi...,2019,1227564,8.4,181,PG-13,"Action,Adventure,Drama"
2,tt4154756,3,Avengers: Infinity War,Avengers: Infinity War,The Avengers and their allies must be willing ...,2018,1167536,8.4,149,PG-13,"Action,Adventure,Sci-Fi"
3,tt6751668,4,Parasite,Parasite,Greed and class discrimination threaten the ne...,2019,903939,8.5,132,R,"Drama,Thriller"
4,tt1825683,5,Black Panther,Black Panther,"T'Challa, heir to the hidden but advanced king...",2018,820837,7.3,134,PG-13,"Action,Adventure,Sci-Fi"
...,...,...,...,...,...,...,...,...,...,...,...
495,tt9072352,496,Relic,Relic,"A daughter, mother and grandmother are haunted...",2020,29282,6.0,89,R,"Drama,Horror,Mystery"
496,tt1006569,497,Antebellum,Episode dated 9 August 2005,Successful author Veronica Henley finds hersel...,2020,29199,5.8,105,R,News
497,tt8652728,498,Waves,Waves,Traces the journey of a suburban family - led ...,2019,29103,7.5,135,R,"Drama,Romance,Sport"
498,tt7748244,499,Mortal World,Mortal World,Mermer Family lives a double life working at t...,2018,29052,7.6,107,R,"Action,Comedy,Crime"


# Export the enriched dataset to a CSV file:

In [23]:
# Use the following naming convention: 
#  Project_3_PartA_Lastname.csv
df.to_csv('Project_3_PartA.csv', index = False,encoding="cp1252")


# Summary


    Step 1: Scraped the required data from the IDMb website.
    Step 2: Convert the movies list of dicts to dataframe.
    Step 3: Exported the collected movies to IMDb_TopVoted.csv file.
    Step 4: Imported the movies.csv file to df1 and print df1.
    Step 5: Found empty values in ratingCategory and filled it with 'Not Rated'
    Step 6: Replaced 'Unrated' to 'Not Rated' to have uniform data in ratingCategory field
    Step 7: Replaced "\N" with - to make it more meaningful
    Step 8: Imported the scraped data from the IMDb_TopVoted.csv file to Pandas DataFrame called df2
    Step 9: Performed Data cleansing and transformation for df2:
        - corrected the data types
        - converted the runtime to minutes and renamed the column
    Step 10: Merged the datasets.
    Step 11: Rearranged the dataset fields as suggested.
    Step 12: Exported the enriched dataset to a CSV file.