Run this to mount the code:


## Mounting

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install scrapy

Collecting scrapy
  Downloading Scrapy-2.9.0-py2.py3-none-any.whl (277 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.2/277.2 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Twisted>=18.9.0 (from scrapy)
  Downloading Twisted-22.10.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.1.0-py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Collecting pyOpenSSL>=21.0.0 (from scrapy)
  Downloading pyOpenSSL-23.2.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting queuelib>=1.4.2 (from scrapy)
  Downloading queue

In [None]:
%cd /content/drive/MyDrive/IMDB\ Project/Scraping


/content/drive/MyDrive/IMDB Project/Scraping


Now I've already navigated to the current folder, time to make my first project ScraPy_Code 1

In [None]:
!scrapy startproject ScraPy_Code_1


Now to navigate to this part and write my new ScraPy spider: `basic_details_scraper.py`


In [None]:
%cd /content/drive/MyDrive/IMDB\ Project/Scraping/ScraPy_Code_1/ScraPy_Code_1/spiders


/content/drive/MyDrive/IMDB Project/Scraping/ScraPy_Code_1/ScraPy_Code_1/spiders


Now to write the scrapy file! Note, this file scrapes basic details from all the movies that show up via this url:

'https://www.imdb.com/search/title/?title_type=feature&user_rating=5.0,10.0&languages=en'

From the year 1980 to 2023. Since the file was ran on the 19th of July 2023, that's where the dataset will be updated till.

**The list is roughly 60000 titles long.**


Note, it does not take movies that fall below 5 stars of overall user rating!

In [None]:
%%writefile basic_details_scraper.py
import scrapy
import csv
import os
import logging

class IMDbMovieSpider(scrapy.Spider):
    name = 'imdb_basic_details_spider2'
    base_url = 'https://www.imdb.com/search/title/?title_type=feature&user_rating=1.0,10.0&languages=en&sort=boxoffice_gross_us,desc'
    output_directory = '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data'
    os.makedirs(output_directory, exist_ok=True)

    def start_requests(self):

        print("STARTING START REQUESTSSSS\n\n\n")

        # # Loop through 2-year intervals
        # for year in range(1996,2023,2):
        #     start_date = f'{year-1}-01-01'
        #     end_date = f'{year}-12-31'
        #     url = f'{self.base_url}&release_date={start_date},{end_date}&start=1'
        #     output_file = os.path.join(self.output_directory, f'movies_{year-1}_{year}.csv')


        # Set year to 2023
        year = 2023
        start_date = f'{year}-01-01'
        end_date = f'{year}-12-31'
        url = f'{self.base_url}&release_date={start_date},{end_date}&start=1'
        output_file = os.path.join(self.output_directory, f'movies_{year}.csv')


        # Log the URL and output file
        self.log(f'URL: {url}', level=logging.INFO)
        self.log(f'Output file: {output_file}', level=logging.INFO)

        # Write the headers to the file
        with open(output_file, 'w', newline='', encoding='utf-8') as file:
            writer = csv.writer(file, delimiter='|')
            writer.writerow(['Title', 'Gross', 'Details URL', 'Genres'])

        # Send request for each 2-year interval
        yield scrapy.Request(url=url, callback=self.parse, meta={'output_file': output_file, 'start_year': year-1})

    def parse(self, response):
        print("STARTING PARSE \n\n\n\n")

        self.log('Visited %s' % response.url)

        output_file = response.meta['output_file']
        start_year = response.meta['start_year']

        for movie in response.css('div.lister-item'):
            title = movie.css('h3.lister-item-header a::text').get()
            gross = movie.css('p.sort-num_votes-visible span[name="nv"]:last-child::attr(data-value)').get()
            details_url = movie.css('h3.lister-item-header a::attr(href)').get()
            genres = movie.css('span.genre::text').get()



            # Count the number of rows in the file
            with open(output_file, 'r', newline='', encoding='utf-8') as file:
                reader = csv.reader(file, delimiter='|')
                row_count = sum(1 for row in reader)

            # Save to the appropriate CSV file
            with open(output_file, 'a', newline='', encoding='utf-8') as file:
                writer = csv.writer(file, delimiter='|')
                writer.writerow([title, gross, details_url, genres])


            # Print the number of rows saved to the file
            print(f'Saved {row_count} rows to {output_file}')



        # Pagination
        current_start = int(response.url.split('&start=')[1].split('&')[0])

        # Stop if 10000 titles are reached for the current 2-year interval
        if current_start >= 10000:
            return

        # Go to the next page
        next_start = current_start + 50
        next_page = f'{self.base_url}&release_date={start_year}-01-01,{start_year+2}-12-31&start={next_start}'
        yield scrapy.Request(url=next_page, callback=self.parse, meta={'output_file': output_file, 'start_year': start_year})


Writing basic_details_scraper.py


In [None]:
#Now, to resave this list but order it by the box office

The ScraPy spider for the first section has been written! Now, to navigate to the original project directory, then run the spider.

In [None]:
# Navigate to the spiders folder to view the spider codes
%cd /content/drive/MyDrive/IMDB\ Project/Scraping/ScraPy_Code_1/ScraPy_Code_1/spiders
!ls

/content/drive/MyDrive/IMDB Project/Scraping/ScraPy_Code_1/ScraPy_Code_1/spiders
basic_details_scraper2.py  basic_details_scraper.py  __init__.py  __pycache__


In [None]:
#Now to actually run the spider! The data will be saved  in '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data'

%cd /content/drive/MyDrive/IMDB\ Project/Scraping/ScraPy_Code_1/ScraPy_Code_1/spiders

!scrapy runspider basic_details_scraper.py


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Saved 762 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 763 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 764 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 765 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 766 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 767 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 768 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 769 rows to /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1.1_data/movies_2023.csv
Saved 770 rows to /content/driv

In [None]:
import os

def check_file_exists(file_path):
    if os.path.isfile(file_path):
        print(f"The file '{file_path}' exists.")
    else:
        print(f"The file '{file_path}' does not exist.")

# Use the function to check if your file exists
check_file_exists('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2016_2017.csv')


The file '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2016_2017.csv' exists.


Alright! Now we have our entire list of movies, some small amounts of detail, and most importantly the movie id, which will be used to guide the more detailed scraper.

But first, we need to concatenate all of these files into 1, the reason why we had to scrape in groups of 2 years is because there is a 10K results limit for each search category.

In [None]:
import pandas as pd
import glob

# Get a list of all CSV files
csv_files = glob.glob('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/*.csv')
csv_files

['/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/all_movie_ids_old.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2014_2015.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2002_2003.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2016_2017.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1992_1993.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2012_2013.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2004_2005.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1996_1997.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1998_1999.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/mo

In [None]:
csv_files.remove('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/all_movie_ids_old.csv')
csv_files.remove('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/all_movie_ids2.csv')


In [None]:
csv_files

['/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2014_2015.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2002_2003.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2016_2017.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1992_1993.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2012_2013.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2004_2005.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1996_1997.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1998_1999.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1982_1983.csv',
 '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/mov

In [None]:
import pandas as pd

# Iterate over each file
for file in csv_files:
    # Read the file
    df = pd.read_csv(file, header = None, nrows=3, sep='|')

    # Display the first row
    print(f"File: {file}")
    print(df)

    # Get the new column names
    new_columns = input("Enter the new column names, comma separated: ")
    new_columns = new_columns.split(",")

    # Read the file again with the new column names
    df = pd.read_csv(file, names=new_columns, header = None, sep='|')

    # Overwrite the original file
    df.to_csv(file, index=False)


File: /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2014_2015.csv
                       0                                                  1   \
0            Jason Bourne  https://m.media-amazon.com/images/S/sash/4Fyxw...   
1  Straight Outta Compton  https://m.media-amazon.com/images/S/sash/4Fyxw...   
2        Star Trek Beyond  https://m.media-amazon.com/images/S/sash/4Fyxw...   

                  2           3        4   \
0  /title/tt4196776/  (I) (2016)  123 min   
1  /title/tt1398426/      (2015)  147 min   
2  /title/tt2660888/      (2016)  122 min   

                                        5    6   7   \
0           \nAction, Thriller              6.6  58   
1  \nBiography, Drama, History              7.8  72   
2  \nAction, Adventure, Sci-Fi              7.0  68   

                                                  8   9            10  
0  The CIA's most dangerous former operative is d... NaN  162,434,410  
1                            

#### Compiling all the data into one csv

In [None]:
# Assuming csv_files is your list of csv file paths

# Loop through the CSV files and print the column names
for csv_file in csv_files:
    df = pd.read_csv(csv_file, nrows=0)
    print(f'Column names for {csv_file}: \n{df.columns.tolist()}\n')


Column names for /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2014_2015.csv: 
["'Title'", " 'image_url'", " 'details_url'", " 'date'", " 'duration'", " 'genres'", " 'rating'", " 'metascore'", " 'summary'", " 'votes'", " 'gross'"]

Column names for /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2002_2003.csv: 
["'Title'", " 'image_url'", " 'details_url'", " 'date'", " 'duration'", " 'genres'", " 'rating'", " 'metascore'", " 'summary'", " 'votes'", " 'gross'"]

Column names for /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_2016_2017.csv: 
["'Title'", " 'image_url'", " 'details_url'", " 'date'", " 'duration'", " 'genres'", " 'rating'", " 'metascore'", " 'summary'", " 'votes'", " 'gross'"]

Column names for /content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/movies_1992_1993.csv: 
["'Title'", " 'image_url'", " 'details_url'", " 'date'", " 'duration'", "

In [None]:
# Define the column names
column_names = ['Title', 'image_url', 'details_url', 'date', 'duration', 'genres', 'rating', 'metascore', 'summary', 'votes', 'gross']


In [None]:
all_data = pd.DataFrame()

In [None]:
# Loop through the CSV files and read each one into a dataframe
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    all_data = pd.concat([all_data, df], ignore_index=True)


In [None]:
all_data.shape

(21855, 11)

In [None]:
all_data.columns = all_data.columns.str.strip(' \'')


In [None]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21855 entries, 0 to 21854
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Title        21855 non-null  object 
 1   image_url    21855 non-null  object 
 2   details_url  21855 non-null  object 
 3   date         21855 non-null  object 
 4   duration     21284 non-null  object 
 5   genres       21711 non-null  object 
 6   rating       21855 non-null  float64
 7   metascore    10626 non-null  float64
 8   summary      21359 non-null  object 
 9   votes        0 non-null      float64
 10  gross        21855 non-null  object 
dtypes: float64(3), object(8)
memory usage: 1.8+ MB


In [None]:
# Remove trailing apostrophe from column names
all_data.columns = all_data.columns.str.rstrip('\'')

# Replace NaN values in 'gross' column with 999999999999
all_data['gross'].fillna(999999999999, inplace=True)

# Remove commas and convert 'gross' column to integer
all_data['gross'] = all_data['gross'].str.replace(',', '').astype(float)

# Sort the dataframe by 'gross' column in descending order
all_data.sort_values('gross', ascending=False, inplace=True)

all_data

Unnamed: 0,Title,image_url,details_url,date,duration,genres,rating,metascore,summary,votes,gross
13449,The Dark Knight,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0468569/,(2008),152 min,"\nAction, Crime, Drama",9.0,84.0,When the menace known as the Joker wreaks havo...,,534858444.0
13450,Pirates of the Caribbean: Dead Man's Chest,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0383574/,(2006),151 min,"\nAction, Adventure, Fantasy",7.3,53.0,Jack Sparrow races to recover the heart of Dav...,,423315812.0
13451,Spider-Man 3,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0413300/,(2007),139 min,"\nAction, Adventure, Sci-Fi",6.3,59.0,A strange black entity from another world bond...,,336530303.0
13452,Shrek the Third,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0413267/,(2007),93 min,"\nAnimation, Adventure, Comedy",6.1,58.0,Reluctantly designated as the heir to the land...,,320706665.0
13453,Transformers,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0418279/,(2007),144 min,"\nAction, Adventure, Sci-Fi",7.0,61.0,An ancient struggle between two Cybertronian r...,,319246193.0
...,...,...,...,...,...,...,...,...,...,...,...
19050,Fireheart,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt8354218/,(2022),92 min,"\nAnimation, Adventure, Comedy",6.2,,Sixteen-year-old Georgia Nolan dreams of being...,,
19051,Late Night with the Devil,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt14966898/,(2023),86 min,\nHorror,8.3,71.0,A live television broadcast in 1977 goes horri...,,
19052,Alone Together,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt14584284/,(I) (2022),98 min,"\nDrama, Romance",5.4,52.0,Two strangers embroiled in bad relationships w...,,
19053,Real Love,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt27230149/,(II) (2023),87 min,"\nDrama, Romance",7.4,,It follows Kendra as she goes to an HBCU in No...,,


In [None]:
all_data.reset_index(drop=True, inplace=True)


In [None]:
all_data

Unnamed: 0,Title,image_url,details_url,date,duration,genres,rating,metascore,summary,votes,gross
0,The Dark Knight,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0468569/,(2008),152 min,"\nAction, Crime, Drama",9.0,84.0,When the menace known as the Joker wreaks havo...,,534858444.0
1,Pirates of the Caribbean: Dead Man's Chest,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0383574/,(2006),151 min,"\nAction, Adventure, Fantasy",7.3,53.0,Jack Sparrow races to recover the heart of Dav...,,423315812.0
2,Spider-Man 3,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0413300/,(2007),139 min,"\nAction, Adventure, Sci-Fi",6.3,59.0,A strange black entity from another world bond...,,336530303.0
3,Shrek the Third,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0413267/,(2007),93 min,"\nAnimation, Adventure, Comedy",6.1,58.0,Reluctantly designated as the heir to the land...,,320706665.0
4,Transformers,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt0418279/,(2007),144 min,"\nAction, Adventure, Sci-Fi",7.0,61.0,An ancient struggle between two Cybertronian r...,,319246193.0
...,...,...,...,...,...,...,...,...,...,...,...
21850,Fireheart,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt8354218/,(2022),92 min,"\nAnimation, Adventure, Comedy",6.2,,Sixteen-year-old Georgia Nolan dreams of being...,,
21851,Late Night with the Devil,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt14966898/,(2023),86 min,\nHorror,8.3,71.0,A live television broadcast in 1977 goes horri...,,
21852,Alone Together,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt14584284/,(I) (2022),98 min,"\nDrama, Romance",5.4,52.0,Two strangers embroiled in bad relationships w...,,
21853,Real Love,https://m.media-amazon.com/images/S/sash/4Fyxw...,/title/tt27230149/,(II) (2023),87 min,"\nDrama, Romance",7.4,,It follows Kendra as she goes to an HBCU in No...,,


In [None]:
#That's strange, time to find the really top grossing movies like Avatar and Avengers


In [None]:
# Write the concatenated dataframe to a new CSV file
all_data.to_csv('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/all_movie_ids_final.csv', index=False)

## Okay! It's done! Now for the next step: Scraping ALL the important details, except for the actual user reviews:

In [None]:
%cd /content/drive/MyDrive/IMDB\ Project/Scraping
!scrapy startproject ScraPy_Code_2

/content/drive/MyDrive/IMDB Project/Scraping
New Scrapy project 'ScraPy_Code_2', using template directory '/usr/local/lib/python3.10/dist-packages/scrapy/templates/project', created in:
    /content/drive/MyDrive/IMDB Project/Scraping/ScraPy_Code_2

You can start your first spider with:
    cd ScraPy_Code_2
    scrapy genspider example example.com


In [None]:
%cd /content/drive/MyDrive/IMDB\ Project/Scraping/ScraPy_Code_2/ScraPy_Code_2/spiders

/content/drive/MyDrive/IMDB Project/Scraping/ScraPy_Code_2/ScraPy_Code_2/spiders


In [None]:
%%writefile more_details_scraper.py

import pandas as pd
import scrapy
import datetime
import json
import os
from scrapy import signals
import csv


class csv_dialect(csv.Dialect):
    delimiter = ','
    quotechar = '"'
    doublequote = True
    skipinitialspace = False
    lineterminator = '\n'
    quoting = csv.QUOTE_ALL


class IMDbSpider(scrapy.Spider):
    name = 'scraping_test2'
    user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    output_directory = '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_2.2_data'
    os.makedirs(output_directory, exist_ok=True)
    data = []

    # Try to read in the already scraped titles
    try:
        df_scraped = pd.read_csv('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_2.2_data/more_details.csv', header=None)
    except pd.errors.EmptyDataError:
        # If the file is empty, create an empty DataFrame
        df_scraped = pd.DataFrame()

    scraped_ids = df_scraped[0].tolist() if not df_scraped.empty else []

    # Read in the start URLs and remove already scraped titles
    df = pd.read_csv('/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data/all_movie_ids_final.csv')
    df['details_url'] = 'https://www.imdb.com' + df['details_url']
    df['id'] = df['details_url'].str.extract(r'(tt\d+)')
    df = df[~df['id'].isin(scraped_ids)]
    start_urls = df['details_url'].tolist()

    # Record the start time
    start_time = datetime.datetime.now()

    # Initialize a counter
    url_count = 0

    def handle_error(self, failure):
        self.log(failure)

    def start_requests(self):
        for url in self.start_urls:
            imdb_id = url.split('/')[-2]
            output_file = os.path.join(self.output_directory, 'more_details_2.csv')
            yield scrapy.Request(url, headers={'User-Agent': self.user_agent}, meta={'imdb_id': imdb_id}, errback=self.handle_error, callback=self.parse)

    def parse(self, response):
        imdb_id = response.meta['imdb_id']


        # Increment the counter
        self.url_count += 1



        # Calculate the elapsed time and the average time per item
        elapsed_time = datetime.datetime.now() - self.start_time
        avg_time_per_item = elapsed_time / self.url_count

        # Estimate the remaining time
        remaining_items = len(self.start_urls) - self.url_count
        estimated_remaining_time = avg_time_per_item * remaining_items

        self.log(f'***********************TIME ESTIMATION ********************************* \n\n Processing {imdb_id} ({self.url_count}/{len(self.start_urls)}), estimated remaining time: {estimated_remaining_time}\n\n')




        # Initialize the meta dictionary with the imdb_id
        meta = {
            'imdb_id': imdb_id
        }
    #START OF PARSING CODE HERE
    #Scraping Main Details
        title = response.css('h1.sc-afe43def-0 span.sc-afe43def-1::text').get()
        director = response.css('li[data-testid="title-pc-principal-credit"] span:contains("Director") ~ div ul li a::text').get()
        writers = response.css('li[data-testid="title-pc-principal-credit"] span:contains("Writers") ~ div ul li a::text').getall()
        stars = response.css('li[data-testid="title-pc-principal-credit"] a:contains("Stars") ~ div ul li a::text').getall()
        user_reviews = response.css('ul[data-testid="reviewContent-all-reviews"] a:contains("User reviews") span.score::text').get()
        critic_reviews = response.css('ul[data-testid="reviewContent-all-reviews"] a:contains("Critic reviews") span.score::text').get()
        metascore = response.css('ul[data-testid="reviewContent-all-reviews"] a:contains("Metascore") span.score-meta::text').get()

    # Scraping Technical Specs
        tech_specs = response.css('div[data-testid="title-techspecs-section"]')
        runtime = tech_specs.css('li[data-testid="title-techspec_runtime"] div.ipc-metadata-list-item__content-container::text').getall()
        runtime = " ".join(runtime)  # Joining the scraped parts to form the complete runtime text
        sound_mix = tech_specs.css('li[data-testid="title-techspec_soundmix"] a::text').getall()
        # Scraping aspect ratio
        aspect_ratio = tech_specs.css('span.ipc-metadata-list-item__list-content-item::text').get()
    # Scraping Box Office Information
        budget = response.css('li[data-testid="title-boxoffice-budget"] span.ipc-metadata-list-item__list-content-item::text').get()
        gross_us_canada = response.css('li[data-testid="title-boxoffice-grossdomestic"] span.ipc-metadata-list-item__list-content-item::text').get()

        #
        opening_weekend_data = response.css('li[data-testid="title-boxoffice-openingweekenddomestic"] span.ipc-metadata-list-item__list-content-item::text').getall()
        if opening_weekend_data:
            opening_weekend_amount = opening_weekend_data[0]
            opening_weekend_date = opening_weekend_data[1] if len(opening_weekend_data) > 1 else None
        else:
            opening_weekend_amount = None
            opening_weekend_date = None
        #



        # opening_weekend_amount = response.css('li[data-testid="title-boxoffice-openingweekenddomestic"] span.ipc-metadata-list-item__list-content-item::text').getall()[0]
        # opening_weekend_date = response.css('li[data-testid="title-boxoffice-openingweekenddomestic"] span.ipc-metadata-list-item__list-content-item::text').getall()[1]
        opening_weekend_us_canada = f"{opening_weekend_amount}, {opening_weekend_date}"
        gross_worldwide = response.css('li[data-testid="title-boxoffice-cumulativeworldwidegross"] span.ipc-metadata-list-item__list-content-item::text').get()

    # Scraping Details Section
        release_date = response.css('li[data-testid="title-details-releasedate"] a.ipc-metadata-list-item__list-content-item--link::text').get()
        countries_of_origin = response.css('li[data-testid="title-details-origin"] a.ipc-metadata-list-item__list-content-item--link::text').getall()
        official_sites = response.css('li[data-testid="details-officialsites"] a.ipc-metadata-list-item__list-content-item--link::attr(href)').getall()
        languages = response.css('li[data-testid="title-details-languages"] a.ipc-metadata-list-item__list-content-item--link::text').getall()
        also_known_as = response.css('li[data-testid="title-details-akas"] span.ipc-metadata-list-item__list-content-item::text').get()
        filming_locations = response.css('li[data-testid="title-details-filminglocations"] a.ipc-metadata-list-item__list-content-item--link::text').get()
        production_companies = response.css('li[data-testid="title-details-companies"] a.ipc-metadata-list-item__list-content-item--link::text').getall()


        # Update the meta dictionary with new data
        meta.update({
            'title': title,
            'runtime': runtime,
            'sound_mix': sound_mix,
            'aspect_ratio': aspect_ratio,
            'budget': budget,
            'gross_us_canada': gross_us_canada,
            'opening_weekend_us_canada': opening_weekend_us_canada,
            'gross_worldwide': gross_worldwide,
            'writers': writers,
            'release_date': release_date,
            'countries_of_origin': countries_of_origin,
            'official_sites': official_sites,
            'director': director,
            'writers': writers,
            'stars': stars,
            'user_reviews': user_reviews,
            'critic_reviews': critic_reviews,
            'metascore': metascore,
            'languages': languages,
            'also_known_as': also_known_as,
            'filming_locations': filming_locations,
            'production_companies': production_companies
        })


        print(f'Title: {title}')
        print(f'Director: {director}')

        yield response.follow(f'https://www.imdb.com/title/{imdb_id}/plotsummary', self.parse_plot_summary, meta=meta, errback=self.handle_error)

    def parse_plot_summary(self, response):
        imdb_id = response.meta['imdb_id']

        # Scraping plot summaries
        plot_summaries = response.css('div[data-testid="sub-section-summaries"] div.ipc-html-content-inner-div::text').getall()
        # Scraping synopsis
        synopsis = response.css('ul.meta-data-list-full div.ipc-html-content-inner-div::text').get()



        # Update the meta dictionary with new data
        response.meta.update({
            'plot_summaries': plot_summaries,
            'synopsis': synopsis
        })


        yield response.follow(f'https://www.imdb.com/title/{imdb_id}/reviews?ref_=tt_urv', self.parse_user_reviews, meta=response.meta, errback=self.handle_error)


    def parse_user_reviews(self, response):
        imdb_id = response.meta['imdb_id']
        review_blocks = response.css('.review-container')

        if not review_blocks:
            print("No review blocks found.")
            print("Going to Technical Specs")
            yield response.follow(f'https://www.imdb.com/title/{imdb_id}/technical/?ref_=tt_spec_sm',
                                self.parse_technical_specs,
                                meta=response.meta,
                                errback=self.handle_error)

        # Get reviews_data from response.meta, if it doesn't exist initialize it as an empty list
        reviews_data_str = response.meta.get('reviews_data', '[]')
        # Convert the string back to a list
        reviews_data_list = eval(reviews_data_str)
        reviewer_ratings = response.meta.get('reviewer_ratings', [])

        for block in review_blocks:
            review = block.css('.text.show-more__control::text').get()
            reviewer = block.css('.display-name-link a::text').get()
            rating = block.css('.ipl-ratings-bar span.rating-other-user-rating span::text').get()

            print(f"Reviewer: {reviewer}, Rating: {rating}")

            review_data = {  # dictionary to hold individual review data
                'review': review,
                'reviewer': reviewer,
                'rating': rating
            }

            reviews_data_list.append(str(review_data))  # append string representation of review_data dictionary to reviews_data_list

            reviewer_ratings.append({
                'reviewer': reviewer,
                'rating': rating
            })

        # Convert the list back to a string to store it in response.meta
        reviews_data_str = str(reviews_data_list)

        response.meta.update({
            'reviews_data': reviews_data_str,  # Update with the string representation of the list
            'reviewer_ratings': reviewer_ratings
        })

        key = response.css('.load-more-data::attr(data-key)').get()

        # ... rest of your code


        print(f"Key: {key}")

        if key:
            yield scrapy.Request(
                url = f'https://www.imdb.com/title/{imdb_id}/reviews/_ajax?ref_=undefined&paginationKey='+ key,
                callback=self.parse_user_reviews,
                meta=response.meta,
            )
        else:
            print("No key found.")
            yield response.follow(f'https://www.imdb.com/title/{imdb_id}/technical/?ref_=tt_spec_sm',
                                self.parse_technical_specs,
                                meta=response.meta,
                                errback=self.handle_error)

    def parse_technical_specs(self, response):
        imdb_id = response.meta['imdb_id']

        # Scraping Technical Specs
        runtime = response.css('li#runtime span.ipc-metadata-list-item__list-content-item::text').getall()
        sound_mix = response.css('li#soundmixes a.ipc-metadata-list-item__list-content-item--link::text').getall()
        color = response.css('li#colorations a.ipc-metadata-list-item__list-content-item--link::text').getall()
        aspect_ratio = response.css('li#aspectratio span.ipc-metadata-list-item__list-content-item::text').getall()
        camera = response.css('li#cameras span.ipc-metadata-list-item__list-content-item::text').getall()
        laboratory = response.css('li#laboratory span.ipc-metadata-list-item__list-content-item::text').getall()
        film_length = response.css('li#filmLength span.ipc-metadata-list-item__list-content-item::text').getall()
        negative_format = response.css('li#negativeFormat span.ipc-metadata-list-item__list-content-item::text').getall()
        cinematographic_process = response.css('li#process span.ipc-metadata-list-item__list-content-item::text').getall()
        printed_film_format = response.css('li#printedFormat span.ipc-metadata-list-item__list-content-item::text').getall()



        # Update the meta dictionary with new data
        response.meta.update({
            'runtime': runtime,
            'sound_mix': sound_mix,
            'color': color,
            'aspect_ratio': aspect_ratio,
            'camera': camera,
            'laboratory': laboratory,
            'film_length': film_length,
            'negative_format': negative_format,
            'cinematographic_process': cinematographic_process,
            'printed_film_format': printed_film_format
        })

        yield response.follow(f'https://www.imdb.com/title/{imdb_id}/externalreviews?ref_=tt_ov_rt', self.parse_external_reviews, meta=response.meta, errback=self.handle_error)

    def parse_external_reviews(self, response):
        imdb_id = response.meta['imdb_id']

        # Locate all review site blocks
        review_site_blocks = response.css('.ipc-metadata-list__item.ipc-metadata-list-item--link')

        # Lists to store review site names and URLs
        review_site_names = []
        review_site_urls = []

        # For each block, extract the reviewer site name and URL
        for block in review_site_blocks:
            review_site_name = block.css('a.ipc-metadata-list-item__label--link::text').get()
            review_site_url = block.css('a.ipc-metadata-list-item__label--link::attr(href)').get()

            # Add reviewer site name to the list
            review_site_names.append(review_site_name)

            # Add reviewer site URL to the list
            review_site_urls.append(review_site_url)

        # Update the meta dictionary with new data
        response.meta.update({
            'review_site_names': review_site_names,
            'review_site_urls': review_site_urls
        })


        # Convert the meta dictionary to a DataFrame
        meta_df = pd.DataFrame([response.meta])

        # Write to CSV file
        output_file = os.path.join(self.output_directory, '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_2.2_data/more_details.csv')
        print(f"Writing data to {output_file}")  # Print the file path
        print(meta_df)  # Print the data that is being written
        meta_df.to_csv(output_file, mode='a', header=False, index=False, encoding='utf-8')

        # Yielding the scraped data
        yield response.meta

    def handle_error(self, failure):
        # Log all failures
        self.log(failure)
        # Yield the meta data
        yield failure.request.meta


Overwriting more_details_scraper.py


## Part 2 Execution

This is where the magic happens! This code will run the above spider, and you'll get to see from the output what items has been scraped, what hasn't, etc.

The important thing to note is that this ScraPy spider is tuned specifically to scrape certain details of movies from IMDb, filtered based on the previously collected list in part 1!

In [None]:
# Navigate to the spiders folder to view the spider codes
%cd /content/drive/MyDrive/IMDB\ Project/Scraping/ScraPy_Code_2/ScraPy_Code_2/spiders
!ls

/content/drive/MyDrive/IMDB Project/Scraping/ScraPy_Code_2/ScraPy_Code_2/spiders
__init__.py  more_details.csv  more_details_scraper.py	__pycache__


In [None]:
#Now to actually run the spider! The data will be saved  in '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_1_data'

%cd /content/drive/MyDrive/IMDB\ Project/Scraping/ScraPy_Code_2/ScraPy_Code_2/spiders

!scrapy runspider more_details_scraper.py




/content/drive/MyDrive/IMDB Project/Scraping/ScraPy_Code_2/ScraPy_Code_2/spiders
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 8, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.10/dist-packages/scrapy/cmdline.py", line 157, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 325, in __init__
    super().__init__(settings)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 197, in __init__
    self.spider_loader = self._get_spider_loader(settings)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/crawler.py", line 191, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/local/lib/python3.10/dist-packages/scrapy/spiderloader.py", line 69, in from_settings
    return cls(settings)
  File "/usr/local/lib/python3.10/dist-packages/scrapy/spiderloader.py", line 24, in __init__
    self._load_all_sp

In case you're wondering what this insanely long wall of text means, and why its continously moving, allow me to explain:

>`Processing tt2293003 (473/57619), estimated remaining time: 3 days, 5:06:39.038536`


Gives you an idea of how many more items in the list there is to be scraped. Note that the total amount is slightly different considering that your file might already have some movie titles in there


> `Key:g4w6ddbmqyzdo6ic4oxwjnbuqpu44cj734kdv4pkb3d7ev35pjt6uds7ou4vvmjcb4dtjnwhi66hz34nvubrcdjplm2jo`



Is part of the scrapy code output where it is extracting the AJAX key from the 'load more' button, to request from the server the next 50 user reviews.


>`2023-07-19 12:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt3393786/reviews?ref_=tt_urv> (referer: https://www.imdb.com/title/tt3393786/plotsummary/)`

Simply means that the spider has successfully crawled through the abovementioned URL and is now going towards the next part of the title details (in this case it's going to the reviews page)


Here is the code below to run in case some of the scraped data is wrong, or in the incorrect format, so that those lines will be removed and saved to a different file for cleaning later

In [None]:
import pandas as pd

# Load the dataframe
filename = '/content/drive/MyDrive/IMDB Project/Scraping/scraped_data/ScraPy_Code_2.2_data/more_details.csv'
df = pd.read_csv(filename, on_bad_lines='skip')

# Save the dataframe
df.to_csv(filename, index=False)