<a href="https://colab.research.google.com/github/daryllman/multithreaded-webscraper/blob/master/multithreaded_webscraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In this webscraping tutorial, we will explore the concept of **multithreading** in python to help speed up our webscraping tasks (up to 10x in this tutorial)    
We will scraping movie related data from IMDB website - [Most Popular Movies](https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm) - a total of 100 movies and retrieving some movie details (name, rating, date, plot text). Subsequently, we will be saving these into a .csv file.   

Have fun :)

# Getting Started
Please make sure you run the 2 steps below:
- Mounting your Google Drive (only for saving files to your gdrive on collab)
- Load the necessary libraries

## Google Drive Mounting
Lets mount to our google drive as the tutorial below involves saving .csv file

In [None]:
# Mount to your google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Check that your gdrive is properly mounted and ready to go
with open('/content/drive/My Drive/Colab Notebooks/assets/foo.txt', 'w') as f:
  f.write('If you see this printed in console, everything is good:)')
!cat /content/drive/My\ Drive//Colab\ Notebooks/assets/foo.txt

## Load necessary libraries
Using mainly: 
- Python Requests library (to retrieve html document)
- BeautifulSoup (great html parser)

In [None]:
import requests
from bs4 import BeautifulSoup as bs # pip install beautifulsoup4
import time
import csv
import random
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} # global headers to be used for requests

# Webscraping Exploration
Exploring the IMDB website and parsing the html with BeautifulSoup to get the details we want

In [None]:
# Load the webpage
target_url = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' # IMDB Most Popular Movies

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get(target_url, headers=headers)

# Convert to a beautiful soup object
soup = bs(r.content)

In [None]:
movies_table = soup.find("table", attrs={"data-caller-name": "chart-moviemeter"}).find("tbody")
movies_tablerows = movies_table.find_all("tr")
movie_links = ["https://imdb.com" + movie.find("a")["href"] for movie in movies_tablerows]
movie_links

In [None]:
movie = movie_links[0] # picking the first movie, just to see
movie_soup = bs(requests.get(movie, headers=headers).content)

In [None]:
# Extract some details from a single movie
title = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("h1").get_text()
date = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}).get_text().strip()
rating = movie_soup.find("span", attrs={"itemprop": "ratingValue"}).get_text()
plot_text = movie_soup.find("div", attrs={"class": "summary_text"}).get_text().strip()
print(title, date, rating, plot_text) # If this print is okay, then we are good with this!

# Tutorials
I have added full comprehensive codes of **4 different ways of webscraping** data from IMDB's website            
1. Without Multithreading - Normal
2. Multithreading
3. Multithreading + Lock
4. Multithreading + Shared Queue

## 1. Without Multithreading - Normal




In [None]:
def extract_movie_details(movie_link):
  time.sleep(random.uniform(0, 0.2))
  movie_soup = bs(requests.get(movie_link, headers=headers).content)
  title = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("h1").get_text()
  date = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}).get_text().strip() if movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}) else None
  rating = movie_soup.find("span", attrs={"itemprop": "ratingValue"}).get_text() if movie_soup.find("span", attrs={"itemprop": "ratingValue"}) else None
  plot_text = movie_soup.find("div", attrs={"class": "summary_text"}).get_text().strip() if movie_soup.find("div", attrs={"class": "summary_text"}) else None
  print(title, date, rating, plot_text) # If this print is okay, then we are good with this!
 
  with open('/content/drive/My Drive/Colab Notebooks/assets/movies.csv', mode='a') as f:
    movie_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    movie_writer.writerow([title, date, rating, plot_text])



def extract_movies(soup):
  movies_table = soup.find("table", attrs={"data-caller-name": "chart-moviemeter"}).find("tbody")
  movies_tablerows = movies_table.find_all("tr")
  movie_links = ["https://imdb.com" + movie.find("a")["href"] for movie in movies_tablerows]
  
  for movie_link in movie_links:
    extract_movie_details(movie_link)




def main():
  start_time = time.time()
  popular_movies_url = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' # IMDB Most Popular Movies - 100 movies
  r = requests.get(popular_movies_url, headers=headers)
  soup = bs(r.content)
  
  # Main function to extract the 100 movies from IMDB Most Popular Movies
  extract_movies(soup)
  
  end_time = time.time()
  print("Total time taken: ", end_time-start_time)


main()


## 2. Multithreading

In [None]:
import concurrent.futures

MAX_THREADS = 10

def extract_movie_details(movie_link):
  time.sleep(random.uniform(0, 0.2))
  movie_soup = bs(requests.get(movie_link, headers=headers).content)

  title = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("h1").get_text()
  date = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}).get_text().strip() if movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}) else None
  rating = movie_soup.find("span", attrs={"itemprop": "ratingValue"}).get_text() if movie_soup.find("span", attrs={"itemprop": "ratingValue"}) else None
  plot_text = movie_soup.find("div", attrs={"class": "summary_text"}).get_text().strip() if movie_soup.find("div", attrs={"class": "summary_text"}) else None
  print(title, date, rating, plot_text) # If this print is okay, then we are good with this!
 
  with open('/content/drive/My Drive/Colab Notebooks/assets/movies.csv', mode='a') as f:
    movie_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    movie_writer.writerow([title, date, rating, plot_text])



def extract_movies(soup):
  movies_table = soup.find("table", attrs={"data-caller-name": "chart-moviemeter"}).find("tbody")
  movies_tablerows = movies_table.find_all("tr")
  movie_links = ["https://imdb.com" + movie.find("a")["href"] for movie in movies_tablerows]
  
  threads = min(MAX_THREADS, len(movie_links))
  with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    executor.map(extract_movie_details, movie_links)




def main():
  start_time = time.time()
  popular_movies_url = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' # IMDB Most Popular Movies - 100 movies
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
  r = requests.get(popular_movies_url, headers=headers)
  soup = bs(r.content)
  
  # Main function to extract the 100 movies from IMDB Most Popular Movies
  extract_movies(soup)
  
  end_time = time.time()
  print("Total time taken: ", end_time-start_time)


main()


## 3. Multithreading + Lock

In [None]:
import concurrent.futures
import threading
csv_writer_lock = threading.Lock() # only one thread can hold this lock at one time

MAX_THREADS = 10

def extract_movie_details(movie_link):
  time.sleep(random.uniform(0, 0.2))
  movie_soup = bs(requests.get(movie_link, headers=headers).content)

  title = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("h1").get_text()
  date = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}).get_text().strip() if movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}) else None
  rating = movie_soup.find("span", attrs={"itemprop": "ratingValue"}).get_text() if movie_soup.find("span", attrs={"itemprop": "ratingValue"}) else None
  plot_text = movie_soup.find("div", attrs={"class": "summary_text"}).get_text().strip() if movie_soup.find("div", attrs={"class": "summary_text"}) else None
  print(title, date, rating, plot_text) # If this print is okay, then we are good with this!

  # Only one thread allowed to access the csv file at one time
  with csv_writer_lock:
    with open('/content/drive/My Drive/Colab Notebooks/assets/movies.csv', mode='a') as f:
      movie_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
      movie_writer.writerow([title, date, rating, plot_text])



def extract_movies(soup):
  movies_table = soup.find("table", attrs={"data-caller-name": "chart-moviemeter"}).find("tbody")
  movies_tablerows = movies_table.find_all("tr")
  movie_links = ["https://imdb.com" + movie.find("a")["href"] for movie in movies_tablerows]
  
  threads = min(MAX_THREADS, len(movie_links))
  with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    executor.map(extract_movie_details, movie_links)




def main():
  start_time = time.time()
  popular_movies_url = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' # IMDB Most Popular Movies - 100 movies
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
  r = requests.get(popular_movies_url, headers=headers)
  soup = bs(r.content)
  
  # Main function to extract the 100 movies from IMDB Most Popular Movies
  extract_movies(soup)
  
  end_time = time.time()
  print("Total time taken: ", end_time-start_time)


main()


## 4. Multithreading + Shared Queue

In [None]:
import concurrent.futures
from threading import Thread
from queue import Queue
from random import randint
MAX_THREADS = 10

extraction_queue = Queue()

def extraction():
  while True:
    if not extraction_queue.empty():
      i = extraction_queue.get()

      if i == None: # end of job
        break

      #row comes out of queue. csv writing goes here
      with open('/content/drive/My Drive/Colab Notebooks/assets/movies.csv', mode='a') as f:
        movie_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        movie_writer.writerow([i['title'], i['date'], i['rating'], i['plot_text']])
      print('Wrote to file: ', i['title'])
      extraction_queue.task_done()

def write_to_file(movie_details):
  extraction_queue.put(movie_details)

def extract_movie_details(movie_link):
  time.sleep(random.uniform(0, 0.2))
  movie_soup = bs(requests.get(movie_link, headers=headers).content)

  title = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("h1").get_text()
  date = movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}).get_text().strip() if movie_soup.find("div", attrs={"class": "title_wrapper"}).find("a", attrs={"title": "See more release dates"}) else None
  rating = movie_soup.find("span", attrs={"itemprop": "ratingValue"}).get_text() if movie_soup.find("span", attrs={"itemprop": "ratingValue"}) else None
  plot_text = movie_soup.find("div", attrs={"class": "summary_text"}).get_text().strip() if movie_soup.find("div", attrs={"class": "summary_text"}) else None
  print(title, date, rating, plot_text) # If this print is okay, then we are good with this!
  
  write_to_file({'title': title, 'date': date, 'rating': rating, 'plot_text': plot_text})



def extract_movies(soup):
  movies_table = soup.find("table", attrs={"data-caller-name": "chart-moviemeter"}).find("tbody")
  movies_tablerows = movies_table.find_all("tr")
  movie_links = ["https://imdb.com" + movie.find("a")["href"] for movie in movies_tablerows]
  
  threads = min(MAX_THREADS, len(movie_links))
  with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
    executor.map(extract_movie_details, movie_links)





def main():
  start_time = time.time()

  # Start consumer thread - dedicated thread to write movie details to file
  consumer = Thread(target=extraction)
  consumer.setDaemon(True)
  consumer.start()

  popular_movies_url = 'https://www.imdb.com/chart/moviemeter/?ref_=nv_mv_mpm' # IMDB Most Popular Movies - 100 movies
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
  r = requests.get(popular_movies_url, headers=headers)
  soup = bs(r.content)
  
  # Main function to extract the 100 movies from IMDB Most Popular Movies
  extract_movies(soup)

  # Signal to the queue that it is the end
  extraction_queue.put(None)

  # End queue
  consumer.join()
  
  end_time = time.time()
  print("Total time taken: ", end_time-start_time)

main()


# Credits
- [Troy Fawkes ](https://www.troyfawkes.com/learn-python-multithreading-queues-basics/)
- [StackOverflow d33tah](https://stackoverflow.com/questions/19387864/signal-the-end-of-jobs-on-the-queue)
- [StackOverflow GreenGodot](https://stackoverflow.com/questions/33107019/multiple-threads-writing-to-the-same-csv-in-python)