# Letterboxd Film Page Web Scraping

For my SQL project analyzing my film viewership and rating trends in the Letterboxd app, I needed descriptive data about each film I had rated.

Below, I use Beautiful Soup to scrape the Letterboxd webpage of each film on my ratings list for cast/crew information, universal ratings, genres, etc. I compile this data into separate CSV files for [**analysis in SQL**](https://github.com/andrewdkim7/portfolio/blob/165ad768ab75cb870a159fec8d5a97b22f2ef01e/SQL/LetterboxdSQLAnalysis.ipynb).

In [1]:
import pandas as pd

# import my personal letterboxd ratings csv
myratings = pd.read_csv('myratings.csv')

In [2]:
import requests
from bs4 import BeautifulSoup
import re

# create lists to populate scraped data
directors = []
cast = []
studios = []
length = []
genres = []
rating = []
ratingcount = []

# iterate through each movie i've rated on letterboxd
for url in myratings['Letterboxd URI']:

  # request letterboxd movie page and create soup object
  r = requests.get(url)
  soup = BeautifulSoup(r.content, 'html.parser')

  # unique id from url
  url_id = url.split('/')[-1]

  # title
  title = soup.select('.js-widont')[0].text

  # directors
  directors.extend([(url_id, title, director.text) for director in soup.select('.directorlist span')])

  # cast
  ncast = [(url_id, title, actor.text) for actor in soup.select('#tab-cast a')]
  if (url_id, title,'Show All…') in ncast:
    ncast.remove((url_id, title,'Show All…'))
  cast.extend(ncast)

  # studios
  nstudiostag = soup.select_one('#tab-details .text-sluglist p')
  studios.extend([(url_id, title, studio.text) for studio in nstudiostag.select('a')])

  # length in minutes
  nlength = int(soup.select('.col-10 .text-footer')[0].text.split()[0])
  length.append((url_id, title, nlength))

  # genres
  ngenrestag = soup.select_one('#tab-genres .text-sluglist p')
  genres.extend([(url_id, title, genre.text) for genre in ngenrestag.select('a')])

  # rating out of 5 stars
  nrating = float(soup.find('meta', {'name': 'twitter:data2'})['content'][:4])
  rating.append((url_id, title, nrating))

  # number of ratings
  ratingcountre = re.search('"ratingCount":[0-9]+', str(soup.find('script', {'type': 'application/ld+json'})))
  ratingcount.append((url_id, title, int(ratingcountre[0].split(':')[-1])))

In [3]:
# create dataframes from lists
directorsdf = pd.DataFrame(directors, columns = ['url_id', 'title', 'director'])
castdf = pd.DataFrame(cast, columns = ['url_id', 'title', 'actor'])
studiosdf = pd.DataFrame(studios, columns = ['url_id', 'title', 'studio'])
lengthdf = pd.DataFrame(length, columns = ['url_id', 'title', 'length'])
genresdf = pd.DataFrame(genres, columns = ['url_id', 'title', 'genre'])
ratingdf = pd.DataFrame(rating, columns = ['url_id', 'title', 'rating'])
ratingcountdf = pd.DataFrame(ratingcount, columns = ['url_id', 'title', 'rating_count'])

In [4]:
# export as csv files for analysis in sql
directorsdf.to_csv('directors.csv')
castdf.to_csv('cast.csv')
studiosdf.to_csv('studios.csv')
lengthdf.to_csv('length.csv')
genresdf.to_csv('genres.csv')
ratingdf.to_csv('rating.csv')
ratingcountdf.to_csv('ratingcount.csv')