# Letterboxd Profile Analysis

As any avid movie-watcher knows, keeping track of the movies one watches provides one with a personalized cinematic journal. This practice can offer a sense of satisfaction, as well as allows you to reflect on your evolving interests and tastes over time. There are several websites that allow you to do this, such as letterboxd, IMDb, Criticker, and Reelgood. These websites are perfectly fine for keeping track of your movies, and also provide a large community of likeminded movie-watchers, they don't provide much insight to anything else.

In this project, I'd like to focus on some of the quantifiable aspects of all the movies I've watched, as well as some of my movie watching habits. Letterboxd allows its users to download a folder with rudimentary information, such as movies watched, and when they were watched. I will use this csv, along with the TMDb API, to scrape some more relevant, useful data.

Some specific questions I'd like answered are:
- Who are my favorite actors/directors?
- How much time have I spent watching movies?
- How often do I watch movies?
- What type of movies do I tend to gravitate towards?


## 1- First, lets read in and prepare our datasets.

In [1]:
import pandas as pd

df1 = pd.read_csv("/Users/erickpuntiel/Desktop/watched.csv")
df2 = pd.read_csv("/Users/erickpuntiel/Desktop/ratings.csv")

## getting rid of redundant columns
df1 = df1.drop(columns = ['Letterboxd URI'])
df2 = df2.drop(columns = ['Letterboxd URI', 'Date', 'Year'])

## merging dataframes on the movie title
movies = df1.merge(df2, on = 'Name', how = 'left')
movies.head(15)

Unnamed: 0,Date,Name,Year,Rating
0,2021-01-16,La Haine,1995,5.0
1,2022-01-07,Oldboy,2003,3.0
2,2022-01-07,No Country for Old Men,2007,5.0
3,2022-01-07,Taxi Driver,1976,
4,2022-01-10,Lady Bird,2017,4.0
5,2022-01-13,Birdman or (The Unexpected Virtue of Ignorance),2014,3.0
6,2022-01-14,City of God,2002,4.0
7,2022-01-15,Donnie Darko,2001,2.0
8,2022-01-16,Lost in Translation,2003,5.0
9,2022-01-17,Whiplash,2014,4.0


This is pretty much all the relevant data that the letterboxd csv provides. I'm going to use the tmdb API to scrape more data, such as release year, genre, country of origin, etc.
## 2- Connect to the API, and search for IDs

In [2]:
import tmdbsimple as tmdb
tmdb.API_KEY = '022c2850fb876773a9522f007ef2c46d'

search = tmdb.Search()
discover = tmdb.Discover()


tmdbID = []

## search by title and year to avoid any redundancies (movies with same title)
for t, y in zip(movies["Name"], movies["Year"]):
    movie_info = search.movie(query = t, year = y)
    if movie_info['results']:
        tmdbID.append(movie_info['results'][0]['id'])
    else:
        tmdbID.append(-1)

print(tmdbID[:15])

[406, 670, 6977, 103, 391713, 194662, 598, 141, 153, 244786, 27205, 38, 1359, 9100, 9424]


## 3- Use each movie's ID to get more information

In [3]:
## runtime, language, country, genre
language, runtime, genre, country = [], [], [], []

for ID in tmdbID:
    if ID != -1:
        movie_info = tmdb.Movies(ID).info()
        language.append(movie_info.get('original_language'))
        runtime.append(movie_info.get('runtime'))
        genre.append(movie_info.get('genres')[0]['name'])
        if movie_info.get("production_countries"):
            country.append(movie_info.get('production_countries')[0]['name'])
        else:
            country.append(None)
    else:
        language.append(None)
        runtime.append(None)
        genre.append(None)
        country.append(None)


movies['TMDB ID'] = tmdbID       
movies['Language'] = language
movies['Runtime'] = runtime
movies['Genre'] = genre
movies['Country'] = country

In [4]:
movies.head(15)

Unnamed: 0,Date,Name,Year,Rating,TMDB ID,Language,Runtime,Genre,Country
0,2021-01-16,La Haine,1995,5.0,406,fr,98.0,Drama,France
1,2022-01-07,Oldboy,2003,3.0,670,ko,120.0,Drama,South Korea
2,2022-01-07,No Country for Old Men,2007,5.0,6977,en,122.0,Crime,United States of America
3,2022-01-07,Taxi Driver,1976,,103,en,114.0,Crime,United States of America
4,2022-01-10,Lady Bird,2017,4.0,391713,en,94.0,Drama,United States of America
5,2022-01-13,Birdman or (The Unexpected Virtue of Ignorance),2014,3.0,194662,en,120.0,Drama,United States of America
6,2022-01-14,City of God,2002,4.0,598,pt,130.0,Drama,Brazil
7,2022-01-15,Donnie Darko,2001,2.0,141,en,114.0,Fantasy,United States of America
8,2022-01-16,Lost in Translation,2003,5.0,153,en,102.0,Drama,Japan
9,2022-01-17,Whiplash,2014,4.0,244786,en,107.0,Drama,United States of America


We have more information to work with, but I'd like to know less about the movie itself, and moreso the people who worked on the movies. Unlike movie information, crew information is stored in .credits(), not in .info().


In [5]:
directors = []

for id in tmdbID:
    if id == -1:
        directors.append(["No director found"])
        continue  

    credits = tmdb.Movies(id).credits()
    if credits and 'crew' in credits:
        director_names = []
        for crew in credits['crew']:
            if crew['job'] == 'Director':
                director_names.append(crew['name'])
        directors.append(director_names)
        
for director in directors[:15]:
    print(f"Director(s): {', '.join(director)}")

Director(s): Mathieu Kassovitz
Director(s): Park Chan-wook
Director(s): Joel Coen, Ethan Coen
Director(s): Martin Scorsese
Director(s): Greta Gerwig
Director(s): Alejandro González Iñárritu
Director(s): Fernando Meirelles
Director(s): Richard Kelly
Director(s): Sofia Coppola
Director(s): Damien Chazelle
Director(s): Christopher Nolan
Director(s): Michel Gondry
Director(s): Mary Harron
Director(s): Andrew Fleming
Director(s): David Nutter


Since actor information is also stored in .credits(), we can use a similar method to find all the actors. Because there are typically more actors than directors working on a movie, I'm going to find the top 5 billed actors rather than just the leading actor. While this might not accurately account for movies with large ensemble casts, most movies rarely have more than 5 protagonists/antagonists. 

In [6]:
actors = []

for id in tmdbID:
    if id == -1:
        actors.append(["No actors found"])
        continue  

    credits = tmdb.Movies(id).credits()
    if credits and 'cast' in credits:
        actor_names = []
        for cast_member in credits['cast']:
            actor_names.append(cast_member['name'])
            if len(actor_names) == 5:  # Stop after finding the top 5 actors
                break
        actors.append(actor_names)

for actor_names in actors[:15]:
    print(f"Top Actors: {', '.join(actor_names)}")

Top Actors: Vincent Cassel, Hubert Koundé, Saïd Taghmaoui, Abdel Ahmed Ghili, Souleymane Dicko
Top Actors: Choi Min-sik, Yoo Ji-tae, Kang Hye-jung, Kim Byeong-ok, Ji Dae-han
Top Actors: Tommy Lee Jones, Javier Bardem, Josh Brolin, Woody Harrelson, Kelly Macdonald
Top Actors: Robert De Niro, Jodie Foster, Cybill Shepherd, Harvey Keitel, Peter Boyle
Top Actors: Saoirse Ronan, Laurie Metcalf, Tracy Letts, Beanie Feldstein, Lucas Hedges
Top Actors: Michael Keaton, Emma Stone, Zach Galifianakis, Edward Norton, Andrea Riseborough
Top Actors: Alexandre Rodrigues, Leandro Firmino, Phellipe Haagensen, Douglas Silva, Jonathan Haagensen
Top Actors: Jake Gyllenhaal, Jena Malone, James Duval, Drew Barrymore, Beth Grant
Top Actors: Bill Murray, Scarlett Johansson, Giovanni Ribisi, Anna Faris, Akiko Takeshita
Top Actors: Miles Teller, J.K. Simmons, Paul Reiser, Melissa Benoist, Austin Stowell
Top Actors: Leonardo DiCaprio, Joseph Gordon-Levitt, Ken Watanabe, Tom Hardy, Elliot Page
Top Actors: Jim Car

In [7]:
movies['Director'] = directors
movies['Top 5 Actors'] = actors
movies.head(15)


Unnamed: 0,Date,Name,Year,Rating,TMDB ID,Language,Runtime,Genre,Country,Director,Top 5 Actors
0,2021-01-16,La Haine,1995,5.0,406,fr,98.0,Drama,France,[Mathieu Kassovitz],"[Vincent Cassel, Hubert Koundé, Saïd Taghmaoui..."
1,2022-01-07,Oldboy,2003,3.0,670,ko,120.0,Drama,South Korea,[Park Chan-wook],"[Choi Min-sik, Yoo Ji-tae, Kang Hye-jung, Kim ..."
2,2022-01-07,No Country for Old Men,2007,5.0,6977,en,122.0,Crime,United States of America,"[Joel Coen, Ethan Coen]","[Tommy Lee Jones, Javier Bardem, Josh Brolin, ..."
3,2022-01-07,Taxi Driver,1976,,103,en,114.0,Crime,United States of America,[Martin Scorsese],"[Robert De Niro, Jodie Foster, Cybill Shepherd..."
4,2022-01-10,Lady Bird,2017,4.0,391713,en,94.0,Drama,United States of America,[Greta Gerwig],"[Saoirse Ronan, Laurie Metcalf, Tracy Letts, B..."
5,2022-01-13,Birdman or (The Unexpected Virtue of Ignorance),2014,3.0,194662,en,120.0,Drama,United States of America,[Alejandro González Iñárritu],"[Michael Keaton, Emma Stone, Zach Galifianakis..."
6,2022-01-14,City of God,2002,4.0,598,pt,130.0,Drama,Brazil,[Fernando Meirelles],"[Alexandre Rodrigues, Leandro Firmino, Phellip..."
7,2022-01-15,Donnie Darko,2001,2.0,141,en,114.0,Fantasy,United States of America,[Richard Kelly],"[Jake Gyllenhaal, Jena Malone, James Duval, Dr..."
8,2022-01-16,Lost in Translation,2003,5.0,153,en,102.0,Drama,Japan,[Sofia Coppola],"[Bill Murray, Scarlett Johansson, Giovanni Rib..."
9,2022-01-17,Whiplash,2014,4.0,244786,en,107.0,Drama,United States of America,[Damien Chazelle],"[Miles Teller, J.K. Simmons, Paul Reiser, Meli..."


Now that we have more workable information, lets do some preliminary cleaning before heading to Excel. 
- Manually fill in any movies who's data could not be retrieved using TMDB
- Rather than keeping the abbreviation for the langauge, I want the full name of the language. 
- Since runtimes are all rounded, there's no point in keeping the decimal.
- Although some directors share directing credits, I only want to keep the top-billed director.

## 4- Do some preliminary cleaning

In [8]:
## all movies whose data could not be found
print(movies[movies['TMDB ID'] == -1])

           Date                 Name  Year  Rating  TMDB ID Language  Runtime  \
377  2023-06-09  Jackals & Fireflies  2023     NaN       -1     None      NaN   

    Genre Country             Director       Top 5 Actors  
377  None    None  [No director found]  [No actors found]  


In [9]:
# manually filling in data
new_values = {'Director' : 'Charlie Kaufman', 'Top 5 Actors' : ['Eva H.D'],
'Language' : 'en', 'Runtime' : 20.0, 'Genre' : 'Drama',
  'Country' : 'United States of America'}

movies.loc[377, ['Director', 'Top 5 Actors', 'Language', 'Runtime', 'Genre', 'Country']] = new_values
print(movies[movies['TMDB ID'] == -1])

           Date                 Name  Year  Rating  TMDB ID Language  Runtime  \
377  2023-06-09  Jackals & Fireflies  2023     NaN       -1       en     20.0   

     Genre                   Country         Director Top 5 Actors  
377  Drama  United States of America  Charlie Kaufman    [Eva H.D]  


In [10]:
print(movies['Language'].unique())
# dictionary of abbreviation : full name pairs
language_mapping = {
    'fr' : 'French', 'ko' : 'Korea',
    'en' : 'English', 'pt' : 'Portuguese',
    'ja' : 'Japanese', 'ru' : 'Russian',
    'cn' : 'Chinese', 'es' : 'Spanish',
    'sv' : 'Swedish', 'zh' : 'Taiwanese',
    'de' : 'German', 'el' : 'Greek',
    'fi' : 'Finnish', 'fa' : 'Persian'
}

# replace abbreviations with full names using the dictionary
movies['Language'] = movies['Language'].replace(language_mapping)

['fr' 'ko' 'en' 'pt' 'ja' 'ru' 'cn' 'es' 'sv' 'zh' 'de' 'el' 'fi' 'fa']


In [31]:
# dropping the decimal from the runtimes
movies['Runtime'] = movies['Runtime'].astype(int)

In [11]:
# only returning the top billed director
def top_billed(lst):
    if len(lst) > 0:
        return lst[0]
    
movies['Director'] = movies['Director'].apply(top_billed)
movies.head(15)

Unnamed: 0,Date,Name,Year,Rating,TMDB ID,Language,Runtime,Genre,Country,Director,Top 5 Actors
0,2021-01-16,La Haine,1995,5.0,406,French,98.0,Drama,France,Mathieu Kassovitz,"[Vincent Cassel, Hubert Koundé, Saïd Taghmaoui..."
1,2022-01-07,Oldboy,2003,3.0,670,Korea,120.0,Drama,South Korea,Park Chan-wook,"[Choi Min-sik, Yoo Ji-tae, Kang Hye-jung, Kim ..."
2,2022-01-07,No Country for Old Men,2007,5.0,6977,English,122.0,Crime,United States of America,Joel Coen,"[Tommy Lee Jones, Javier Bardem, Josh Brolin, ..."
3,2022-01-07,Taxi Driver,1976,,103,English,114.0,Crime,United States of America,Martin Scorsese,"[Robert De Niro, Jodie Foster, Cybill Shepherd..."
4,2022-01-10,Lady Bird,2017,4.0,391713,English,94.0,Drama,United States of America,Greta Gerwig,"[Saoirse Ronan, Laurie Metcalf, Tracy Letts, B..."
5,2022-01-13,Birdman or (The Unexpected Virtue of Ignorance),2014,3.0,194662,English,120.0,Drama,United States of America,Alejandro González Iñárritu,"[Michael Keaton, Emma Stone, Zach Galifianakis..."
6,2022-01-14,City of God,2002,4.0,598,Portuguese,130.0,Drama,Brazil,Fernando Meirelles,"[Alexandre Rodrigues, Leandro Firmino, Phellip..."
7,2022-01-15,Donnie Darko,2001,2.0,141,English,114.0,Fantasy,United States of America,Richard Kelly,"[Jake Gyllenhaal, Jena Malone, James Duval, Dr..."
8,2022-01-16,Lost in Translation,2003,5.0,153,English,102.0,Drama,Japan,Sofia Coppola,"[Bill Murray, Scarlett Johansson, Giovanni Rib..."
9,2022-01-17,Whiplash,2014,4.0,244786,English,107.0,Drama,United States of America,Damien Chazelle,"[Miles Teller, J.K. Simmons, Paul Reiser, Meli..."


We can now head to Excel to do some further cleaning, as well as start answering some of my questions.
## 5- Save as a csv, head to Excel

In [12]:
movies.to_csv('letterboxd.csv') 