<a href="https://colab.research.google.com/github/ilmuneraka/letterboxd-friends-ranker/blob/main/Letterboxd_Profile_Analyzer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

DOMAIN = "https://letterboxd.com"

# Preparing Necessary Functions
- We'll reuse some functions from my [previous notebook](https://colab.research.google.com/drive/1P1ApHz8nVAgaLlWcIz4kNnh6eq7DmHzS?usp=sharing)

In [3]:
def transform_ratings(some_str):
    """
    transforms raw star rating into float value
    :param: some_str: actual star rating
    :rtype: returns the float representation of the given star(s)
    """
    stars = {
        "★": 1,
        "★★": 2,
        "★★★": 3,
        "★★★★": 4,
        "★★★★★": 5,
        "½": 0.5,
        "★½": 1.5,
        "★★½": 2.5,
        "★★★½": 3.5,
        "★★★★½": 4.5
    }
    try:
        return stars[some_str]
    except:
        return -1

def scrape_films(username):
    movies_dict = {}
    movies_dict['id'] = []
    movies_dict['title'] = []
    movies_dict['rating'] = []
    movies_dict['liked'] = []
    movies_dict['link'] = []
    url = DOMAIN + "/" + username + "/films/"
    url_page = requests.get(url)
    soup = BeautifulSoup(url_page.content, 'html.parser')
    
    # check number of pages
    li_pagination = soup.findAll("li", {"class": "paginate-page"})
    if len(li_pagination) == 0:
        ul = soup.find("ul", {"class": "poster-list"})
        if (ul != None):
            movies = ul.find_all("li")
            for movie in movies:
                movies_dict['id'].append(movie.find('div')['data-film-id'])
                movies_dict['title'].append(movie.find('img')['alt'])
                movies_dict['rating'].append(transform_ratings(movie.find('p', {"class": "poster-viewingdata"}).get_text().strip()))
                movies_dict['liked'].append(movie.find('span', {'class': 'like'})!=None)
                movies_dict['link'].append(movie.find('div')['data-target-link'])
    else:
        for i in range(int(li_pagination[-1].find('a').get_text().strip())):
            url = DOMAIN + "/" + username + "/films/page/" + str(i+1)
            url_page = requests.get(url)
            soup = BeautifulSoup(url_page.content, 'html.parser')
            ul = soup.find("ul", {"class": "poster-list"})
            if (ul != None):
                movies = ul.find_all("li")
                for movie in movies:
                    movies_dict['id'].append(movie.find('div')['data-film-id'])
                    movies_dict['title'].append(movie.find('img')['alt'])
                    movies_dict['rating'].append(transform_ratings(movie.find('p', {"class": "poster-viewingdata"}).get_text().strip()))
                    movies_dict['liked'].append(movie.find('span', {'class': 'like'})!=None)
                    movies_dict['link'].append(movie.find('div')['data-target-link'])
    
    df_film = pd.DataFrame(movies_dict)    
    return df_film

# Getting The Movie Details
As you can see from the scrape_films function, we only scrape
- id
- title
- given rating
- liked
- link

This is because the other details such as genres, actors are not accessible from one's profile page. So, we need to go to the movie's page to get those details.

In [4]:
# first, we need to take a look at the html of a movie's page
# for example, let's try to scrape the oscar winning movie Everything Everywhere All at Once (2022)

url = "https://letterboxd.com/film/everything-everywhere-all-at-once/"
url_page = requests.get(url)
soup = BeautifulSoup(url_page.content, 'html.parser')
soup


<!DOCTYPE html>

<!--[if lt IE 7 ]> <html lang="en" class="ie6 lte9 lte8 lte7 lte6 no-js"> <![endif]-->
<!--[if IE 7 ]>    <html lang="en" class="ie7 lte9 lte8 lte7 no-js"> <![endif]-->
<!--[if IE 8 ]>    <html lang="en" class="ie8 lte9 lte8 no-js"> <![endif]-->
<!--[if IE 9 ]>    <html lang="en" class="ie9 lte9 no-js"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html class="no-mobile no-js" id="html" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="width=1024" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="An aging Chinese immigrant is swept up in an insane adventure, where she alone can save what's important to her by connecting with the lives she could have led in other universes." name="description"/>
<meta content="video.movie" property="og:type"/>
<meta content="https://letterboxd.com/film/everything-everywhere-all-at-once/" property="og:url"/>
<meta content="Everything Everywhere All at Once (2022)" pro

Everything Everywhere All at Once (2022) Letterboxd page
![](https://drive.google.com/uc?export=view&id=14JiJGE38gm1-iqcIviAl2Oeymj6hvkN1)

We will try to scrape:
- Year
- Director (could be multiple)
- Cast (will be very likely multiple)
- Average Rating
- Number of watches
- Number of likes
- Genres (will be likely multiple)

from this page, we could obtain the year, director, cast, avg rating, and genres

the exact number of watches and likes need to be obtained from another page, try to find this 'film-stats' class, it will lead to /esi/film/everything-everywhere-all-at-once/stats/

## Organizing The Dataset
As you know, a movie could be directed by more than one person, the cast will mostly consists of many people, and the genres of a movie could be multiple too. So, we will not store those details into one dataframe, we will store each multiple values column into a different dataframe with a key column, and we will join them into one once we need it, so it's basically some kind of relational database.

I will call the original dataframe as df_rating and it stores single value columns that are year, average rating, number of watches, and number of likes. The director dataframe as df_director and the actor dataframe as df_actor. I'm putting this illustration below for a better understanding of the dataset scheme.

![](https://drive.google.com/uc?export=view&id=1Lq40gEWH0pu7IEir4UANWgPMQiLRni94)

Actually, we can store those data into one table since we can store list into pandas dataframe, but I'm more comfortable this way so I will do it this way

In [5]:
# I will not give detailed steps to scrape each detail, I will give you my final function

# this function will take df_film produced by scrape_films function as input
# iterate the movies inside df_film to get all details from those movies
# and then return 4 different dataframes like shown above
def scrape_films_details(df_film):
    df_film = df_film[df_film['rating']!=-1].reset_index(drop=True)
    movies_rating = {}
    movies_rating['id'] = []
    movies_rating['avg_rating'] = []
    movies_rating['year'] = []
    movies_rating['watched_by'] = []
    movies_rating['liked_by'] = []
    
    movies_actor = {}
    movies_actor['id'] = []
    movies_actor['actor'] = []
    movies_actor['actor_link'] = []
    
    movies_director = {}
    movies_director['id'] = []
    movies_director['director'] = []
    movies_director['director_link'] = []
    
    movies_genre = {}
    movies_genre['id'] = []
    movies_genre['genre'] = []
    for link in df_film['link']:
        print('scraping details of '+df_film[df_film['link'] == link]['title'].values[0])
        
        id_movie = df_film[df_film['link'] == link]['id'].values[0]
        url_movie = DOMAIN + link
        url_movie_page = requests.get(url_movie)
        soup_movie = BeautifulSoup(url_movie_page.content, 'html.parser')
        for sc in soup_movie.findAll("script"):
            if sc.string != None:
                if "ratingValue" in sc.string:
                    rating = sc.string.split("ratingValue")[1].split(",")[0][2:]
                if "releaseYear" in sc.string:
                    year = sc.string.split("releaseYear")[1].split(",")[0][2:].replace('"','')
        url_stats = DOMAIN + "/esi" + link + "stats"
        url_stats_page = requests.get(url_stats)
        soup_stats = BeautifulSoup(url_stats_page.content, 'html.parser')
        watched_by = int(soup_stats.findAll('li')[0].find('a')['title'].replace(u'\xa0', u' ').split(" ")[2].replace(u',', u''))
        liked_by = int(soup_stats.findAll('li')[2].find('a')['title'].replace(u'\xa0', u' ').split(" ")[2].replace(u',', u''))
        movies_rating['id'].append(id_movie)
        movies_rating['avg_rating'].append(rating)
        movies_rating['year'].append(year)
        movies_rating['watched_by'].append(watched_by)
        movies_rating['liked_by'].append(liked_by)

        # finding the actors
        if (soup_movie.find('div', {'class':'cast-list'}) != None):
            for actor in soup_movie.find('div', {'class':'cast-list'}).findAll('a'):
                if actor.get_text().strip() != 'Show All…':
                    movies_actor['id'].append(id_movie)
                    movies_actor['actor'].append(actor.get_text().strip())
                    movies_actor['actor_link'].append(actor['href'])

        # finding the directors
        if (soup_movie.find('div', {'id':'tab-crew'}) != None):
            for director in soup_movie.find('div', {'id':'tab-crew'}).find('div').findAll('a'):
                movies_director['id'].append(id_movie)
                movies_director['director'].append(director.get_text().strip())
                movies_director['director_link'].append(director['href'])

        # finding the genres
        if (soup_movie.find('div', {'id':'tab-genres'}) != None):
            for genre in soup_movie.find('div', {'id':'tab-genres'}).find('div').findAll('a'):
                movies_genre['id'].append(id_movie)
                movies_genre['genre'].append(genre.get_text().strip())

    df_rating = pd.DataFrame(movies_rating)
    df_actor = pd.DataFrame(movies_actor)
    df_director = pd.DataFrame(movies_director)
    df_genre = pd.DataFrame(movies_genre)
    return df_rating, df_actor, df_director, df_genre

In [6]:
# let's try it
df_film = scrape_films('cacingpincang')

In [7]:
df_rating, df_actor, df_director, df_genre = scrape_films_details(df_film)

scraping details of A Man Called Otto
scraping details of Puss in Boots: The Last Wish
scraping details of The Menu
scraping details of Glass Onion: A Knives Out Mystery
scraping details of The Banshees of Inisherin
scraping details of Stealing Raden Saleh
scraping details of All or Nothing: Arsenal
scraping details of Nope
scraping details of Missing Home
scraping details of Elvis
scraping details of Men
scraping details of Srimulat: Hil Yang Mustahal – Babak Pertama
scraping details of Doctor Strange in the Multiverse of Madness
scraping details of Sonic the Hedgehog 2
scraping details of Everything Everywhere All at Once
scraping details of The Batman
scraping details of Neymar: The Perfect Chaos
scraping details of Spider-Man: No Way Home
scraping details of Nussa
scraping details of Clickbait
scraping details of Free Guy
scraping details of The Suicide Squad
scraping details of Red Rocket
scraping details of Cruella
scraping details of Wrath of Man
scraping details of Mortal Komba

In [8]:
df_rating.head()

Unnamed: 0,id,avg_rating,year,watched_by,liked_by
0,842221,3.62,2022,51696,15296
1,242285,4.27,2022,591453,272760
2,521323,3.67,2022,1065110,330250
3,586723,3.65,2022,1206120,370656
4,598882,4.1,2022,623489,233470


In [9]:
# now if we want to know the title too, merge it with the df_film, like this
pd.merge(df_film, df_rating).head()

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by
0,842221,A Man Called Otto,4.0,False,/film/a-man-called-otto/,3.62,2022,51696,15296
1,242285,Puss in Boots: The Last Wish,5.0,True,/film/puss-in-boots-the-last-wish/,4.27,2022,591453,272760
2,521323,The Menu,4.0,False,/film/the-menu-2022/,3.67,2022,1065110,330250
3,586723,Glass Onion: A Knives Out Mystery,4.0,False,/film/glass-onion-a-knives-out-mystery/,3.65,2022,1206120,370656
4,598882,The Banshees of Inisherin,5.0,True,/film/the-banshees-of-inisherin/,4.1,2022,623489,233470


# Visualization & Analysis
- Visualization is a good way to understand the data, in this case I will try to use Altair chart, not Matplotlib

In [10]:
import altair as alt

## Year Movies were Released
I want to know from what year my movies were released, the year span, what the oldest movie is

In [11]:
df_rating_merged = pd.merge(df_film, df_rating)

In [12]:
alt.Chart(df_rating_merged).mark_bar(tooltip=True).encode(
                alt.X("year:O", axis=alt.Axis(labelAngle=90)),
                y='count()',
                color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
            ).properties(
                    width=800,
                    height=300
                    )

In [13]:
df_rating_merged[df_rating_merged['year'] == '1957']

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by
234,51700,12 Angry Men,5.0,False,/film/12-angry-men/,4.52,1957,598862,226772


- Looks like I mostly watched movies that were released in 2019
- I mostly liked movies from 2020
- The oldest movie I've watched is 12 Angry Men which was released in 1957
- Next, I want to see by the decade, so I will group the movies by decade and then visualize it

In [14]:
def decade_year(year):
    return str(int(year/10)*10)+"s"

df_rating_merged['decade'] = df_rating_merged.apply(lambda row: decade_year(int(row['year'])), axis=1)

In [15]:
alt.Chart(df_rating_merged).mark_bar(tooltip=True).encode(
                alt.X("decade:O", axis=alt.Axis(labelAngle=90)),
                y='count()',
                color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
                ).properties(
                    width=800,
                    height=300
                    )

- as expected, I mostly watched movies that were released in 2010s with 105 movies
- unlike number of movies which peaked in 2010s, my number of likes peaked in 2000s (21 movies), I mostly liked movies from that decade

## My Ratings Compared to the Average Ratings
I want to know how my ratings compared to the other users' ratings
- Did I give higher or lower ratings than the others?
- Which movie that I differed most from the crowd?

In [16]:
# to see if I gave higher or lower
df_rating_merged['difference'] = df_rating_merged['rating'].astype(float)-df_rating_merged['avg_rating'].astype(float)

# absolute values to know how different my ratings are and to see which movie that I differed most from the crowd
df_rating_merged['difference_abs'] = abs(df_rating_merged['difference'])

In [17]:
print("On average, I gave higher/lower ratings by {} points".format(round(df_rating_merged['difference'].mean(), 2)))
print("On average, my ratings are different from other users by {} points".format(round(df_rating_merged['difference_abs'].mean(), 2)))

On average, I gave higher/lower ratings by 0.29 points
On average, my ratings are different from other users by 0.51 points


- Looks like I gave slightly higher ratings than others by 0.29 points

In [82]:
# my average rating
print("I gave average rating of {} on {} movies, and I liked {} movies".format(round(df_rating_merged['rating'].mean() ,2), len(df_rating_merged), df_rating_merged['liked'].sum()))

I gave average rating of 3.89 on 235 movies, and I liked 70 movies


In [18]:
# top 5 movies with the biggest differences
df_rating_merged.sort_values('difference_abs', ascending=False).head()

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by,decade,difference,difference_abs
112,278289,The Ridiculous 6,4.0,False,/film/the-ridiculous-6/,1.77,2015,51479,3443,2010s,2.23,2.23
101,316316,A Dog's Purpose,5.0,True,/film/a-dogs-purpose/,3.27,2017,69302,11161,2010s,1.73,1.73
179,46659,Kicking & Screaming,4.0,False,/film/kicking-screaming/,2.7,2005,58333,5805,2000s,1.3,1.3
85,432335,Thunder Road,2.5,False,/film/thunder-road-2018/,3.8,2018,50834,16516,2010s,-1.3,1.3
149,18627,Incendies,3.0,False,/film/incendies/,4.28,2010,171297,61142,2010s,-1.28,1.28


Why people dislike The Ridiculous 6 though? I know it's stupid but it's still pretty much entertaining for a comedy movie

In [19]:
# now we will visualize it
alt.Chart(df_rating_merged).mark_bar(tooltip=True).encode(
                alt.X("rating:O", axis=alt.Axis(labelAngle=0)),
                y='count()',
                color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
            ).properties(
                    width=500,
                    height=300
                    )

In [20]:
alt.Chart(df_rating_merged).mark_bar(tooltip=True).encode(
                alt.X("avg_rating", bin=True, axis=alt.Axis(labelAngle=0)),
                y='count()',
                color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
            ).properties(
                    width=500,
                    height=300
                    )

- I mostly rated movies that have average ratings between 3.5 and 4
- Now I wonder how is the correlation between my ratings and the average ratings, let's find it out

In [21]:
df_rating_merged['rating'].astype(float).corr(df_rating_merged['avg_rating'].astype(float))

0.560793650054708

In [22]:
alt.Chart(df_rating_merged).mark_circle(size=60).encode(
    x='rating:Q',
    y='avg_rating:Q',
    color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"])),
    tooltip=['title', 'year', 'rating', 'avg_rating', 'difference']
).interactive()

## Popularity of My Movies and are They Likeable Movies?
- Note that we also scrape number of watches and likes from my movies, so we will use it to measure popularity and likeability
- To measure likeability, I will use the like to watch ratio, how many users like the movie compared to the total watches (sum of likes/sum of watches)
- I also want to know which movie is the least popular and the most popular, the least likeable and the most likeable


In [23]:
df_rating_merged['ltw_ratio'] = df_rating_merged['liked_by']/df_rating_merged['watched_by']

In [24]:
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df_rating_merged[['watched_by', 'ltw_ratio']].describe()

Unnamed: 0,watched_by,ltw_ratio
count,235.0,235.0
mean,560558.323,0.252
std,592909.134,0.099
min,188.0,0.059
25%,74873.5,0.175
50%,372246.0,0.243
75%,853414.5,0.328
max,2430366.0,0.578


- On average (mean), I watched movies that were watched by 560,011.285 users, and movies that were liked by 25.2% users
- From the description above, we can see that the number of watches values are not balanced (the difference between the mean and the median is quite large) while ltw_ratio values are more balanced

In [25]:
# top 5 my most popular movies
df_rating_merged.sort_values('watched_by', ascending=False).head()

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by,decade,difference,difference_abs,ltw_ratio
62,426406,Parasite,4.5,False,/film/parasite-2019/,4.57,2019,2430366,1404042,2010s,-0.07,0.07,0.578
56,406775,Joker,4.0,False,/film/joker-2019/,3.9,2019,2400675,984940,2010s,0.1,0.1,0.41
202,51568,Fight Club,5.0,True,/film/fight-club/,4.3,1999,2234416,978300,1990s,0.7,0.7,0.438
223,51444,Pulp Fiction,4.0,False,/film/pulp-fiction/,4.29,1994,2173711,959883,1990s,-0.29,0.29,0.442
55,475370,Knives Out,4.0,False,/film/knives-out-2019/,4.06,2019,2127610,962167,2010s,-0.06,0.06,0.452


In [26]:
# top 5 my least popular movies
df_rating_merged.sort_values('watched_by').head()

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by,decade,difference,difference_abs,ltw_ratio
64,532074,Chase,3.5,False,/film/chase-2019/,2.33,2019,188,11,2010s,1.17,1.17,0.059
18,606030,Nussa,3.5,False,/film/nussa/,3.64,2021,434,103,2020s,-0.14,0.14,0.237
27,597511,The Heartbreak Club,3.0,False,/film/the-heartbreak-club/,2.99,2021,567,68,2020s,0.01,0.01,0.12
11,830325,Srimulat: Hil Yang Mustahal – Babak Pertama,4.0,False,/film/srimulat-hil-yang-mustahal-babak-pertama/,3.44,2022,626,143,2020s,0.56,0.56,0.228
37,763562,Backstreet Rookie,3.5,False,/film/backstreet-rookie/,3.02,2020,1708,228,2020s,0.48,0.48,0.133


- Note that there's 3 Indonesian movies in this list, it shows that there's still very few Indonesian Letterboxd users, because I think Nussa is not that obscure

In [27]:
# top 5 my most likeable movies
df_rating_merged.sort_values('ltw_ratio', ascending=False).head()

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by,decade,difference,difference_abs,ltw_ratio
62,426406,Parasite,4.5,False,/film/parasite-2019/,4.57,2019,2430366,1404042,2010s,-0.07,0.07,0.578
14,474474,Everything Everywhere All at Once,4.5,True,/film/everything-everywhere-all-at-once/,4.42,2022,1527067,780065,2020s,0.08,0.08,0.511
1,242285,Puss in Boots: The Last Wish,5.0,True,/film/puss-in-boots-the-last-wish/,4.27,2022,591453,272760,2020s,0.73,0.73,0.461
55,475370,Knives Out,4.0,False,/film/knives-out-2019/,4.06,2019,2127610,962167,2010s,-0.06,0.06,0.452
100,353117,Get Out,4.5,True,/film/get-out-2017/,4.2,2017,2081680,934923,2010s,0.3,0.3,0.449


In [28]:
# top 5 my least likeable movies
df_rating_merged.sort_values('ltw_ratio').head()

Unnamed: 0,id,title,rating,liked,link,avg_rating,year,watched_by,liked_by,decade,difference,difference_abs,ltw_ratio
64,532074,Chase,3.5,False,/film/chase-2019/,2.33,2019,188,11,2010s,1.17,1.17,0.059
112,278289,The Ridiculous 6,4.0,False,/film/the-ridiculous-6/,1.77,2015,51479,3443,2010s,2.23,2.23,0.067
120,174952,Annabelle,3.0,False,/film/annabelle/,2.43,2014,258962,18997,2010s,0.57,0.57,0.073
176,46635,See No Evil,2.5,False,/film/see-no-evil-2006/,2.16,2006,14678,1140,2000s,0.34,0.34,0.078
124,139348,Let's Be Cops,3.0,False,/film/lets-be-cops/,2.56,2014,64350,5612,2010s,0.44,0.44,0.087


- We see some bad movies here
- Now I'm curious about the correlation of avg_rating and the ltw_ratio, let's find out

In [29]:
df_rating_merged['avg_rating'].astype(float).corr(df_rating_merged['ltw_ratio'])

0.8861304610658886

really high correlation!

In [30]:
alt.Chart(df_rating_merged).mark_circle(size=60).encode(
    x='ltw_ratio:Q',
    y='avg_rating:Q',
    color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"])),
    tooltip=['title', 'year', 'rating', 'avg_rating', 'ltw_ratio']
).interactive()

In [31]:
# now I will try to make a graph based on popularity and likeability
# we will know less popular movies with high likeability and popular movies but low likeability
alt.Chart(df_rating_merged).mark_circle(size=60).encode(
    x='ltw_ratio:Q',
    y='watched_by:Q',
    color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"])),
    tooltip=['title', 'year', 'rating', 'avg_rating', 'watched_by', 'ltw_ratio']
).interactive()

- since it's an interactive chart, try to zoom in on movies with more than 1 million watches, and you'll see that Incredibles 2 (2018) is my popular movie with the least likeability
- now try to zoom in on less popular movies with number of watches below 200k and move right (more than 0.345 ltw_ratio), from there I find Incendies (2010), Taste of Cherry (1997), Love Letter (1995), and Missing Home (2022) as likeable and unpopular movies
- now I will try to find a movie that I like with lowest ltw_ratio, it's Johnny English Reborn (2011), only about 9% users liked it

for the next visualization, I will try to group popularity and likeability into 4 scale

popularity is determined by number of watches.
- <= 10,000 -> very obscure
- 10,101 - 100,000 -> obscure
- 100,001 - 1,000,000 -> popular
- \> 1,000,000 -> very popular

Likeability is determined by number of likes to number of watches ratio.
- <= 0.1 -> rarely likeable
- 0.1 - 0.2 -> sometimes likeable
- 0.2 - 0.4 -> often likeable
- \> 0.4 -> usually likeable

In [32]:
def classify_popularity(watched_by):
    if (watched_by <= 10000):
        return "1 - very obscure"
    elif (watched_by <= 100000):
        return "2 - obscure"
    elif (watched_by <= 1000000):
        return "3 - popular"
    else:
        return "4 - very popular"

def classify_likeability(ltw_ratio):
    if (ltw_ratio <= 0.1):
        return "1 - rarely likeable"
    elif (ltw_ratio <= 0.2):
        return "2 - sometimes likeable"
    elif (ltw_ratio <= 0.4):
        return "3 - often likeable"
    else:
        return "4 - usually likeable"

df_rating_merged['popularity'] = df_rating_merged.apply(lambda row: classify_popularity(row['watched_by']), axis=1)
df_rating_merged['likeability'] = df_rating_merged.apply(lambda row: classify_likeability(row['ltw_ratio']), axis=1)

In [33]:
# now we will visualize it
alt.Chart(df_rating_merged).mark_bar(tooltip=True).encode(
                alt.X("popularity:O", axis=alt.Axis(labelAngle=0)),
                y='count()',
                color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
            ).properties(
                    width=500,
                    height=300
                    )

In [34]:
# now we will visualize it
alt.Chart(df_rating_merged).mark_bar(tooltip=True).encode(
                alt.X("likeability:O", axis=alt.Axis(labelAngle=0)),
                y='count()',
                color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
            ).properties(
                    width=500,
                    height=300
                    )

## Directors & Actors
Let's see which director and actor I watched most

In [35]:
df_director_merged = pd.merge(df_film, df_director)
df_actor_merged = pd.merge(df_film, df_actor)

In [36]:
# my top 10 most watched directors
df_director_merged['director'].value_counts().head(10)

Quentin Tarantino    8
Joel Coen            6
Guy Ritchie          5
Martin Scorsese      5
David Fincher        5
Steven Soderbergh    4
Martin McDonagh      4
Ethan Coen           4
Chad Stahelski       3
Edgar Wright         3
Name: director, dtype: int64

In [37]:
# my top 10 most watched actors
df_actor_merged['actor'].value_counts().head(10)

Brad Pitt            11
Tom Hanks             7
Andy García           7
Samuel L. Jackson     7
Matt Damon            7
Keanu Reeves          6
Eddie Marsan          6
Jack Black            6
Steve Buscemi         6
Casey Affleck         6
Name: actor, dtype: int64

- Wow, I didn't know I watched Quentin Tarantino movies that much
- I also didn't realise I watched many movies starring Brad Pitt
- Now, I want to see which director and actor with the highest rating

In [38]:
df_temp_director = pd.merge(df_director_merged.groupby(['director', 'director_link']).agg({'liked':'sum', 'rating':'mean'}).reset_index(),
                            df_director['director'].value_counts().reset_index().rename(columns = {'index':'director', 'director':'count'}))

In [39]:
df_temp_director.sort_values('count', ascending=False).head(10)

Unnamed: 0,director,director_link,liked,rating,count
145,Quentin Tarantino,/director/quentin-tarantino/,3,4.438,8
92,Joel Coen,/director/joel-coen/,3,4.083,6
47,David Fincher,/director/david-fincher/,1,3.9,5
69,Guy Ritchie,/director/guy-ritchie/,2,3.9,5
120,Martin Scorsese,/director/martin-scorsese/,1,3.9,5
173,Steven Soderbergh,/director/steven-soderbergh/,0,3.625,4
119,Martin McDonagh,/director/martin-mcdonagh/,3,4.625,4
58,Ethan Coen,/director/ethan-coen/,2,4.125,4
54,Edgar Wright,/director/edgar-wright/,2,4.167,3
27,Chad Stahelski,/director/chad-stahelski/,0,3.833,3


In [40]:
df_temp_director.sort_values('liked', ascending=False).head(10)

Unnamed: 0,director,director_link,liked,rating,count
92,Joel Coen,/director/joel-coen/,3,4.083,6
119,Martin McDonagh,/director/martin-mcdonagh/,3,4.625,4
145,Quentin Tarantino,/director/quentin-tarantino/,3,4.438,8
69,Guy Ritchie,/director/guy-ritchie/,2,3.9,5
54,Edgar Wright,/director/edgar-wright/,2,4.167,3
58,Ethan Coen,/director/ethan-coen/,2,4.125,4
82,Jeff Fowler,/director/jeff-fowler/,2,4.0,2
146,Rajkumar Hirani,/director/rajkumar-hirani/,2,4.75,2
70,Hannah Fidell,/director/hannah-fidell/,1,4.0,1
142,Peter Weir,/director/peter-weir/,1,4.5,2


- Among my most watched directors, the director which I gave the highest average rating is Martin McDonagh, well it's not surprising since I really like his works, it's a shame Banshees of Inisherin didn't win any Oscar this year :(
- There are 3 directors which I liked 3 of their movies, I think this list confirmed that Martin McDonagh is my favorite director

In [41]:
df_temp_actor = pd.merge(df_actor_merged.groupby(['actor', 'actor_link']).agg({'liked':'sum', 'rating':'mean'}).reset_index(),
                            df_actor['actor'].value_counts().reset_index().rename(columns = {'index':'actor', 'actor':'count'}))

In [42]:
df_temp_actor.sort_values('count', ascending=False).head(10)

Unnamed: 0,actor,actor_link,liked,rating,count
1141,Brad Pitt,/actor/brad-pitt/,5,4.091,11
8222,Samuel L. Jackson,/actor/samuel-l-jackson/,3,4.143,7
9113,Tom Hanks,/actor/tom-hanks/,2,4.214,7
530,Andy García,/actor/andy-garcia/,0,3.286,7
6264,Matt Damon,/actor/matt-damon/,0,3.857,7
3818,Jack Black,/actor/jack-black-1/,2,3.75,6
3551,Harvey Keitel,/actor/harvey-keitel/,1,4.167,6
5123,Keanu Reeves,/actor/keanu-reeves/,0,3.583,6
8691,Steve Buscemi,/actor/steve-buscemi/,2,4.25,6
1480,Casey Affleck,/actor/casey-affleck/,0,3.833,6


In [43]:
df_temp_actor.sort_values('liked', ascending=False).head(10)

Unnamed: 0,actor,actor_link,liked,rating,count
1141,Brad Pitt,/actor/brad-pitt/,5,4.091,11
4376,Jim Carrey,/actor/jim-carrey/,5,4.083,6
3590,Helena Bonham Carter,/actor/helena-bonham-carter/,4,4.625,4
8191,Sam Rockwell,/actor/sam-rockwell/,4,4.5,5
1361,Caleb Landry Jones,/actor/caleb-landry-jones/,3,4.375,4
8222,Samuel L. Jackson,/actor/samuel-l-jackson/,3,4.143,7
1886,Colin Farrell,/actor/colin-farrell/,3,4.4,5
8679,Stephen Root,/actor/stephen-root/,3,4.167,3
9554,Woody Harrelson,/actor/woody-harrelson/,3,4.3,5
1187,Brendan Gleeson,/actor/brendan-gleeson/,3,4.375,4


- It looks like I gave good ratings on movies starring Helena Bonham Carter, which I also liked all 4 movies starring her, she's really an amazing actress
- I also like Jim Carrey so it's fair seeing him on top of this list

In [56]:
df_temp_director = df_temp_director.sort_values('count', ascending=False).reset_index(drop=True)
n_director = df_temp_director.iloc[9]['count']
df_temp_director = df_temp_director[df_temp_director['count']>=n_director]

In [57]:
# let's try to visualize it
base = alt.Chart(df_director_merged[df_director_merged['director'].isin(df_temp_director['director'])]).encode(
                    alt.Y("director", sort=df_temp_director['director'].tolist(), axis=alt.Axis(labelAngle=0))
                )
            
area = base.mark_bar(tooltip=True).encode(
    alt.X('count()',
        axis=alt.Axis(title='Count of Records')),
        color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
)
line = alt.Chart(df_temp_director).mark_line(interpolate='monotone').encode(
    alt.Y("director", sort=df_temp_director['director'].tolist(), axis=alt.Axis(labelAngle=0)),
    alt.X('rating', axis=alt.Axis(title='Average Rating', titleColor='#40bcf4'), scale=alt.Scale(zero=False)),
    color=alt.Color(value="#40bcf4"),
)
alt.layer(area, line).resolve_scale(
                x = 'independent'
)

In [58]:
df_temp_actor = df_temp_actor.sort_values('count', ascending=False).reset_index(drop=True)
n_actor = df_temp_actor.iloc[9]['count']
df_temp_actor = df_temp_actor[df_temp_actor['count']>=n_actor]

In [59]:
base = alt.Chart(df_actor_merged[df_actor_merged['actor'].isin(df_temp_actor['actor'])]).encode(
                    alt.Y("actor", sort=df_temp_actor['actor'].tolist(), axis=alt.Axis(labelAngle=0))
                )
            
area = base.mark_bar(tooltip=True).encode(
    alt.X('count()',
        axis=alt.Axis(title='Count of Records')),
        color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
)
line = alt.Chart(df_temp_actor).mark_line(interpolate='monotone').encode(
    alt.Y("actor", sort=df_temp_actor['actor'].tolist(), axis=alt.Axis(labelAngle=0)),
    alt.X('rating', axis=alt.Axis(title='Average Rating', titleColor='#40bcf4'), scale=alt.Scale(zero=False)),
    color=alt.Color(value="#40bcf4"),
)
alt.layer(area, line).resolve_scale(
                x = 'independent'
)

## Genres
Let's find out which genre I watched most

In [60]:
df_genre_merged = pd.merge(df_film, df_genre)

In [62]:
df_genre['genre'].value_counts()

drama              112
comedy              98
thriller            76
crime               76
action              59
adventure           34
mystery             29
horror              23
family              19
fantasy             17
animation           15
science-fiction     14
romance             12
war                 12
history             11
documentary          7
western              5
music                4
tv-movie             1
Name: genre, dtype: int64

- Turns out my most watched genre is drama, followed by comedy
- Let's give the same treatment as actors & directors to see which genre has the highest rating and genre that I like the most

In [63]:
df_temp_genre = pd.merge(df_genre_merged.groupby(['genre']).agg({'liked':'sum', 'rating':'mean'}).reset_index(),
                            df_genre['genre'].value_counts().reset_index().rename(columns = {'index':'genre', 'genre':'count'}))

In [66]:
df_temp_genre.sort_values('rating', ascending=False)

Unnamed: 0,genre,liked,rating,count
9,history,3,4.227,11
18,western,2,4.2,5
17,war,4,4.125,12
11,music,0,4.125,4
6,drama,40,4.089,112
12,mystery,10,3.966,29
4,crime,23,3.928,76
14,science-fiction,5,3.893,14
3,comedy,37,3.893,98
15,thriller,21,3.849,76


- Well it's true that I don't really fancy horror movies for I find it's boring, so it's not surprising seeing horror at the bottom of the list

In [67]:
df_temp_genre.sort_values('liked', ascending=False)

Unnamed: 0,genre,liked,rating,count
6,drama,40,4.089,112
3,comedy,37,3.893,98
4,crime,23,3.928,76
15,thriller,21,3.849,76
0,action,15,3.737,59
1,adventure,13,3.794,34
12,mystery,10,3.966,29
13,romance,6,3.833,12
7,family,6,3.737,19
8,fantasy,5,3.618,17


In [69]:
df_temp_genre = df_temp_genre.sort_values('count', ascending=False).reset_index(drop=True)

In [71]:
base = alt.Chart(df_genre_merged[df_genre_merged['genre'].isin(df_temp_genre['genre'])]).encode(
                    alt.X("genre", sort=df_temp_genre['genre'].tolist(), axis=alt.Axis(labelAngle=90))
                )

area = base.mark_bar(tooltip=True).encode(
    alt.Y('count()',
        axis=alt.Axis(title='Count of Records')),
        color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
)

line = alt.Chart(df_temp_genre).mark_line(interpolate='monotone').encode(
        alt.X('genre', sort=df_temp_genre['genre'].tolist()),
        alt.Y('rating',
            axis=alt.Axis(title='Average Rating', titleColor='#40bcf4'), scale=alt.Scale(zero=False)),
            color=alt.Color(value="#40bcf4")
    )

alt.layer(area, line).resolve_scale(
                y = 'independent'
).properties(width=800,
             height=300)

### Genre Combinations
It's a common thing that a movie has multiple genres, so let's dig deeper into combinations of genre to get a more specific understanding of my movies

In [72]:
df_genre_combination = pd.DataFrame(columns=df_genre_merged.columns)
for i in range(len(df_temp_genre['genre'].tolist())):
    for j in range(i+1, len(df_temp_genre['genre'].tolist())):
        df_ha = df_genre_merged[(df_genre_merged['genre'] == df_temp_genre['genre'].tolist()[i]) | (df_genre_merged['genre'] == df_temp_genre['genre'].tolist()[j])]
        if len(df_ha) != 0:
            df_ha['genre'] = df_temp_genre['genre'].tolist()[i] + " & " + df_temp_genre['genre'].tolist()[j]
            df_ha = df_ha[df_ha.duplicated('id')]
            df_genre_combination = pd.concat([df_genre_combination, df_ha]).reset_index(drop=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ha['genre'] = df_temp_genre['genre'].tolist()[i] + " & " + df_temp_genre['genre'].tolist()[j]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ha['genre'] = df_temp_genre['genre'].tolist()[i] + " & " + df_temp_genre['genre'].tolist()[j]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ha['genr

In [73]:
df_genre_combination['genre'].value_counts().head()

crime & thriller    43
drama & comedy      36
drama & crime       35
drama & thriller    33
comedy & crime      28
Name: genre, dtype: int64

In [74]:
df_temp_comb = pd.merge(df_genre_combination.groupby(['genre']).agg({'liked':'sum', 'rating':'mean'}).reset_index(),
                        df_genre_combination['genre'].value_counts().reset_index().rename(columns = {'index':'genre', 'genre':'count'}))

In [77]:
df_temp_comb.sort_values('count', ascending=False).head(20)

Unnamed: 0,genre,liked,rating,count
39,crime & thriller,12,3.86,43
44,drama & comedy,13,4.125,36
45,drama & crime,13,4.014,35
55,drama & thriller,13,4.061,33
21,comedy & crime,11,3.929,28
33,crime & action,5,3.741,27
18,comedy & action,11,3.75,26
77,thriller & action,5,3.667,24
19,comedy & adventure,11,3.864,22
0,action & adventure,6,3.618,17


- Wow, it turns out my most watched genre combination is crime & thriller while drama & comedy is in the second place

In [78]:
df_temp_comb = df_temp_comb.sort_values('count', ascending=False).reset_index(drop=True)
n_genre = df_temp_comb.iloc[19]['count']
df_temp_comb = df_temp_comb[df_temp_comb['count']>=n_genre]

In [80]:
base = alt.Chart(df_genre_combination[df_genre_combination['genre'].isin(df_temp_comb['genre'])]).encode(
                        alt.X("genre", sort=df_temp_comb['genre'].tolist(), axis=alt.Axis(labelAngle=90))
                    )
area = base.mark_bar(tooltip=True).encode(
    alt.Y('count()',
        axis=alt.Axis(title='Count of Records')),
        color=alt.Color('liked', scale=alt.Scale(domain=[True, False], range=["#ff8000", "#00b020"]))
)
line = alt.Chart(df_temp_comb).mark_line(interpolate='monotone').encode(
        alt.X('genre', axis=alt.Axis(title='genre combination'), sort=df_temp_comb['genre'].tolist()),
        alt.Y('rating',
            axis=alt.Axis(title='Average Rating', titleColor='#40bcf4'), scale=alt.Scale(zero=False)),
            color=alt.Color(value="#40bcf4")
    )
alt.layer(area, line).resolve_scale(y = 'independent').properties(width=800, height=300)

- horror once again appear with the lowest average rating among my top genre combinations, even when it's combined with thriller

# Deployment with Streamlit
I also deployed these functions to a Streamlit web app so you can use it to get these analyses and visualizations

Try it here
https://letterboxd-friends-ranker.streamlit.app