## Emoji Data
I've decided to look into the usage of emojis in letterboxd reviews. As the first part of this process, I've rerun my web scraper and expanded my corpus from ~25,000 reviews to ~48,000 (stored in all2.csv). I actually did this halfway thru writing this notebook, as I found myself wanting more sample size. Going into this project, I wanted to track what movies are associated with what emojis, as well as some kind of average score metric that could serve as a sort of sentiment analysis.

Next step is converting the bigger corpus to a list.

In [2]:
import pandas as pd
df = pd.read_csv('C:\\Users\Connor\Desktop\\370\letterboxdCorpus\\all2.csv')
reviews = df.reviews.tolist()
titles = df.movie.tolist()
print(reviews[500])
print(titles[500])

9/10 Next time I see a rich person I'm breaking their arm     
triangle-of-sadness


I'm going to take an object-oriented approach, so I'm going to write a class for each emoji. The goal is to have a list of these objects, differentiated by what emoji they portray. Each obj has its emoji value, like the cowboy🤠, and a list of every movie where the said value was found in a review. Adjacent to the movie list is the score list, where the score of the review is held in the same index.

For example, this:
The Good, the Bad and the Ugly - 10/10 best 🤠 movie
creates:
    Emote
    value: 🤠
    related_movies[0] = The Good, the Bad and the Ugly
    related_scores[0] = 10
and subsequent cowboy-emoji-using reviews will add onto this object.
    
The class also has a method for printing out related movies and scores, and a method for averaging the related scores array.

In [2]:
class Emote:
    def __init__(self, value):
        self.value = value
        self.related_movies = []
        self.related_scores = []
    def add_movie(self, movie_title):
        self.related_movies.append(movie_title)
    def add_score(self, score):
        if type(score) is int:
            self.related_scores.append(score)
    def display(self, display_limit):
        print("Emoji: " + self.value)
        print("Used in the following reviews (limited to " + str(display_limit) + "): ")
        limit = min(display_limit, len(self.related_movies))
        for i in range(limit):
            print("\t"+ self.related_movies[i] + ": " + str(self.related_scores[i]) + "/10")
    def calc_avg_score(self):
        numerator = 0
        denom = 0
        for number in self.related_scores:
            if number != -1:
                numerator += number
                denom += 1
        try:
            return numerator/denom
        except ZeroDivisionError:
            return -1
    

Now that the class is defined, I'm going to prune all the reviews into a format of emoji::movie-title::score and then sort them.

In [24]:
import emoji
object_list = []
organized_data = []
for i in range(len(reviews)):
    emoji_list = emoji.distinct_emoji_list(reviews[i])
    if len(emoji_list) != 0:
        for face in emoji_list:
            temp = reviews[i].split("/")
            score = ""
            try:
                score = int(temp[0])
            except ValueError:
                score = "None"
            organized_data.append(face + "::"+ titles[i] + "::"+ str(score))
            
organized_data.sort()
print(organized_data[0:100])
        

['1️⃣::a-nightmare-on-elm-street::7', '2️⃣::a-nightmare-on-elm-street::7', '3️⃣::a-nightmare-on-elm-street::7', '4️⃣::a-nightmare-on-elm-street::7', '5️⃣::a-nightmare-on-elm-street::7', '6️⃣::a-nightmare-on-elm-street::7', '7️⃣::a-nightmare-on-elm-street::7', '8️⃣::a-nightmare-on-elm-street::7', '9️⃣::a-nightmare-on-elm-street::7', '©::cremaster-3::10', '®::cremaster-3::10', '®::kong-skull-island::7', '®::olympus-has-fallen::5', '®::the-big-shave::8', '‼️::all-that-heaven-allows::7', '‼️::badlands::6', '‼️::cinderella-2015::8', '‼️::evil-dead-rise::8', '‼️::mona-lisa-and-the-blood-moon::8', '‼️::shin-godzilla::8', '‼️::spider-man-homecoming::8', '‼️::the-dark-knight::8', '‼️::the-killer-2023::5', '‼️::the-meyerowitz-stories-new-and-selected::9', '‼️::the-spirit-of-45::7', '‼️::wild-child::7', '™::1917::8', '™::8-half::10', '™::amistad::6', '™::an-unmarried-woman::10', '™::annabelle-creation::4', '™::ant-man-and-the-wasp::7', '™::apocalypse-now::9', '™::baby-driver::5', '™::belfast::7',

And now split the above format by "::" to easily create all the emoji objects.

In [4]:
previous = ""
for line in organized_data:
    #[0] = face, [1] = title, [2] = score(string)
    traits = line.split("::")
    face = traits[0]
    title = traits[1]
    try:
        score = int(traits[2])
    except ValueError:
        score = -1
        #this is the case where they leave a review with no score. 
        #the -1 is intended to be ignored in calculations
    if face != previous:
        obj = Emote(face)
        obj.add_movie(title)
        obj.add_score(score)
        object_list.append(obj)
    elif face == previous:
        for thing in object_list:
            if thing.value == face:
                thing.add_movie(title)
                thing.add_score(score)
    
    previous = face
    
len(object_list)


428

^ 428 unique emojis were used across the corpus. Part of me expected most of the different emojis to be used at least once, but scrolling through my phone there seems to be a lot more 428 emojis. Anyway, here's random stats gathered by the display and avg score methods. 

In [5]:
import random
for i in range(10):
    rand = random.randint(0,428)
    object_list[rand].display(10)
    print("avg: " + str(object_list[rand].calc_avg_score()))

Emoji: 🙏
Used in the following reviews (limited to 10): 
	a-nightmare-on-elm-street: 7/10
	a-summers-tale: 10/10
	frances-ha: 9/10
	raiders-of-the-lost-ark: 10/10
	the-deer-king: 7/10
	topsy-turvy: 7/10
avg: 8.333333333333334
Emoji: 🎉
Used in the following reviews (limited to 10): 
	enter-the-void: 8/10
	pans-labyrinth: 9/10
	the-rental: 6/10
avg: 7.666666666666667
Emoji: 🌷
Used in the following reviews (limited to 10): 
	le-bonheur: 10/10
avg: 10.0
Emoji: ♠
Used in the following reviews (limited to 10): 
	cremaster-3: 10/10
avg: 10.0
Emoji: 💖
Used in the following reviews (limited to 10): 
	downton-abbey: 8/10
	edward-scissorhands: 9/10
	el-camino-a-breaking-bad-movie: 6/10
	flavors-of-youth: 8/10
	hearts-beat-loud: 8/10
	hush-2001: 9/10
	inception: 10/10
	kedi-2016: 9/10
	midnight-mass-2021: -1/10
	notorious: 8/10
avg: 8.176470588235293
Emoji: 💣
Used in the following reviews (limited to 10): 
	suburbicon: 4/10
	the-super-mario-bros-movie: 7/10
avg: 5.5
Emoji: 👀
Used in the following 

A burning question, what is the most used emoji? and how many times have users appended an emoji to their review?

In [22]:
#emoji frequency
freq = {}
summ = 0
for obj in object_list:
    freq[obj.value] = len(obj.related_movies)
    summ += len(obj.related_movies)
sorted_freq = sorted(freq.items(), key=lambda kv: kv[1], reverse=True)
print(sorted_freq)
print("Total emojis in corpus: " + str(summ))

[('😭', 104), ('❤️', 71), ('™', 63), ('😔', 41), ('🥺', 38), ('👍', 30), ('😌', 28), ('😳', 25), ('🤝', 25), ('😍', 24), ('💕', 23), ('👀', 22), ('👏', 22), ('😩', 22), ('🤠', 20), ('💖', 18), ('😂', 18), ('🤔', 18), ('🥰', 18), ('💀', 17), ('💞', 17), ('🔥', 17), ('✌️', 14), ('✨', 14), ('🎵', 14), ('👌', 14), ('™️', 13), ('🎶', 13), ('‼️', 12), ('👻', 12), ('😢', 12), ('💗', 11), ('💛', 11), ('💯', 11), ('💓', 10), ('😘', 10), ('🥳', 10), ('✅', 9), ('👄', 9), ('💔', 9), ('💘', 9), ('😎', 9), ('😬', 9), ('😱', 9), ('🤢', 9), ('🤭', 9), ('🥴', 9), ('🥵', 9), ('✊', 8), ('❤', 8), ('👉', 8), ('💙', 8), ('💚', 8), ('💜', 8), ('😁', 8), ('😜', 8), ('🙄', 8), ('✈️', 7), ('👈', 7), ('💝', 7), ('😡', 7), ('😪', 7), ('🥲', 7), ('✔', 6), ('🇺🇸', 6), ('👁', 6), ('👁️', 6), ('👏🏼', 6), ('💦', 6), ('😇', 6), ('😊', 6), ('😏', 6), ('😛', 6), ('😝', 6), ('😫', 6), ('😴', 6), ('🙏', 6), ('🤐', 6), ('☕', 5), ('❌', 5), ('🎄', 5), ('👅', 5), ('👎', 5), ('💅', 5), ('😀', 5), ('😋', 5), ('😐', 5), ('😙', 5), ('😵\u200d💫', 5), ('🤨', 5), ('🤪', 5), ('®', 4), ('☝️', 4), ('☺️', 4), ('❄️

Evidently, the most used emojis are the sobbing face, followed by red heart, and then the trademark symbol. Sobbing face makes sense, as it conveys sadness, sobbing from laughter, or just general exaggerated misery; so it is a widely applicable emoji as a response to movies. And red heart makes enough sense. But I had to look into trademark symbol, where a lot of its usage ends up being for sarcastic purposes. If you ctrl+f it in all2.csv, you'll see the weird ways people use the trademark emoji.

primeval,dirkh,"2/10 In the left corner we have Evil African Warlord™ killing innocent people!!!   In the right corner we have Crappy CGI Crocodile™ spilling lots of Crappy CGI Blood™ !!!   Caught in the middle are Tough Guy™, Hot Chick™, Science Guy™ and Black Guy™ accompanied by Weird Eccentric Guy™ trying to make sense of it all by doing silly things and running around a lot.  The best acting is done by the crocodile and the fact that this is all based on true events is just too silly to mention.     "

stoker,deathproof,10/10 this is My Movie™     

postcards-from-the-edge,brat,8/10 coincidence that they cast dennis quaid aka harrison ford Lite™ as carrie fisher's love interest?? i think not!!!! 👀 
8-half,usercillian,"10/10 bitch even that title is meta, consider me shook™     "

 
It's definitely some odd social media culture thing.

Amount of emojis in corpus: 1745. Which I find greatly disappointing for 48,000 reviews. I suppose people often sit down and write most of their reviews on desktop more often than a quick tweet-like response to a movie via the letterboxd app. I gathered the data mainly from the top users, who maybe gained traction through beautiful prosaic reviews that may make a effort to lack emojis.

So which emojis have generally high scores vs. low?

In [7]:
#emojis sorted by avg score
avgs_list = {}
for obj in object_list:
    #factoring out emojis used very little
    if len(obj.related_movies) >= 5:
        avgs_list[obj.value] = obj.calc_avg_score()
        
value_sorted = dict(sorted(avgs_list.items(), key=lambda item: item[1], reverse=True))
print(value_sorted)
        

{'✔': 8.666666666666666, '😢': 8.666666666666666, '🥲': 8.666666666666666, '🥵': 8.666666666666666, '💛': 8.545454545454545, '☕': 8.4, '😋': 8.4, '🙏': 8.333333333333334, '🥴': 8.333333333333334, '✌️': 8.214285714285714, '😵\u200d💫': 8.2, '💖': 8.176470588235293, '😇': 8.166666666666666, '🥰': 8.11111111111111, '🤠': 8.1, '❤': 8.0, '💚': 8.0, '🤨': 8.0, '🔥': 7.9375, '👌': 7.923076923076923, '👏': 7.909090909090909, '😱': 7.888888888888889, '💜': 7.875, '👀': 7.8, '💘': 7.75, '💙': 7.75, '😜': 7.75, '😳': 7.72, '😁': 7.714285714285714, '😩': 7.714285714285714, '👁️': 7.666666666666667, '👄': 7.666666666666667, '🥳': 7.666666666666667, '😌': 7.6, '💞': 7.588235294117647, '💯': 7.545454545454546, '✈️': 7.5, '✊': 7.5, '👏🏼': 7.5, '💓': 7.5, '💗': 7.5, '😭': 7.43, '‼️': 7.416666666666667, '👻': 7.416666666666667, '💕': 7.409090909090909, '🎄': 7.4, '💅': 7.4, '🎵': 7.357142857142857, '❤️': 7.338461538461538, '😎': 7.333333333333333, '😏': 7.333333333333333, '🤭': 7.333333333333333, '😘': 7.3, '👉': 7.285714285714286, '😡': 7.2857142857

I feel like what this data really says is that if a movie has an emoji in its review, then it's probably at least decent. What shocks me the most is the thumbs down emoji got a score of 5.25. Who in their right mind would use thumbs down in any review higher than a 3/10? I was really expecting some avg scores of ~2/10 but I suppose law of averages strikes again. The -1 score shown above just means that all instances of the emoji have no score attached. 

Something I've also been thinking about for some reason is what emojis are used for tarantino movies.

In [8]:
#emojis used for a certain movie
def emojis_in_movie(title):
    related_emojis = []
    for obj in object_list:
        if title in obj.related_movies:
            related_emojis.append(obj.value)
    return related_emojis

In [9]:
#see all2.csv for correct format of movie titles
print(emojis_in_movie("pulp-fiction"))
print(emojis_in_movie("reservoir-dogs"))
print(emojis_in_movie("kill-bill-vol-1"))
print(emojis_in_movie("kill-bill-vol-2"))
print(emojis_in_movie("django-unchained"))
print(emojis_in_movie("inglorious-basterds"))
print(emojis_in_movie("jackie-brown"))
print(emojis_in_movie("once-upon-a-time-in-hollywood"))
print(emojis_in_movie("death-proof"))
print(emojis_in_movie("the-hateful-eight"))

[]
['😳']
[]
[]
['🤠']
[]
[]
['🎵', '😔']
[]
['🤠']


Ok, so no feet emojis like I was expecting. In fact, a disappointing lack of emojis in general. Out of everyone who's seen pulp fiction, no one wanted to put an emoji in their review???
At the very least, this data sort of tells you which of these movies are westerns.

In [10]:
#vector graph
#i didn't actually do one because I had errors associated with labels being emojis

I'm gonna see what nltk.most_similar can tell me. First I gotta tokenize everything and get the model trained.

In [4]:
#most similar word2Vec model
import nltk
from nltk.tokenize import word_tokenize

sentences = []
for review in reviews:
    lower2 = review[5:].lower()
    tokens2 = word_tokenize(lower2)
    sentences.append(tokens2)
print(sentences[0:5])

from gensim.models import Word2Vec
print(len(sentences))
model = Word2Vec(sentences = sentences,
                 window = 5, 
                 min_count =1,
                 vector_size = 5)

[['i', 'highly', 'recommend', 'everyone', 'get', 'drunk', 'with', 'friends', 'and', 'watch', 'this'], ['my', 'introduction', 'to', 'john', 'cena', "'s", 'absolutely', 'delicious', 'looking', 'biceps'], ['not', 'fast', 'enough', '!'], ['on', 'march', '24th', '2023', 'kanye', 'west', 'said', 'he', 'is', 'no', 'longer', 'anti', 'semitic', 'because', 'of', 'this', 'movie', 'and', 'jonah', 'hill'], ['why', 'are', 'old', 'people', 'having', 'fun', 'they', "'re", 'literally', 'gon', 'na', 'die', 'tomorrow', '😭']]
49074


In [26]:
print(model.wv.most_similar("🤠", topn= 20))

[('dolt', 0.9995066523551941), ('suffocates', 0.999189019203186), ('ohhhh', 0.998142421245575), ('rouge', 0.9977474212646484), ('paterson', 0.9975634217262268), ('43', 0.9974400997161865), ('landis', 0.9960612654685974), ('poppy', 0.9954481720924377), ('mirjam', 0.9954425096511841), ('89/100', 0.9950217604637146), ('psychoactive', 0.9949510097503662), ('relationships-', 0.9948994517326355), ('marking', 0.9944379925727844), ('metadoc', 0.9940840601921082), ('bbc', 0.9940271973609924), ('fondly', 0.9937567710876465), ('wanti', 0.9936219453811646), ('clings', 0.9932264089584351), ('firefighting', 0.9931585192680359), ('suppress', 0.9928309321403503)]


hmm.. not seeing the correlation here. A couple of these words only appear once in the corpus. Maybe this function doesn't handle emojis well.

Lastly, let's just throw in some other ntlk function calls.

In [31]:
words = []
for sentence in sentences:
    for item in sentence:
        words.append(item)
corpus = nltk.Text(words)

In [35]:
corpus.concordance("🤠", 100, 25)
corpus.similar("🤠")

Displaying 15 of 15 matches:
t for me , dear god i wish steve zahn was my dad 🤠 you ca n't , and i ca n't stress this enough , f
 cinematic history methinks can i get a yeeeehaw 🤠 martin scorsese literally wishes he made a plot 
 ! ! ! cowboy steven yeun please be my boyfriend 🤠 how can something be so beautiful yet so painful
o-yeon this was made for the girls and gays only 🤠 shantay , you stay park chan-wook more like park
st greek director this could never be made today 🤠 but only because laugh-out-loud studio comedies 
asleep three times but that 's on me , not him . 🤠 cat i am catsitting for jumped on bed during the
ring the sex scene and scared the shit out of me 🤠 this happened to me in 2019 feels like aidan qui
an stevens 😔 👌/|| _ _/¯ ¯ _ king of gay cinema 👋 🤠 || _ _/¯ ¯ _ hoping a king princess song never m
nd the underrated strength required to endure it 🤠 mesmerising sir ari aster sure does know how to 
ve her ! ! ! the ‘ d ' is silent hillbilly ! ! ! 🤠 phenomenal cinema ac

In [36]:
corpus.concordance("😭", 100, 25)
corpus.similar("😭")

Displaying 25 of 52 matches:
aving fun they 're literally gon na die tomorrow 😭 kids movie where dolls fight for their lives aft
rsten word for word😭😭😭 paulie stop kissing gibby 😭 there is so much that i loved about this holy sh
he director went from this to the last of us hbo 😭 if you use snapchat in 2023 you deserve to die t
idea in human history and they just went with it 😭 we need fewer women in stem ! ! ! going to a par
tial crisis this is currently putting me through 😭 me in the sky with diamonds they really made thi
 now how am i supposed to go to sleep after this 😭 i literally do n't know what to make of this but
t ca n't be that bad '' well actually yes it can 😭 this was so good except for the guy next to us l
amy dunne in gone girl ? ? ? ? ? not that ending 😭 i 'm never gon na think about jeffrey dahmer the
 apologizing for this painful shakespearean shit 😭 nwr really had an impact on ryan gosling huh i '
n depends diapers every single night out of fear 😭 `` this is so metaph