# 0. Introduction

The classification that we are interested for this program is that of document classification, specifically movie titles. Titles are the first thing we take in whenever something new is airing. Beyond the trailers, actors and directors involved, we read a name, and from there we decide whether it is worth a watch or more inquiery, or if it's a waste of our time. As an avid movie lover, I spend a lot of time reading about and watching movies, and an intriguing title is often what hooks me.

Specifically for this paper I will be coding supervised classification, which is a feature extractor that converts inputs into a feature set, like positive or negative (linguist89, 2025). 

I will start by following the document classification on the corpus by nltk on names, and using this framework to run a similar experiment for my own data, that being the titles of 2024 feature films.

# 1. Packages to be installed

The first thing we do is import the necessary tools needed for our program. The first program I will be installing is the most important and the base for everything I will be doing, nltk (Natural Language Toolkit) then I will be importing random # nltk, regular expressions, and random module to support the preexisting tools built into nltk.

The first program is nltk, which is the National Language Tool Kit. We have used this platform to build on data from corpuses and for language processing in class, and I find it to be a good tool for the scope of my paper (Hansen, Olsen and Enevoldsen, 2023). Please see the bibliography in the written portion of this exam for all references made in here. 

The next program is pandas, which lets us manipulate and see dataframes. We have used it in both the first and second semester of the master's program, so I see it fit to use this again for my set of data.

Note that if there are hashtags in front of the !pip command, please remove these before attempting to run the code. They are simply there so the programs do not download over and over.

A large portion of my code makes reference to homework we were assigned and worked with in class, which can be accessed through Stephan's github account (linguist89) or through this link: https://github.com/linguist89/compling25-exercises-week11. Please reach out to him or Ross if there are any issues accessing this worksheet.

In [388]:
%pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [389]:
import nltk, re, random # nltk, regular expression, random module
from nltk import word_tokenize

Although NLTK has a corpus of movie-related items, they are limited to reviews, and these reviews do not have information that is viable to my thesis, that being titles of movies, so I will move forward with their corpus containing names. From here, I will build on the gendered features we explored in class, and run similar code for the 2024 feature film titles.
The goal is to see where this model places consistences for the titles involved, and if this is comparable to that of standard names.

An example of an inquiry could be: are titles that are deemed more likely to be feminine more positively reviewed?

This entire thing is a bit arbitrary, admittedly, but a curious case nonetheless.

# 2. Gendered name identification

This section is based largely on the code we worked with in class for week 11. This code builds on NLTK's corpus of names, and in class we coded to see if there was any reliability in assuming features of names being gendered. According to the homework notes "The returned dictionary, known as a feature set, maps from features' names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are values with simple types, such as booleans, numbers, and strings." (linguist89, 2025)

For example, the last letter of the name Shrek is 'k', which is a letter we must map onto a set of variables we determine to be either female or male. This whole thing is quite binary, and there is no in-between encoding for gender-neutral names, as this was not a part of the corpus nor something I am able to append to the corpus data.

Firstly, we write code which will extract the last letter of any given name (word)

In [390]:
def gender_features(word):
    # Extracting the last letter of the word
    return {'last_letter': word[-1:]}

In [391]:
gender_features('Shrek')

{'last_letter': 'k'}

I will also run my own name below, for fun

In [392]:
gender_features('Suzan')

{'last_letter': 'n'}

Now that we have a feature extractor, we can move on to importing the corpus of names from NLTK. We will be randomizing the contents to ensure a widespread variety of data to work from.

In [393]:
import nltk.corpus
nltk.download('names')
from nltk.corpus import names # we import names to access the names in the corpus
import random
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
len (names) # 7944 names

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\a\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


7944

Following what was done in class, we now use the feature extractor to process the data and divide it into two groups, one training set and one test set. The training set trains a new "naive Bayes" classifier (linguist89, 2025)

In [394]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500] # test set is first 500 records
# train set is everything after the first 500 records, actually 7544 names. We train on far
# more data than we test
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) # 0.752


0.78


Let's see what happens if we get the first ten results from the training set, just to test that everything works.

In [395]:
train_set[:10] # first 10 records in the training set

[({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 's'}, 'male'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'n'}, 'female')]

And again for the test set

In [396]:
test_set[:10] # first 10 records in the test set

[({'last_letter': 'r'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 's'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'h'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'e'}, 'female')]

They both alternate quite well between feminine and masculine features, which is a good sign for us, and shows no current overrepresentation of either.

Let's test out some names

In [397]:
classifier.classify(gender_features('Anna')) 

'female'

In [398]:
classifier.classify(gender_features('John')) # classify many names at once

'male'

What about a name we are sure is not a traditional name? I'm thinking of dear Smeagol AKA Gollum from Lord of the Rings. It will correctly classify both of his names as male, despite neither being present in the list of names from nltk.

Oddly, Smeagol sometimes is predicted to be female instead of male. Run the cell again and it should fix the mistake.

In [399]:
classifier.classify(gender_features('Gollum'))

'male'

In [400]:
classifier.classify(gender_features('Smeagol'))

'male'

We continue to accounting for accuracy in the model between the classifier and with the data contained in the test set

In [401]:
nltk.classify.accuracy(classifier, test_set)

0.78

My test set came out with an accuracy of 0.758 a number that varies based on the generated test set.

Let's see which features are the most informative for reaching this accuracy distribution:

In [402]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     33.5 : 1.0
             last_letter = 'k'              male : female =     30.5 : 1.0
             last_letter = 'f'              male : female =     16.5 : 1.0
             last_letter = 'p'              male : female =     11.1 : 1.0
             last_letter = 'v'              male : female =     11.1 : 1.0


With the most basic framework set, I will be continuing my program by importing my data and starting the relevant programming for my thesis below

## 2.1 Picking the right features
Relevant features of interest to this program are good to set in stone, so our model produces information that we can then use to see patterns of interest, like position of letters, lenght of names, or other features that might be universal for movies with better scores.

In [403]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
# We have 54 features and one special feature for 'the', which gives us a total of 55 features.

In [404]:
#Let's test it out:
gender_features2('Shrek')

{'firstletter': 's',
 'lastletter': 'k',
 'count(a)': 0,
 'has(a)': False,
 'count(b)': 0,
 'has(b)': False,
 'count(c)': 0,
 'has(c)': False,
 'count(d)': 0,
 'has(d)': False,
 'count(e)': 1,
 'has(e)': True,
 'count(f)': 0,
 'has(f)': False,
 'count(g)': 0,
 'has(g)': False,
 'count(h)': 1,
 'has(h)': True,
 'count(i)': 0,
 'has(i)': False,
 'count(j)': 0,
 'has(j)': False,
 'count(k)': 1,
 'has(k)': True,
 'count(l)': 0,
 'has(l)': False,
 'count(m)': 0,
 'has(m)': False,
 'count(n)': 0,
 'has(n)': False,
 'count(o)': 0,
 'has(o)': False,
 'count(p)': 0,
 'has(p)': False,
 'count(q)': 0,
 'has(q)': False,
 'count(r)': 1,
 'has(r)': True,
 'count(s)': 1,
 'has(s)': True,
 'count(t)': 0,
 'has(t)': False,
 'count(u)': 0,
 'has(u)': False,
 'count(v)': 0,
 'has(v)': False,
 'count(w)': 0,
 'has(w)': False,
 'count(x)': 0,
 'has(x)': False,
 'count(y)': 0,
 'has(y)': False,
 'count(z)': 0,
 'has(z)': False}

I also want to make an alternative one that includes special characters, for the sake of the movies in the spreadsheet. Some of them contain special characters like : or !, and I think it's interesting to include them as they are. 

In [405]:
def gender_features_special(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["has(the)"] = ('the' in name.lower())
    for letter in '1234567890abcdefghijklmnopqrstuvwxyz&#@!$%^&*+-=;:,.?':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features


We have 64 features now, including special characters. I have added the word 'the' as a feature as it appears in many of the movie names. It will count out as 107 features.

Let's compare the two by using the name Shrek with and without an exclamation mark in the two different models:

In [406]:
len(gender_features2("Shrek"))

54

In [407]:
len(gender_features2("Shrek!"))

54

In [408]:
len(gender_features_special("Shrek"))

107

In [409]:
len(gender_features_special("Shrek!"))

107

In [410]:
len(gender_features_special("Shrek!"))

107

As seen above, if I run the exclamation marked version of Shrek in the normal gender_features2, no parameters change, nor do they in the special (extended extractor) change the number. Each feature simply gets counted individually for whether it has each respective item in the special alphabet in each category.

In [411]:
#Let's prove this with another example:
gender_features2('Window123')
len(gender_features2('Window123'))

54

In [412]:
gender_features_special('Window123')
len((gender_features_special('Window123')))

107

# 3. Classification of the 2024 movie title pool

Now that we have the feature extractor, we can move on to the relevant data for this paper. 

I put all of the titles from the 2024 releases from the .csv file called IMDB_LIST_266_massive into a list called movie_titles. Admittedly, I got this list from the list I imported from IMDb and ran a Copilot script to automate the process of listing each of them into items in my list. 

Note that the order below is seemingly random, and is not based on alphabetical order, popularity or release date. I only know that this is ranked by an arbitrary "list order", which I cannot find the parameters for. The same order is of course in the IMDb spreadsheet, and was imported in this order from IMDb itself.


In [413]:
movie_titles = [
    "Conclave",
    "Gladiator II",
    "The Brutalist",
    "Babygirl",
    "Anora",
    "A Complete Unknown",
    "The Substance",
    "Nosferatu",
    "Beetlejuice Beetlejuice",
    "Fight or Flight",
    "Speak No Evil",
    "Heretic",
    "Kraven the Hunter",
    "Wicked",
    "Twisters",
    "Deadpool & Wolverine",
    "The Order",
    "Paddington in Peru",
    "Dune: Part Two",
    "The Wild Robot",
    "Mufasa: The Lion King",
    "Alien: Romulus",
    "Trap",
    "A Real Pain",
    "I'm Still Here",
    "Longlegs",
    "The Ministry of Ungentlemanly Warfare",
    "Small Things Like These",
    "Smile 2",
    "Venom: The Last Dance",
    "Moana 2",
    "Flow",
    "The Beekeeper",
    "We Live in Time",
    "Blink Twice",
    "Sonic the Hedgehog 3",
    "Civil War",
    "Juror #2",
    "The Apprentice",
    "Parthenope",
    "It Ends with Us",
    "Furiosa: A Mad Max Saga",
    "Abigail",
    "The Count of Monte-Cristo",
    "Presence",
    "The Fall Guy",
    "Better Man",
    "Carry-On",
    "Wolfs",
    "MaXXXine",
    "Terrifier 3",
    "Megalopolis",
    "Subservience",
    "Road House",
    "Challengers",
    "Madame Web",
    "Oddity",
    "September 5",
    "Saturday Night",
    "Joker: Folie a Deux",
    "The Last Showgirl",
    "Cuckoo",
    "The Idea of You",
    "Inside Out 2",
    "Y2K",
    "Transformers One",
    "Borderlands",
    "Queer",
    "Kinds of Kindness",
    "Elevation",
    "Rebel Ridge",
    "The Crow",
    "The Room Next Door",
    "Godzilla x Kong: The New Empire",
    "Lucky Baskhar",
    "Salem's Lot",
    "Here",
    "Despicable Me 4",
    "A Different Man",
    "Horizon: An American Saga - Chapter 1",
    "Love Lies Bleeding",
    "Azrael",
    "Arcadian",
    "My Old Ass",
    "Caddo Lake",
    "Nickel Boys",
    "Miller's Girl",
    "Emilia Perez",
    "A Quiet Place: Day One",
    "Pushpa: The Rule - Part 2",
    "Argylle",
    "Bad Boys: Ride or Die",
    "IF",
    "Kingdom of the Planet of the Apes",
    "The Lord of the Rings: The War of the Rohirrim",
    "Drive-Away Dolls",
    "Kung Fu Panda 4",
    "It's What's Inside",
    "Fly Me to the Moon",
    "Red One",
    "Snack Shack",
    "The Six Triple Eight",
    "I Saw the TV Glow",
    "Ghostbusters: Frozen Empire",
    "Fighter",
    "Nightbitch",
    "The Watchers",
    "Maria",
    "The Killer's Game",
    "Damsel",
    "The Outrun",
    "The Instigators",
    "Monkey Man",
    "The Seed of the Sacred Fig",
    "Mean Girls",
    "The Girl with the Needle",
    "Immaculate",
    "Kneecap",
    "Memoir of a Snail",
    "Never Let Go",
    "Don't Move",
    "Maharaja",
    "Thelma",
    "Mothers' Instinct",
    "How to Make Millions Before Grandma Dies",
    "The Union",
    "Land of Bad",
    "In a Violent Nature",
    "All We Imagine as Light",
    "Hellboy: The Crooked Man",
    "Spaceman",
    "Greedy People",
    "Afraid",
    "Blitz",
    "The First Omen",
    "Solo Leveling: ReAwakening",
    "Ricky Stanicky",
    "Attack on Titan the Movie: The Last Attack",
    "Culpa Tuya",
    "Time Cut",
    "Apartment 7A",
    "Nr. 24",
    "Incoming",
    "Uglies",
    "Exhuma",
    "A Family Affair",
    "Beverly Hills Cop: Axel F",
    "The Deliverance",
    "The Killer",
    "Upgraded",
    "Apocalypse Z: The Beginning of the End",
    "Tarot",
    "Sleeping Dogs",
    "Lift",
    "The Garfield Movie",
    "Kalki 2898 AD",
    "Arthur the King",
    "Back to Black",
    "Unfrosted",
    "Da-di",
    "Rebel Moon - Part Two: The Scargiver",
    "Atlas",
    "Wallace & Gromit: Vengeance Most Fowl",
    "Jackpot!",
    "Night Swim",
    "Lonely Planet",
    "The Strangers: Chapter 1",
    "Under Paris",
    "Lisa Frankenstein",
    "Young Woman and the Sea",
    "Reagan",
    "Mother of the Bride",
    "Brothers",
    "Killer Heat",
    "The Tearsmith",
    "Sting",
    "Canary Black",
    "Kishkindha Kaandam",
    "Ordinary Angels",
    "No Way Up",
    "Look Back",
    "Scoop",
    "Winnie-the-Pooh: Blood and Honey 2",
    "Do Patti",
    "Bob Marley: One Love",
    "The Platform 2",
    "Manjummel Boys",
    "Trigger Warning",
    "Baby John",
    "Marco",
    "Amaran",
    "Meiyazhagan",
    "Orion and the Dark",
    "Stree 2: Sarkate Ka Aatank",
    "The Exorcism",
    "Imaginary",
    "Find Me Falling",
    "Our Little Secret",
    "Dear Santa",
    "Teri Baaton Mein Aisa Uljha Jiya",
    "Sikandar Ka Muqaddar",
    "Article 370",
    "Sector 36",
    "Sookshma Darshini",
    "Munjya",
    "The American Society of Magical Negroes",
    "Irish Wish",
    "Aavesham",
    "Role Play",
    "Khel Khel Mein",
    "My Spy: The Eternal City",
    "The Merry Gentlemen",
    "Badland Hunters",
    "Hot Frosty",
    "Chandu Champion",
    "Players",
    "Singham Again",
    "Jigra",
    "Shaitaan",
    "That Christmas",
    "Bramayugam",
    "Ulajh",
    "Premalu",
    "Code 8: Part II",
    "Kanguva",
    "Devara Part 1",
    "Maharaj",
    "The Goat Life",
    "The Greatest of All Time",
    "Vettaiyan",
    "Anweshippin Kandethum",
    "Maidaan",
    "Sarfira",
    "Crew",
    "Madgaon Express",
    "Space Cadet",
    "Bhool Bhulaiyaa 3",
    "Murder Mubarak",
    "Srikanth",
    "Vicky Vidya Ka Woh Wala Video",
    "The Sabarmati Report",
    "Viduthalai Part 2",
    "Amar Singh Chamkila",
    "Hanu Man",
    "Color of Victory",
    "Blackout",
    "Auron Mein Kahan Dum Tha",
    "Savi",
    "Guntur Kaaram",
    "Raayan",
    "Intoxicated by Love",
    "Agni",
    "Bad Newz",
    "Bade Miyan Chote Miyan",
    "Swatantrya Veer Savarkar",
    "Merry Christmas",
    "Do Aur Do Pyaar",
    "Yudhra",
    "Now or Never!",
    "Mr. & Mrs. Mahi",
    "Indian 2",
    "Love Li",
    "Wild Wild Punjab",
    "Martin",
    "Kaam Chalu Hai",
    "Sipahi"
]

Let's double check that all 266 titles are in here

In [414]:
len(movie_titles)

266

Now comes the most intensive part of this program, that of importing the dataframe from the Excel (csv file) containing all of my data. There is a lot of data...

We start by importing pandas, a toolkit that allows us to work with data in a user-friendly way, and is great for tables with lots of data. As we have worked with this on this course and for a course on the previous semester, I am a bit more familiar with it than I am with nltk.

I will be using the official documentation from Pandas to guide my coding process, especially to make sure I do not make massive mistakes with my massive dataset (pandas documentation — pandas 2.2.3 documentation, 2024)

In [415]:
import pandas as pd
import os

# Check current working directory
print("Current working directory:", os.getcwd())

# List files in the current directory
print("Files in directory:", os.listdir())

# Update the path below if your file is not in the current directory
csv_path = "G:\\Mit drev\\Kandidaten\\Computational Linguistics\\EXAM\\IMDB_LIST_266_massive.csv"

rawdata = pd.read_csv(csv_path, sep=';', quoting=1, encoding='utf-8')
rawdata = rawdata.replace('*', '') 
rawdata

Current working directory: g:\Mit drev\Kandidaten\Computational Linguistics\EXAM
Files in directory: ['IMDB_LIST_266.csv', 'imdb_list.csv', 'Suz_Compling_Code.ipynb', 'suzan_fwpportfolio.ipynb', 'imdb_list_of_2024_us_releases.csv', 'IMDB_LIST_266_with_genres.csv', 'IMDB_LIST_266_massive.csv', 'teis_fun_with_pandas.ipynb']


Unnamed: 0,Position,Const,Created,Modified,Description,Title,Original Title,Url,Title Type,IMDb Rating,...,Rotten Tomatoes,NVotesRT,Metacritic,NVotesMC,FM_IMDb,NVotesFM_IMDb,Letterboxd,NVotesLB,Flickmetrix_total,Notes
0,Position,Const,Created,Modified,Description,Title,Original Title,URL,Title Type,IMDb Rating,...,,,,,,,,,,
1,1,tt20215234,21-05-2025,21-05-2025,,Conclave,Conclave,https://www.imdb.com/title/tt20215234/,Movie,7.4,...,81,181,79,41,74,132086,78,1023425,79,
2,2,tt9218128,21-05-2025,21-05-2025,,Gladiator II,Gladiator II,https://www.imdb.com/title/tt9218128/,Movie,7.5,...,66,214,64,41,66,216494,66,964396,67,
3,3,tt8999762,21-05-2025,21-05-2025,,The Brutalist,The Brutalist,https://www.imdb.com/title/tt8999762/,Movie,7.6,...,88,175,90,41,75,68946,80,544891,84,
4,4,tt30057084,21-05-2025,21-05-2025,,Babygirl,Babygirl,https://www.imdb.com/title/tt30057084/,Movie,7.7,...,72,143,79,41,60,47394,58,553519,58,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
262,262,tt27051027,21-05-2025,21-05-2025,,Love Li,Love li,https://www.imdb.com/title/tt27051027/,Movie,7.265,...,,,,,,,,,,doesn't exist on the site
263,263,tt31514262,21-05-2025,21-05-2025,,Wild Wild Punjab,Wild Wild Punjab,https://www.imdb.com/title/tt31514262/,Movie,7.266,...,46,7,,,62,15447,58,1965,55,no mc
264,264,tt15334030,21-05-2025,21-05-2025,,Martin,Martin,https://www.imdb.com/title/tt15334030/,Movie,7.267,...,,,,,,,,,,doesn't exist on the site
265,265,tt30783769,21-05-2025,21-05-2025,,Kaam Chalu Hai,Kaam Chalu Hai,https://www.imdb.com/title/tt30783769/,Movie,7.268,...,,,,,,,,,,doesn't exist on the site


There is a whole row which just repeats the titles for the dataframe. Let's remove it.

In [416]:
# Let us remove the row where Position is 'Position' (the header row accidentally included in the data)
rawdata = rawdata[rawdata['Position'] != 'Position']
rawdata = rawdata.reset_index(drop=True)
rawdata

Unnamed: 0,Position,Const,Created,Modified,Description,Title,Original Title,Url,Title Type,IMDb Rating,...,Rotten Tomatoes,NVotesRT,Metacritic,NVotesMC,FM_IMDb,NVotesFM_IMDb,Letterboxd,NVotesLB,Flickmetrix_total,Notes
0,1,tt20215234,21-05-2025,21-05-2025,,Conclave,Conclave,https://www.imdb.com/title/tt20215234/,Movie,7.4,...,81,181,79,41,74,132086,78,1023425,79,
1,2,tt9218128,21-05-2025,21-05-2025,,Gladiator II,Gladiator II,https://www.imdb.com/title/tt9218128/,Movie,7.5,...,66,214,64,41,66,216494,66,964396,67,
2,3,tt8999762,21-05-2025,21-05-2025,,The Brutalist,The Brutalist,https://www.imdb.com/title/tt8999762/,Movie,7.6,...,88,175,90,41,75,68946,80,544891,84,
3,4,tt30057084,21-05-2025,21-05-2025,,Babygirl,Babygirl,https://www.imdb.com/title/tt30057084/,Movie,7.7,...,72,143,79,41,60,47394,58,553519,58,
4,5,tt28607951,21-05-2025,21-05-2025,,Anora,Anora,https://www.imdb.com/title/tt28607951/,Movie,7.8,...,87,193,91,41,76,150547,78,1494086,82,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,262,tt27051027,21-05-2025,21-05-2025,,Love Li,Love li,https://www.imdb.com/title/tt27051027/,Movie,7.265,...,,,,,,,,,,doesn't exist on the site
262,263,tt31514262,21-05-2025,21-05-2025,,Wild Wild Punjab,Wild Wild Punjab,https://www.imdb.com/title/tt31514262/,Movie,7.266,...,46,7,,,62,15447,58,1965,55,no mc
263,264,tt15334030,21-05-2025,21-05-2025,,Martin,Martin,https://www.imdb.com/title/tt15334030/,Movie,7.267,...,,,,,,,,,,doesn't exist on the site
264,265,tt30783769,21-05-2025,21-05-2025,,Kaam Chalu Hai,Kaam Chalu Hai,https://www.imdb.com/title/tt30783769/,Movie,7.268,...,,,,,,,,,,doesn't exist on the site


Admittedly, this is overkill. BUT! it does provide us with scores and a point of reference for how good a movie is beyond the features extractor I built earlier. 

Also, I spent way too many hours inputting the data, so bear with me for wanting to display it for a short while.

Let's group the columns we will want to use in the initial experiment for now. I will come back to the rest of the data more in the discursive parts of this assignment.

In [417]:
df_small_beta = rawdata.groupby(['Position', 'Title', 'Runtime (mins)', 'Genres', 'Flickmetrix_total']).size().reset_index(name='Count')
df_small_beta = df_small_beta.rename(columns={
    'Position': 'Position',
    'Title': 'Title',
    'Runtime (mins)': 'Runtime',
    'Genres': 'Genres',
    'Flickmetrix_total': 'Flickmetrix_total'
})
df_small_beta

Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
0,1,Conclave,120,"Drama, Thriller",79,1
1,10,Fight or Flight,102,"Action, Comedy",64,1
2,100,Red One,123,"Action, Adventure, Comedy, Fantasy, Mystery",56,1
3,101,Snack Shack,112,Comedy,68,1
4,102,The Six Triple Eight,127,"Drama, History, War",64,1
...,...,...,...,...,...,...
261,95,The Lord of the Rings: The War of the Rohirrim,134,"Animation, Action, Adventure, Drama, Fantasy",63,1
262,96,Drive-Away Dolls,84,"Action, Comedy, Thriller",57,1
263,97,Kung Fu Panda 4,94,"Animation, Action, Adventure, Comedy, Family, ...",63,1
264,98,It's What's Inside,103,"Comedy, Mystery, Sci-Fi, Thriller",68,1


Some of the titles were not on Flick Metrix, so let's sort those out. We know that the cells without an input are simply empty, which is why we use the .isnull function. We can then strip them from the dataframe and index the final outcome as a new cleaned dataframe.

Admittedly, I have struggled with getting the September 5 title to be correct in both the Excel file and in my code, so I will exclude it as well.

In [418]:
# Find rows where 'Flickmetrix_total' is missing or empty, or Title is 'September 5' or 'sep-05'
missing_scores = df_small_beta[
	df_small_beta['Flickmetrix_total'].isnull() |
	(df_small_beta['Flickmetrix_total'].astype(str).str.strip() == '') |
	(df_small_beta['Title'].str.strip().str.lower().isin(['september 5', 'sep-05']))
]

# Print the titles of the removed rows
print("Removed rows (no Flickmetrix_total or problematic title):")
print(missing_scores[['Title', 'Flickmetrix_total']])

# Remove those rows from the dataframe
df_cleaned = df_small_beta[
	~(df_small_beta['Flickmetrix_total'].isnull() |
	  (df_small_beta['Flickmetrix_total'].astype(str).str.strip() == '') |
	  (df_small_beta['Title'].str.strip().str.lower().isin(['september 5', 'sep-05']))
	)
].reset_index(drop=True)

# I will lastly remove the titles from the movie_list list
removed_titles = set(missing_scores['Title'].str.strip())
movie_titles = [title for title in movie_titles if title.strip() not in removed_titles]

Removed rows (no Flickmetrix_total or problematic title):
                             Title Flickmetrix_total
7                          Fighter                  
68                            Dìdi                  
101                          Marco                  
103                    Meiyazhagan                  
124                 Khel Khel Mein                  
153              Bhool Bhulaiyaa 3                  
155                       Srikanth                  
157  Vicky Vidya Ka Woh Wala Video                  
158           The Sabarmati Report                  
164       Auron Mein Kahan Dum Tha                  
165                           Savi                  
170                           Agni                  
175                Do Aur Do Pyaar                  
176                         Yudhra                  
179                Mr. & Mrs. Mahi                  
181                        Love Li                  
183                         Martin       

In [419]:
# Check if 'Martin' is in the movie_titles list
print('Martin' in movie_titles)
# Or, to get its index (will raise ValueError if not found):
# print(movie_titles.index('Savi'))

False


In [420]:
df_cleaned # It should now be cleaned and show 246 movies now, AKA 246 rows

Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
0,1,Conclave,120,"Drama, Thriller",79,1
1,10,Fight or Flight,102,"Action, Comedy",64,1
2,100,Red One,123,"Action, Adventure, Comedy, Fantasy, Mystery",56,1
3,101,Snack Shack,112,Comedy,68,1
4,102,The Six Triple Eight,127,"Drama, History, War",64,1
...,...,...,...,...,...,...
241,95,The Lord of the Rings: The War of the Rohirrim,134,"Animation, Action, Adventure, Drama, Fantasy",63,1
242,96,Drive-Away Dolls,84,"Action, Comedy, Thriller",57,1
243,97,Kung Fu Panda 4,94,"Animation, Action, Adventure, Comedy, Family, ...",63,1
244,98,It's What's Inside,103,"Comedy, Mystery, Sci-Fi, Thriller",68,1


## 3.1 Sorting

Now we can get to the fun part: sorting!

This subpart is more of a formality so that we can get different sets of sorting into respective dataframes. This makes the process of comparing different values and sorting orders easier later on.

The reason I focus on titles, runtime, and genres for this programming is that these are the most common things we watch out for when picking a movie to watch. Subcategorizations with actors and directors is a much more biased line of researching which is far too expansive for this paper - for example, someone who likes movies from one director is more likely to watch the rest of the movies from that director, even if they have bad ratings.

We are all also different in our tastes and in our general moods. Some people love long movies, some will sigh if a movie is anything over 2 hours long. Some love horror movies, others find them repulsive and will want to steer clear of any. For this reason, I will keep these categories (mostly) separated for most of my coding going forward.

Let's start by making sure we have all 246 movies that contain a Flick Metrix score and see them in alphabetical order:

In [421]:
total = df_cleaned.value_counts(['Title', 'Flickmetrix_total']).sort_index(ascending=True)
print(total)
#And the length of the total just below, it should say 247
print(len(total))

Title                               Flickmetrix_total
A Complete Unknown                  76                   1
A Different Man                     78                   1
A Family Affair                     45                   1
A Quiet Place: Day One              78                   1
A Real Pain                         79                   1
                                                        ..
Wild Wild Punjab                    55                   1
Winnie-the-Pooh: Blood and Honey 2  45                   1
Wolfs                               64                   1
Y2K                                 47                   1
Young Woman and the Sea             74                   1
Name: count, Length: 246, dtype: int64
246


We start by making a dataframe that indexes by title in alphabetical order (ascending):

In [422]:
flickmetrix_sorted = df_cleaned.sort_values((['Title','Flickmetrix_total']), ascending=True)
flickmetrix_sorted

Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
202,6,A Complete Unknown,141,"Biography, Drama, Music",76,1
223,79,A Different Man,112,"Comedy, Drama, Thriller",78,1
51,146,A Family Affair,111,"Comedy, Drama, Romance",45,1
234,89,A Quiet Place: Day One,99,"Drama, Horror, Sci-Fi, Thriller",78,1
149,24,A Real Pain,90,"Comedy, Drama",79,1
...,...,...,...,...,...,...
166,263,Wild Wild Punjab,109,"Comedy, Romance",55,1
91,183,Winnie-the-Pooh: Blood and Honey 2,93,"Horror, Thriller",45,1
191,49,Wolfs,108,"Crime, Thriller",64,1
208,65,Y2K,91,"Comedy, Horror, Sci-Fi",47,1


In [423]:
# If we do the same with genres, it shows the entire groups of genres, while we are interested in the individual ones.
genres_sorted = df_cleaned.sort_values((['Genres','Title']), ascending=True)
genres_sorted.value_counts('Genres')
genres_sorted

Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
238,92,Bad Boys: Ride or Die,115,"Action, Adventure, Comedy, Crime, Thriller",66,1
177,36,Sonic the Hedgehog 3,110,"Action, Adventure, Comedy, Family, Fantasy, Sc...",72,1
2,100,Red One,123,"Action, Adventure, Comedy, Fantasy, Mystery",56,1
66,16,Deadpool & Wolverine,128,"Action, Adventure, Comedy, Sci-Fi",73,1
210,67,Borderlands,101,"Action, Adventure, Comedy, Sci-Fi, Thriller",35,1
...,...,...,...,...,...,...
220,76,Salem's Lot,114,"Horror, Thriller",50,1
102,195,The Exorcism,95,"Horror, Thriller",44,1
91,183,Winnie-the-Pooh: Blood and Honey 2,93,"Horror, Thriller",45,1
176,35,Blink Twice,102,"Mystery, Thriller",68,1


As some of these movies have multiple genres, I needed to import defaultdict, which is a tool that helps me do so.

In [424]:
from collections import defaultdict

# Create a dictionary to store genres and the titles they appear in
genre_titles = defaultdict(set)

# We run a for loop to iterate over the dataframe and split genres by comma
for _, row in df_cleaned.iterrows():
    genres = [g.strip() for g in row['Genres'].split(',')]
    for genre in genres:
        genre_titles[genre].add(row['Title'])

# Print the count of titles per genre and example titles
for genre, titles in genre_titles.items():
    print(f"{genre}: {len(titles)} titles, e.g. {list(titles)[:10]}")

# And a DataFrame to summarize the genres and their titles
genre_titles_df = pd.DataFrame({
    'Genre': list(genre_titles.keys()),
    'Number of Titles': [len(titles) for titles in genre_titles.values()],
    'Titles': [', '.join(sorted(titles)) for titles in genre_titles.values()]
})

genre_titles_df

# We have two different outputs below, one with the genres sorted by number of titles and listing the ones included
# and one which is a more easy-on-the-eye dataframe with genres, number of titles, and some of the titles.
# Note: I'm not sure how to make the dataframe show all the titles, so it only shows the first few titles in each genre.

Drama: 129 titles, e.g. ['Joker: Folie à Deux', 'Conclave', 'Love Lies Bleeding', 'A Complete Unknown', 'The Lord of the Rings: The War of the Rohirrim', 'It Ends with Us', 'Small Things Like These', 'Megalopolis', 'Horizon: An American Saga - Chapter 1', 'The Deliverance']
Thriller: 111 titles, e.g. ['Joker: Folie à Deux', 'Conclave', 'Kraven the Hunter', 'Abigail', 'The Greatest of All Time', 'Love Lies Bleeding', 'Smile 2', 'Bad Boys: Ride or Die', 'Beverly Hills Cop: Axel F', 'Longlegs']
Action: 82 titles, e.g. ['Kraven the Hunter', 'The Greatest of All Time', 'Role Play', 'Love Lies Bleeding', 'Bad Boys: Ride or Die', 'Beverly Hills Cop: Axel F', 'The Lord of the Rings: The War of the Rohirrim', 'Azrael', 'Elevation', "The Killer's Game"]
Comedy: 78 titles, e.g. ['Role Play', 'Bad Boys: Ride or Die', 'Beverly Hills Cop: Axel F', 'Paddington in Peru', 'Y2K', 'The Garfield Movie', 'Snack Shack', 'The American Society of Magical Negroes', "The Killer's Game", 'The Union']
Adventure: 

Unnamed: 0,Genre,Number of Titles,Titles
0,Drama,129,"A Complete Unknown, A Different Man, A Family ..."
1,Thriller,111,"A Different Man, A Quiet Place: Day One, Abiga..."
2,Action,82,"Aavesham, Amaran, Apocalypse Z: The Beginning ..."
3,Comedy,78,"A Different Man, A Family Affair, A Real Pain,..."
4,Adventure,46,"Arthur the King, Atlas, Attack on Titan the Mo..."
5,Fantasy,34,"Beetlejuice Beetlejuice, Better Man, Damsel, D..."
6,Mystery,32,"Afraid, Anweshippin Kandethum, Badland Hunters..."
7,History,16,"Blitz, Chandu Champion, Color of Victory, I'm ..."
8,War,6,"Amaran, Blitz, Chandu Champion, Nr. 24, The Mi..."
9,Sci-Fi,37,"A Quiet Place: Day One, Afraid, Alien: Romulus..."


And lastly (for now), by runtime

In [425]:
# The most important step here is to convert the runtime column to numeric, so we can sort it as such.
df_cleaned['Runtime'] = pd.to_numeric(df_cleaned['Runtime'], errors='coerce')
runtime_sorted = df_cleaned.sort_values('Runtime', ascending=False)
runtime_sorted
# If correctly done, The Brutalist should be a whopping 216 minutes long (3.6 hours),
# and the shortest movie should be Look Back with 58 minutes (just two minutes shy of an hour).

Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
170,3,The Brutalist,216,Drama,84,1
236,90,Pushpa: The Rule - Part 2,201,"Action, Crime, Drama, Thriller",59,1
139,229,The Greatest of All Time,183,"Action, Thriller",50,1
143,232,Maidaan,181,"Biography, Drama, History, Sport",69,1
225,80,Horizon: An American Saga - Chapter 1,181,"Drama, Western",57,1
...,...,...,...,...,...,...
37,133,Afraid,84,"Horror, Mystery, Sci-Fi, Thriller",43,1
187,45,Presence,84,"Drama, Horror, Thriller",67,1
242,96,Drive-Away Dolls,84,"Action, Comedy, Thriller",57,1
69,163,Wallace & Gromit: Vengeance Most Fowl,82,"Animation, Adventure, Comedy, Family, Sci-Fi",83,1


In [426]:
# Let's test out how it looks for a single movie, which we can only by calling it as a boolean
tester1 = df_cleaned[df_cleaned['Title'] == 'Indian 2']

In [427]:
# Now three of my favorite movies from the list, the isin (is in) function allows us to check for multiple values
tester1 = df_cleaned[df_cleaned['Title'].isin(['Challengers', 'Conclave', 'Drive-Away Dolls'])]
# We will sort this by Flickmetrix_total, descending
tester1.sort_values('Flickmetrix_total', ascending=False)


Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
0,1,Conclave,120,"Drama, Thriller",79,1
198,55,Challengers,131,"Comedy, Drama, Romance, Sport",76,1
242,96,Drive-Away Dolls,84,"Action, Comedy, Thriller",57,1


In [428]:
# Let us run tester1 by runtime in descending order
tester1_sorted = tester1.sort_values('Runtime', ascending=False)
tester1_sorted

Unnamed: 0,Position,Title,Runtime,Genres,Flickmetrix_total,Count
198,55,Challengers,131,"Comedy, Drama, Romance, Sport",76,1
0,1,Conclave,120,"Drama, Thriller",79,1
242,96,Drive-Away Dolls,84,"Action, Comedy, Thriller",57,1


To progress to the final stages, we have needed these four items: the list of the movie titles, the runtime, the genres, and the Flick Metrix scores. Now that we have all four, we can proceed to the next part, where we combine it all with the gender features coding we did way earlier, and from this we will start to deduce if there are any patterns between these four parameters and the gender features. 

I do this to not just run the entire list of movies through the (quite arbitrary) name model, but to have a point of comparison to estimate any patterns of legitimacy for running it in the first place.

# 4. Let's combine!

For this section we can finally begin to combine our classifyer with the grunt work of the data in the subdivided dataframes

I will be importing a collection from the Python datatypes called Counter, which helps in providing tallies - of which there will presumably be a lot of for each letter and number in our specialized alphabet (collections — Container datatypes, no date).

In [429]:
# Extract the titles from df_cleaned
titles_cleaned = df_cleaned['Title'].tolist()

# Extract features for each title
title_features_cleaned = [gender_features(title) for title in titles_cleaned]

# Classify each title using the trained classifier
title_genders_cleaned = [classifier.classify(features) for features in title_features_cleaned]

# Count occurrences of each predicted gender
from collections import Counter
gender_counts_cleaned = Counter(title_genders_cleaned)
print("Predicted gender counts for df_cleaned titles:", gender_counts_cleaned)

Predicted gender counts for df_cleaned titles: Counter({'male': 148, 'female': 98})


Let's follow this up by creating a more comprehensive dataframe that includes the movie titles and their predicted genders

In [430]:
# A DataFrame with titles and their predicted gender
# Use df_cleaned for titles and predicted gender
gender_df = pd.DataFrame({
    'Title': titles_cleaned,
    'Predicted Gender': title_genders_cleaned
})

# Merge with Flickmetrix scores from df_cleaned
gender_df = gender_df.merge(flickmetrix_sorted[['Title', 'Flickmetrix_total']], on='Title', how='left')

# And runtime
gender_df = gender_df.merge(runtime_sorted[['Title', 'Runtime']], on='Title', how='left')

# And genres
gender_df = gender_df.merge(genres_sorted[['Title', 'Genres']], on='Title', how='left')

# Convert Flickmetrix_total to numeric for sorting
gender_df['Flickmetrix_total'] = pd.to_numeric(gender_df['Flickmetrix_total'], errors='coerce')

# Separate female and male titles, sort by Flickmetrix score descending
female_titles = gender_df[gender_df['Predicted Gender'] == 'female'].sort_values('Flickmetrix_total', ascending=False).reset_index(drop=True)
male_titles = gender_df[gender_df['Predicted Gender'] == 'male'].sort_values('Flickmetrix_total', ascending=False).reset_index(drop=True)

#Finally, we can display the top 20 titles sorted by Flickmetrix score. This is just to show a baseline of the data.
gender_df.sort_values('Flickmetrix_total', ascending=False).head(20).style.format({'Flickmetrix_total': '{:.0f}'})

Unnamed: 0,Title,Predicted Gender,Flickmetrix_total,Runtime,Genres
42,Attack on Titan the Movie: The Last Attack,male,92,145,"Animation, Action, Adventure, Drama"
89,Look Back,male,88,58,"Animation, Drama"
108,The Wild Robot,male,87,102,"Animation, Sci-Fi"
156,I'm Still Here,female,87,137,"Biography, Drama, History"
173,Flow,male,86,85,"Animation, Adventure, Family, Fantasy"
98,Dune: Part Two,male,86,166,"Action, Adventure, Drama, Sci-Fi"
170,The Brutalist,male,84,216,Drama
69,Wallace & Gromit: Vengeance Most Fowl,male,83,82,"Animation, Adventure, Comedy, Family, Sci-Fi"
32,All We Imagine as Light,male,83,118,"Drama, Romance"
28,How to Make Millions Before Grandma Dies,male,82,125,"Comedy, Drama, Family"


From a first glance it seems that movies with "male" titles have the highest score. Let's explore why.

We start by wrangling the letters to see which ones are the most common overall. Then we move onto the most common last letter of words.

Looking at all of the letters is mostly for fun, but does give us an idea of what it could look like for the last letters in the proceeding code.

In [431]:
# We first exclude all of the instances of 'the' 
movie_titles_no_the = [re.sub(r'\bthe\b', '', title, flags=re.IGNORECASE).strip() for title in movie_titles]

# Display a few examples to verify
print(movie_titles[:10])
print(movie_titles_no_the[:10])# Join all movie titles into a single string and convert to lowercase. We convert to lowercase to ensure uniformity in counting letters.
all_letters = ''.join(movie_titles).lower()

# Count each letter
letter_counts = Counter(all_letters)

# Display the counts for each letter (a-z + special characters), sorted by count descending
for letter, count in sorted(letter_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{letter}: {count}")

# It will display the original titles, then the titles without 'the', and finally the counts of each letter in the titles, sorted in descending order.

['Conclave', 'Gladiator II', 'The Brutalist', 'Babygirl', 'Anora', 'A Complete Unknown', 'The Substance', 'Nosferatu', 'Beetlejuice Beetlejuice', 'Fight or Flight']
['Conclave', 'Gladiator II', 'Brutalist', 'Babygirl', 'Anora', 'A Complete Unknown', 'Substance', 'Nosferatu', 'Beetlejuice Beetlejuice', 'Fight or Flight']
 : 368
e: 346
a: 320
t: 224
i: 211
r: 202
n: 194
o: 179
l: 158
s: 143
h: 134
m: 111
d: 102
c: 83
g: 81
u: 69
p: 63
k: 61
y: 55
f: 53
b: 52
w: 43
v: 33
:: 26
j: 15
2: 13
x: 13
-: 9
': 8
z: 8
3: 4
q: 3
4: 3
1: 3
8: 3
&: 2
7: 2
!: 2
#: 1
5: 1
.: 1
9: 1
0: 1
6: 1


We can't test for accuracy because none of the titles have an actual gender assigned, meaning that there are no true labels for gendering which nltk can predict accuracy on. However, we do have informative features.

In [432]:
# Let us get the most informative last letters for movie titles in determining gender.

# Extract last letters and predicted genders for each movie title
last_letters = [features['last_letter'].lower() for features in movie_title_features]
gender_labels = movie_title_genders

# Count (last_letter, gender) pairs
pair_counts = Counter(zip(last_letters, gender_labels))

# For each last letter, compute the ratio of male to female predictions
letter_stats = {}
for letter in set(last_letters):
    male_count = pair_counts.get((letter, 'male'), 0)
    female_count = pair_counts.get((letter, 'female'), 0)
    total = male_count + female_count
    if total > 0:
        ratio = male_count / total
        letter_stats[letter] = {'male': male_count, 'female': female_count, 'ratio_male': ratio, 'total': total}

# Sort by informativeness: letters with high skew toward one gender and enough samples
informative_letters = sorted(
    letter_stats.items(),
    key=lambda x: abs(x[1]['ratio_male'] - 0.5) * x[1]['total'],
    reverse=True
)

# Display the top 10 most informative last letters
print("Most informative last letters for movie title gender prediction:")
for letter, stats in informative_letters[:10]:
    print(f"Last letter '{letter}': male={stats['male']}, female={stats['female']}, total={stats['total']}, male_ratio={stats['ratio_male']:.2f}")

Most informative last letters for movie title gender prediction:
Last letter 'e': male=0, female=40, total=40, male_ratio=0.00
Last letter 'n': male=30, female=0, total=30, male_ratio=1.00
Last letter 's': male=30, female=0, total=30, male_ratio=1.00
Last letter 'a': male=0, female=18, total=18, male_ratio=0.00
Last letter 't': male=17, female=0, total=17, male_ratio=1.00
Last letter 'r': male=14, female=0, total=14, male_ratio=1.00
Last letter 'i': male=0, female=11, total=11, male_ratio=0.00
Last letter 'g': male=9, female=0, total=9, male_ratio=1.00
Last letter '2': male=0, female=9, total=9, male_ratio=0.00
Last letter 'l': male=9, female=0, total=9, male_ratio=1.00


In [433]:
# In a dataframe:
informative_features_df = pd.DataFrame.from_dict(letter_stats, orient='index')
informative_features_df.index.name = 'last_letter'
informative_features_df = informative_features_df.reset_index()

# Sort by informativeness: letters with high skew toward one gender and enough samples
informative_features_df['informativeness'] = informative_features_df['total'] * abs(informative_features_df['ratio_male'] - 0.5) 
# We ratio it to male as we know it has a higher count than for female features
informative_features_df = informative_features_df.sort_values('informativeness', ascending=False)

# Print last letter and its statistics
for _, row in informative_features_df.iterrows():
    print(f"Last letter '{row['last_letter']}': male={row['male']}, female={row['female']}, total={row['total']}, male_ratio={row['ratio_male']:.2f}")

informative_features_df.sort_values('total', ascending=False).head(20)

#Note: the first column shows the letter based on the most informative features from the test set in the initial classifier model of names

Last letter 'e': male=0, female=40, total=40, male_ratio=0.00
Last letter 'n': male=30, female=0, total=30, male_ratio=1.00
Last letter 's': male=30, female=0, total=30, male_ratio=1.00
Last letter 'a': male=0, female=18, total=18, male_ratio=0.00
Last letter 't': male=17, female=0, total=17, male_ratio=1.00
Last letter 'r': male=14, female=0, total=14, male_ratio=1.00
Last letter 'i': male=0, female=11, total=11, male_ratio=0.00
Last letter 'g': male=9, female=0, total=9, male_ratio=1.00
Last letter '2': male=0, female=9, total=9, male_ratio=0.00
Last letter 'l': male=9, female=0, total=9, male_ratio=1.00
Last letter 'm': male=8, female=0, total=8, male_ratio=1.00
Last letter 'y': male=0, female=8, total=8, male_ratio=0.00
Last letter 'k': male=8, female=1, total=9, male_ratio=0.89
Last letter 'o': male=6, female=0, total=6, male_ratio=1.00
Last letter 'h': male=0, female=5, total=5, male_ratio=0.00
Last letter 'd': male=6, female=1, total=7, male_ratio=0.86
Last letter 'w': male=4, f

Unnamed: 0,last_letter,male,female,ratio_male,total,informativeness
26,e,0,40,0.0,40,20.0
4,n,30,0,1.0,30,15.0
30,s,30,0,1.0,30,15.0
17,a,0,18,0.0,18,9.0
3,t,17,0,1.0,17,8.5
2,r,14,0,1.0,14,7.0
19,i,0,11,0.0,11,5.5
1,g,9,0,1.0,9,4.5
9,2,0,9,0.0,9,4.5
29,l,9,0,1.0,9,4.5


In [434]:
last_letter_to_titles = defaultdict(list)
for title, features in zip(movie_titles, movie_title_features):
    last_letter = features['last_letter'].lower()
    last_letter_to_titles[last_letter].append(title)

# Print the last letter and the corresponding movie titles
for letter, titles in last_letter_to_titles.items():
    print(f"Last letter '{letter}': {titles}")

# Convert the letter_stats dictionary to a DataFrame
letter_stats_df = pd.DataFrame.from_dict(letter_stats, orient='index')
letter_stats_df.index.name = 'last_letter'
letter_stats_df = letter_stats_df.reset_index()
letter_stats_df = letter_stats_df.sort_values(by='total', ascending=False)

letter_stats_df

Last letter 'e': ['Conclave', 'The Substance', 'Beetlejuice Beetlejuice', 'Deadpool & Wolverine', "I'm Still Here", 'The Ministry of Ungentlemanly Warfare', 'Small Things Like These', 'Venom: The Last Dance', 'We Live in Time', 'Blink Twice', 'The Apprentice', 'Parthenope', 'Presence', 'MaXXXine', 'Subservience', 'Road House', 'Transformers One', 'Rebel Ridge', 'Godzilla x Kong: The New Empire', 'Here', 'Caddo Lake', 'A Quiet Place: Day One', 'Argylle', 'Bad Boys: Ride or Die', "It's What's Inside", 'Red One', 'Ghostbusters: Frozen Empire', 'Damsel', 'Immaculate', 'Kneecap', 'Maharaja', 'All We Imagine as Light', 'Afraid', 'The Killer', 'Kalki 2898 AD', 'Brothers', 'The Platform 2', 'Maidaan', 'Sarfira']
Last letter 'i': ['Gladiator II', 'Rebel Moon - Part Two: The Scargiver', 'Bob Marley: One Love', 'Irish Wish', 'The Goat Life', 'Wild Wild Punjab']
Last letter 't': ['The Brutalist', 'Fight or Flight', 'The Wild Robot', 'Saturday Night', "Salem's Lot", 'The Six Triple Eight', 'How to 

Unnamed: 0,last_letter,male,female,ratio_male,total
26,e,0,40,0.0,40
4,n,30,0,1.0,30
30,s,30,0,1.0,30
17,a,0,18,0.0,18
3,t,17,0,1.0,17
2,r,14,0,1.0,14
19,i,0,11,0.0,11
1,g,9,0,1.0,9
0,k,8,1,0.888889,9
9,2,0,9,0.0,9


Now let's see what common grounds we can settle on for the common last letters of these movies. I will take the first 10 features into account.

In [446]:
# Find the 10 most common last letters in the cleaned movie titles

# Get last letters from cleaned titles
last_letters_cleaned = [features['last_letter'].lower() for features in title_features_cleaned]
last_letter_counts = Counter(last_letters_cleaned)
top_10_letters = [letter for letter, _ in last_letter_counts.most_common(10)]

# Prepare results
results = []

for letter in top_10_letters:
    # Filter df_cleaned for titles ending with this last letter
    mask = [features['last_letter'].lower() == letter for features in title_features_cleaned]
    subset = df_cleaned[mask]
    # Calculate average runtime
    avg_runtime = int(round(subset['Runtime'].mean()))
    # Get all genres, split and count
    all_genres = subset['Genres'].str.split(',').explode().str.strip()
    genre_counts = all_genres.value_counts().head(5)  # Top 5 genres for brevity
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    results.append({
        'last_letter': letter,
        'count': len(subset),
        'avg_runtime': avg_runtime,
        'top_genres': genre_counts.to_dict()
    })



# Display as DataFrame
pd.DataFrame(results)

Unnamed: 0,last_letter,count,avg_runtime,top_genres
0,e,40,115,"{'Drama': 20, 'Thriller': 19, 'Action': 12, 'C..."
1,s,30,116,"{'Drama': 17, 'Thriller': 15, 'Comedy': 13, 'A..."
2,n,27,128,"{'Drama': 15, 'Action': 14, 'Thriller': 12, 'C..."
3,t,16,112,"{'Drama': 8, 'Comedy': 6, 'Thriller': 6, 'Roma..."
4,a,16,132,"{'Drama': 11, 'Thriller': 6, 'Action': 5, 'Com..."
5,r,12,127,"{'Drama': 8, 'Thriller': 7, 'Action': 6, 'Crim..."
6,k,9,112,"{'Comedy': 5, 'Drama': 4, 'Animation': 3, 'Hor..."
7,l,9,99,"{'Drama': 5, 'Thriller': 4, 'Horror': 3, 'Acti..."
8,g,9,111,"{'Drama': 4, 'Thriller': 4, 'Adventure': 4, 'C..."
9,2,9,131,"{'Thriller': 7, 'Drama': 5, 'Horror': 3, 'Crim..."


In [457]:
# Let's expand the top_genres to show every genre for each last letter
for entry in results:
	print(f"Last letter: {entry['last_letter']}, Top genres: {entry['top_genres']}")

Last letter: e, Top genres: {'Drama': 20, 'Thriller': 19, 'Action': 12, 'Comedy': 12, 'Sci-Fi': 10}
Last letter: s, Top genres: {'Drama': 17, 'Thriller': 15, 'Comedy': 13, 'Action': 10, 'Adventure': 8}
Last letter: n, Top genres: {'Drama': 15, 'Action': 14, 'Thriller': 12, 'Crime': 6, 'Comedy': 6}
Last letter: t, Top genres: {'Drama': 8, 'Comedy': 6, 'Thriller': 6, 'Romance': 4, 'Horror': 3}
Last letter: a, Top genres: {'Drama': 11, 'Thriller': 6, 'Action': 5, 'Comedy': 5, 'Romance': 5}
Last letter: r, Top genres: {'Drama': 8, 'Thriller': 7, 'Action': 6, 'Crime': 4, 'Biography': 2}
Last letter: k, Top genres: {'Comedy': 5, 'Drama': 4, 'Animation': 3, 'Horror': 3, 'Adventure': 2}
Last letter: l, Top genres: {'Drama': 5, 'Thriller': 4, 'Horror': 3, 'Action': 2, 'Adventure': 2}
Last letter: g, Top genres: {'Drama': 4, 'Thriller': 4, 'Adventure': 4, 'Crime': 3, 'Action': 3}
Last letter: 2, Top genres: {'Thriller': 7, 'Drama': 5, 'Horror': 3, 'Crime': 3, 'Action': 3}


In [458]:
# Let's compare to what we did in the sorting, where we generated the genre_titles_df and sorted it by the number of titles.
# Drama, Thriller and Action were the three most common genres, making this average very likely.
genre_titles_df[:10]

Unnamed: 0,Genre,Number of Titles,Titles
0,Drama,129,"A Complete Unknown, A Different Man, A Family ..."
1,Thriller,111,"A Different Man, A Quiet Place: Day One, Abiga..."
2,Action,82,"Aavesham, Amaran, Apocalypse Z: The Beginning ..."
3,Comedy,78,"A Different Man, A Family Affair, A Real Pain,..."
4,Adventure,46,"Arthur the King, Atlas, Attack on Titan the Mo..."
5,Fantasy,34,"Beetlejuice Beetlejuice, Better Man, Damsel, D..."
6,Mystery,32,"Afraid, Anweshippin Kandethum, Badland Hunters..."
7,History,16,"Blitz, Chandu Champion, Color of Victory, I'm ..."
8,War,6,"Amaran, Blitz, Chandu Champion, Nr. 24, The Mi..."
9,Sci-Fi,37,"A Quiet Place: Day One, Afraid, Alien: Romulus..."


Additionally, since we calculated the runtime while generating the average for the genres, we now have a complete overview of what we need. 

The very last thing to do in this code is to compare the average gender features connected to the average of the other parameters.

Finally, we can see whether movie titles that are the highest rated are most commonly coded as male or as female.