# 0. Introduction

The classification that we are interested for this program is that of document classification, specifically movie titles. Titles are the first thing we take in whenever something new is airing. Beyond the trailers, actors and directors involved, we read a name, and from there we decide whether it is worth a watch or more inquiery, or if it's a waste of our time. As an avid movie lover, I spend a lot of time reading about and watching movies, and an intriguing title is often what hooks me.

Specifically for this paper I will be coding supervised classification, which is a feature extractor that converts inputs into a feature set, like positive or negative (linguist89, 2025). My thesis is that the most common features from a list of movies will have common features. To test for this, I will run my titles through a gender prediction based on nltk's name corpus, and see what averages there are for runtimes and genres in the most highly rated feature films from this data.

I will start by following the document classification on the corpus by nltk on names, and using this framework to run a similar experiment for my own data, that being the titles of 2024 feature films.

Note: I use terminology that may differ, but means the same: code, program, feature predictor, etc. 

# 1. Packages to be installed

The first thing we do is import the necessary tools needed for our program. The first program I will be installing is the most important and the base for everything I will be doing, nltk (Natural Language Toolkit) then I will be importing random # nltk, regular expressions, and random module to support the preexisting tools built into nltk.

The first program is nltk, which is the National Language Tool Kit. We have used this platform to build on data from corpuses and for language processing in class, and I find it to be a good tool for the scope of my paper (Hansen, Olsen and Enevoldsen, 2023). Please see the bibliography in the written portion of this exam for all references made in here. 

The next program is pandas, which lets us manipulate and see dataframes. We have used it in both the first and second semester of the master's program, so I see it fit to use this again for my set of data.

Note that if there are hashtags in front of the !pip command, please remove these before attempting to run the code. They are simply there so the programs do not download over and over.

A large portion of my code makes reference to homework we were assigned and worked with in class, which can be accessed through Stephan's github account (linguist89) or through this link: https://github.com/linguist89/compling25-exercises-week11. Please reach out to him or Ross if there are any issues accessing this worksheet.

In [None]:
%pip install nltk

In [None]:
import nltk, re, random # nltk, regular expression, random module
from nltk import word_tokenize
from collections import Counter  # I will import Counter again later, but this is just to avoid any errors when running all the code at once

Although NLTK has a corpus of movie-related items, they are limited to reviews, and these reviews do not have information that is viable to my thesis, that being titles of movies, so I will move forward with their corpus containing names. From here, I will build on the gendered features we explored in class, and run similar code for the 2024 feature film titles.
The goal is to see where this model places consistences for the titles involved, and if this is comparable to that of standard names.

An example of an inquiry could be: are titles that are deemed more likely to be feminine more positively reviewed?

This entire thing is a bit arbitrary, admittedly, but a curious case nonetheless.

# 2. Gendered name identification

This section is based largely on the code we worked with in class for week 11. This code builds on NLTK's corpus of names, and in class we coded to see if there was any reliability in assuming features of names being gendered. According to the homework notes "The returned dictionary, known as a feature set, maps from features' names to their values. Feature names are case-sensitive strings that typically provide a short human-readable description of the feature. Feature values are values with simple types, such as booleans, numbers, and strings." (linguist89, 2025)

For example, the last letter of the name Shrek is 'k', which is a letter we must map onto a set of variables we determine to be either female or male. This whole thing is quite binary, and there is no in-between encoding for gender-neutral names, as this was not a part of the corpus nor something I am able to append to the corpus data.

Firstly, we write code which will extract the last letter of any given name (word)

In [None]:
def gender_features(word):
    # Extracting the last letter of the word
    return {'last_letter': word[-1:]}

In [None]:
gender_features('Shrek')

I will also run my own name below, for fun

In [None]:
gender_features('Suzan')

Now that we have a feature extractor, we can move on to importing the corpus of names from NLTK. We will be randomizing the contents to ensure a widespread variety of data to work from.

In [None]:
import nltk.corpus
nltk.download('names')
from nltk.corpus import names # we import names to access the names in the corpus
import random
names = ([(name, 'male') for name in names.words('male.txt')] +
         [(name, 'female') for name in names.words('female.txt')])
random.shuffle(names)
len (names) # 7944 names

Following what was done in class, we now use the feature extractor to process the data and divide it into two groups, one training set and one test set. The training set trains a new "naive Bayes" classifier (linguist89, 2025)

In [None]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500] # test set is first 500 records
# train set is everything after the first 500 records, actually 7544 names. We train on far
# more data than we test
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set)) # 0.752


Let's see what happens if we get the first ten results from the training set, just to test that everything works.

In [None]:
train_set[:10] # first 10 records in the training set

And again for the test set

In [None]:
test_set[:10] # first 10 records in the test set

They both alternate quite well between feminine and masculine features, which is a good sign for us, and shows no current overrepresentation of either.

Let's test out some names

In [None]:
classifier.classify(gender_features('Anna')) 

In [None]:
classifier.classify(gender_features('John')) # classify many names at once

What about a name we are sure is not a traditional name? I'm thinking of dear Smeagol AKA Gollum from Lord of the Rings. It will correctly classify both of his names as male, despite neither being present in the list of names from nltk.

Oddly, Smeagol sometimes is predicted to be female instead of male. Run the cell again and it should fix the prediction.

In [None]:
classifier.classify(gender_features('Gollum'))

In [None]:
classifier.classify(gender_features('Smeagol'))

We continue to accounting for accuracy in the model between the classifier and with the data contained in the test set

In [None]:
nltk.classify.accuracy(classifier, test_set)

My test set came out with an accuracy of 0.758, a number that varies based on the generated test set.

Let's see which features are the most informative for reaching this accuracy distribution:

In [None]:
classifier.show_most_informative_features(5)

With the most basic framework set, I will be continuing my program by importing my data and starting the relevant programming for my thesis below

## 2.1 Picking the right features
Relevant features of interest to this program are good to set in stone, so our model produces information that we can then use to see patterns of interest, like position of letters, lenght of names, or other features that might be universal for movies with better scores.

In [None]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features
# We have 54 features and one special feature for 'the', which gives us a total of 55 features.

In [None]:
#Let's test it out:
gender_features2('Shrek')

I also want to make an alternative one that includes special characters, for the sake of the movies in the spreadsheet. Some of them contain special characters like : or !, and I think it's interesting to include them as they are. 

In [None]:
def gender_features_special(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    features["has(the)"] = ('the' in name.lower())
    for letter in '1234567890abcdefghijklmnopqrstuvwxyz&#@!$%^&*+-=;:,.?':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features


We have 64 features now, including special characters. I have added the word 'the' as a feature as it appears in many of the movie names. It will count out as 107 features.

Let's compare the two by using the name Shrek with and without an exclamation mark in the two different models:

In [None]:
len(gender_features2("Shrek"))

In [None]:
len(gender_features2("Shrek!"))

In [None]:
len(gender_features_special("Shrek"))

In [None]:
len(gender_features_special("Shrek!"))

In [None]:
len(gender_features_special("Shrek!"))

As seen above, if I run the exclamation marked version of Shrek in the normal gender_features2, no parameters change, nor do they in the special (extended extractor) change the number. Each feature simply gets counted individually for whether it has each respective item in the special alphabet in each category.

In [None]:
#Let's prove this with another example:
gender_features2('Window123')
len(gender_features2('Window123'))

In [None]:
gender_features_special('Window123')
len((gender_features_special('Window123')))

# 3. Classification of the 2024 movie title pool

Now that we have the feature extractor, we can move on to the relevant data for this paper. 

I put all of the titles from the 2024 releases from the .csv file called IMDB_LIST_266_massive into a list called movie_titles. Admittedly, I got this list from the list I imported from IMDb and ran a Copilot script to automate the process of listing each of them into items in my list. 

Note that the order below is seemingly random, and is not based on alphabetical order, popularity or release date. I only know that this is ranked by an arbitrary "list order", which I cannot find the parameters for. The same order is of course in the IMDb spreadsheet, and was imported in this order from IMDb itself.


Let's double check that all 266 titles are in here

Now comes the most intensive part of this program, that of importing the dataframe from the Excel (csv file) containing all of my data. There is a lot of data...

We start by importing pandas, a toolkit that allows us to work with data in a user-friendly way, and is great for tables with lots of data. As we have worked with this on this course and for a course on the previous semester, I am a bit more familiar with it than I am with nltk.

I will be using the official documentation from Pandas to guide my coding process, especially to make sure I do not make massive mistakes with my massive dataset (pandas documentation — pandas 2.2.3 documentation, 2024)

Please import the repository from this Github if the Wiseflow application has not imported the following two items https://github.com/bingusiscoding/suzcompling.git :
1. IMDB_List_266_massive.csv
2. Suz_Compling_Code.ipynb

In [None]:
import pandas as pd
import os

# Check current working directory
print("Current working directory:", os.getcwd())

# List files in the current directory
print("Files in directory:", os.listdir())

# Update the path below if your file is not in the current directory
csv_path = "https://raw.githubusercontent.com/bingusiscoding/suzcompling/main/IMDB_LIST_266_massive.csv"

rawdata = pd.read_csv(csv_path, sep=';', quoting=1, encoding='utf-8')
rawdata = rawdata.replace('*', '') 
rawdata

There is a whole row which just repeats the titles for the dataframe. Let's remove it.

In [None]:
# Let us remove the row where Position is 'Position' (the header row accidentally included in the data)
rawdata = rawdata[rawdata['Position'] != 'Position']
rawdata = rawdata.reset_index(drop=True)
rawdata

In [None]:
# The very first thing we're going to do is list all of the movie titles into a list
movie_titles = rawdata['Title'].tolist()

In [None]:
# Let's see if that worked as it should
movie_titles[:10]

In [None]:
# Let's make sure we have all 266 movies, which is the number of rows in the original dataset
len(movie_titles)

Admittedly, this is overkill. BUT! it does provide us with scores and a point of reference for how good a movie is beyond the features extractor I built earlier. 

Also, I spent way too many hours inputting the data, so bear with me for wanting to display it for a short while.

Let's group the columns we will want to use in the initial experiment for now. I will come back to the rest of the data more in the discursive parts of this assignment.

In [None]:
df_small_beta = rawdata.groupby(['Position', 'Title', 'Runtime (mins)', 'Genres', 'Flickmetrix_total']).size().reset_index(name='Count')
df_small_beta = df_small_beta.rename(columns={
    'Position': 'Position',
    'Title': 'Title',
    'Runtime (mins)': 'Runtime',
    'Genres': 'Genres',
    'Flickmetrix_total': 'Flickmetrix_total'
})
df_small_beta

Some of the titles were not on Flick Metrix, so let's sort those out. We know that the cells without an input are simply empty, which is why we use the .isnull function. We can then strip them from the dataframe and index the final outcome as a new cleaned dataframe.

Admittedly, I have struggled with getting the September 5 title to be correct in both the Excel file and in my code, so I will exclude it as well.

In [None]:
# Find rows where 'Flickmetrix_total' is missing or empty, or Title is 'September 5' or 'sep-05'
missing_scores = df_small_beta[
	df_small_beta['Flickmetrix_total'].isnull() |
	(df_small_beta['Flickmetrix_total'].astype(str).str.strip() == '') |
	(df_small_beta['Title'].str.strip().str.lower().isin(['september 5', 'sep-05']))
]

# Print the titles of the removed rows
print("Removed rows (no Flickmetrix_total or problematic title):")
print(missing_scores[['Title', 'Flickmetrix_total']])

# Remove those rows from the dataframe
df_cleaned = df_small_beta[
	~(df_small_beta['Flickmetrix_total'].isnull() |
	  (df_small_beta['Flickmetrix_total'].astype(str).str.strip() == '') |
	  (df_small_beta['Title'].str.strip().str.lower().isin(['september 5', 'sep-05']))
	)
].reset_index(drop=True)

# I will lastly remove the titles from the movie_list list
removed_titles = set(missing_scores['Title'].str.strip())
movie_titles = [title for title in movie_titles if title.strip() not in removed_titles]

In [None]:
# Let's check to see if these titles have been removed. Movie 183, 'Martin', is one that should now no longer be in the movie_titles list
print('Martin' in movie_titles)

In [None]:
len(df_cleaned) # Let's see how many we have left, it should be 246

## 3.1 Sorting

Now we can get to the fun part: sorting!

This subpart is more of a formality so that we can get different sets of sorting into respective dataframes. This makes the process of comparing different values and sorting orders easier later on.

The reason I focus on titles, runtime, and genres for this programming is that these are the most common things we watch out for when picking a movie to watch. Subcategorizations with actors and directors is a much more biased line of researching which is far too expansive for this paper - for example, someone who likes movies from one director is more likely to watch the rest of the movies from that director, even if they have bad ratings.

We are all also different in our tastes and in our general moods. Some people love long movies, some will sigh if a movie is anything over 2 hours long. Some love horror movies, others find them repulsive and will want to steer clear of any. For this reason, I will keep these categories (mostly) separated for most of my coding going forward.

Let's start by making sure we have all 246 movies that contain a Flick Metrix score and see them in alphabetical order:

In [None]:
total = df_cleaned.value_counts(['Title', 'Flickmetrix_total']).sort_index(ascending=True)
print(total)
#And the length of the total just below, it should say 247
print(len(total))

We start by making a dataframe that indexes by title in alphabetical order (ascending):

In [None]:
flickmetrix_sorted = df_cleaned.sort_values(['Title', 'Flickmetrix_total'], ascending=True)
flickmetrix_sorted.head(10)

In [None]:
# If we do the same with genres, it shows the entire groups of genres, while we are interested in the individual ones.
genres_sorted = df_cleaned.sort_values((['Genres','Title']), ascending=True)
genres_sorted.value_counts('Genres')
genres_sorted.head(10)

As some of these movies have multiple genres, I needed to import defaultdict, which is a tool that helps me do so.

In [None]:
from collections import defaultdict

# Create a dictionary to store genres and the titles they appear in
genre_titles = defaultdict(set)

# We run a for loop to iterate over the dataframe and split genres by comma
for _, row in df_cleaned.iterrows():
    genres = [g.strip() for g in row['Genres'].split(',')]
    for genre in genres:
        genre_titles[genre].add(row['Title'])

# Print the count of titles per genre and example titles
for genre, titles in genre_titles.items():
    print(f"{genre}: {len(titles)} titles, e.g. {list(titles)[:10]}")

# And a DataFrame to summarize the genres and their titles
genre_titles_df = pd.DataFrame({
    'Genre': list(genre_titles.keys()),
    'Number of Titles': [len(titles) for titles in genre_titles.values()],
    'Titles': [', '.join(sorted(titles)) for titles in genre_titles.values()]
})

genre_titles_df

# We have two different outputs below, one with the genres sorted by number of titles and listing the ones included
# and one which is a more easy-on-the-eye dataframe with genres, number of titles, and some of the titles.
# Note: I'm not sure how to make the dataframe show all the titles, so it only shows the first few titles in each genre.

And lastly (for now), by runtime

In [None]:
# The most important step here is to convert the runtime column to numeric, so we can sort it as such.
df_cleaned['Runtime'] = pd.to_numeric(df_cleaned['Runtime'], errors='coerce')
runtime_sorted = df_cleaned.sort_values('Runtime', ascending=False)
runtime_sorted.head(10)  # Display the top 10 longest movies
# If correctly done, The Brutalist should be a whopping 216 minutes long (3.6 hours),
# and the shortest movie should be Look Back with 58 minutes (just two minutes shy of an hour).

In [None]:
# Let's test out how it looks for a single movie, which we can only by calling it as a boolean
tester1 = df_cleaned[df_cleaned['Title'] == 'Look Back']
tester1

In [None]:
# Now three of my favorite movies from the list, the isin (is in) function allows us to check for multiple values
tester1 = df_cleaned[df_cleaned['Title'].isin(['Challengers', 'Conclave', 'Drive-Away Dolls'])]
# We will sort this by Flickmetrix_total, descending
tester1.sort_values('Flickmetrix_total', ascending=False)


In [None]:
# Let us run tester1 by runtime in descending order
tester1_sorted = tester1.sort_values('Runtime', ascending=False)
tester1_sorted

To progress to the final stages, we have needed these four items: the list of the movie titles, the runtime, the genres, and the Flick Metrix scores. Now that we have all four, we can proceed to the next part, where we combine it all with the gender features coding we did way earlier, and from this we will start to deduce if there are any patterns between these four parameters and the gender features. 

I do this to not just run the entire list of movies through the (quite arbitrary) name model, but to have a point of comparison - any patterns with the other factors add some credibility to the final results.

# 4. Let's combine!

For this section we can finally begin to combine our classifyer with the sorted dataframes from section 3.1

I will be importing a collection from the Python datatypes called Counter, which helps in providing tallies - of which there will presumably be a lot of for each letter and number in our specialized alphabet (collections — Container datatypes, no date).

In [None]:
from collections import Counter

# Extract the titles from df_cleaned
titles_cleaned = df_cleaned['Title'].tolist()

# Extract features for each title
title_features_cleaned = [gender_features(title) for title in titles_cleaned]

# Classify each title using the trained classifier
title_genders_cleaned = [classifier.classify(features) for features in title_features_cleaned]

# Count occurrences of each predicted gender
gender_counts_cleaned = Counter(title_genders_cleaned)
print("Predicted gender counts for df_cleaned titles:", gender_counts_cleaned)

Let's follow this up by creating a more comprehensive dataframe that includes the movie titles and their predicted genders

In [None]:
# A DataFrame with titles and their predicted gender
# Use df_cleaned for titles and predicted gender
gender_df = pd.DataFrame({
    'Title': titles_cleaned,
    'Predicted Gender': title_genders_cleaned
})

# Merge with Flickmetrix scores from df_cleaned
gender_df = gender_df.merge(flickmetrix_sorted[['Title', 'Flickmetrix_total']], on='Title', how='left')

# And runtime
gender_df = gender_df.merge(runtime_sorted[['Title', 'Runtime']], on='Title', how='left')

# And genres
gender_df = gender_df.merge(genres_sorted[['Title', 'Genres']], on='Title', how='left')

# Convert Flickmetrix_total to numeric for sorting
gender_df['Flickmetrix_total'] = pd.to_numeric(gender_df['Flickmetrix_total'], errors='coerce')

# Separate female and male titles, sort by Flickmetrix score descending
female_titles = gender_df[gender_df['Predicted Gender'] == 'female'].sort_values('Flickmetrix_total', ascending=False).reset_index(drop=True)
male_titles = gender_df[gender_df['Predicted Gender'] == 'male'].sort_values('Flickmetrix_total', ascending=False).reset_index(drop=True)

#Finally, we can display the top 20 titles sorted by Flickmetrix score. This is just to show a baseline of the data.
gender_df.sort_values('Flickmetrix_total', ascending=False).head(20).style.format({'Flickmetrix_total': '{:.0f}'})

From a first glance it seems that movies with "male" titles have the highest score. Let's explore why.

We start by wrangling the letters to see which ones are the most common overall. Then we move onto the most common last letter of words.

Looking at all of the letters is mostly for fun, but does give us an idea of what it could look like for the last letters in the proceeding code.

In [None]:
# We first exclude all of the instances of 'the' 
movie_titles_no_the = [re.sub(r'\bthe\b', '', title, flags=re.IGNORECASE).strip() for title in movie_titles]

# Display a few examples to verify
print(movie_titles[:10])
print(movie_titles_no_the[:10])# Join all movie titles into a single string and convert to lowercase. We convert to lowercase to ensure uniformity in counting letters.
all_letters = ''.join(movie_titles).lower()

# Count each letter
letter_counts = Counter(all_letters)

# Display the counts for each letter (a-z + special characters), sorted by count descending
for letter, count in sorted(letter_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"{letter}: {count}")

# It will display the original titles, then the titles without 'the', and finally the counts of each letter in the titles, sorted in descending order.

We can't test for accuracy because none of the titles have an actual gender assigned, meaning that there are no true labels for gendering which nltk can predict accuracy on. However, we do have informative features.

In [None]:
# Let us get the most informative last letters for movie titles in determining gender.

# Extract last letters and predicted genders for each movie title
last_letters = [features['last_letter'].lower() for features in title_features_cleaned]
gender_labels = title_genders_cleaned

# Count (last_letter, gender) pairs
pair_counts = Counter(zip(last_letters, gender_labels))

# For each last letter, compute the ratio of male to female predictions
letter_stats = {}
for letter in set(last_letters):
    male_count = pair_counts.get((letter, 'male'), 0)
    female_count = pair_counts.get((letter, 'female'), 0)
    total = male_count + female_count
    if total > 0:
        ratio = male_count / total
        letter_stats[letter] = {'male': male_count, 'female': female_count, 'ratio_male': ratio, 'total': total}

# Sort by informativeness: letters with high skew toward one gender and enough samples
informative_letters = sorted(
    letter_stats.items(),
    key=lambda x: abs(x[1]['ratio_male'] - 0.5) * x[1]['total'],
    reverse=True
)

# Display the top 10 most informative last letters
print("Most informative last letters for movie title gender prediction:")
for letter, stats in informative_letters[:10]:
    print(f"Last letter '{letter}': male={stats['male']}, female={stats['female']}, total={stats['total']}, male_ratio={stats['ratio_male']:.2f}")

In [None]:
# In a dataframe:
informative_features_df = pd.DataFrame.from_dict(letter_stats, orient='index')
informative_features_df.index.name = 'last_letter'
informative_features_df = informative_features_df.reset_index()

# Sort by informativeness: letters with high skew toward one gender and enough samples
informative_features_df['informativeness'] = informative_features_df['total'] * abs(informative_features_df['ratio_male'] - 0.5) 
# We ratio it to male as we know it has a higher count than for female features
informative_features_df = informative_features_df.sort_values('informativeness', ascending=False)

# Print last letter and its statistics
for _, row in informative_features_df.iterrows():
    print(f"Last letter '{row['last_letter']}': male={row['male']}, female={row['female']}, total={row['total']}, male_ratio={row['ratio_male']:.2f}")

informative_features_df.sort_values('total', ascending=False).head(20)

#Note: the first column shows the letter based on the most informative features from the test set in the initial classifier model of names

In [None]:
last_letter_to_titles = defaultdict(list)
for title, features in zip(movie_titles, title_features_cleaned):
    last_letter = features['last_letter'].lower()
    last_letter_to_titles[last_letter].append(title)

# Print the last letter and the corresponding movie titles
for letter, titles in last_letter_to_titles.items():
    print(f"Last letter '{letter}': {titles}")

# Convert the letter_stats dictionary to a DataFrame
letter_stats_df = pd.DataFrame.from_dict(letter_stats, orient='index')
letter_stats_df.index.name = 'last_letter'
letter_stats_df = letter_stats_df.reset_index()
letter_stats_df = letter_stats_df.sort_values(by='total', ascending=False)

letter_stats_df

Now let's see what common grounds we can settle on for the common last letters of these movies. I will take the first 10 features into account.

In [None]:
# Find the 10 most common last letters in the cleaned movie titles

# Get last letters from cleaned titles
last_letters_cleaned = [features['last_letter'].lower() for features in title_features_cleaned]
last_letter_counts = Counter(last_letters_cleaned)
top_10_letters = [letter for letter, _ in last_letter_counts.most_common(10)]

# Prepare results
results = []

for letter in top_10_letters:
    # Filter df_cleaned for titles ending with this last letter
    mask = [features['last_letter'].lower() == letter for features in title_features_cleaned]
    subset = df_cleaned[mask]
    # Calculate average runtime
    avg_runtime = int(round(subset['Runtime'].mean()))
    # Get all genres, split and count
    all_genres = subset['Genres'].str.split(',').explode().str.strip()
    genre_counts = all_genres.value_counts().head(5)  # Top 5 genres for brevity
    pd.set_option('display.max_rows', None)
    pd.set_option('display.max_columns', None)
    results.append({
        'last_letter': letter,
        'count': len(subset),
        'avg_runtime': avg_runtime,
        'top_genres': genre_counts.to_dict()
    })



# Display as DataFrame
pd.DataFrame(results)

In [None]:
# Let's expand the top_genres to show every genre for each last letter
for entry in results:
	print(f"Last letter: {entry['last_letter']}, Top genres: {entry['top_genres']}")

In [None]:
# Let's compare to what we did in the sorting, where we generated the genre_titles_df and sorted it by the number of titles.
# Drama, Thriller and Action were the three most common genres, making this average very likely.
genre_titles_df[:10]

Additionally, since we calculated the runtime while generating the average for the genres, we now have a complete overview of what we need. 

The very last thing to do in this code is to compare the average gender features connected to the average of the other parameters.

Finally, we can see whether movie titles that are the highest rated are most commonly coded as male or as female.

In [None]:
# Find the title(s) with the highest Flickmetrix score
max_score = gender_df['Flickmetrix_total'].max()
top_titles = gender_df[gender_df['Flickmetrix_total'] == max_score]

# Get their last letters
top_titles['last_letter'] = top_titles['Title'].str[-1].str.lower()

# Get the most common last letter(s) among all titles
most_common_last_letter = top_10_letters[0]  # 'e', from previous results

print("Title(s) with the highest Flickmetrix score:")
print(top_titles[['Title', 'Flickmetrix_total', 'last_letter', 'Predicted Gender']])

print(f"\nMost common last letter among all titles: '{most_common_last_letter}'")
print("Top 10 most common last letters and their stats:")
for entry in results:
    print(f"Last letter: {entry['last_letter']}, Count: {entry['count']}, Avg runtime: {entry['avg_runtime']}, Top genres: {entry['top_genres']}")

results_df = pd.DataFrame(results)
results_df

In [None]:
# Add most common predicted gender for each last letter to the results table
results_with_gender = []

for entry in results:
    letter = entry['last_letter']
    # Find indices in title_features_cleaned where last_letter matches
    mask = [features['last_letter'].lower() == letter for features in title_features_cleaned]
    # Get predicted genders for these titles
    genders = [g for g, m in zip(title_genders_cleaned, mask) if m]
    # Count most common gender
    if genders:
        most_common_gender = Counter(genders).most_common(1)[0][0]
    else:
        most_common_gender = None
    entry_with_gender = entry.copy()
    entry_with_gender['most_common_gender'] = most_common_gender
    results_with_gender.append(entry_with_gender)

# Display as DataFrame
pd.DataFrame(results_with_gender)

In [None]:
# Calculate female to male ratio for the 10 most common last letters
ratios = []
for entry in results_with_gender[:10]:
    # Only process if entry is a dict or a tuple whose second element is a dict with 'last_letter'
    entry_dict = None
    if isinstance(entry, dict) and 'last_letter' in entry:
        entry_dict = entry
    elif isinstance(entry, tuple) and len(entry) > 1 and isinstance(entry[1], dict) and 'last_letter' in entry[1]:
        entry_dict = entry[1]
    # Skip entries that do not have the expected structure
    if entry_dict is None:
        continue
    letter = entry_dict['last_letter']
    # Find indices in title_features_cleaned where last_letter matches
    mask = [features['last_letter'].lower() == letter for features in title_features_cleaned]
    # Get predicted genders for these titles
    genders = [g for g, m in zip(title_genders_cleaned, mask) if m]
    female_count = genders.count('female')
    male_count = genders.count('male')
    ratio = female_count / male_count if male_count > 0 else float('inf')
    ratios.append({
        'last_letter': letter,
        'female': female_count,
        'male': male_count,
        'female_to_male_ratio': ratio,
        'most_common_gender': entry_dict.get('most_common_gender')
    })

ratios_df = pd.DataFrame(ratios)

# Top 20 movies by Flickmetrix score with gender attached
top20 = gender_df.sort_values('Flickmetrix_total', ascending=False).head(20)[['Title', 'Predicted Gender', 'Flickmetrix_total']]
print("\nTop 20 scores from Flick Metrix with predicted gender attached:")
top20

In [None]:
# Let's write some code to calculate the specific ratio between male and female counts
male_count = (top20['Predicted Gender'] == 'male').sum()
female_count = (top20['Predicted Gender'] == 'female').sum()
if female_count > 0:
    male_female_ratio = male_count / female_count
else:
    male_female_ratio = float('inf')
print(f"\nIn the top 20: male={male_count}, female={female_count}, male/female ratio={male_female_ratio:.2f}")

That's it! We can now prove by the metrics that for the pool of 2024 releases favor an average runtime of 115-116 minutes, having the genres of drama, thriller and action, and are determined to have male features for their titles, usually ending in e, s, or n.