In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder


# Name and movies datasets

For this project we decided to work on the CMU movie dataset containing metadata extracted from Freebase, including Movie metadata like Movie box office revenue, genre, release date, runtime, and language but also Character metada like character names and aligned information about the actors who portray them, including gender and estimated age at the time of the movie's release. 
First let's see what the 2 metadasets contains.

#### Characters
The dataset contains informations  450,669 characters aligned to the movies such as Wikipedia movie ID,  Freebase movie ID, Movie release date, Character name, Actor date of birth, Actor gender, Actor height (in meters), Actor ethnicity,Actor name, Actor age at movie release, Freebase character/actor map ID, Freebase character ID, Freebase actor ID. 


#### Movies
The dataset contains informations on 81,741 movies such as the Wikipedia movie ID, Freebase movie ID, Movie name, Movie release date, Movie box office revenue, Movie runtime, Movie languages, Movie countries, Movie genres .



## Cleaning

The cleaning task was implement in the *clean_raw_data()* method of our different CharacterData and MovieData classes implementation (by Wikipedia Movie ID) and validated using the *check_clean_data()* method, available and shared by the 2 datasets (python inheritance).

From both metadataset, we directly oberved similar features such as Wikipedia Movie ID and Freebase Movie ID that is useful for futur merge of the 2 dataset. However, as in both datasets we saw that there were several columns containing Freebase and Wikipedia IDs for actors, characters and films, we decided to put them aside as the data is too difficult to access.

This are the different steps we applied to both datasets before merging:

Character dataset:
- Load with the right spacer.
- Rename the columns for proper understanding.
- Check the good type format : Actor date of birth and the Release Date as a datetime, and the other into objects.
- Deal with missing values : we wrote them as NaN or NaT
- Droping unwanted columns
- Checking that the cleaning was made

Movie dataset :
- Load with the right spacer.
- Rename the columns for proper understanding.
- Modify the Language, Country and Genre columns:  the information was a JSON format not readable nor accessible so we isolate the information and replace it by a human-readable string format.
- Modify the datatypes of movie runtime into timedelta and the release date into a datetime object for further manipulation.
- Modify the movie Name, Language, Country and Genre:  we checked that they were in object type and modified them if not.
- Deal with missing values : we wrote them as NaN or NaT
- Dropping the unwanted columns
- Check that the cleaning was made


We kept the whole dataset with NaN and NaT values in a specific file to keep features that could be interesting even if the rate of missing values is very high (such as etchnicity or Box office revenue). However, for the following notebook, to make some we decided to remove this 2 columns since they have more than 70% missing values.


## Demo

Here, we will import and clean the data base to demonstrate the process.


In [None]:
# imports my code from the file src/data/movies_char_data.py
import src.data.movies_char_data as MovieChar

#### Characters 

In [None]:
character_df = MovieChar.CharacterData("Character", "character.metadata.tsv", output_name = "character_data_clean.csv")
character_df.raw_df.head()

In [None]:
# print duplicated rows
character_df.clean_raw_data()
character_df.clean_df.head()

In [None]:
character_df.pipeline()

####  Movie dataset

In [None]:
movie_df = MovieChar.MovieData("Movie", "movie.metadata.tsv", output_name = "movie_data_clean.csv")

#Display name and file_name
print(movie_df.name, movie_df.file_name)

movie_df.raw_df.head()

In [None]:
# print duplicated rows
movie_df.clean_raw_data()
movie_df.clean_df.head()

In [None]:
movie_df.pipeline()

## Merging Movie and Character into one dataset

We merged the 2 dataset following the Wikipedi movie ID.

In [None]:
from src.utils.movies_utils import *

mov_char_data = merge_movies_characters_data(movie_df, character_df)

mov_char_data.head() # When we call the data name object, it returns the cleaned data

In [None]:
# Print size of the dataset
print(mov_char_data.shape)

In [None]:
mov_char_data.head(100) # When we call the data name object, it returns the cleaned data
      

# Names datasets

In order to answer our research questions, we needed to find some birth registries from different countries. Those were freely available and we found datasets for the following countries:

- France
- USA
- United Kingdom
- AJOUTER AUTRE SI BESOIN

Since they all came from different places and didn't follow the same structure, we had to decide what kind of data was necessary for our project and what structure would be the most practical to work with. We ended-up with the following collumns in our dataframes:

1. **Year** : An integer value giving the year of the count 
2. **Name** :  A string representing the name that was counted
3. **Sex** : There are two possible characters, 'F' (female) and 'M' (male)
4. **Count** : An integer value giving the count of the name during this year

## Data homogenization

The cleaning task was implement in the *clean_raw_data()* method of our different NamesData classes implementation (by country) and validated using the *check_clean_data()* method, available and shared by all the name datasets (python inheritance).

### Column structure
This task was not too difficult since it was mostly reordering, renaming the ones needed and dropping the ones that were not useful for our project. We also made sure that the same type was used on the collumns of the different datasets. 

### Year 
All of our dataset had the same year format, but some had missing values in this field, which made those row useless and they were therefore discarded.
This collumn made it hard to find datasets from more countries, since a lot of them started to count only in the early 2000's, which doesn't give us enough data to detect real changes in the distribution. (The movie data base ends in 2012)

### Name
This was the hardest column to sanitize and clean since a lot of variation of a same name are possible. We ended by defining a regex expression do define what we would accept as a valid name : ^[A-Z-\s\']+$

This allows us to limit ourselves to names composed only of capitalized letters, spaces, '-' for composed names and ''' for the some regional variations. This rule is really strict and would have made us lose a considerable proportion of our dataset. This is where the data cleaning process came to help homogenize our data and it mainly consisted of the following operations:

- Converting all the name to uppercase
- Removing all accents on letter, for example é becomes e.

Some names have different spellings, for example you can write JEREMY and JEREMIE, but we decided to count this as two separate entries since grouping "similar" is out of the scope of this project and is not an uniformised concept.

### Sex
The french dataset had some integer values that we converted to the expected format. This field is useful for our research questions, but complicated the dataset research, since a lot of countries did not include this information in their registries.

### Validation

The python class representing our datasets contains a method *check_clean_data()* that checks multiple conditions to be sure that the data is uniform. 

- Checks the collumns' name
- Checks if some missing values are present
- Checks the data type of each collumn
- Checks for duplicated rows (same name, same sex and same name)
- Checks that the strings respects the defined regex expressions
- Checks that the counts and years are coherent numbers 

## Demo

Here, we will import and clean the data base to demonstrate the process.


In [None]:
import src.data.names_data as NamesData
ukNames = NamesData.UKNamesData("UK", "ukbabynames.csv")

# The raw data directly from the file
ukNames.raw_df.head()

In [None]:
# We can call the cleaning method, which will correct the columns' names and ordering, and clean the content
ukNames.clean_raw_data()
ukNames().head() #  This is the cleaned data

In [None]:
frenchNames = NamesData.FranceNamesData("France", "france.csv", "https://www.insee.fr/fr/statistiques/8205621?sommaire=8205628#dictionnaire", ";")

frenchNames.raw_df.head()

In [None]:
frenchNames.clean_raw_data()
frenchNames().head()


In [None]:
USNames = NamesData.USNamesData("US", "babyNamesUSYOB-full.csv")
USNames.raw_df.head()

In [None]:
USNames.clean_raw_data()
USNames().head()

### Merging the datasets
If we want to answer a question with no regards to the provenance of the names, we can use our function to group all the datasets together

In [None]:
from src.utils.names_utils import merge_names_data

global_names = merge_names_data([ukNames, frenchNames, USNames])
global_names().head()
print(f"The merged dataset contains {global_names().shape[0]} rows ! ")

## Feature Visualization

Lets visualize the different information from the datasets.

In [None]:
from src.utils.data_utils import *

In [None]:
# Number of Nan values in the Movies & Character dataset
mov_char_data.isna().sum()

In [None]:
# Visualizing the number of missing values per columns
nan_percentage = mov_char_data.isnull().mean().sort_values(ascending=False)

# Plot the percentage of NaN values per column
ax = nan_percentage.plot(kind='bar', figsize=(16, 8), color='skyblue')
plt.ylabel('Percentage of NaN values')
plt.title('Percentage of NaN values in % per column', fontsize=20)
for p in ax.patches:
    ax.annotate(f'{p.get_height():.2%}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')
plt.show()

This graph illustrates the distribution of columns based on the percentage of NaN values present in each. It helps us understand how missing data is spread across different features, highlighting columns with higher or lower levels of incompleteness

Top 50 movies revenue 
This graph displays the top 50 movies based only on the revenue.

## Trend evaluation
To assess what impact a movie had on child naming, we first try a simple model that computes the average count of babies named like a character 5 years before and 5 years after the movie's release, and compute their difference. The higher it is, the higher the trend the year the film was released.

Here is a demo of the model printing the top10 trend-inducing character names

In [43]:
# make a new dataframe from mov_char_data with only the Movie_name
movie_name_df = mov_char_data[["Movie_name","Wikipedia_movie_ID"]]
print(movie_name_df.shape)
movie_name_df.head()




(171400, 2)


Unnamed: 0,Movie_name,Wikipedia_movie_ID
0,Getting Away with Murder: The JonBenét Ramsey ...,3196793
1,Getting Away with Murder: The JonBenét Ramsey ...,3196793
2,Getting Away with Murder: The JonBenét Ramsey ...,3196793
3,Getting Away with Murder: The JonBenét Ramsey ...,3196793
4,Getting Away with Murder: The JonBenét Ramsey ...,3196793


In [47]:
# remove duplicate from the movie_name_df
movie_name_df = movie_name_df.drop_duplicates()
print(movie_name_df.shape)


(31627, 2)


In [None]:
from src.models.trend_evaluation import trend_eval_ranking
from src.models.imdb_manipulation import get_movie_votes, merge_with_characters

# Get the IMDB votes for the movies (Warning : might take a few minutes, nearly 700K rows !)
imdb_titles = get_movie_votes("data/raw/imdb")
imdb_titles.head()




In [48]:
# Merging the IMDB votes and rating with the dataset of movie names
merge_with_rating = merge_with_characters(imdb_titles, movie_name_df)
print(merge_with_rating.shape)
merge_with_rating.head(10)






There are 55459 rows in the merged dataset
(55459, 7)


Unnamed: 0,Movie_name,Wikipedia_movie_ID,tconst,primaryTitle,originalTitle,numVotes,averageRating
0,Getting Away with Murder: The JonBenét Ramsey ...,3196793,,,,,
1,The Gangsters,13696889,tt0139667,The Gangsters,Les truands,36.0,5.5
2,The Gangsters,13696889,tt27788655,The Gangsters,The Gangsters,14.0,5.8
3,The Sorcerer's Apprentice,18998739,tt0075811,The Sorcerer's Apprentice,Carodejuv ucen,1224.0,7.5
4,The Sorcerer's Apprentice,18998739,tt0120166,The Sorcerer's Apprentice,The Sorcerer's Apprentice,742.0,4.2
5,The Sorcerer's Apprentice,18998739,tt0963966,The Sorcerer's Apprentice,The Sorcerer's Apprentice,172102.0,6.1
6,Alexander's Ragtime Band,10408933,tt0029852,Alexander's Ragtime Band,Alexander's Ragtime Band,2361.0,6.8
7,Little city,6631279,,,,,
8,Henry V,171005,tt0036910,Henry V,The Chronicle History of King Henry the Fifth ...,7261.0,7.0
9,Henry V,171005,tt0097499,Henry V,Henry V,32160.0,7.5


In [51]:
aggregated_data = merge_with_rating.groupby('primaryTitle').apply(
    lambda group: pd.Series({
        'weightedAverageRating': (group['numVotes'] * group['averageRating']).sum() / group['numVotes'].sum(),
        'totalVotes': group['numVotes'].sum()
    })
).reset_index()

# Display the result
aggregated_data.sort_values('weightedAverageRating', ascending=False).head(10)

  aggregated_data = merge_with_rating.groupby('primaryTitle').apply(


Unnamed: 0,primaryTitle,weightedAverageRating,totalVotes
11700,Ninaithale,9.4,6.0
13386,Raja Ki Aayegi Baraat,9.4,25.0
12113,One Woman Show,9.3,47.0
10889,Moksha,9.3,51.0
19287,The Shawshank Redemption,9.3,2971050.0
15284,Spicy Mac Project,9.3,116.0
8411,Janam,9.251695,2124.0
17555,The Godfather,9.2,2072334.0
4392,Day One,9.2,8.0
13430,Ramayana: The Legend of Prince Rama,9.2,15670.0


In [52]:
aggregated_data.sort_values('totalVotes', ascending=False).head(10)

Unnamed: 0,primaryTitle,weightedAverageRating,totalVotes
1018,Alice in Wonderland,6.629452,6729712.0
9554,Les Misérables,7.498881,4808408.0
20288,Titanic,7.89181,3974739.0
2102,Beauty and the Beast,7.592731,3541676.0
6618,Gladiator,8.487226,3409098.0
16427,The Avengers,7.873179,3063446.0
19287,The Shawshank Redemption,9.3,2971050.0
17043,The Dark Knight,9.0,2952379.0
8066,Inception,8.8,2619982.0
17623,The Great Gatsby,7.164842,2556964.0


In [None]:
def is_blockbuster(row, votes_threshold=1000000, rating_threshold=8.0):
    """
    Determines if a movie is a blockbuster based on total votes and weighted average rating.

    :param row: A row of the DataFrame
    :param votes_threshold: The minimum number of votes to qualify as a blockbuster
    :param rating_threshold: The minimum average rating to qualify as a blockbuster
    :return: Boolean (True if blockbuster, False otherwise)
    """
    return row['totalVotes'] > votes_threshold and row['weightedAverageRating'] >= rating_threshold

# Apply the function to the DataFrame
aggregated_data['isBlockbuster'] = aggregated_data.apply(is_blockbuster, axis=1)

# Display the first few rows
display(aggregated_data.head())




Unnamed: 0,primaryTitle,weightedAverageRating,totalVotes,isBlockbuster
0,$,6.3,2961.0,False
1,$9.99,6.7,3525.0,False
2,'Neath the Arizona Skies,5.0,1174.0,False
3,'R Xmas,5.7,1661.0,False
4,'Til There Was You,4.8,3038.0,False


Unnamed: 0,primaryTitle,weightedAverageRating,totalVotes,isBlockbuster
205,A Beautiful Mind,8.2,1006552.0,True
1165,American Beauty,8.3,1231659.0,True
1184,American History X,8.5,1211578.0,True
1848,Back to the Future,8.5,1346149.0,True
2029,Batman Begins,8.2,1616000.0,True


In [59]:
# Filter for blockbusters
blockbusters = aggregated_data[aggregated_data['isBlockbuster']]
blockbusters.sort_values('weightedAverageRating', ascending=False).head(50)

Unnamed: 0,primaryTitle,weightedAverageRating,totalVotes,isBlockbuster
19287,The Shawshank Redemption,9.3,2971050.0,True
17555,The Godfather,9.2,2072334.0,True
17556,The Godfather Part II,9.0,1398975.0,True
17043,The Dark Knight,9.0,2952379.0,True
18281,The Lord of the Rings: The Return of the King,9.0,2033640.0,True
14325,Schindler's List,9.0,1489706.0,True
18280,The Lord of the Rings: The Fellowship of the Ring,8.9,2063322.0,True
13153,Pulp Fiction,8.9,2280958.0,True
6156,Forrest Gump,8.8,2324952.0,True
18282,The Lord of the Rings: The Two Towers,8.8,1833778.0,True


In [61]:
# Step 1: Standardize both columns for consistency
aggregated_data['primaryTitle_clean'] = aggregated_data['primaryTitle'].str.strip().str.lower()
mov_char_data['Movie_name_clean'] = mov_char_data['Movie_name'].str.strip().str.lower()

# Step 2: Merge the data
merged_data = mov_char_data.merge(
    aggregated_data[['primaryTitle_clean', 'weightedAverageRating', 'totalVotes', 'isBlockbuster']],
    left_on='Movie_name_clean',
    right_on='primaryTitle_clean',
    how='left'
)

# Step 3: Drop the helper columns if needed
merged_data.drop(['Movie_name_clean', 'primaryTitle_clean'], axis=1, inplace=True)

# Display the first few rows of the new dataset
merged_data.head()


Unnamed: 0,Wikipedia_movie_ID,Movie_name,Release_date,Revenue,Runtime,Languages,Countries,Genres,Character_name,Actor_DOB,Actor_gender,Actor_height,Actor_name,Actor_age,weightedAverageRating,totalVotes,isBlockbuster
0,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,0 days 01:35:00,English,United States of America,"Mystery, Biographical film, Drama, Crime Drama",POLICE OFFICER,NaT,M,,ALLEN CUTLER,,,,
1,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,0 days 01:35:00,English,United States of America,"Mystery, Biographical film, Drama, Crime Drama",REPORTER,1956-12-19,F,,ALICE BARRETT,43.0,,,
2,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,0 days 01:35:00,English,United States of America,"Mystery, Biographical film, Drama, Crime Drama",FBI PROFILER ROBERT HANKS,1950-01-05,M,,ROBERT CATRINI,50.0,,,
3,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,0 days 01:35:00,English,United States of America,"Mystery, Biographical film, Drama, Crime Drama",JOHN RAMSEY,1945-02-12,M,1.85,CLIFF DEYOUNG,55.0,,,
4,3196793,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,0 days 01:35:00,English,United States of America,"Mystery, Biographical film, Drama, Crime Drama",PATSY RAMSEY,1964-07-12,F,1.63,JUDI EVANS,35.0,,,


In [66]:
# Display the top 10 blockbusters, if there is NAN values, we drop them, and groupby the movie name
merged_data.dropna(subset=['isBlockbuster'], inplace=True)
blockbusters = merged_data[merged_data['isBlockbuster']]
blockbusters.groupby('Movie_name').first().sort_values('weightedAverageRating', ascending=False).head(10)

Unnamed: 0_level_0,Wikipedia_movie_ID,Release_date,Revenue,Runtime,Languages,Countries,Genres,Character_name,Actor_DOB,Actor_gender,Actor_height,Actor_name,Actor_age,weightedAverageRating,totalVotes,isBlockbuster
Movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
The Shawshank Redemption,30625,1994-09-10,28341470.0,0 days 02:22:00,English,United States of America,"Crime Fiction, Buddy film, Prison film, Drama,...",WARDEN NORTON,1945-11-15,M,1.85,BOB GUNTON,48.0,9.3,2971050.0,True
The Godfather,2466773,1972-03-15,268500000.0,0 days 02:57:00,"Latin, Italian, Sicilian, English",United States of America,"Crime Fiction, Gangster Film, Crime Drama, Fam...",VIRGIL SOLLOZZO,1928-02-24,M,1.75,AL LETTIERI,44.0,9.2,2072334.0,True
The Godfather Part II,73875,1974-12-12,193000000.0,0 days 03:20:00,"Italian, English, Latin, Spanish",United States of America,"Crime Fiction, Gangster Film, Drama, Crime Drama",QUESTADT,1928-01-20,M,1.7,PETER DONAT,46.0,9.0,1398975.0,True
The Dark Knight,4276475,2008-07-16,1004558000.0,0 days 02:33:00,"Standard Mandarin, English","United States of America, United Kingdom","Crime Fiction, Thriller, Superhero movie, Crim...",HARVEY DENT,1968-03-12,M,1.83,AARON ECKHART,40.0,9.0,2952379.0,True
The Lord of the Rings: The Return of the King,174251,2003-12-17,1119930000.0,0 days 04:10:00,"Old English language, English","United States of America, New Zealand","Fantasy Adventure, Adventure, Epic, Action/Adv...",MERCENARY ON BOAT,1961-10-31,M,1.69,PETER JACKSON,42.0,9.0,2033640.0,True
Schindler's List,65834,1993-11-30,321306300.0,0 days 03:06:00,"French, Polish, Hebrew, German, English",United States of America,"Tragedy, Biography, History, War film, Biopic ...",MILA PFEFFERBERG,1952-06-07,M,1.93,ADI NITZAN,41.0,9.0,1489706.0,True
The Lord of the Rings: The Fellowship of the Ring,173941,2001-12-10,871530300.0,0 days 02:58:00,English,"United States of America, New Zealand","Fantasy Adventure, Adventure, Epic, Fantasy, F...",FRODO BAGGINS,1981-01-28,M,1.68,ELIJAH WOOD,20.0,8.9,2063322.0,True
Pulp Fiction,54173,NaT,213928800.0,0 days 02:48:00,"French, English, Spanish",United States of America,"Crime Fiction, Thriller, Crime Comedy, Indie, ...",JODY,1959-08-10,F,1.64,ROSANNA ARQUETTE,,8.9,2280958.0,True
Forrest Gump,41528,1994-06-23,677387700.0,0 days 02:16:00,English,United States of America,"Coming of age, Comedy film, Drama, War film, R...",YOUNG FORREST GUMP,1956-07-09,M,1.83,MICHAEL CONNER HUMPHREYS,9.0,8.8,2324952.0,True
The Lord of the Rings: The Two Towers,173944,2002-12-05,926047100.0,0 days 02:59:00,"Old English language, English","United States of America, New Zealand","Fantasy Adventure, Adventure, Epic, Action/Adv...",ALDOR,1930-08-25,M,1.68,BRUCE ALLPRESS,72.0,8.8,1833778.0,True


### Emile


In [None]:
from src.models.trend_evaluation import trend_eval_ranking
from src.models.imdb_manipulation import get_movie_votes, merge_with_characters

# Get the IMDB votes for the movies (Warning : might take a few minutes, nearly 700K rows !)
imdb_titles = get_movie_votes("data/raw/imdb")
# Merge the movies and characters data with the IMDB votes
char_rating = merge_with_characters(imdb_titles, mov_char_data)



In [None]:
def merge_with_characters(imdb_df, characters_df):
    """
    Function to merge the IMDb data with the characters data.
    Ensures no duplicate rows are added and only matches with `Movie_name` in `characters_df` are considered.
    :param imdb_df: DataFrame
    :param characters_df: DataFrame
    :return: DataFrame
    """
    # Merge based on primaryTitle
    char_rating = characters_df.merge(
        imdb_df[['primaryTitle', 'averageRating', 'numVotes']], 
        left_on='Movie_name', 
        right_on='primaryTitle', 
        how='left'
    )
    # Drop the redundant 'primaryTitle' column
    char_rating = char_rating.drop(columns=['primaryTitle'])
    
    # Merge based on originalTitle to fill missing data
    char_rating = char_rating.merge(
        imdb_df[['originalTitle', 'averageRating', 'numVotes']],
        left_on='Movie_name',
        right_on='originalTitle',
        how='left',
        suffixes=('_primary', '_original')
    )
    # Use primary title data if available, otherwise fallback to original title
    char_rating['averageRating'] = char_rating['averageRating_primary'].combine_first(char_rating['averageRating_original'])
    char_rating['numVotes'] = char_rating['numVotes_primary'].combine_first(char_rating['numVotes_original'])
    
    # Drop temporary columns
    char_rating = char_rating.drop(columns=['originalTitle', 'averageRating_primary', 'averageRating_original', 'numVotes_primary', 'numVotes_original'])

    print(f"There are {char_rating.shape[0]} rows in the merged dataset after ensuring no duplicates are added.")
    return char_rating


In [None]:
# studing the number of nan in the rating column
char_rating.isna().sum() 

# Printing and visualizing the year of which the movies that have Nan value in the averageRating colum are released
nan_rating = char_rating[char_rating['averageRating'].isna()]
nan_rating['Release_date'] = nan_rating['Release_date'].apply(lambda x: x.year)
nan_rating['Release_date'].value_counts().sort_index().plot(kind='bar', figsize=(16, 8), color='skyblue')
plt.ylabel('Number of movies with NaN averageRating')
plt.xlabel('Release_date')
plt.title('Number of movies with NaN averageRating per year', fontsize=20)
plt.show() 

In [None]:
# create a dataset with the movies that have a rating
char_rating_cleaned = char_rating.dropna(subset=['averageRating'])

#what is the size of the dataset
print(char_rating_cleaned.shape)


In [None]:
# Ranking top 10 influencing character names by trend increase
ranking = trend_eval_ranking(global_names.clean_df, char_rating)
print(ranking[["Character_name","movie_name","release_year"]].head(10))

### Trend visualisation
Using the previously computed trend-inducing movies, we can now plot the baby name popularity over time with a red line on the year of the most influential movie for this name.

Note that we need to indicate the name in uppercase for compatibility with name datasets and add the gender M/F to avoid confusion for androgenous names.

In [None]:
from src.models.trend_evaluation import plot_trend

plot_trend("NEO", "M", ranking, global_names.clean_df)

## Name prediction

To answer our research questions, we needed to find a method to determine if after a specific date, the count of name would follow an abnormal evolution.

There are multiple ways to do it and for our first tentative, we decided to try interrupted time series.

### ITS - Interrupted time series
The concept is rather simple: at a specific point in time, we split our measurements in two parts and use the first one to train a model. This model will try to predict what the evolution would have been based on the previous behaviour and once we get it, we can compare it with the second part of the data that we kept. 

As mentioned in the explanation, we need to chose a model for this and after some researches, we decided to try the two following ones.

We are still evaluating how well they are suited for our project, since the training sample is quite limited due to the granularity of the data. (Count is by year)

#### SARIMA - Seasonal Autoregressive Integrated Moving Average
Well known model for univariate time series forecasting, SARIMA is an extension of the ARIMA model and adds the support for time series with a seasonal behaviour in addition to the trend support of ARIMA. 

In [None]:
from src.models.naming_prediction import predict_naming_ARIMA

prediction = predict_naming_ARIMA(global_names, "LUKE", 1976, 10, True)

Here we asked our SARIMA model to forecast the counts for the name "Luke" from the year after the year 1976, which is when the first Star Wars movie was released. 

We can see that the the modelled curve has a slower growth than the actual one and this can be used to show an abnormal evolution of the count.

#### Prophet
Developped by Facebook, Prophet is a fully automatic procedure made for time series forecasting that is used in various context due it's wide range of features (seasonality, holidy effect, ...)   

In [None]:
from src.models.naming_prediction import predict_naming_prophet

prediction = predict_naming_prophet(global_names, "LUKE", 1976, 10, True)

This time, we use Facebook's Prophet to forecast the counts for the same parameters and we can already see a difference between the two models. Prophet is generally more resistant to outliers and here, this leads to a more important difference between the modelled data and the actual one.

For now, those are only observations and we'll be investigated more thoroughly in the following days

#### Model conclusion
We still need to compare the two models and see if the ITS approach would be beneficial for our project since other options are available.