# Matching IMDB to Wikipedia

This notebook will focus on merging our Wikipedia dataframe with the movies from our dataframe gathered from IMDB.
- The first step is getting all movies that match in both dataframes with title and year.
- The second step is to account for any movies where the title matches but the years are skewed somewhat, potentially by user error.
- The final step is to fuzzymatch for for movies where the titles are not the same, due to small conflicts like ":" instead of "-".* 

*As such the threshold for the matching is very high, we are not trying to find the best fit for all remaining movies.


In [5]:
import pandas as pd
import re
import ast
import numpy as np
from rapidfuzz import process, fuzz

In [6]:
wiki_movies = pd.read_csv("wikipedia_movie_franchises.csv", index_col = 0)
imdb_movies = pd.read_csv("imdb_movies_db.csv", index_col = 0)

We need to change the year for imdb_movies so it's an integer. We'll first remove any without a value (signified by \n) and then convert to int

In [7]:
imdb_movies = imdb_movies[~(imdb_movies["startYear"] == "\\N")]
imdb_movies["startYear"] = imdb_movies["startYear"].apply(lambda x: int(x))

Convert the akas from a string to a list:

In [8]:
def extract_akas(row):
    # convert string in list format to list
    if type(row) == str:
        akas = ast.literal_eval(row)
        return akas

In [9]:
imdb_movies["akas"] = imdb_movies["akas"].apply(lambda x : extract_akas(x))

Combine all the names into a new column

In [10]:
def combine_names(row):
    tempList = []
    tempList.append(row["primaryTitle"])
    tempList.append(row["originalTitle"])
    if isinstance(row["akas"], list):
        for x in row["akas"]:
            tempList.append(x)
    return tempList

imdb_movies["all_names"] = imdb_movies.apply(combine_names, axis=1)


Remove any duplicates in the all_names column

In [11]:
imdb_movies["all_names"] = imdb_movies["all_names"].apply(lambda x: list(set(x)))

Create a row for each item in "all_names", to allow our matching to run through the dataframe smoothly, instead of going into each list in each row and checking them there.

In [12]:
imdb_movies = imdb_movies.explode("all_names")

### Merge the dataframes

Now we're going to combine the dataframes in stages. Matching based on year and name first, then any with an exact name, with an offset in place for year to allow for any user error in IMDB and Wikipedia. We'll gradually reduce the dataframes in size by filtering to ensure we aren't matching ones that have already been matched.

In [13]:
wiki_filter = ["tconst", "titleType", "primaryTitle","originalTitle", "isAdult","startYear","endYear","runtimeMinutes","genres","averageRating","numVotes","akas","_merge", "all_names"]
imdb_filter = ["franchise_name", "franchise_id", "movie_name", "release_year","_merge"]

As the matching takes place, we merge both of our datasets. To get the unmatched rows for both dataframes, we apply this filter to the rows that were "left_only" or "right_only" in the "_merge" column, and return them as remainder dataframes.

In [14]:
def filter_df(df, merge, cols):
    remainder = df[df["_merge"]== merge]
    remainder = remainder.drop(cols, axis = 1)
    return remainder

Merges the dataframes based on title and year, then returns the unmatched rows for both dataframes.

In [15]:
all_merged = pd.merge(wiki_movies, imdb_movies, left_on=["movie_name", "release_year"], right_on=["all_names", "startYear"], how="outer", indicator = True)
wiki_remainder = filter_df(all_merged, "left_only", wiki_filter)
imdb_remainder = filter_df(all_merged, "right_only", imdb_filter)
all_merged

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,akas,all_names,_merge
0,f0,The Aldrich Family,What a Life,1939.0,tt0032123,movie,What a Life,What a Life,0.0,1939.0,\N,75,"Comedy,Drama",6.9,93.0,"[What a Life, A Vida Começa aos 14]",What a Life,both
1,f0,The Aldrich Family,Life with Henry,1940.0,tt0033834,movie,Life with Henry,Life with Henry,0.0,1940.0,\N,80,"Comedy,Family,Music",6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Life with Henry,both
2,f0,The Aldrich Family,Henry Aldrich for President,1941.0,tt0033708,movie,Henry Aldrich for President,Henry Aldrich for President,0.0,1941.0,\N,75,"Comedy,Family",6.6,146.0,"[Henry Aldrich Para Presidente, Henry Aldrich ...",Henry Aldrich for President,both
3,f0,The Aldrich Family,"Henry Aldrich, Editor",1942.0,tt0034842,movie,"Henry Aldrich, Editor","Henry Aldrich, Editor",0.0,1942.0,\N,72,"Comedy,Drama,Family",6.4,150.0,"[Henry periodista, Henry Aldrich, Editor]","Henry Aldrich, Editor",both
4,f0,The Aldrich Family,Henry and Dizzy,1942.0,tt0034844,movie,Henry and Dizzy,Henry and Dizzy,0.0,1942.0,\N,71,"Comedy,Family",7.2,58.0,[Henry and Dizzy],Henry and Dizzy,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1340959,,,,,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0.0,2019.0,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",The Secret of China,right_only
1340960,,,,,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0.0,2019.0,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",Hong xing zhao yao Zhong guo,right_only
1340961,,,,,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0.0,2019.0,\N,123,Drama,8.6,7.0,[Kuambil Lagi Hatiku],Kuambil Lagi Hatiku,right_only
1340962,,,,,tt9916730,movie,6 Gunn,6 Gunn,0.0,2017.0,\N,116,\N,8.3,10.0,"[६ गुण, 6 Gunn]",६ गुण,right_only


Drops any duplicate rows, as the cell may match more than one title in imdb if for example their primaryTitle or originalTitle are the same

In [16]:
all_merged_both = all_merged[all_merged["_merge"]=="both"]
all_merged_both = all_merged_both.drop_duplicates(subset = "tconst",keep = "first")
all_merged_both

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,akas,all_names,_merge
0,f0,The Aldrich Family,What a Life,1939.0,tt0032123,movie,What a Life,What a Life,0.0,1939.0,\N,75,"Comedy,Drama",6.9,93.0,"[What a Life, A Vida Começa aos 14]",What a Life,both
1,f0,The Aldrich Family,Life with Henry,1940.0,tt0033834,movie,Life with Henry,Life with Henry,0.0,1940.0,\N,80,"Comedy,Family,Music",6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Life with Henry,both
2,f0,The Aldrich Family,Henry Aldrich for President,1941.0,tt0033708,movie,Henry Aldrich for President,Henry Aldrich for President,0.0,1941.0,\N,75,"Comedy,Family",6.6,146.0,"[Henry Aldrich Para Presidente, Henry Aldrich ...",Henry Aldrich for President,both
3,f0,The Aldrich Family,"Henry Aldrich, Editor",1942.0,tt0034842,movie,"Henry Aldrich, Editor","Henry Aldrich, Editor",0.0,1942.0,\N,72,"Comedy,Drama,Family",6.4,150.0,"[Henry periodista, Henry Aldrich, Editor]","Henry Aldrich, Editor",both
4,f0,The Aldrich Family,Henry and Dizzy,1942.0,tt0034844,movie,Henry and Dizzy,Henry and Dizzy,0.0,1942.0,\N,71,"Comedy,Family",7.2,58.0,[Henry and Dizzy],Henry and Dizzy,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8194,f1343,Super Sentai,Kamen Rider × Super Sentai × Space Sheriff: Su...,2013.0,tt2632184,movie,Kamen Rider × Super Sentai × Space Sheriff: Su...,Kamen Raidâ × Sûpâ Sentai × Uchû Keiji: Supâ H...,0.0,2013.0,\N,92,"Action,Adventure,Comedy",6.3,129.0,[Kamen Raidâ × Sûpâ Sentai × Uchû Keiji: Supâ ...,Kamen Rider × Super Sentai × Space Sheriff: Su...,both
8199,f1343,Super Sentai,Ressha Sentai ToQger vs. Kyoryuger: The Movie,2015.0,tt4152148,movie,Ressha Sentai ToQger vs. Kyoryuger: The Movie,Ressha Sentai Tokkyûjâ tai Kyôryûjâ Za Mûbî,0.0,2015.0,\N,64,Action,7.5,32.0,"[Ressha Sentai Tokkyûjâ tai Kyôryûjâ Za Mûbî, ...",Ressha Sentai ToQger vs. Kyoryuger: The Movie,both
8200,f1343,Super Sentai,Super Hero Taisen GP: Kamen Rider 3,2015.0,tt4282466,movie,Super Hero Taisen GP: Kamen Rider 3,Super Hero Taisen GP: Kamen Rider 3,0.0,2015.0,\N,95,Action,6.5,85.0,"[Super Hero Taisen GP: Kamen Rider 3, スーパーヒーロー...",Super Hero Taisen GP: Kamen Rider 3,both
8213,f1343,Super Sentai,Kishiryu Sentai Ryusoulger Special Chapter: Me...,2021.0,tt13681618,movie,Kishiryu Sentai Ryusoulger Special Chapter: Me...,Kishiryuu Sentai Ryuusoujâ Tokubetsuhen: Memor...,0.0,2021.0,\N,15,"Action,Fantasy",6.4,6.0,[Kishiryuu Sentai Ryuusoujâ Tokubetsuhen: Memo...,Kishiryu Sentai Ryusoulger Special Chapter: Me...,both


This removes any unmatched values from the imdb_remainder, removing any rows that had a match with one of their exploded values

In [17]:
imdb_remainder = imdb_remainder[~imdb_remainder["tconst"].isin(all_merged_both["tconst"])]

This next segment of cells extract any movies that match based purely on the title, then filter out the ones where the year in one is too far removed from the other. We set that offset at 10 years.

In [18]:
no_year = pd.merge(wiki_remainder, imdb_remainder, left_on=["movie_name"], right_on=["all_names"], how="outer", indicator = True)
wiki_remainder = filter_df(all_merged, "left_only", wiki_filter)
imdb_remainder = filter_df(all_merged, "right_only", imdb_filter)
no_year

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,akas,all_names,_merge
0,f1,Coffin Joe,At Midnight I'll Take Your Soul,1963.0,,,,,,,,,,,,,,left_only
1,f1,Coffin Joe,The End of Man,1970.0,tt0067099,movie,Finis Hominis,Finis Hominis,0.0,1971.0,\N,79,"Comedy,Drama,Mystery",5.7,373.0,"[End of man, The End of Man, Finis Hominis, En...",The End of Man,both
2,f2,The Crime Club,The Last Express,1938.0,tt0037776,movie,The Hidden Eye,The Hidden Eye,0.0,1945.0,\N,69,"Action,Crime,Mystery",6.2,397.0,"[The Hidden Eye, Perfume do Oriente, L'oeil ca...",The Last Express,both
3,f3,Fast & Furious,Fast X,2023.0,,,,,,,,,,,,,,left_only
4,f4,Gingerdead Man vs. Evil Bong,Evil Bong 3D: The Wrath of Bong,2011.0,,,,,,,,,,,,,,left_only
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1274737,,,,,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0.0,2019.0,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",The Secret of China,right_only
1274738,,,,,tt9916428,movie,The Secret of China,Hong xing zhao yao Zhong guo,0.0,2019.0,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",Hong xing zhao yao Zhong guo,right_only
1274739,,,,,tt9916538,movie,Kuambil Lagi Hatiku,Kuambil Lagi Hatiku,0.0,2019.0,\N,123,Drama,8.6,7.0,[Kuambil Lagi Hatiku],Kuambil Lagi Hatiku,right_only
1274740,,,,,tt9916730,movie,6 Gunn,6 Gunn,0.0,2017.0,\N,116,\N,8.3,10.0,"[६ गुण, 6 Gunn]",६ गुण,right_only


In [19]:
no_year_both = no_year[no_year["_merge"]=="both"]
no_year_both = no_year_both.drop_duplicates(subset = "tconst",keep = "first")
no_year_both["tconst"].nunique()

918

In [20]:
no_year_both['difference_in_years'] = no_year_both.apply(lambda x: abs(x['startYear'] - x['release_year']), axis=1)
no_year_both = no_year_both[no_year_both["difference_in_years"] <= 10]
no_year_both["tconst"].nunique()

427

In [21]:
no_year_both.head(50)

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,akas,all_names,_merge,difference_in_years
1,f1,Coffin Joe,The End of Man,1970.0,tt0067099,movie,Finis Hominis,Finis Hominis,0.0,1971.0,\N,79,"Comedy,Drama,Mystery",5.7,373.0,"[End of man, The End of Man, Finis Hominis, En...",The End of Man,both,1.0
2,f2,The Crime Club,The Last Express,1938.0,tt0037776,movie,The Hidden Eye,The Hidden Eye,0.0,1945.0,\N,69,"Action,Crime,Mystery",6.2,397.0,"[The Hidden Eye, Perfume do Oriente, L'oeil ca...",The Last Express,both,7.0
78,f16,Young and Dangerous,Those Were the Days,2000.0,tt0114146,movie,Those Were the Days,Le plus bel âge...,0.0,1995.0,\N,85,Drama,6.5,208.0,"[A Mais Bela Idade, Those Were the Days, Najpi...",Those Were the Days,both,5.0
79,f16,Young and Dangerous,Those Were the Days,2000.0,tt0118165,movie,Those Were the Days,Wong Gok dik tin hung 2: Nam siu yee,0.0,1996.0,\N,89,"Action,Crime,Drama",5.1,17.0,"[Those Were the Days, Wong Gok dik tin hung 2:...",Those Were the Days,both,4.0
80,f16,Young and Dangerous,Those Were the Days,2000.0,tt0186543,movie,Si ge 32A he yi ge xiang jiao shao nian,Si ge 32A he yi ge xiang jiao shao nian,0.0,1996.0,\N,101,Drama,6.6,32.0,"[Si ge 32A he yi ge xiang jiao shao nian, Thos...",Those Were the Days,both,4.0
81,f16,Young and Dangerous,Those Were the Days,2000.0,tt0285244,movie,Those Were the Days,Jing zhuang nan xiong nan di,0.0,1997.0,\N,103,Comedy,6.5,140.0,"[Those Were the Days, 精裝難兄難弟, 精装难兄难弟, Jing zhu...",Those Were the Days,both,3.0
120,f20,Gamera,Gamera the Brave,2006.0,tt0467923,movie,Gamera the Brave,Chiisaki yûsha-tachi: Gamera,0.0,2005.0,\N,96,"Action,Adventure,Family",6.6,1298.0,"[Гамера: Маленькие герои, Gamera: O genaios, 小...",Gamera the Brave,both,1.0
133,f23,"L.E.T.H.A.L. Ladies (a.k.a. Triple-B, Bullets,...",Hard Hunted,1992.0,tt0104391,movie,Hard Hunted,Hard Hunted,0.0,1993.0,\N,97,"Action,Adventure,Crime",4.1,1253.0,"[Agenttitytöt Havaijilla, Hard Hunted, Θηλυκοί...",Hard Hunted,both,1.0
153,f27,"Signed, Sealed, Delivered",From Paris with Love,2015.0,tt1179034,movie,From Paris with Love,From Paris with Love,0.0,2010.0,\N,92,"Action,Crime,Thriller",6.4,119052.0,"[From Paris with Love, 諜戰巴黎, Párizsból szerete...",From Paris with Love,both,5.0
156,f27,"Signed, Sealed, Delivered",Truth Be Told,2015.0,tt1500252,movie,Auf Wiedersehen: 'Til We Meet Again,Auf Wiedersehen: 'Til We Meet Again,0.0,2010.0,\N,76,"Biography,Documentary,History",6.8,15.0,"[Auf Wiedersehen: 'Til We Meet Again, Truth Be...",Truth Be Told,both,5.0


This merges the offset year matches with the rest of the matches above

In [39]:
matched = pd.concat([all_merged_both, no_year_both])
matched = matched[matched["_merge"]=="both"]
matched

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,akas,all_names,_merge,difference_in_years
0,f0,The Aldrich Family,What a Life,1939.0,tt0032123,movie,What a Life,What a Life,0.0,1939.0,\N,75,"Comedy,Drama",6.9,93.0,"[What a Life, A Vida Começa aos 14]",What a Life,both,
1,f0,The Aldrich Family,Life with Henry,1940.0,tt0033834,movie,Life with Henry,Life with Henry,0.0,1940.0,\N,80,"Comedy,Family,Music",6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Life with Henry,both,
2,f0,The Aldrich Family,Henry Aldrich for President,1941.0,tt0033708,movie,Henry Aldrich for President,Henry Aldrich for President,0.0,1941.0,\N,75,"Comedy,Family",6.6,146.0,"[Henry Aldrich Para Presidente, Henry Aldrich ...",Henry Aldrich for President,both,
3,f0,The Aldrich Family,"Henry Aldrich, Editor",1942.0,tt0034842,movie,"Henry Aldrich, Editor","Henry Aldrich, Editor",0.0,1942.0,\N,72,"Comedy,Drama,Family",6.4,150.0,"[Henry periodista, Henry Aldrich, Editor]","Henry Aldrich, Editor",both,
4,f0,The Aldrich Family,Henry and Dizzy,1942.0,tt0034844,movie,Henry and Dizzy,Henry and Dizzy,0.0,1942.0,\N,71,"Comedy,Family",7.2,58.0,[Henry and Dizzy],Henry and Dizzy,both,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,f1342,Hopalong Cassidy (American-Western),Sunset Trail,1939.0,tt0023539,movie,Sunset Trail,Sunset Trail,0.0,1932.0,\N,62,"Action,Music,Romance",6.2,42.0,"[La fattoria maledetta, To monopati tou thanat...",Sunset Trail,both,7.0
4041,f1342,Hopalong Cassidy (American-Western),Sunset Trail,1939.0,tt0030812,movie,Sunset Trail,Sunset Trail,0.0,1938.0,\N,69,Western,6.9,180.0,"[Gentleman-Cowboy, Äventyret i Silver City, Ro...",Sunset Trail,both,1.0
4043,f1342,Hopalong Cassidy (American-Western),Lost Canyon,1943.0,tt0034996,movie,Lost Canyon,Lost Canyon,0.0,1942.0,\N,61,Western,6.3,161.0,"[Striden i dödsdalen, Den sorte rytter, Desfil...",Lost Canyon,both,1.0
4046,f1342,Hopalong Cassidy (American-Western),Fool's Gold,1947.0,tt0038532,movie,Fool's Gold,Fool's Gold,0.0,1946.0,\N,63,"Drama,Western",6.1,140.0,"[Överfallet på guldtransporten, Twin Buttes, R...",Fool's Gold,both,1.0


In [23]:
imdb_remainder = imdb_remainder[~imdb_remainder["tconst"].isin(all_merged_both["tconst"])]

### Fuzzy Matching

Finds the best fit for each remaining movie in the wiki dataframe

In [24]:
def get_top_match(row):
    return process.extractOne(row.movie_name, imdb_movie_list)

In [25]:
imdb_movie_list = imdb_remainder.all_names.to_list()

In [26]:
wiki_remainder['best_match'] = wiki_remainder.apply(lambda row: get_top_match(row), axis = 1)

Saves a backup of the fuzzy matching results in case any mistakes are made to avoid re-running the matching script

In [27]:
wiki_remainder.to_csv("backup_wiki_remainder.csv")

Split the result into the 3 columns below

In [28]:
wiki_remainder[['best_fit_title', 'best_fit_ratio', 'best_fit_game_id']] = wiki_remainder['best_match'].apply(lambda x: pd.Series([i for i in x]))
wiki_remainder

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,best_match,best_fit_title,best_fit_ratio,best_fit_game_id
11,f1,Coffin Joe,At Midnight I'll Take Your Soul,1963.0,"(Midnight, 90.0, 30523)",Midnight,90.000000,30523
15,f1,Coffin Joe,The End of Man,1970.0,"(The End of Man, 100.0, 279988)",The End of Man,100.000000,279988
27,f2,The Crime Club,The Last Express,1938.0,"(The Last Express, 100.0, 95553)",The Last Express,100.000000,95553
44,f3,Fast & Furious,Fast X,2023.0,"(X, 90.0, 42415)",X,90.000000,42415
49,f4,Gingerdead Man vs. Evil Bong,Evil Bong 3D: The Wrath of Bong,2011.0,"(Evil Bong 3: The Wrath of Bong, 98.3606557377...",Evil Bong 3: The Wrath of Bong,98.360656,991760
...,...,...,...,...,...,...,...,...
8217,f1343,Super Sentai,Kikai Sentai Zenkaiger vs. Kiramager vs. Senpa...,2022.0,(Kikai Sentai Zenkaiger vs Kiramager vs Senpai...,Kikai Sentai Zenkaiger vs Kiramager vs Senpaiger,97.959184,997090
8218,f1343,Super Sentai,Avataro Sentai Donbrothers The Movie: New Firs...,2022.0,"(Brothers, 90.0, 20888)",Brothers,90.000000,20888
8219,f1343,Super Sentai,Ninpu Sentai Hurricaneger Degozaru! Shushuuto ...,2023.0,"(Hurricane, 90.0, 24928)",Hurricane,90.000000,24928
8220,f1343,Super Sentai,Avataro Sentai Donbrothers vs. Zenkaiger,2023.0,"(Brothers, 90.0, 20888)",Brothers,90.000000,20888


In [29]:
wiki_remainder = wiki_remainder[wiki_remainder['best_fit_ratio'].notna()]

Convert best_fit_game_id into int value to compare later

In [30]:
wiki_remainder["best_fit_game_id"] = wiki_remainder["best_fit_game_id"].apply(lambda x: int(x))

In [35]:
wiki_remainder[wiki_remainder["best_fit_ratio"] == 100]

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,best_match,best_fit_title,best_fit_ratio,best_fit_game_id
15,f1,Coffin Joe,The End of Man,1970.0,"(The End of Man, 100.0, 279988)",The End of Man,100.0,279988
27,f2,The Crime Club,The Last Express,1938.0,"(The Last Express, 100.0, 95553)",The Last Express,100.0,95553
111,f10,Mystery Woman,Mystery Woman,2003.0,"(Mystery Woman, 100.0, 47226)",Mystery Woman,100.0,47226
151,f13,Star Wars,Star Wars: Episode I – The Phantom Menace,1999.0,"(Star Wars: Episode I - The Phantom Menace, 10...",Star Wars: Episode I - The Phantom Menace,100.0,526549
152,f54,Star Wars,Star Wars: Episode I – The Phantom Menace,1999.0,"(Star Wars: Episode I - The Phantom Menace, 10...",Star Wars: Episode I - The Phantom Menace,100.0,526549
...,...,...,...,...,...,...,...,...
8017,f1340,Santo,Chanoc y el hijo del Santo contra los vampiros...,1981.0,(Chanoc y el hijo del Santo contra los vampiro...,Chanoc y el hijo del Santo contra los vampiros...,100.0,623072
8105,f1342,Hopalong Cassidy (American-Western),Sunset Trail,1939.0,"(Sunset Trail, 100.0, 34301)",Sunset Trail,100.0,34301
8130,f1342,Hopalong Cassidy (American-Western),Lost Canyon,1943.0,"(Lost Canyon, 100.0, 83158)",Lost Canyon,100.0,83158
8140,f1342,Hopalong Cassidy (American-Western),Fool's Gold,1947.0,"(Fool's Gold, 100.0, 3807)",Fool's Gold,100.0,3807


In [36]:
wiki_remainder[(wiki_remainder["best_fit_ratio"] < 100) & (wiki_remainder["best_fit_ratio"] > 95)]

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,best_match,best_fit_title,best_fit_ratio,best_fit_game_id
49,f4,Gingerdead Man vs. Evil Bong,Evil Bong 3D: The Wrath of Bong,2011.0,"(Evil Bong 3: The Wrath of Bong, 98.3606557377...",Evil Bong 3: The Wrath of Bong,98.360656,991760
53,f4,Gingerdead Man vs. Evil Bong,Evil Bong High-5!,2016.0,"(Evil Bong: High 5, 96.96969696969697, 1152092)",Evil Bong: High 5,96.969697,1152092
71,f6,Lash LaRue,Son of a Bad Man,1949.0,"(Son of a Badman, 96.7741935483871, 117172)",Son of a Badman,96.774194,117172
103,f9,Mickey Mouse,Mickey Mouse Jubliee Show,1978.0,"(Mickey Mouse Jubilee Show, 96.0, 643767)",Mickey Mouse Jubilee Show,96.000000,643767
126,f11,Naruto the Movie,Naruto Shippuden the Movie,2007.0,"(Naruto Shippuden: The Movie, 98.1132075471698...",Naruto Shippuden: The Movie,98.113208,822626
...,...,...,...,...,...,...,...,...
8202,f1343,Super Sentai,Shuriken Sentai Ninninger vs. ToQger the Movie...,2016.0,(Shuriken Sentai Ninninger vs. ToQger the Movi...,Shuriken Sentai Ninninger vs. ToQger the Movie...,99.259259,1162698
8209,f1343,Super Sentai,Kishiryu Sentai Ryusoulger VS Lupinranger VS P...,2020.0,(Kishiryu Sentai Ryusoulger vs. Lupinranger vs...,Kishiryu Sentai Ryusoulger vs. Lupinranger vs....,98.181818,865015
8214,f1343,Super Sentai,Mashin Sentai Kiramager vs. Ryusoulger,2021.0,"(Mashin Sentai Kiramager vs Ryusoulger, 98.666...",Mashin Sentai Kiramager vs Ryusoulger,98.666667,927510
8215,f1343,Super Sentai,Saber + Zenkaiger: Superhero Senki,2021.0,"(Saber + Zenkaiger: Super Hero Senki, 98.55072...",Saber + Zenkaiger: Super Hero Senki,98.550725,940152


Only take values above 95, only taking matches that had 1 letter difference or 1 punctuation difference for example

In [37]:
wiki_remainder = wiki_remainder[wiki_remainder["best_fit_ratio"] > 95]

Merge remainders based on fuzzy matching

In [38]:
fuzzy_merged = pd.merge(wiki_remainder, imdb_remainder, left_on=["best_fit_game_id"], right_index=True, how="outer", indicator = True)
wiki_remainder = filter_df(fuzzy_merged, "left_only", wiki_filter)
imdb_remainder = filter_df(fuzzy_merged, "right_only", imdb_filter)
fuzzy_merged

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,best_match,best_fit_title,best_fit_ratio,best_fit_game_id,tconst,titleType,...,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,akas,all_names,_merge
15.0,f1,Coffin Joe,The End of Man,1970.0,"(The End of Man, 100.0, 279988)",The End of Man,100.000000,279988,tt0063704,movie,...,0.0,1968.0,\N,91,Thriller,6.0,195.0,"[L'homme à la jaguar rouge, La morte in jaguar...",Death in a Red Jaguar,both
27.0,f2,The Crime Club,The Last Express,1938.0,"(The Last Express, 100.0, 95553)",The Last Express,100.000000,95553,tt0035082,movie,...,0.0,1942.0,\N,94,"Crime,Drama,Film-Noir",6.8,2034.0,"[La péniche de l'amour, Apokliroi tis zois, St...",Nocny przypływ,both
49.0,f4,Gingerdead Man vs. Evil Bong,Evil Bong 3D: The Wrath of Bong,2011.0,"(Evil Bong 3: The Wrath of Bong, 98.3606557377...",Evil Bong 3: The Wrath of Bong,98.360656,991760,tt14354838,movie,...,0.0,2021.0,\N,94,Documentary,7.5,704.0,"[Héroes. Silencio y Rock & Roll, Héroes del Si...",Héroes del Silencio: Barulho e Rock'n'Roll,both
53.0,f4,Gingerdead Man vs. Evil Bong,Evil Bong High-5!,2016.0,"(Evil Bong: High 5, 96.96969696969697, 1152092)",Evil Bong: High 5,96.969697,1152092,tt3113296,movie,...,0.0,2014.0,\N,80,Documentary,7.0,10.0,[Pequeña Babilonia],Pequeña Babilonia,both
71.0,f6,Lash LaRue,Son of a Bad Man,1949.0,"(Son of a Badman, 96.7741935483871, 117172)",Son of a Badman,96.774194,117172,tt0039348,movie,...,0.0,1948.0,\N,100,"Drama,Fantasy",6.6,68.0,"[El judio errante, Den evige jøde, O Homem Sem...",Vaeltava juutalainen,both
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,,,,,,,,1340959,tt9916428,movie,...,0.0,2019.0,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",The Secret of China,right_only
,,,,,,,,1340960,tt9916428,movie,...,0.0,2019.0,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",Hong xing zhao yao Zhong guo,right_only
,,,,,,,,1340961,tt9916538,movie,...,0.0,2019.0,\N,123,Drama,8.6,7.0,[Kuambil Lagi Hatiku],Kuambil Lagi Hatiku,right_only
,,,,,,,,1340962,tt9916730,movie,...,0.0,2017.0,\N,116,\N,8.3,10.0,"[६ गुण, 6 Gunn]",६ गुण,right_only


In [40]:
fuzzy_merged = fuzzy_merged[fuzzy_merged["_merge"]=="both"]
fuzzy_merged = fuzzy_merged.drop_duplicates(subset = "tconst",keep = "first")
fuzzy_merged["tconst"].nunique()

787

Filter by year

In [41]:
fuzzy_merged['difference_in_years'] = fuzzy_merged.apply(lambda x: abs(x['startYear'] - x['release_year']), axis=1)
fuzzy_merged = fuzzy_merged[fuzzy_merged["difference_in_years"] <= 10]
fuzzy_merged["tconst"].nunique()

545

In [42]:
matched = pd.concat([matched, fuzzy_merged])
matched = matched[matched["_merge"]=="both"]
matched

Unnamed: 0,franchise_id,franchise_name,movie_name,release_year,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,...,averageRating,numVotes,akas,all_names,_merge,difference_in_years,best_match,best_fit_title,best_fit_ratio,best_fit_game_id
0.0,f0,The Aldrich Family,What a Life,1939.0,tt0032123,movie,What a Life,What a Life,0.0,1939.0,...,6.9,93.0,"[What a Life, A Vida Começa aos 14]",What a Life,both,,,,,
1.0,f0,The Aldrich Family,Life with Henry,1940.0,tt0033834,movie,Life with Henry,Life with Henry,0.0,1940.0,...,6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Life with Henry,both,,,,,
2.0,f0,The Aldrich Family,Henry Aldrich for President,1941.0,tt0033708,movie,Henry Aldrich for President,Henry Aldrich for President,0.0,1941.0,...,6.6,146.0,"[Henry Aldrich Para Presidente, Henry Aldrich ...",Henry Aldrich for President,both,,,,,
3.0,f0,The Aldrich Family,"Henry Aldrich, Editor",1942.0,tt0034842,movie,"Henry Aldrich, Editor","Henry Aldrich, Editor",0.0,1942.0,...,6.4,150.0,"[Henry periodista, Henry Aldrich, Editor]","Henry Aldrich, Editor",both,,,,,
4.0,f0,The Aldrich Family,Henry and Dizzy,1942.0,tt0034844,movie,Henry and Dizzy,Henry and Dizzy,0.0,1942.0,...,7.2,58.0,[Henry and Dizzy],Henry and Dizzy,both,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8197.0,f1343,Super Sentai,Heisei Riders vs. Shōwa Riders: Kamen Rider Ta...,2014.0,tt2241691,movie,Under the Rainbow,Au bout du conte,0.0,2013.0,...,5.8,1393.0,"[Au bout du conte, Under regnbuen, Książę nie ...",Френска приказка,both,1.0,(Heisei Rider vs. Showa Rider: Kamen Rider Tai...,Heisei Rider vs. Showa Rider: Kamen Rider Tais...,97.058824,1102455.0
8202.0,f1343,Super Sentai,Shuriken Sentai Ninninger vs. ToQger the Movie...,2016.0,tt3355274,movie,Così parlò De Crescenzo,Così parlò De Crescenzo,0.0,2016.0,...,7.8,77.0,"[Così parlò De Crescenzo, I Am an Unemployed E...",Sono un ingegnere disoccupato,both,0.0,(Shuriken Sentai Ninninger vs. ToQger the Movi...,Shuriken Sentai Ninninger vs. ToQger the Movie...,99.259259,1162698.0
8204.0,f1343,Super Sentai,Doubutsu Sentai Zyuohger vs. Ninninger the Mov...,2016.0,tt4257950,movie,Russell Madness,Russell Madness,0.0,2015.0,...,4.2,684.0,"[Russell Madness, Chú Chó Đô Vật, Безумие Расс...",Russell Wahnsinn,both,1.0,(Doubutsu Sentai Zyuohger vs. Ninninger the Mo...,Doubutsu Sentai Zyuohger vs. Ninninger the Mov...,100.000000,1197391.0
8214.0,f1343,Super Sentai,Mashin Sentai Kiramager vs. Ryusoulger,2021.0,tt11708788,movie,Mecca I'm Coming,Mekah I'm Coming,0.0,2019.0,...,7.6,99.0,"[Mekah I'm Coming, Mecca, I'm Coming, Mecca I'...",Mekah I'm Coming,both,2.0,"(Mashin Sentai Kiramager vs Ryusoulger, 98.666...",Mashin Sentai Kiramager vs Ryusoulger,98.666667,927510.0


Trim complete dataframe

In [43]:
matched = matched[["franchise_id", "franchise_name", "movie_name", "tconst", "primaryTitle", "isAdult", "release_year", "runtimeMinutes", "genres", "averageRating", "numVotes"]]

In [None]:
matched

In [None]:
wiki_remainder

In [44]:
matched.columns = matched.columns.str.replace('release_year', 'startYear')

Merge matched dataframe with all movies left in IMDB

*This was done with the intention of examining non-franchised movies in a research question that was later cut*

In [45]:
imdb_final = pd.merge(matched, imdb_movies, on = "tconst", how = "outer")
imdb_final

Unnamed: 0,franchise_id,franchise_name,movie_name,tconst,primaryTitle_x,isAdult_x,startYear_x,runtimeMinutes_x,genres_x,averageRating_x,...,originalTitle,isAdult_y,startYear_y,endYear,runtimeMinutes_y,genres_y,averageRating_y,numVotes_y,akas,all_names
0,f0,The Aldrich Family,What a Life,tt0032123,What a Life,0.0,1939.0,75,"Comedy,Drama",6.9,...,What a Life,0,1939,\N,75,"Comedy,Drama",6.9,93.0,"[What a Life, A Vida Começa aos 14]",A Vida Começa aos 14
1,f0,The Aldrich Family,What a Life,tt0032123,What a Life,0.0,1939.0,75,"Comedy,Drama",6.9,...,What a Life,0,1939,\N,75,"Comedy,Drama",6.9,93.0,"[What a Life, A Vida Começa aos 14]",What a Life
2,f0,The Aldrich Family,Life with Henry,tt0033834,Life with Henry,0.0,1940.0,80,"Comedy,Family,Music",6.0,...,Life with Henry,0,1940,\N,80,"Comedy,Family,Music",6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Henry quería ir a Alaska
3,f0,The Aldrich Family,Life with Henry,tt0033834,Life with Henry,0.0,1940.0,80,"Comedy,Family,Music",6.0,...,Life with Henry,0,1940,\N,80,"Comedy,Family,Music",6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Henry Está na Berlinda
4,f0,The Aldrich Family,Life with Henry,tt0033834,Life with Henry,0.0,1940.0,80,"Comedy,Family,Music",6.0,...,Life with Henry,0,1940,\N,80,"Comedy,Family,Music",6.0,49.0,"[Life with Henry, Henry quería ir a Alaska, He...",Life with Henry
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1337336,,,,tt9916428,,,,,,,...,Hong xing zhao yao Zhong guo,0,2019,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",The Secret of China
1337337,,,,tt9916428,,,,,,,...,Hong xing zhao yao Zhong guo,0,2019,\N,\N,"Adventure,History,War",3.8,14.0,"[The Secret of China, Hong xing zhao yao Zhong...",Hong xing zhao yao Zhong guo
1337338,,,,tt9916538,,,,,,,...,Kuambil Lagi Hatiku,0,2019,\N,123,Drama,8.6,7.0,[Kuambil Lagi Hatiku],Kuambil Lagi Hatiku
1337339,,,,tt9916730,,,,,,,...,6 Gunn,0,2017,\N,116,\N,8.3,10.0,"[६ गुण, 6 Gunn]",६ गुण


Filters out duplicate columns

In [47]:
imdb_final = imdb_final[["franchise_id", "franchise_name", "movie_name", "tconst", "primaryTitle_y", "isAdult_y", "startYear_y", "runtimeMinutes_y", "genres_y", "averageRating_y", "numVotes_y"]]
imdb_final

Unnamed: 0,franchise_id,franchise_name,movie_name,tconst,primaryTitle_y,isAdult_y,startYear_y,runtimeMinutes_y,genres_y,averageRating_y,numVotes_y
0,f0,The Aldrich Family,What a Life,tt0032123,What a Life,0,1939,75,"Comedy,Drama",6.9,93.0
1,f0,The Aldrich Family,What a Life,tt0032123,What a Life,0,1939,75,"Comedy,Drama",6.9,93.0
2,f0,The Aldrich Family,Life with Henry,tt0033834,Life with Henry,0,1940,80,"Comedy,Family,Music",6.0,49.0
3,f0,The Aldrich Family,Life with Henry,tt0033834,Life with Henry,0,1940,80,"Comedy,Family,Music",6.0,49.0
4,f0,The Aldrich Family,Life with Henry,tt0033834,Life with Henry,0,1940,80,"Comedy,Family,Music",6.0,49.0
...,...,...,...,...,...,...,...,...,...,...,...
1337336,,,,tt9916428,The Secret of China,0,2019,\N,"Adventure,History,War",3.8,14.0
1337337,,,,tt9916428,The Secret of China,0,2019,\N,"Adventure,History,War",3.8,14.0
1337338,,,,tt9916538,Kuambil Lagi Hatiku,0,2019,123,Drama,8.6,7.0
1337339,,,,tt9916730,6 Gunn,0,2017,116,\N,8.3,10.0


Cleans up column names

In [48]:
imdb_final.columns = imdb_final.columns.str.replace('primaryTitle_y', 'primaryTitle')
imdb_final.columns = imdb_final.columns.str.replace('isAdult_y', 'isAdult')
imdb_final.columns = imdb_final.columns.str.replace('startYear_y', 'startYear')
imdb_final.columns = imdb_final.columns.str.replace('runtimeMinutes_y', 'runtimeMinutes')
imdb_final.columns = imdb_final.columns.str.replace('genres_y', 'genres')
imdb_final.columns = imdb_final.columns.str.replace('averageRating_y', 'averageRating')
imdb_final.columns = imdb_final.columns.str.replace('numVotes_y', 'numVotes')

In [51]:
wiki_remainder["franchise_name"].nunique()

41

In [49]:
imdb_final["franchise_id"].nunique()

1142

Export to csv

In [52]:
imdb_final.to_csv("clean_movies_with_franchises.csv")

# Test Code

In [None]:
imdb_final[imdb_final["primaryTitle"] == "iron man"]

In [None]:
wiki_remainder[wiki_remainder["franchise_name"] == "DC Extended Universe"]

In [None]:
test = imdb_remainder.sort_values("numVotes", ascending = False).drop_duplicates(subset=["tconst"], keep="first").sort_values(["tconst"])
test[test["numVotes"] >292]

In [None]:
imdb_final[imdb_final["franchise_id"].isnull]

In [None]:
test = pd.isnull(imdb_final["franchise_id"])
imdb_final[test]["numVotes"].describe()

In [None]:
test = pd.notnull(imdb_final["franchise_id"])
imdb_final[test]["numVotes"].describe()

# Unused code

In [None]:
# apply this mask function at each step of the matching process and see how many found rows there are
list1 = list(matched["index"])
found_mask = []
for i in range(0, len(list(wiki_movies["index"]))):
    if i in list1:
        found_mask.append(True)
    else:
        found_mask.append(False)
    
#print(list1)
print(wiki_movies[found_mask])

In [None]:
both_test = matched[matched["_merge"]=="both"].sort_values(by = "index")

In [None]:
# look into moving all names into one list, then searching with date
# With remainder, search without date and see what happens

In [None]:
# losing some values in both data frames for unknown reason when using merge.
# starts with 6405 rows in wiki and 290239 rows in imdb
# 3307 
#              wiki  |  imdb 
# start     |  6405  |  290239
# found     |  3088  |    3088
# remaining |  3307  |  287199
# total     |  6395  |  290287
# variance  |   -10  |     +48