**Collecting and Cleaning the Data**

First, we will import memorable_quotes.txt from the Cornell Movie Quotes Corpus and convert the information we want as a dataframe.

In [None]:
import pandas as pd

columns = ["movie_title", "memorable_quote", "LINE_ID_MEMORABLE_MATCHED_QUOTE"]
with open("/content/moviequotes.memorable_quotes.txt", "r",
          encoding="latin-1") as file:
    text = file.read()
chunks = text.strip().split("\n\n")
data = [chunk.split("\n") for chunk in chunks]

df_memorable = pd.DataFrame(data, columns = columns)

Next we will import scripts.txt so that we can merge by line_id to get the characters that say their corresponding line.

In [None]:
df_memorable["line_id"] = df_memorable["LINE_ID_MEMORABLE_MATCHED_QUOTE"].str.split(" ").str[0]
df_memorable["line_id"] = df_memorable["line_id"].astype(int)
df_memorable = df_memorable[["movie_title", "memorable_quote", "line_id"]]

In [None]:
columns = ["line_id", "movie_title", "movie_line_nr", "character",
           "reply_to_line_id", "text"]

# read in file as df
df_full = pd.read_csv("moviequotes.scripts.txt", delimiter=r" \+\+\+\$\+\+\+ ",
                      engine="python", header=None, names=columns,
                      encoding="ISO-8859-1")

In [None]:
df_full = df_full[df_full["line_id"].isin(df_memorable["line_id"])]
df_full = df_full[["line_id", "character"]]
df_full

Unnamed: 0,line_id,character
733,733,kat
788,788,kat
831,831,bianca
876,876,michael
979,979,walter
...,...,...
892829,893074,keenan
893222,893467,sydney
893951,894196,storey
894001,894246,pulleine


In [None]:
df = df_memorable.merge(df_full, on=("line_id"))

In [None]:
#putting a separator (_) between quotes said by same character
df_joined = df.groupby(["movie_title", "character"], as_index=False).agg({
    "memorable_quote": "_".join
})
df_joined.to_csv("memorable_quotes.csv")

Installing Cinemagoer to find imdb IDs of movies

In [None]:
pip install cinemagoer



In [None]:
df_memorable = pd.read_csv("memorable_quotes.csv")
df_memorable

Unnamed: 0.1,Unnamed: 0,movie_title,character,memorable_quote
0,0,10 things i hate about you,bianca,You're asking me out? That's so cute! What's y...
1,1,10 things i hate about you,cameron,"Just 'cause you're beautiful, that doesn't mea..."
2,2,10 things i hate about you,joey,Watching that bitch violate my car doesn't cou...
3,3,10 things i hate about you,kat,"I guess in this society, being male and an ass..."
4,4,10 things i hate about you,michael,"I have a dick on my face, don't I?"
...,...,...,...,...
3233,3233,zerophilia,keenan,You're gonna' have a great time with her tonig...
3234,3234,zerophilia,sydney,That's the thing about the truth. It'll set yo...
3235,3235,zulu dawn,crealock,"Excuse me, my Lord, there's something I must c..."
3236,3236,zulu dawn,pulleine,"Well fought, Gentlemen. It's time to save the ..."


We will be using the IMDB API by Octopus Team to get our movie info and the IMDB Scraper API by Oanor to get our actor info. The IMDB APIs have limits for the free plan, so we have to change the key after a certain number of requests.

In [None]:
num = "1" #the api's have a request limit
#repeat steps appropriate number of times to get all info
df_num = df_memorable.iloc[0: 100, :]
df_num

Unnamed: 0.1,Unnamed: 0,movie_title,character,memorable_quote
0,0,10 things i hate about you,bianca,You're asking me out? That's so cute! What's y...
1,1,10 things i hate about you,cameron,"Just 'cause you're beautiful, that doesn't mea..."
2,2,10 things i hate about you,joey,Watching that bitch violate my car doesn't cou...
3,3,10 things i hate about you,kat,"I guess in this society, being male and an ass..."
4,4,10 things i hate about you,michael,"I have a dick on my face, don't I?"
...,...,...,...,...
95,95,air force one,rose,". No matter what happens, we land this aircraf..."
96,96,airplane,air controller #1,I know but this guy has no flying experience a...
97,97,airplane,businessman,"Well, I'll give him another twenty minutes, bu..."
98,98,airplane,dr. rumack,I just want to tell you both good luck. We're ...


In [None]:
unique_movies = df_num["movie_title"].unique().tolist() #checks movies
unique_movies

['10 things i hate about you',
 '1492: conquest of paradise',
 '15 minutes',
 '2001: a space odyssey',
 '8mm',
 'a bucket of blood',
 'a clockwork orange',
 'a few good men',
 "a hard day's night",
 'a nightmare on elm street',
 'a nightmare on elm street 3: dream warriors',
 'a nightmare on elm street 4: the dream master',
 "a nightmare on elm street part 2: freddy's revenge",
 'a nightmare on elm street: the dream child',
 'a perfect world',
 'a serious man',
 'a walk to remember',
 'above the law',
 'absolute power',
 'ace ventura pet detective',
 'adaptation',
 'affliction',
 'after school special',
 'agnes of god',
 'air force one',
 'airplane']

In [None]:
import requests
from imdb import Cinemagoer # extracts the imdb ID based on a movie name

ia = Cinemagoer()

This API grabs info about the cast of the specified movie.

In [None]:
#function for movie id
def get_id(movie_name):
  movies = ia.search_movie(movie_name)
  movie = movies[0]
  return movie.movieID

#function for movie info
def get_movie_info(movie_id):
  url = f"https://imdb236.p.rapidapi.com/imdb/tt{movie_id}/cast"
  headers = {
	"x-rapidapi-key": "e8b568dfbcmshcb2b1b257c05f75p1f4700jsn0675dce0a24d",
	"x-rapidapi-host": "imdb236.p.rapidapi.com"
}
  response = requests.get(url, headers=headers)
  return response.json()

movie_ids = []
movie_jsons = {}
for movie in unique_movies:
  id = get_id(movie)
  movie_ids.append(id)
  movie_json = get_movie_info(id)
  movie_jsons[movie] = movie_json

In [None]:
movie_json

[{'id': 'nm0001332',
  'url': 'https://www.imdb.com/name/nm0001332/',
  'fullName': 'Robert Hays',
  'job': 'actor',
  'characters': ['Ted Striker']},
 {'id': 'nm0353546',
  'url': 'https://www.imdb.com/name/nm0353546/',
  'fullName': 'Julie Hagerty',
  'job': 'actress',
  'characters': ['Elaine Dickinson']},
 {'id': 'nm0000558',
  'url': 'https://www.imdb.com/name/nm0000558/',
  'fullName': 'Leslie Nielsen',
  'job': 'actor',
  'characters': ['Dr. Rumack']},
 {'id': 'nm0000717',
  'url': 'https://www.imdb.com/name/nm0000717/',
  'fullName': 'Kareem Abdul-Jabbar',
  'job': 'actor',
  'characters': ['Roger Murdock']},
 {'id': 'nm0000978',
  'url': 'https://www.imdb.com/name/nm0000978/',
  'fullName': 'Lloyd Bridges',
  'job': 'actor',
  'characters': ['Steve McCroskey']},
 {'id': 'nm0336335',
  'url': 'https://www.imdb.com/name/nm0336335/',
  'fullName': 'Peter Graves',
  'job': 'actor',
  'characters': ['Captain Clarence Oveur']},
 {'id': 'nm0666309',
  'url': 'https://www.imdb.com/nam

In [None]:
def get_cast(movie_json, character):
  for actor in movie_json:
    roles = [character.lower() for character in actor['characters']]
    for role in roles:
      if character in role:
        return actor['fullName']
  return "actor not found"


def get_actor_id(movie_json, actor_name):
  for actor in movie_json:
    if actor.get("fullName") == actor_name:
      return actor.get("id")
  return None

def get_actor_job(movie_json, actor_id):
  for actor in movie_json:
    if actor.get("id") == actor_id:
      return actor.get("job")
  return None

movie_dict = df_num.groupby("movie_title")["character"].apply(list).to_dict()
movie_dict.keys()

dict_keys(['10 things i hate about you', '1492: conquest of paradise', '15 minutes', '2001: a space odyssey', '8mm', 'a bucket of blood', 'a clockwork orange', 'a few good men', "a hard day's night", 'a nightmare on elm street', 'a nightmare on elm street 3: dream warriors', 'a nightmare on elm street 4: the dream master', "a nightmare on elm street part 2: freddy's revenge", 'a nightmare on elm street: the dream child', 'a perfect world', 'a serious man', 'a walk to remember', 'above the law', 'absolute power', 'ace ventura pet detective', 'adaptation', 'affliction', 'after school special', 'agnes of god', 'air force one', 'airplane'])

This API will actor info

In [None]:
def get_actor_info(actor_id):
  url = "https://imdb-scraper3.p.rapidapi.com/api/v1/name/detail"
  querystring = {"id":f"{actor_id}"}
  headers = {
	"x-rapidapi-key": "91514d88a9msh6375841cf39c88bp1054afjsn72bc51c76bb4",
	"x-rapidapi-host": "imdb-scraper3.p.rapidapi.com"
}
  response = requests.get(url, headers=headers, params=querystring)
  return response.json()


def get_age(actor_json):
  return actor_json["data"]["result"].get("birthDate").get("date")


def get_bio(actor_json):
  return actor_json["data"]["result"].get("bio").get("text").get("plainText")

We extract the info to put into lists which are turned into columns of our dataframe

In [None]:
actors = []
actor_ids = []
jobs = []
#appends the lists to add to dataframe with corresponding information
for movie in movie_dict.keys():
  for character in movie_dict[movie]:
    actor = get_cast(movie_jsons[movie], character)
    actors.append(actor)
    id = get_actor_id(movie_jsons[movie], actor)
    actor_ids.append(id)
    job = get_actor_job(movie_jsons[movie], id)
    jobs.append(job)

df_num["actor"] = actors
df_num["actor_id"] = actor_ids
df_num["job"] = jobs
df_num = df_num[df_num["actor"] != "actor not found"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num["actor"] = actors
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num["actor_id"] = actor_ids
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num["job"] = jobs


In [None]:
#this ensures that we have separate entries for the same actor who acted in
# two different movies / as two different characters in one movie
actor_tuples = [(row.memorable_quote, row.actor_id) for row in
                df_num.itertuples(index=False)]
actor_jsons = {}

len(actor_tuples)

62

Run for specific actor info

In [None]:
for pair in actor_tuples:
  actor_jsons[pair] = get_actor_info(pair[1])

In [None]:
actor_jsons

{("You're asking me out? That's so cute! What's your name again?",
  'nm0646351'): {'status': 'ok',
  'request_id': 'ffcf056a-65fc-4f8f-8004-2846833e5b60',
  'message': None,
  'data': {'result': {'id': 'nm0646351',
    'nameText': {'text': 'Larisa Oleynik', '__typename': 'NameText'},
    'searchIndexing': {'disableIndexing': False,
     '__typename': 'NameSearchIndexing'},
    'disambiguator': {'text': 'I', '__typename': 'Disambiguation'},
    'knownFor': {'edges': [{'node': {'title': {'titleText': {'text': '10 Dinge, die ich an Dir hasse',
          '__typename': 'TitleText'},
         '__typename': 'Title'},
        'summary': {'principalCategory': {'text': 'Actress',
          '__typename': 'CreditCategory'},
         '__typename': 'NameKnownForSummary'},
        '__typename': 'NameKnownFor'},
       '__typename': 'NameKnownForEdge'},
      {'node': {'title': {'titleText': {'text': 'Was ist los mit Alex Mack?',
          '__typename': 'TitleText'},
         '__typename': 'Title'},


Get the date of births and the bios of specific actors

In [None]:
DOBs = []
bios = []
for actor in actor_jsons:
  if actor in actor_jsons:  # Check if actor is present in actor_jsons
    DOB = get_age(actor_jsons[actor])
    bio = get_bio(actor_jsons[actor])
  else:
    age = "None"  # or any default value for missing age
    bio = "None"  # or any default value for missing bio

  DOBs.append(DOB)
  bios.append(bio)


df_num["actor_DOB"] = DOBs
df_num["actor_bio"] = bios

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num["actor_DOB"] = DOBs
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_num["actor_bio"] = bios


This third call is a different endpoint on the first API we called. This is one retrieves a lot of information about the called movies. The information we care about is release date, so we can use that with DOB to calculate age.

In [None]:
def get_movie_details(movie_id):
  url = f"https://imdb236.p.rapidapi.com/imdb/tt{movie_id}"

  headers = {
	"x-rapidapi-key": "91514d88a9msh6375841cf39c88bp1054afjsn72bc51c76bb4",
	"x-rapidapi-host": "imdb236.p.rapidapi.com"
}

  response = requests.get(url, headers=headers)

  return response.json()

movie_details = {}
for movie in unique_movies:
  id = get_id(movie)
  movie_date = get_movie_details(id)
  movie_details[movie] = movie_date

In [None]:
#for movie release dates
movies_list = []
for title, details in movie_details.items():
    release_date = details.get("releaseDate", None)  # Get the release date for each movie
    movies_list.append({"title": title, "release_date": release_date})

This dataframe will be merged with our other dataframe to save requests since there are multiple characters in the same movies in this dataframe.

In [None]:
#turn to dataframe to be merged
df_movies_list = pd.DataFrame(movies_list)
df_movies_list

Unnamed: 0,title,release_date
0,10 things i hate about you,1999-03-31
1,1492: conquest of paradise,1992-10-09
2,15 minutes,2001-03-09
3,2001: a space odyssey,1968-05-12
4,8mm,1999-02-26
5,a bucket of blood,1959-10-21
6,a clockwork orange,1972-02-02
7,a few good men,1992-12-11
8,a hard day's night,1964-07-07
9,a nightmare on elm street,1984-11-16


In [None]:
df_num

Unnamed: 0.1,Unnamed: 0,movie_title,character,memorable_quote,actor,actor_id,job,actor_DOB,actor_bio
0,0,10 things i hate about you,bianca,You're asking me out? That's so cute! What's y...,Larisa Oleynik,nm0646351,actress,1981-06-07,"Larisa Oleynik was born in Santa Clara County,..."
1,1,10 things i hate about you,cameron,"Just 'cause you're beautiful, that doesn't mea...",Joseph Gordon-Levitt,nm0330687,actor,1981-02-17,"Joseph Gordon-Levitt is an actor, filmmaker, a..."
2,2,10 things i hate about you,joey,Watching that bitch violate my car doesn't cou...,Andrew Keegan,nm0005080,actor,1979-01-29,"Andrew Keegan was born in Shadow Hills, Califo..."
3,3,10 things i hate about you,kat,"I guess in this society, being male and an ass...",Julia Stiles,nm0005466,actress,1981-03-28,"Lovely, Julia (O'Hara) Stiles, of Irish, Engli..."
4,4,10 things i hate about you,michael,"I have a dick on my face, don't I?",David Krumholtz,nm0472710,actor,1978-05-15,David Krumholtz is an American actor and comed...
...,...,...,...,...,...,...,...,...,...
88,88,affliction,wade,I think there's some dirty business going on i...,Nick Nolte,nm0000560,actor,1941-02-08,"Nick Nolte was born in Omaha, Nebraska and beg..."
90,90,agnes of god,agnes,As much as Mother Miriam loves me?,Meg Tilly,nm0000672,actress,1960-02-14,"Meg Tilly was set on being a dancer, and at 17..."
91,91,agnes of god,martha,As much as God loves you_I don't know the mean...,Jane Fonda,nm0000404,actress,1937-12-21,Born in New York City to legendary screen star...
94,94,air force one,gibbs,How the hell did they get Air Force One?,Xander Berkeley,nm0075359,actor,1955-12-16,Xander's father was a painter and his mother a...


This is one part of the dataframe. We ended up with 29 dataframes that we concatenated together.

In [None]:
#Repeat steps until filling in entire dataframe
df_num.to_csv(f"df_{num}.csv")

Finding Actor Ages

In [None]:
#merge movie with list
df_job_releaseDate = df_num.merge(df_movies_list, left_on="movie_title",
                                  right_on="title")
df_job_releaseDate.drop(columns=["title"], inplace=True)
#replace job titles with genders
df_job_releaseDate["job"] = df_job_releaseDate["job"].replace({"actor": "M",
                                                               "actress": "F"})

In [None]:
#code to calculate age at time of release
df_job_releaseDate["actor_DOB"] = pd.to_datetime(df_job_releaseDate["actor_DOB"],
                                                 errors='coerce')
df_job_releaseDate["release_date"] = pd.to_datetime(df_job_releaseDate["release_date"],
                                                    errors='coerce')

df_job_releaseDate["age_at_release"] = (df_job_releaseDate["release_date"].dt.year
                                        - df_job_releaseDate["actor_DOB"].dt.year)

df_job_releaseDate

Unnamed: 0.1,Unnamed: 0,movie_title,character,memorable_quote,actor,actor_id,job,actor_DOB,actor_bio,release_date,age_at_release
0,0,10 things i hate about you,bianca,You're asking me out? That's so cute! What's y...,Larisa Oleynik,nm0646351,F,1981-06-07,"Larisa Oleynik was born in Santa Clara County,...",1999-03-31,18.0
1,1,10 things i hate about you,cameron,"Just 'cause you're beautiful, that doesn't mea...",Joseph Gordon-Levitt,nm0330687,M,1981-02-17,"Joseph Gordon-Levitt is an actor, filmmaker, a...",1999-03-31,18.0
2,2,10 things i hate about you,joey,Watching that bitch violate my car doesn't cou...,Andrew Keegan,nm0005080,M,1979-01-29,"Andrew Keegan was born in Shadow Hills, Califo...",1999-03-31,20.0
3,3,10 things i hate about you,kat,"I guess in this society, being male and an ass...",Julia Stiles,nm0005466,F,1981-03-28,"Lovely, Julia (O'Hara) Stiles, of Irish, Engli...",1999-03-31,18.0
4,4,10 things i hate about you,michael,"I have a dick on my face, don't I?",David Krumholtz,nm0472710,M,1978-05-15,David Krumholtz is an American actor and comed...,1999-03-31,21.0
...,...,...,...,...,...,...,...,...,...,...,...
57,88,affliction,wade,I think there's some dirty business going on i...,Nick Nolte,nm0000560,M,1941-02-08,"Nick Nolte was born in Omaha, Nebraska and beg...",1999-02-19,58.0
58,90,agnes of god,agnes,As much as Mother Miriam loves me?,Meg Tilly,nm0000672,F,1960-02-14,"Meg Tilly was set on being a dancer, and at 17...",1985-09-27,25.0
59,91,agnes of god,martha,As much as God loves you_I don't know the mean...,Jane Fonda,nm0000404,F,1937-12-21,Born in New York City to legendary screen star...,1985-09-27,48.0
60,94,air force one,gibbs,How the hell did they get Air Force One?,Xander Berkeley,nm0075359,M,1955-12-16,Xander's father was a painter and his mother a...,1997-07-25,42.0


In [None]:
df_job_releaseDate.rename(columns={ "job" : "gender"})

Unnamed: 0.1,Unnamed: 0,movie_title,character,memorable_quote,actor,actor_id,gender,actor_DOB,actor_bio,release_date,age_at_release
0,0,10 things i hate about you,bianca,You're asking me out? That's so cute! What's y...,Larisa Oleynik,nm0646351,F,1981-06-07,"Larisa Oleynik was born in Santa Clara County,...",1999-03-31,18.0
1,1,10 things i hate about you,cameron,"Just 'cause you're beautiful, that doesn't mea...",Joseph Gordon-Levitt,nm0330687,M,1981-02-17,"Joseph Gordon-Levitt is an actor, filmmaker, a...",1999-03-31,18.0
2,2,10 things i hate about you,joey,Watching that bitch violate my car doesn't cou...,Andrew Keegan,nm0005080,M,1979-01-29,"Andrew Keegan was born in Shadow Hills, Califo...",1999-03-31,20.0
3,3,10 things i hate about you,kat,"I guess in this society, being male and an ass...",Julia Stiles,nm0005466,F,1981-03-28,"Lovely, Julia (O'Hara) Stiles, of Irish, Engli...",1999-03-31,18.0
4,4,10 things i hate about you,michael,"I have a dick on my face, don't I?",David Krumholtz,nm0472710,M,1978-05-15,David Krumholtz is an American actor and comed...,1999-03-31,21.0
...,...,...,...,...,...,...,...,...,...,...,...
57,88,affliction,wade,I think there's some dirty business going on i...,Nick Nolte,nm0000560,M,1941-02-08,"Nick Nolte was born in Omaha, Nebraska and beg...",1999-02-19,58.0
58,90,agnes of god,agnes,As much as Mother Miriam loves me?,Meg Tilly,nm0000672,F,1960-02-14,"Meg Tilly was set on being a dancer, and at 17...",1985-09-27,25.0
59,91,agnes of god,martha,As much as God loves you_I don't know the mean...,Jane Fonda,nm0000404,F,1937-12-21,Born in New York City to legendary screen star...,1985-09-27,48.0
60,94,air force one,gibbs,How the hell did they get Air Force One?,Xander Berkeley,nm0075359,M,1955-12-16,Xander's father was a painter and his mother a...,1997-07-25,42.0


Generally, movies take about a year after filming before being released in theaters, so we wanted to take in account the estimate of when the actor/actress said the line to see if there were any problematic trends.

In [None]:
#column for approx age while filming, most movies take about a year after filming to come out
df_job_releaseDate["approx_age_filming"] = df_job_releaseDate["age_at_release"] - 1

Cleaning the data

In [63]:

df_cleaned = df_job_releaseDate.dropna(subset=["release_date", "actor_DOB"])
missing_values_after = df_cleaned[["release_date", "actor_DOB"]].isna().sum()

Wanted an easy way to know how many quotes are said by one person. Not gonna be as helpful since we are going to separate the data by individual quotes instead of characters

In [None]:
df_cleaned["quote_count"] = df_cleaned["memorable_quote"].str.count("_") + 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned["quote_count"] = df_cleaned["memorable_quote"].str.count("_") + 1


In [None]:
#repeat for until all data is recorded
df_cleaned

Unnamed: 0.1,Unnamed: 0,movie_title,character,memorable_quote,actor,actor_id,job,actor_DOB,actor_bio,release_date,age_at_release,approx_age_filming,quote_count
0,0,10 things i hate about you,bianca,You're asking me out? That's so cute! What's y...,Larisa Oleynik,nm0646351,F,1981-06-07,"Larisa Oleynik was born in Santa Clara County,...",1999-03-31,18.0,17.0,1
1,1,10 things i hate about you,cameron,"Just 'cause you're beautiful, that doesn't mea...",Joseph Gordon-Levitt,nm0330687,M,1981-02-17,"Joseph Gordon-Levitt is an actor, filmmaker, a...",1999-03-31,18.0,17.0,3
2,2,10 things i hate about you,joey,Watching that bitch violate my car doesn't cou...,Andrew Keegan,nm0005080,M,1979-01-29,"Andrew Keegan was born in Shadow Hills, Califo...",1999-03-31,20.0,19.0,1
3,3,10 things i hate about you,kat,"I guess in this society, being male and an ass...",Julia Stiles,nm0005466,F,1981-03-28,"Lovely, Julia (O'Hara) Stiles, of Irish, Engli...",1999-03-31,18.0,17.0,4
4,4,10 things i hate about you,michael,"I have a dick on my face, don't I?",David Krumholtz,nm0472710,M,1978-05-15,David Krumholtz is an American actor and comed...,1999-03-31,21.0,20.0,1
5,5,10 things i hate about you,patrick,Who knocked up your sister?_I was watching you...,Heath Ledger,nm0005132,M,1979-04-04,"When hunky, twenty-year-old heart-throb Heath ...",1999-03-31,20.0,19.0,5
6,6,10 things i hate about you,walter,"You're 18, you don't know what you want. And y...",Larry Miller,nm0588777,M,1953-10-15,Larry Miller was born on 15 October 1953 in Va...,1999-03-31,46.0,45.0,1
7,7,1492: conquest of paradise,fernando,Of all the words my father wrote - and there w...,Loren Dean,nm0000363,M,1969-07-31,Loren Dean was born on 31 July 1969 in Las Veg...,1992-10-09,23.0,22.0,1
8,8,15 minutes,emil,I can kill you - I'm insane._I love America. N...,Karel Roden,nm0734558,M,1962-05-18,Karel Roden is an internationally known actor ...,2001-03-09,39.0,38.0,2
9,11,8mm,max,"Oh, he's a lover, man... definitely loves what...",Joaquin Phoenix,nm0001618,M,1974-10-28,Joaquin Phoenix was born Joaquin Rafael Bottom...,1999-02-26,25.0,24.0,1


In [None]:
#ends up with this dataset after repeating
df = pd.read_csv("df_cleaned-4.csv")