# CPSC 4810 - Group project


# Project description

- Data acquisition
- Preparation
- Cleaning and aggregation
- Exploratory data analyis

***

# Data

For this project 7 datasets from 4 different sources were used:
1.	Credits – CSV file obtained from Kaggle.
2.	Keywords – CSV file obtained from Kaggle.
3.	Movies – CSV file obtained from Kaggle.
4.	Ratings – CSV file obtained from Kaggle.
5.	Academy Award Winning Films – data from a table Wikipedia. 
6.	Top Movies – JSON files from OMDb API.
7.	Oscar Winners – TXT file from OpenIntro.org

More information about the content of each dataset as well as how they were used is provided through the rest of this document.

## Sources
![alt text](datasources.png "Sources")

In [52]:
import pandas as pd

Because we will be dealing with JSON strings, we import json:

In [53]:
import json
from os.path import exists

***
# Dataset 1: CREDITS
***

***
## Description

The original dataset was obtained from Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset and contains 3 columns:

- **Cast** – information in JSON format about the cast that participated in the movie.
- **Crew** – information in JSON format about the crew the helped to create the movie.
- **ID** – integer that uniquely identifies each movie. 
***

## Transformation and cleaning

The original file (**credits_cleaned01**) is larger than 100mb, which makes it quite heavy, however some data cleaning might help us make the file more manageable. 

***
**!!!! RUN THIS CODE ONLY IF YOU HAVE THE credits_cleaned01.csv FILE !!!!**

The ***credits_cleaned01.csv*** file is greater than 100mb, so it cannot be upload to GitHub. Skip this part and run the section where the file **credits_cleaned03.csv** is imported.
***

Because we are not interested in the column with the **crew** information it will be deleted:

In [118]:
file_exists = exists("data/credits_cleaned01.csv")
if(file_exists):
    credits = pd.read_csv('data/credits_cleaned01.csv')
    credits.drop(columns=['crew'], inplace=True, axis=1)
    credits.to_csv('data/credits_cleaned02.csv', index=False)
    display(list(credits.columns))

['Unnamed: 0', 'cast', 'id']

We read the new file and display its size and columns. We can see there are only two columns: 

In [119]:
file_exists = exists("data/credits_cleaned02.csv")
if(file_exists):
    credits = pd.read_csv('data/credits_cleaned02.csv',index_col=0)
    display(len(credits))
    display(list(credits.columns))

45399

['cast', 'id']

Each row in the **cast** column contains a large JSON with information we don't want to use. This is how each **cast** looks like:

In [126]:
if(file_exists):
    print(credits.iloc[0]['cast'][0:500],"...")

[{"character": "Woody (voice)", "name": "Tom Hanks"}, {"character": "Buzz Lightyear (voice)", "name": "Tim Allen"}, {"character": "Mr. Potato Head (voice)", "name": "Don Rickles"}, {"character": "Slinky Dog (voice)", "name": "Jim Varney"}, {"character": "Rex (voice)", "name": "Wallace Shawn"}, {"character": "Hamm (voice)", "name": "John Ratzenberger"}, {"character": "Bo Peep (voice)", "name": "Annie Potts"}, {"character": "Andy (voice)", "name": "John Morris"}, {"character": "Sid (voice)", "name ...


Since the format of the JSON is not compatible with **json.loads()** we need to change the structure to have a dictionary of people for each movie:
- Create function to clean JSON
- Loop through each row on the dataset
- Change quotes to dummy values
- Remove unnecesary quotes or backslash (if any)
- Remove unnecesary indexes from the dict of people
- Update the dataset with the reduced JSON file

In [61]:
def cleanjsontring (jsonSting):
        jsonSting = jsonSting.replace("{'", "{##")
        jsonSting = jsonSting.replace('{"', "{##")

        jsonSting = jsonSting.replace("': ", "##: ")
        jsonSting = jsonSting.replace('": ', "##: ")

        jsonSting = jsonSting.replace(", '", ", ##")
        jsonSting = jsonSting.replace(', "', ", ##")

        jsonSting = jsonSting.replace("'}", "##}")
        jsonSting = jsonSting.replace('"}', "##}")

        jsonSting = jsonSting.replace(": '", ": ##")
        jsonSting = jsonSting.replace("', ", "##, ")
        jsonSting = jsonSting.replace(': "', ': ##')
        jsonSting = jsonSting.replace('", ', '##, ')
        jsonSting = jsonSting.replace('"}', '##}')
        jsonSting = jsonSting.replace("\\", '')

        jsonSting = jsonSting.replace('"', "")
        jsonSting = jsonSting.replace("'", "")

        jsonSting = jsonSting.replace('##', '"')
        jsonSting = jsonSting.replace(": None", ": null")
        return jsonSting

In [369]:
file_exists = exists("data/credits_cleaned02.csv")
if(file_exists):
    credits = pd.read_csv('data/credits_cleaned02.csv',index_col=0)
    credits.reset_index(drop=True, inplace=True)

errors = []
if(file_exists):
    for i, credit in credits.iterrows():
        if(i >= 0):
            try:
                cast = cleanjsontring(credit.cast)
                
                if(credit.cast == "[]" or credit.cast == ""):
                    errors.append(i)
                else:
                    arrayOfPeople = json.loads(cast)
                    for dictOfPeople in arrayOfPeople:
                        dictOfPeople.pop("cast_id")
                        dictOfPeople.pop("credit_id")
                        dictOfPeople.pop("gender")
                        dictOfPeople.pop("order")
                        dictOfPeople.pop("profile_path")
                        dictOfPeople.pop("id")
                        dictOfPeople['character'].replace('"',"´")
                    credits.iloc[i,0] = json.dumps(arrayOfPeople)
            except Exception as e:
                print(i)
                errors.append(i)

In [556]:
#These are out of range elements:
#print(errors)

In [370]:
len(credits)

45399

If we want to eliminate errors we run the following code:

In [55]:
for i in errors:
    try:
        credits.drop(credits.index[i],inplace=True)
    except Exception as e:
        #print(e)
        pass
credits.reset_index(drop=True, inplace=True)

In [372]:
len(credits)

43182

In [54]:
credits.head(5)

Unnamed: 0,cast,id
0,"[{""character"": ""Woody (voice)"", ""name"": ""Tom H...",862
1,"[{""character"": ""Alan Parrish"", ""name"": ""Robin ...",8844
2,"[{""character"": ""Max Goldman"", ""name"": ""Walter ...",15602
3,"[{""character"": ""Savannah Vannah Jackson"", ""nam...",31357
4,"[{""character"": ""George Banks"", ""name"": ""Steve ...",11862


Now we store this dataframe as a new file. With these steps we have reduced the original file from 190mb to 37mb, which is better because we will deal with a smaller file:

In [561]:
credits.to_csv('data/credits_cleaned03.csv')

## Usage

In [56]:
credits = pd.read_csv('data/credits_cleaned03.csv', index_col=0)
credits.head(5)

Unnamed: 0,cast,id
0,"[{""character"": ""Woody (voice)"", ""name"": ""Tom H...",862
1,"[{""character"": ""Alan Parrish"", ""name"": ""Robin ...",8844
2,"[{""character"": ""Max Goldman"", ""name"": ""Walter ...",15602
3,"[{""character"": ""Savannah Vannah Jackson"", ""nam...",31357
4,"[{""character"": ""George Banks"", ""name"": ""Steve ...",11862


Let's get a dataset of performers counting their participation on movies:

In [57]:
actArray = []
movArray = []
for i, credit in credits.iterrows():
    try:
        arrayOfPeople = json.loads(credit.cast)
        for dictOfPeople in arrayOfPeople:
            actArray.append(dictOfPeople["name"])
            movArray.append(credit.id)
    except Exception as e:  
        print(i,credit.cast,e)
        break

    
myDict = {'performer': actArray, 'movie_id': movArray}
acts = pd.DataFrame(data=myDict)

counts = acts['performer'].value_counts().to_frame()
counts.reset_index(inplace=True)
counts.columns = counts.columns.str.replace('performer', 'number_of_movies')
counts.columns = counts.columns.str.replace('index', 'performer')

#for i in range(0,len(counts)):
    #print(counts.iloc[i].performer, counts.iloc[i].number_of_movies)
    #break
    
display(counts)

Unnamed: 0,performer,number_of_movies
0,Bess Flowers,230
1,Christopher Lee,141
2,Samuel L. Jackson,120
3,John Wayne,120
4,Donald Sutherland,107
...,...,...
196189,Joanne,1
196190,Johnny Chung-Jen Lin,1
196191,Ma Nien-Hsien,1
196192,Wei-min Ying,1


# Dataset 2: Keywords
***

***
## Description

This dataset was also obtained from Kaggle: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset and it contains 2 columns:

- **ID** – integer that uniquely identifies each movie. 
- **Keywords** – a list of different words that might be used to look for a movie.
***

## Transformation and cleaning
***

Becuse of the simplicity of the file, no cleanind was needed.

## Usage
***

In [50]:
keywords = pd.read_csv('data/keywords.csv')
keywords.head(5)

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


Create a new column with the numerical value of the year:

In [59]:
movies["year"] = movies["release_date"].str.slice(stop=4).astype('Int64')
movies[["release_date","year"]]

Unnamed: 0,release_date,year
0,1995-10-30,1995
1,1995-12-15,1995
2,1995-12-22,1995
3,1995-12-22,1995
4,1995-02-10,1995
...,...,...
45458,,
45459,2011-11-17,2011
45460,2003-08-01,2003
45461,1917-10-21,1917


We will now get the count of the most repeated keywords on each decade and display the top 5:

In [62]:
decades = [
    {"1970-1979": movies[(movies["year"]>=1970) & (movies["year"]<=1979)]["id"].tolist()},
    {"1980-1989": movies[(movies["year"]>=1980) & (movies["year"]<=1989)]["id"].tolist()},
    {"1990-1999": movies[(movies["year"]>=1990) & (movies["year"]<=1999)]["id"].tolist()},
    {"2000-2009": movies[(movies["year"]>=2000) & (movies["year"]<=2009)]["id"].tolist()},
    {"2010-2019": movies[(movies["year"]>=2010) & (movies["year"]<=2019)]["id"].tolist()},
    {"2020-2022": movies[(movies["year"]>=2020) & (movies["year"]<=2022)]["id"].tolist()}
]


for i in range(len(decades)):
    listOfKeywords = []
    for years, ids in decades[i].items():
        print("---------------------------------------------")
        print(years)
        print("---------------------------------------------")
        for movieId in ids:
            filteredDF = keywords[keywords["id"] == movieId]
            if(len(filteredDF.index) != 0):
                myString = filteredDF.iloc[0]['keywords']
                myJson = json.loads(cleanjsontring(myString))
                for myKeyword in myJson:
                    listOfKeywords.append(myKeyword["name"])
    
    print(pd.Series(listOfKeywords).value_counts().head(5).to_string())
    print()
    print()

---------------------------------------------
1970-1979
---------------------------------------------
murder              164
female nudity       140
independent film    133
nudity              129
sex                 103


---------------------------------------------
1980-1989
---------------------------------------------
independent film    201
murder              190
nudity              145
woman director      134
female nudity       127


---------------------------------------------
1990-1999
---------------------------------------------
woman director      396
independent film    362
murder              190
sex                 118
violence            109


---------------------------------------------
2000-2009
---------------------------------------------
woman director      988
independent film    948
murder              251
sex                 238
sport               174


---------------------------------------------
2010-2019
---------------------------------------------
wo

###  !!!!!  ^----- Analize THIS  ------^ !!!!!

## Getting list of Academy Award winning films
Now we will get the list of academy award winning films from wikipedia scraping data to compare the best movies with what we have in our dataset

In [361]:
import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
body = soup.find('body')
table = body.find('table')
tbody = table.find('tbody')


c_names = []
awardsData = []

trs = tbody.findAll('tr')
for i in range(len(trs)):
    
    if(i ==0):
        #Getting column names:
        ths = trs[i].findAll('th')
        for j in range(len(ths)):
            c_names.append(ths[j].getText().replace(u'\n', ''))
    else:
        #Data:
        tds = trs[i].findAll('td')
        awardsData.append({
            c_names[0]:tds[0].getText(),
            c_names[1]:tds[1].getText().replace(u'\xa0', ''),
            c_names[2]:int(tds[2].getText().split(' ')[0].replace(u'\n', '')),
            c_names[3]:int(tds[3].getText().split('[')[0].replace(u'\n', '')),
        });
        
academyAwards = pd.DataFrame.from_dict(sorted(awardsData, key=lambda d: d['Nominations'], reverse=True) )

In [362]:
display(academyAwards.head(10))

Unnamed: 0,Film,Year,Awards,Nominations
0,La La Land,2016,6,14
1,Titanic,1997,11,14
2,All About Eve,1950,6,14
3,The Shape of Water,2017,4,13
4,The Curious Case of Benjamin Button,2008,3,13
5,Chicago,2002,6,13
6,The Lord of the Rings: The Fellowship of the Ring,2001,4,13
7,Shakespeare in Love,1998,7,13
8,Forrest Gump,1994,6,13
9,Who's Afraid of Virginia Woolf?,1966,5,13


## Getting top movies details from API
OMDb API provides an API where we can get information about the films and we want to compare the what we have from wikipedia with the information provided by the api

In order to do this, we are using the module requests passing the apiKey and it returns a JSON

In [365]:
#from PIL import Image
from io import BytesIO

awardsFromApiData = []

for i in range(len(academyAwards.head(10))):
    url = "http://www.omdbapi.com/?apikey=1161cf29&t="+academyAwards.iloc[i]["Film"]
    req = requests.get(url)
    MovieData = req.json()
    #img = requests.get(MovieData["Poster"])
    #img = Image.open(BytesIO(img.content))
    #img.thumbnail((200, 200))
    #display(img)
    awardsFromApiData.append({
        "Title" : MovieData["Title"],
        "Director" : MovieData["Director"],
        "APIAwards" : MovieData["Awards"],
        "imdbID" : MovieData["imdbID"],
    })

Now we see the API provides more nominations than wikipedia but the Awards are the same

In [366]:

pd.concat([
    academyAwards.head(10),
    pd.DataFrame(data=awardsFromApiData)["APIAwards"]
], axis=1)


Unnamed: 0,Film,Year,Awards,Nominations,APIAwards
0,La La Land,2016,6,14,Won 6 Oscars. 243 wins & 297 nominations total
1,Titanic,1997,11,14,Won 11 Oscars. 125 wins & 83 nominations total
2,All About Eve,1950,6,14,Won 6 Oscars. 27 wins & 20 nominations total
3,The Shape of Water,2017,4,13,Won 4 Oscars. 138 wins & 358 nominations total
4,The Curious Case of Benjamin Button,2008,3,13,Won 3 Oscars. 83 wins & 160 nominations total
5,Chicago,2002,6,13,Won 6 Oscars. 57 wins & 129 nominations total
6,The Lord of the Rings: The Fellowship of the Ring,2001,4,13,Won 4 Oscars. 121 wins & 126 nominations total
7,Shakespeare in Love,1998,7,13,Won 7 Oscars. 64 wins & 87 nominations total
8,Forrest Gump,1994,6,13,Won 6 Oscars. 51 wins & 75 nominations total
9,Who's Afraid of Virginia Woolf?,1966,5,13,Won 5 Oscars. 22 wins & 28 nominations total


## USING MOVIES FILE

In [375]:
movies = pd.read_csv('data/movies_metadata.csv',dtype={'popularity': 'float'})
print('Columns:')
print(',  '.join(movies))

Columns:
adult,  belongs_to_collection,  budget,  genres,  homepage,  id,  imdb_id,  original_language,  original_title,  overview,  popularity,  poster_path,  production_companies,  production_countries,  release_date,  revenue,  runtime,  spoken_languages,  status,  tagline,  title,  video,  vote_average,  vote_count


In [48]:
movies.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{""name"": ""Toy Story Collection""}",30000000,"[{""name"": ""Animation""}, {""name"": ""Comedy""}, {""...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
1,False,,65000000,"[{""name"": ""Adventure""}, {""name"": ""Fantasy""}, {...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
2,False,"{""name"": ""Grumpy Old Men Collection""}",0,"[{""name"": ""Romance""}, {""name"": ""Comedy""}]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92
3,False,,16000000,"[{""name"": ""Comedy""}, {""name"": ""Drama""}, {""name...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34
4,False,"{""name"": ""Father of the Bride Collection""}",0,"[{""name"": ""Comedy""}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173


Let's analyze the column belongs_to_collection:

In [47]:
print(movies.loc[:,'belongs_to_collection'][0])
print(movies.loc[:,'belongs_to_collection'][9])

{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}
{'id': 645, 'name': 'James Bond Collection', 'poster_path': '/HORpg5CSkmeQlAolx3bKMrKgfi.jpg', 'backdrop_path': '/6VcVl48kNKvdXOZfJPdarlUGOsk.jpg'}


The column has some values that we are not going to use: poster_path, backdrop_path and id.

In [374]:
errors = []
for i, movie in movies.iterrows():
    if(i >= 0):
        try:
            collection = cleanjsontring(movie.belongs_to_collection)

            if(movie.belongs_to_collection == "[]" or movie.belongs_to_collection == ""):
                errors.append(i)
            else:
                collectionDict = json.loads(collection)
                #Removing the columns poster_path, backdrop_path, and id
                collectionDict.pop("poster_path")
                collectionDict.pop("backdrop_path")
                collectionDict.pop("id")
                movies.loc[i,'belongs_to_collection'] = json.dumps(collectionDict)
                #movies.loc[i,'belongs_to_collection'] = str(collectionDict['name'])
        except Exception as e:
            #print(e)
            errors.append(i)

In [49]:
movies.loc[:,'belongs_to_collection'][0]

'{"name": "Toy Story Collection"}'

Let's analyze the column genres:

In [50]:
print(movies.loc[:,'genres'][0])
print(movies.loc[:,'genres'][9])

[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]
[{'id': 12, 'name': 'Adventure'}, {'id': 28, 'name': 'Action'}, {'id': 53, 'name': 'Thriller'}]


The column has some values that we are not going to use: id

In [377]:
errors = []
for i, movie in movies.iterrows():
    if(i >= 0):
        try:
            genres = cleanjsontring(movie.genres)

            if(movie.genres == "[]" or movie.genres == ""):
                errors.append(i)
            else:
                genresDict = json.loads(genres)
                for genre in genresDict:
                    #Removing the column id
                    genre.pop("id")
                    #genresList.append(genre['name'])
                movies.loc[i,'genres'] = json.dumps(genresDict)
                #movies.loc[i,'genres'] = str(genresList)
        except Exception as e:
            #print(e)
            errors.append(i)

In [52]:
print(movies.loc[:,'genres'][0])

[{"name": "Animation"}, {"name": "Comedy"}, {"name": "Family"}]


Let's analyze the column production_companies:

In [53]:
movies.loc[:,'production_companies'][1]

"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Interscope Communications', 'id': 10201}]"

The column has some values that we are not going to use: id

In [54]:
errors = []
for i, movie in movies.iterrows():
    if(i >= 0):
        try:
            prod_companies = cleanjsontring(movie.production_companies)
            if(movie.production_companies == "[]" or movie.production_companies == ""):
                errors.append(i)
            else:
                prodcompDict = json.loads(prod_companies)
                for prodcomp in prodcompDict:
                    #Removing the column id
                    prodcomp.pop("id")
                movies.loc[i,'production_companies'] = json.dumps(prodcompDict)
        except Exception as e:
            #print(e)
            errors.append(i)

In [55]:
movies.loc[:,'production_companies'][1]

'[{"name": "TriStar Pictures"}, {"name": "Teitler Film"}, {"name": "Interscope Communications"}]'

Exporting the changes to a new file:

In [56]:
movies.to_csv('data/movies_cleaned01.csv', index=False)

### Reading the movies cleaned file:

In [7]:
movies = pd.read_csv('data/movies_cleaned01.csv', dtype={'popularity': 'float'})
print('Columns:')
print(',  '.join(movies))

Columns:
adult,  belongs_to_collection,  budget,  genres,  homepage,  id,  imdb_id,  original_language,  original_title,  overview,  popularity,  poster_path,  production_companies,  production_countries,  release_date,  revenue,  runtime,  spoken_languages,  status,  tagline,  title,  video,  vote_average,  vote_count


In [8]:
movies.loc[:,'production_companies'][1]

'[{"name": "TriStar Pictures"}, {"name": "Teitler Film"}, {"name": "Interscope Communications"}]'

In [9]:
movies.popularity

0        21.946943
1        17.015539
2        11.712900
3         3.859495
4         8.387519
           ...    
45458     0.072051
45459     0.178241
45460     0.903007
45461     0.003503
45462     0.163015
Name: popularity, Length: 45463, dtype: float64

In [97]:
movies.id.sort_values()

4342          2
12947         3
17            5
474           6
256          11
          ...  
45075    465044
45270    467731
21890    468343
45395    468707
20188    469172
Name: id, Length: 45463, dtype: int64

In [95]:
movies[movies.id=='6269']

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count


## Reading actors and actresses Oscar winners

### Reading a TXT file from a URL

In [3]:
import requests
url = 'https://www.openintro.org/data/tab-delimited/oscars.txt'
req = requests.get(url, headers={"User-Agent": "Notebook"})
#Create a list with the TXT results and use the \n to separate the rows
Oscar_list=req.text.split('\n')
Oscar_data=[]
for i,row in enumerate(Oscar_list):
    if i==0:
        col_names=row.split('\t')
    else:
        Oscar_data.append(row.split('\t'))
#Creation of the dataframe
Oscars=pd.DataFrame(Oscar_data, columns=col_names)
#Remove of the last row
Oscars.drop(index=Oscars.index[-1], axis=0, inplace=True)
#Removal of unnecessary columns
Oscars.drop(columns=['oscar_no', 'birth_mo','birth_d', 'birth_y'], inplace=True)
display(Oscars)

Unnamed: 0,oscar_yr,award,name,movie,age,birth_pl,birth_date
0,1929,Best actress,Janet Gaynor,7th Heaven,22,Pennsylvania,1906-10-06
1,1930,Best actress,Mary Pickford,Coquette,37,Canada,1892-04-08
2,1931,Best actress,Norma Shearer,The Divorcee,28,Canada,1902-08-10
3,1932,Best actress,Marie Dressler,Min and Bill,63,Canada,1868-11-09
4,1933,Best actress,Helen Hayes,The Sin of Madelon Claudet,32,Washington DC,1900-10-10
...,...,...,...,...,...,...,...
179,2015,Best actor,Eddie Redmayne,The Theory of Everything,32,England,1982-01-06
180,2016,Best actor,Leonardo Di Caprio,The Revenant,41,California,1974-11-11
181,2017,Best actor,Casey Affleck,Manchester by the Sa,41,Massachusetts,1975-08-12
182,2018,Best actor,Gary Oldman,Darkest Hour,59,England,1958-03-21


List of actors/actresses that were awarded more than 1 Oscar

In [4]:
top10actors = Oscars['name'].value_counts()
top10actors = top10actors[top10actors.values>1]
top10actors

Katharine Hepburn      4
Daniel Day-Lewis       3
Sally Field            2
Olivia de Havilland    2
Meryl Streep           2
Jack Nicholson         2
Frances McDormand      2
Jane Fonda             2
Glenda Jackson         2
Hilary Swank           2
Tom Hanks              2
Sean Penn              2
Jodie Foster           2
Fredric March          2
Dustin Hoffman         2
Ingrid Bergman         2
Marlon Brando          2
Vivien Leigh           2
Luise Rainer           2
Bette Davis            2
Gary Cooper            2
Spencer Tracy          2
Name: name, dtype: int64

In which movies were the best actor/actress Oscar awards take place?

In [5]:
index=[]
for i,name in enumerate(Oscars['name']):
    if name in top10actors.index:
        index.append(i)
Oscars.iloc[index]
grouped=Oscars.iloc[index]
grouped=grouped.loc[:,['name','movie','oscar_yr','award']].groupby(['name','movie'])
grouped.first().sort_values(['name','oscar_yr'])

Unnamed: 0_level_0,Unnamed: 1_level_0,oscar_yr,award
name,movie,Unnamed: 2_level_1,Unnamed: 3_level_1
Bette Davis,Dangerous,1936,Best actress
Bette Davis,Jezebel,1939,Best actress
Daniel Day-Lewis,My Left Foot,1990,Best actor
Daniel Day-Lewis,There Will Be Blood,2008,Best actor
Daniel Day-Lewis,Lincoln,2013,Best actor
Dustin Hoffman,Kramer vs. Kramer,1980,Best actor
Dustin Hoffman,Rain Man,1989,Best actor
Frances McDormand,Fargo,1997,Best actress
Frances McDormand,Three Billboards Outside of Ebbing Missouri,2018,Best actress
Fredric March,Dr. Jekyll and Mr. Hyde,1933,Best actor


List of movies that won the award of best actor and best actress

In [6]:
movies_2_oscars=Oscars['movie'].value_counts()
movies_2_oscars=movies_2_oscars[movies_2_oscars.values==2]
movies_2_oscars

Network                     2
Coming Home                 2
It Happened One Night       2
On Golden Pond              2
The Silence of the Lambs    2
Name: movie, dtype: int64

Which actor and actress was awarded the Oscar:

In [7]:
index=[]
for i,movie in enumerate(Oscars['movie']):
    if movie in movies_2_oscars.index:
        index.append(i)
Oscars.iloc[index].sort_values('oscar_yr')[['oscar_yr','award','name','movie']]

Unnamed: 0,oscar_yr,award,name,movie
6,1935,Best actress,Claudette Colbert,It Happened One Night
92,1935,Best actor,Clark Gable,It Happened One Night
49,1977,Best actress,Faye Dunaway,Network
134,1977,Best actor,Peter Finch,Network
51,1979,Best actress,Jane Fonda,Coming Home
136,1979,Best actor,Jon Voight,Coming Home
54,1982,Best actress,Katharine Hepburn,On Golden Pond
139,1982,Best actor,Henry Fonda,On Golden Pond
64,1992,Best actress,Jodie Foster,The Silence of the Lambs
149,1992,Best actor,Anthony Hopkins,The Silence of the Lambs


## USING RATINGS FILE

In [11]:
ratings = pd.read_csv('data/ratings_small.csv', index_col=0)
ratings.head()

Unnamed: 0_level_0,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,31,2.5,1260759144
1,1029,3.0,1260759179
1,1061,3.0,1260759182
1,1129,2.0,1260759185
1,1172,4.0,1260759205


In [103]:
type(ratings.movieId)

pandas.core.series.Series

In [28]:
movie_rating = movies.merge (ratings, left_on='id', right_on='movieId', how='inner')
movie_rating.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,movieId,rating,timestamp
0,False,,60000000,"[{""name"": ""Action""}, {""name"": ""Crime""}, {""name...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886,949,3.5,1148721092
1,False,,60000000,"[{""name"": ""Action""}, {""name"": ""Crime""}, {""name...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886,949,4.0,956598942
2,False,,60000000,"[{""name"": ""Action""}, {""name"": ""Crime""}, {""name...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886,949,2.0,955092697
3,False,,60000000,"[{""name"": ""Action""}, {""name"": ""Crime""}, {""name...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886,949,5.0,956688825
4,False,,60000000,"[{""name"": ""Action""}, {""name"": ""Crime""}, {""name...",,949,tt0113277,en,Heat,"Obsessive master thief, Neil McCauley leads a ...",...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,A Los Angeles Crime Saga,Heat,False,7.7,1886,949,3.0,1117846575
