<a href="https://colab.research.google.com/github/giraffe2003/COGS108_Repo/blob/main/DataCheckpoint_Group101_FA23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Jenny Quach
- Lucy Granger
- Anvita Suresh
- Boston Thome
- Angela (Anqi) Wu

# Research Question

Which variable(s) out of the following - audience rating, genre, release date, studio, and number of nominations in other categories - can be best used to predict whether a movie will win the Best Picture Award at the Oscars out of the nominees?



## Background and Prior Work


The Oscars are annual awards that are meant to recognize movies in various categories for excellence in cinematic achievements and film work. These awards deliver great amounts of recognition and prestige to those who win and are determined by the Academy of Motion Picture Arts and Sciences (AMPAS). The award of Best Picture is the most prestigious and sought out award at this event.

However, the winning movie might not necessarily be the one that we expect. Oftentimes, the movie that has the highest audience rating does not end up being the movie that wins. There have been many cases where a film is critically acclaimed but still bombed at the box office. Therefore, we wanted to see if there was indeed a relationship between certain movie attributes and its likelihood of winning at the Oscars.

This topic has definitely been explored on the internet before. One analysis done by DataRes at UCLA looked at relationships between factors such as the influence of the director, production house, film budget, box office revenue and genre. This article would be very helpful because it does not answer the question we are trying to answer but gives us some insight on what causes certain movies to have better box office revenue than others. Through our exploration and research, we may find a correlation between revenue and Oscar winners with the help of this article.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Another analysis conducted by the Bloomberg company shows what it would look like if Hollywood were to engineer a best actor, best actress, and best picture with the picture being a drama, between 121 and 160 minutes long, released in the final months of the year by Columbia Pictures, that had an average budget of 290 million at the box office. This article relates to our question because it can give us insight into what the authors of the article believe to be important attributes of the best picture awards. We can compare our findings with this article and see if there is a correlation between what they have deemed significant to ‘build an Oscar winner’ and the data we have collected.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)


1. <a name="cite_note-1"></a> [^](#cite_ref-1) DataRes at UCLA. (28 Mar 2022) The Success of Movies at the Oscars, Box Office, and Reviews. Medium. https://ucladatares.medium.com/the-success-of-movies-at-the-oscars-box-office-and-reviews-e33e5059ea40
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Keith Collins and Adam Pearce. (18 Feb 2015) How to Build an Oscar Winner. https://www.bloomberg.com/graphics/2015-oscar-winners/


# Hypothesis


We believe that yes, there are attributes that can be used to predict the Oscar Best Picture Award winner from the nominees: specifically the audience rating, genre, movie studio, time of release, and number of categories nominated for. We believe that the number of categories nominated for will be the most influential attribute to predicting the Oscar Best Picture award winner because if the Academy Award board deems the movie good enough to nominate for other categories, the board probably also favors that movie for the Best Picture award.

# Data

## Data overview

Dataset #1:

- Name: Academy Awards Oscars: Nominees and Winners 1927 to Present
- Link: https://datahub.io/rufuspollock/oscars-nominees-and-winners#resource-oscars-nominees-and-winners_zip
- Number of Observations: 11,058
- Number of Variables: 4

Dataset #2:
- Name: Oscar Best Picture Movies
- Link: https://www.kaggle.com/datasets/martinmraz07/oscar-movies
- Number of Observations: 571
- Number of Variables: 30

Dataset #1 contains a list of all nominees and winners in every single Oscar category since 1927. The important variables here would be the winner/nominee distinction and  the title of the movie. One of the attributes we are interested in looking at are how many nominations each movie has, and because this dataset contains all the nominees for every Oscar category, we could check to see how many nominations each movie has and create another variable for that. After we did that, we could just drop all of the nominations that are not for Best Picture.

Dataset #2 is a list of all the Best Picture winners and nominees. The variables we are interested in here are the movie genres, studio, date of release, and IMDB and Rotten Tomatoes audience ratings. There are a lot of unnecessary variables such as movie length and cast, so we would drop all of those columns.

We would merge the two datasets together by matching the movie titles to get a dataset with all five of the attributes we are interested in.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

## Dataset #1 (use name instead of number here)

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

oscar_winner = pd.read_csv('https://datahub.io/rufuspollock/oscars-nominees-and-winners/r/0.csv')
oscar_winner

Unnamed: 0,year,category,winner,entity
0,1927,ACTOR,False,Richard Barthelmess
1,1927,ACTOR,True,Emil Jannings
2,1927,ACTRESS,False,Louise Dresser
3,1927,ACTRESS,True,Janet Gaynor
4,1927,ACTRESS,False,Gloria Swanson
...,...,...,...,...
11053,2017,SCIENTIFIC AND TECHNICAL AWARD (Scientific and...,True,"To LEONARD CHAPMAN for the overall concept, de..."
11054,2017,SCIENTIFIC AND TECHNICAL AWARD (Technical Achi...,True,To JASON SMITH and JEFF WHITE for the original...
11055,2017,SCIENTIFIC AND TECHNICAL AWARD (Technical Achi...,True,"To JOE MANCEWICZ, MATT DERKSEN and HANS RIJPKE..."
11056,2017,SCIENTIFIC AND TECHNICAL AWARD (Technical Achi...,True,To ALEX POWELL for his contribution to the des...


In [None]:
oscar_greater_1962 = oscar_winner[oscar_winner['year'] >= 1962]
best_picture = oscar_greater_1962[oscar_greater_1962['category'] == 'BEST PICTURE']
best_picture

Unnamed: 0,year,category,winner,entity
4107,1962,BEST PICTURE,True,Lawrence of Arabia
4108,1962,BEST PICTURE,False,The Longest Day
4109,1962,BEST PICTURE,False,Meredith Willson's The Music Man
4110,1962,BEST PICTURE,False,Mutiny on the Bounty
4111,1962,BEST PICTURE,False,To Kill a Mockingbird
...,...,...,...,...
10997,2017,BEST PICTURE,False,Lady Bird
10998,2017,BEST PICTURE,False,Phantom Thread
10999,2017,BEST PICTURE,False,The Post
11000,2017,BEST PICTURE,True,The Shape of Water


In [None]:
num_of_noms = []
oscar_greater_1962 = oscar_greater_1962.reset_index(drop=True)
for i in range(len(oscar_greater_1962)):
    row = oscar_greater_1962.iloc[i]
    year = row["year"]
    entity = row["entity"]
    num = len(oscar_greater_1962[(oscar_greater_1962["year"] == year) & (oscar_greater_1962["entity"] == entity)])
    num_of_noms.append(num)
oscar_greater_1962["number_of_nominations"] = num_of_noms
oscar_greater_1962

Unnamed: 0,year,category,winner,entity,number_of_nominations
0,1962,ACTOR,False,Burt Lancaster,1
1,1962,ACTOR,False,Jack Lemmon,1
2,1962,ACTOR,False,Marcello Mastroianni,1
3,1962,ACTOR,False,Peter O'Toole,1
4,1962,ACTOR,True,Gregory Peck,1
...,...,...,...,...,...
7031,2017,SCIENTIFIC AND TECHNICAL AWARD (Scientific and...,True,"To LEONARD CHAPMAN for the overall concept, de...",1
7032,2017,SCIENTIFIC AND TECHNICAL AWARD (Technical Achi...,True,To JASON SMITH and JEFF WHITE for the original...,1
7033,2017,SCIENTIFIC AND TECHNICAL AWARD (Technical Achi...,True,"To JOE MANCEWICZ, MATT DERKSEN and HANS RIJPKE...",1
7034,2017,SCIENTIFIC AND TECHNICAL AWARD (Technical Achi...,True,To ALEX POWELL for his contribution to the des...,1


In [None]:
oscar_greater_1962[oscar_greater_1962["number_of_nominations"] > 1]

Unnamed: 0,year,category,winner,entity,number_of_nominations
20,1962,ART DIRECTION (Black-and-White),False,Days of Wine and Roses,3
21,1962,ART DIRECTION (Black-and-White),False,The Longest Day,5
24,1962,ART DIRECTION (Black-and-White),True,To Kill a Mockingbird,6
25,1962,ART DIRECTION (Color),True,Lawrence of Arabia,8
26,1962,ART DIRECTION (Color),False,Meredith Willson's The Music Man,6
...,...,...,...,...,...
7014,2017,WRITING (Adapted Screenplay),False,Mudbound,3
7016,2017,WRITING (Original Screenplay),True,Get Out,3
7017,2017,WRITING (Original Screenplay),False,Lady Bird,3
7018,2017,WRITING (Original Screenplay),False,The Shape of Water,10


## Dataset #2 (if you have more than one, use name instead of number here)

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('oscars_df.csv')
df.columns
df.shape

In [None]:
df = df.drop(columns=['Unnamed: 0', 'Streaming Release Date', 'Film ID', 'Year of Release', 'Movie Genre',
        'Production Company', 'Movie Info', 'Audience Status', 'Audience Rating', 'Audience Count', 'Directors',
        'Authors', 'Actors', 'Content Rating', 'Movie Time', 'Tomatometer Status', 'Tomatometer Top Critics Count', 'Tomatometer Fresh Critics Count',
        'Tomatometer Rotten Critics Count', 'Film ID', 'Critic Consensus'])
df = df.dropna()

In [None]:
df.head()

# Ethics & Privacy

When considering ethical and privacy concerns within our project we will first look at the data itself. We are using Oscar nominations and winners, which are chosen by the Academy of Motion Picture Arts and Sciences (AMPAS). This group is not an accurate representation of the general population of people who view movies. Since the members of the AMPAS are likely not diverse in terms of gender, race, culture, fame, socioeconomic status, and so forth; their choices in movies is likely not an accurate representation of the population. This makes the data biased, to the perspectives of the members of the AMPAS. On account of this, our data is not generalizable outside of the scope of the Oscars.

Similar to the first concern, movies that are nominated for an Oscar are likely movies coming from well known studios, directors, producers, or some name already well known in the world of film. Movies from independent or new studios are less likely to be included in the Oscars at all. This is more of an issue with a bias in the Oscar nominations themselves, but is worth noting that our project is therefore not generalizable to all movies; only those within the Oscar nominations.

We also would like to consider how our project could affect potential stakeholders. Since we are using studio/producer/director names they could be negatively impacted if our data shows a correlation between those names and not winning the Oscars. We will write any conclusions with these potential impacts in mind. Additionally, our data does not contain any personal information about people except for their names as it appeared in their Oscar nomination. Our project and data also has the potential to be misconstrued if someone believes the factors we are analyzing are causes of the Oscar outcomes. They are not necessarily causations, and we will make this clear throughout our project.

# Team Expectations


For the expectations of our group, we mainly expect each other to be present at the weekly meetings and contribute ideas regularly. We also want everyone to be responsive to group messages and reply whenever they are able to. If we assign tasks to complete by ourselves, everyone should complete them before the deadline; if one of us cannot complete the task assigned to us, we should let the rest of the team know we can’t complete it. Overall, we expect everyone to have good communication and to stay involved in all aspects of the project.
Signed by:
- Lucy Granger
- Jenny Quach
- Anvita Suresh
- Boston Thome
- Angela Wu


# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research |
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal |
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |