**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Sadaf Jamali-Hanjani
- Emilie Joseph
- Raisa Shafi
- Janhavi Shah
- Austria Wang

# Research Question

How do the marriage and divorce rates in the United States impact the financial success of romantic comedies spanning from 1990 to 2022? Specifically, is there a discernible trend in the profitability, measured by a 20% profit margin compared to the film's original budget, of movies falling under the genre of romantic comedies, identified by keywords such as 'romance', 'love', 'fate', 'humor', 'meet cute', 'chemistry', 'heartwarming', 'feel good', 'relationships', and 'laughter' across different decades?

## Background and Prior Work

Marriage and divorce rates in the United States have long been subjects of interest, not only for sociologists and demographers but also for those in the entertainment industry. The portrayal of romantic relationships in films, particularly romantic comedies, often reflects societal norms, values, and attitudes towards love and marriage. Understanding the fluctuations in marriage and divorce rates can provide insights into how these changes might influence the themes, narratives, and ultimately, the financial success of romantic comedies.

Several studies have explored the relationship between societal trends, including marriage and divorce rates, and their impact on media content. For instance, research by Banjo (2002) suggests that media consumption, especially of romanticized content, can foster unrealistic expectations and perceptions about romantic relationships, potentially influencing individuals' attitudes and behaviors in their own relationships<sup>[1]</sup>.

Ray (2022) highlights the concept of relational schemas, which are cognitive representations of how individuals expect relationships to function. The influence of romantic movie media on the development of these schemas suggests that individuals with high exposure to such media may internalize and seek to replicate the relationship patterns depicted in films<sup>[3]</sup>. Larson's (1988) adds onto this by stating that when individuals consume romantic comedies, they are often presented with idealized depictions of romantic partners who effortlessly embody multiple roles and fulfill all relationship needs. This can further perpetuate unrealistic expectations within viewers' own relationships, potentially leading to dissatisfaction or disillusionment when reality fails to meet the standards set by media portrayals, and eventually divorce<sup>[2]</sup>. However, while these studies provide valuable insights into the individual components of romantic comedies and societal trends, there is a gap in research specifically examining the intersection between marriage and divorce rates and the financial success of romantic comedies over time.

In this study, we aim to fill this gap by analyzing the profitability of romantic comedies spanning from 1990 to 2022 in relation to fluctuations in marriage and divorce rates in the United States. By identifying keywords associated with romantic comedies and using box office data to measure profitability, we seek to uncover any discernible trends or correlations between societal relationship dynamics and the commercial success of romantic comedies. This analysis will contribute to a deeper understanding of how societal changes impact media consumption patterns and the financial performance of films within the romantic comedy genre.

<sub>References:</sub>

<sub>[1] - Banjo, O. (n.d.). (PDF) the effect of media consumption on the perception of romantic relationships. https://www.researchgate.net/publication/283703576_The_Effect_of_Media_Consumption_on_the_Perception_of_Romantic_Relationships</sub>

<sub>[2] - Larson, J. H. (1988). The interplay of marriage and friendship. In W. H. Jones & D. Perlman (Eds.), Advances in personal relationships (Vol. 1, pp. 277–311). Jessica Kingsley Publishers.</sub>

<sub>[3] - Ray, C. R. (n.d.). (thesis). Romantic media exposure’s effect on relationship beliefs and expectations. </sub>

# Hypothesis


Our team's hypothesis explores the relationship between the financial success of romantic comedies and divorce and marriage rates from 1990 to 2022. Our prediction is that an increase in divorce rates will lower the financial success of romantic comedies, if there is a decrease in divorce rates and an increase in marriages the profitability will be higher. To identify the success of romantic comedies we will measure a 20% profit margin over the film's original budget. For our divorce and marriage data, we will use information from the CDC, 1990 to 2022 divorce and marriage rates which are organized by state. This helps us understand what influence divorce and marriage have on the financial success of romantic comedies.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: TMDB 5000 Movie Dataset
  - Link to the dataset:https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
  - Number of observations: 1034416
  - Number of variables: 24
  
- Dataset #2
  - Dataset Name: Marriage Rate by State in the United States from 1990-2022
  - Link to the dataset: https://www.cdc.gov/nchs/data/dvs/marriage-divorce/state-marriage-rates-90-95-00-22.pdf
  - Number of observations: 51
  - Number of variables: 26
  
- Dataset #3
  - Dataset Name: Divorce Rate by State in the United States from 1990-2022
  - Link to the dataset: https://www.cdc.gov/nchs/data/dvs/marriage-divorce/state-marriage-rates-90-95-00-22.pdf
  - Number of observations: 51
  - Number of variables: 26
  
The TMDB Movies Dataset features top grossing romantic comedy movies from 1990 to 2022. We will be looking at features such as MPAA ratings, total gross profit, inflation-adjusted gross profits, tickets sold, market share, and gross profits by the year. We are also using the dataset to extract variables such as keywords, taglines, budget, revenue, release date and genre to compare to the Marriage and Divorce Datasets.

The Centers for Disease Control and Prevention have released a list for the Change in the Reporting of Marriage and Divorce Statistics. This dataset takes a look at information of the rates of divorces and marriages at both the national and state levels. Tracking data from year 1990 until 2022, we can break down important trends such as the percentage of divorces and marriages, by state, year, and by population.

We plan to analyze trends between popular romantic comedy gross profits and prevalence over the past 2 decades, along the USA’s marriage and divorce rate trends to see if there is a relationship between the success of romantic comedies based on the populations of people that have been married or divorces. All three of our datasets will be used to see if we can find a trend line between the question that we are proposing.

## The Movie Database

In [1]:
import pandas as pd
movies = pd.read_csv('TMDB_movie_dataset_v11.csv')
movies.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [2]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [3]:
movies.describe()

Unnamed: 0,budget,id,popularity,revenue,runtime,vote_average,vote_count
count,4803.0,4803.0,4803.0,4803.0,4801.0,4803.0,4803.0
mean,29045040.0,57165.484281,21.492301,82260640.0,106.875859,6.092172,690.217989
std,40722390.0,88694.614033,31.81665,162857100.0,22.611935,1.194612,1234.585891
min,0.0,5.0,0.0,0.0,0.0,0.0,0.0
25%,790000.0,9014.5,4.66807,0.0,94.0,5.6,54.0
50%,15000000.0,14629.0,12.921594,19170000.0,103.0,6.2,235.0
75%,40000000.0,58610.5,28.313505,92917190.0,118.0,6.8,737.0
max,380000000.0,459488.0,875.581305,2787965000.0,338.0,10.0,13752.0


In [4]:
movies.drop_duplicates(inplace=True)
movies.drop(columns=['id', 'status','adult','backdrop_path','spoken_languages','production_countries','homepage','imdb_id','production_companies','original_language','original_title','popularity','poster_path'], inplace=True)
movies.info()

KeyError: "['adult', 'backdrop_path', 'imdb_id', 'poster_path'] not found in axis"

## Marriage Rate in the United States from 1990-2022

In [8]:
marriage = pd.read_csv('marriage.csv')
marriage.head()

Unnamed: 0,State,2022,2021,2020,2019,2018,2017,2016,2015,2014,...,2007,2006,2005,2004,2003,2002,2001,2000,1995,1990
0,Alabama,7.3,7.6,7.2,6.7,6.8,7.0,7.1,7.4,7.8,...,8.9,9.2,9.2,9.4,9.6,9.9,9.4,10.1,9.8,10.6
1,Alaska,6.5,6.3,5.7,6.5,6.7,6.9,7.1,7.4,7.5,...,8.5,8.2,8.2,8.5,8.1,8.3,8.1,8.9,9.0,10.2
2,Arizona,5.8,5.4,4.9,5.3,5.5,5.8,5.9,5.9,5.8,...,6.4,6.5,6.6,6.7,6.5,6.7,7.6,7.5,8.8,10.0
3,Arkansas,7.9,8.2,7.8,8.4,8.9,9.5,9.9,10.0,10.1,...,12.0,12.4,12.9,13.4,13.4,14.3,14.3,15.4,14.4,15.3
4,California,5.7,5.2,3.2,5.7,6.0,6.3,6.5,6.2,6.4,...,6.2,6.3,6.4,6.4,6.1,6.2,6.5,5.8,6.3,7.9


In [9]:
marriage.describe()

Unnamed: 0,2022,2021,2020,2019,2018,2017,2016,2015,2012,2011,2010,2009,2008,2007,2005,2004,1995,1990
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,6.776471,6.645098,5.764706,6.692157,7.019608,7.309804,7.429412,7.472549,7.527451,7.615686,7.619608,7.535294,7.784314,8.182353,8.684314,8.984314,10.343137,11.588235
std,3.194282,3.187997,2.559674,3.107593,3.198188,3.389941,3.369706,3.714086,4.354771,4.595231,4.782929,5.049706,5.382801,6.242586,7.431376,8.01976,10.922678,12.674062
min,3.7,4.4,3.2,4.9,5.1,5.5,5.6,5.3,4.9,4.8,4.9,4.7,4.0,4.2,4.1,5.2,6.1,7.1
25%,5.5,5.25,4.5,5.35,5.7,5.95,6.1,6.2,5.8,5.85,5.8,5.8,5.95,6.15,6.5,6.5,7.3,8.2
50%,6.1,5.9,5.3,6.0,6.4,6.8,6.9,6.9,6.8,6.8,6.8,6.6,6.8,7.0,7.0,7.4,8.4,9.5
75%,7.2,7.05,6.0,7.05,7.1,7.25,7.4,7.4,7.35,7.65,7.4,7.4,7.8,7.8,8.25,8.45,9.9,10.65
max,25.9,26.2,21.0,25.9,26.7,28.6,28.4,31.0,35.1,36.9,38.3,40.3,42.3,48.6,57.4,62.1,85.2,99.0


## Divorce Rate in the United States from 1990-2022 in the United States from 1990-2022

In [10]:
divorce = pd.read_csv('divorce.csv')
divorce.head()

Unnamed: 0,State,2022,2021,2020,2019,2018,2017,2016,2015,2014,...,2007,2006,2005,2004,2003,2002,2001,2000,1995,1990
0,Alabama,3.2,3.6,3.7,3.6,3.7,3.7,3.8,3.9,3.8,...,4.5,4.9,4.9,4.9,5.2,5.4,5.4,5.5,6,6.1
1,Alaska,3,3.1,3.2,3.6,3.7,3.6,3.9,4.1,4,...,4.3,4.2,4.3,4.3,3.9,4.6,4.3,3.9,5,5.5
2,Arizona,2.3,2.7,2.9,2.9,3,3.5,3.4,3.6,3.9,...,4,4,4.2,4.3,4.4,4.8,4,4.6,6.2,6.9
3,Arkansas,3.5,3.6,3.6,4,4.1,3.7,3.9,4.8,4.8,...,5.9,5.8,6,6.1,6,6.2,6.2,6.4,6.3,6.9
4,California,---,---,---,---,---,---,---,---,---,...,---,---,---,---,---,---,---,---,---,4.3


In [11]:
divorce.describe()

Unnamed: 0,State,2022,2021,2020,2019,2018,2017,2016,2015,2014,...,2007,2006,2005,2004,2003,2002,2001,2000,1995,1990
count,51,51.0,51.0,51,51,51,51,51,51.0,51.0,...,51,51,51,51,51.0,51.0,51.0,51.0,51,51.0
unique,51,22.0,22.0,26,22,23,21,23,20.0,25.0,...,27,29,30,28,25.0,28.0,28.0,28.0,29,31.0
top,Alabama,2.3,2.6,---,---,---,---,---,2.8,2.8,...,---,---,---,---,3.2,3.4,3.2,3.9,5,5.5
freq,1,7.0,6.0,5,5,5,5,6,6.0,5.0,...,6,6,6,5,5.0,4.0,5.0,4.0,4,6.0


# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

1. **The Question:** Our question our group is trying to analyze is the relation between Romantic Comedies in film and Marriage and Divorce rates from 1990-2022 in the United States. This is why we are using The TMDb (The Movie Database) we found on Kaggle as well as the CDC's dataset released on Marriages and Divorces. We were interested in focusing on movies as our group enjoys film, specifically movies that contain the key words 'romance', 'love', 'fate', 'humor', 'meet cute', 'chemistry', 'heartwarming', 'feel good', 'relationships', and 'laughter' to signify the genre of a 'romantic comedy'. This question will track each movie’s financial budget, profit, audience rating, as well as genre to see if there is any correlation between romantic comedies and if there was an increase or decrease of marriages or divorces by year.

2. **The Implications:** The stakeholders for this questions would be the movie studios as we would create a tool to allow them to green light or forgo certain movie genres or scripts that they may be thinking about releasing. Some unintended consequences could be that after our final analysis we could find that Romantic Comedies are in fact linked to more Marriages which can lead the movies studios to release more or less depending on if they want the marriage rates to increase or decrease. This can become a detriment to writers, actors or directors as the movie studios would focus on movies or genres that only make them money rather than taking a chance on a movie ideas or scripts that are both original and innovating but may be less financially successful. Another implication would be that creativity would be stifled as the movie studios prioritize financial gains rather than take creative risks.

3. **The Data:** The dataset that we are using is available on Kaggle which contains over 1 million movie titles. Their dataset is also uploaded daily which means that there is always a consistent stream of data being created and used. There are also around 24 columns that we will be able to use to analyze so our data will be able to compare different aspects of the successful or failure of each movie. The only main bias we would have to be hesitant in using would be the voter’s rating and the total voter’s count of each movie. Since the dataset takes the average of the rating of each movie given by the viewer it may be skewed data as not every viewer who watches a movie will give a movie a rating. We would have to take into account that the audience of who is actually rating the movies are those who are very interested in movies and will take the time to rate a movie they just watched rather than just casual viewers who may not review a movie they just watched. For the Marriage and Divorce datasets, it is only broken up by states so we do not have detailed information on what race, age or what the individuals financial status are

4. **Informed Consent:** Since our datasets are all for public use and do not gather any of the individual’s personal information, there is no risk for using this dataset. All of the other empirical data regarding the movie’s budget and financial return are all available to the public so there is very low risk in using this dataset to answer our question. The other issue is if Kaggle or the owner of this dataset decides to remove this dataset, we wouldn’t be able to continue to use it anymore.

5. **Privacy:** Nothing we are using from these datasets will create any risks regarding privacy as this data was gathered from public sources such as IMDB. The main privacy issue would be the voter’s column but this specific dataset is anonymized which means that there is no identifiable information on who the voter actually is.

6. **Evaluation:** We will be able to evaluate our project after implementing a system to see if there is any pattern or correlation between highly successful romantic comedies during a specific year and if there was an increase of marriages or divorces. We will also take into account that if our outcome is not verifiable and that we see no patterns within the genre of romantic comedies we will have to make a conclusion based on that as well.

7. **Analysis:** We will need to spend time with our dataset to make sure that we are not just inferring things that do not correlate with each other. Taking into account financial success, audience ratings, and previous decades of movies being remade we will need to see empirical evidence in whether or not there is a pattern that the movie studios take when deciding if romantic comedies hold enough weight in persuading or dissuading individuals to get married.

8. **Transparency and Appeal:** Our model would not be a black box model as we have the empirical data to back up our findings. Since all of our data is publicly available we would not be able to alter the numbers such as financial gain and profit. We also have each film’s budget which we will take into account when it comes to a movie’s true profitability or not. The data is transparent in the way that the data gathered can be found online and not based on our own personal biases or inferences.

9. **Continuous Monitoring:** We will be able to update our metrics as the dataset that we are using is constantly being updated. Once we configure a good system and code in order to to use and analyze this data we would be able to use it until the dataset is removed from Kaggle. We will be weary of potential unintended side effects such as if movie studios want to use our system to analyze the the future trends of the popularity of romantic comedies and whether or not people will more or less likely marry during a year where the top grossing movies were romance based.

# Team Expectations 

Our group came to the consensus that in order to be successful in the completion of our project, we must hold ourselves and each other accountable in following the expectations we put in place.

* *Team Expectation 1:* Effective Communication: Communicate when things need to be completed so that we are able to maintain consistent progress. We will communicate through messages and discord to share our ideas.

* *Team Expectation 2:* Weekly Meetings: Show up to every weekly meeting so we can touch base on the progress of our project. We will meet every Thursday at 11 am (time that we all agree on) on a video call to work together and discuss our progress

* *Team Expectation 3:* Asking Questions: Reach out and give each other the support when things begin to feel stressful. We will text any questions to the group chat for anyone to answer. We can also save any major questions that need further discussion for the video call meetings.

* *Team Expectation 4:* Meeting Deadlines: Deadlines are important so make sure each individual's assigned portion is completed on time. We must set soft deadlines for specific parts of the project and all agree to complete our respective parts by the selected date.

* *Team Expectation 5:* Checking Work: We should check each other’s work as we progress with the project to make sure we don’t have any errors. Be open to changes if necessary.

# Project Timeline Proposal


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/25  |  11 AM | Create a Discord group chat to communicate | Get to know each other, brainstorm different topics and vote by next meeting date | 
| 5/2  |  11 AM |  Choose topic and gather information about the topic to present | Draft project proposal | 
| 5/9  | 11 AM  |  Everyone provides datasets they have searched for prior (Genre and box office sales)  | Assign group members to their part and discuss any questions/confusions   |
| 5/16  | 11 AM  | Have data prepared and ready for the Checkpoint | Present the started work and review/edit together |
| 5/23  | 11 AM  | Prepare for the next checkpoint | Work on EDA and edit together for the checkpoint |
| 5/30  | 11 AM  | Complete EDA Checkpoint and have it ready to be reviewed | Start finalizing the final project and prepare additional information |
| 6/6  | 11 AM  | Review the Final Project | Add final touch ups before turning the project in on 6/12 |