# ADA: Milestone 3 notebook
---

## Globywood: the story of movie industries across time.

Cinema made its debut in the late 19th century [1] and has become a multibillion-dollar industry, distributed over many different countries, and therefore made subject to distinct cultures and trends.
Using the CMU Movie Summary Corpus, a collection of 42.306 movie plot summaries extracted from Wikipedia, aligned with movie and character metadata, we take a trip through time to take a look at these trends and how they have (or have not) differed across the world.
We will analyze movies from some of the largest film industries - namely the United States, India, United Kingdom, Japan, and France [2] - from the 1950s to the 2010s.
More specifically, we will look from two different perspectives, the movies themselves and the actors that portray them.
This will give us an insight into whether the industry is converging due to globalization, or if each industry can keep its identity.

### Libraries used for this project:

In [1]:
# Local python scripts
import sys
sys.path.append('../src')
import load
import clean
import aggregate
import plot
import features

# Utilities
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# TODO: Add libraries here (ONLY if they are used) and put a header (such as NLP) describing what you use it for

## Loading the data

Our analysis will look at differences between film industries from two perspectives, namely the movie themselves and the actors that play in them.
Regarding movies, we will use the `movie.metadata.tsv` dataset for metadata, and `plot_summaries.txt` for the summaries.
Regarding actors we will use the `character.metadata.tsv` dataset.

Our loading functions allows us to do just that:

In [3]:
movies_raw = load.movie_metadata()
summaries_raw = load.plot_summaries()
characters_raw = load.character_metadata()

**Note:** the original movies metadata file uses a dictionary in format `{freebase_id:value}` for the columns `Movie_Countries`, `Movie_Genres`, and `Movie_Languages`. Our loading function parses these values and creates a unique row for each of them, such that each column holds a single value:

In [4]:
movies_raw[['Movie_Name', 'Movie_Countries', 'Movie_Languages', 'Movie_Genres']].head(3)

Unnamed: 0,Movie_Name,Movie_Countries,Movie_Languages,Movie_Genres
0,Ghosts of Mars,United States of America,English Language,Thriller
0,Ghosts of Mars,United States of America,English Language,Science Fiction
0,Ghosts of Mars,United States of America,English Language,Horror


We now have 3 dataframes to work with for our analysis, the movie metadata, movie summaries, and characters metadata.

## References

[1] S. Pruitt, “The Lumière Brothers, Pioneers of Cinema,” HISTORY. https://www.history.com/news/the-lumiere-brothers-pioneers-of-cinema (accessed Nov. 17, 2022).

[2] “Global Box Office Down 72%, Digital Leads Home Entertainment in 2020,” Boxoffice, Mar. 26, 2021. https://www.boxofficepro.com/global-box-office-down-72-digital-leads-home-entertainment-in-2020/ (accessed Nov. 16, 2022).