# Datamigos - The Iron Curtain Dirty Movie Secret

During the Cold War, both the Western and Eastern blocs relied on soft power to spread ideological messages and influence global opinion, with cinema emerging as a key medium for this influence. We are interested in the way cinema served as a subtle but potent tool for political messaging during this period.

In this notebook, we will build a dataset suitable for exploring this subject and we will examine this problematic from three perspectives:
1. How international power dynamics and influence are reflected in film production.
2. How the themes and narratives of films were crafted to convey and promote specific ideologies.
3. How film characters were used, either as tools to manipulate populations or as mirrors of society.

**Table of content:**

- [Dataset and Cleaning](#Dataset-and-Cleaning)
- [Data Enhancement with GPT-4o mini and cleaning](#data-enhancement-with-gpt-4o-mini-and-cleaning)
- [A first glance at the data](#a-first-glance-at-the-data)
- [The reflection of internationnal dynamics and influences in film production]()
- [Thematic and narrative strategies in movies to promote ideologies]()
- [What about movies characters?]()
- [Conclusion]() 

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd

from src.dataset_creation.cmu_merging_cleaning import *
from src.dataset_creation.datasets_merging import *
from src.dataset_creation.gpt_4o_data_enhancement import *
from src.dataset_creation.merged_dataset_preprocessing import *

from src.utils.constants import *
from src.utils.helpers import *

from src.analysis.intro import *
from ipyleaflet import Map, TileLayer, GeoJSON, Popup, WidgetControl
from ipywidgets import HTML

from src.analysis.intro import *

## Dataset Merging and Cleaning

The first step was to build a suitable dataset for our project. It's important to note that the CMU Movie Summary Corpus was quite "America-centric," which was a bit problematic for our study. To address this, we extended it with the IMDB Non-Commercial Dataset and TMDb to achieve a more balanced representation, with more movies coming from other countries like the Soviet Union. Here is a brief description of the Datasets:

- [CMU Movie Summary Corpus](https://www.cs.cmu.edu/~ark/personas/):
    - **Movies Dataset** (`movie.metadata.tsv`) Includes comprehensive information about movies. It features data such as Wikipedia and Freebase IDs, movie names, release dates, box office revenues, runtimes, languages, countries, and genres.
    - **Plot Summaries Dataset** (`plot_summaries.text`): Offers a concise summary of movie plots, linked to movies through Wikipedia movie IDs.

- [The Movies Dataset (TMDb)](https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset) contains metadata for 45,000 films listed in the Full MovieLens Dataset, all of which were released on or before July 2017. We will use this dataset, specifically the movies_metadata.csv file, primarily to enrich the movie.metadata.tsv of the CMU movie dataset. Additionally, this dataset includes plot keywords (in the keywords.csv file), which will be helpful for analyzing movie plots.

- [IMDB Non-Commercial Dataset](https://developer.imdb.com/non-commercial-datasets/) is a giant database containing over 50 Million movies from different regions, which allows us to complement our initial dataset specifically with movies created in the Soviet Union during the cold war. The CMU Movie Summaries data being largely focused on the United States.

In [2]:
# Load, merge and do some cleaning with our three datasets
merged_dataset = create_merged_dataset()

## Data Enhancement with GPT-4o mini and cleaning

We used the GPT-4o mini model to generate 6 new columns for our dataset. The columns are:
- `cold_war_side` - What side of the cold war the country belongs to if any
- `character_western_bloc_representation` - How the character representing the western block is portrayed 
- `character_eastern_bloc_representation` - How the character representing the eastern block is portrayed
- `eastern_bloc_values` - The values represented by the eastern block
- `western_bloc_values` - The values represented by the western block
- `theme` - The theme and keywords of the movie

This will be very useful later on to grasp the context of the movies and to understand the values and themes they represent.  

this is the prompt we used to generate the new columns:

> You are an expert in movie history and Cold War. You will be given the name of the film, the year and the plot of the movie. You first need to analyse if the movie can be identified to the Eastern or Western bloc during the Cold War. If yes come up with the character or group of character impersonating the Western and Eastern bloc and their values as well as their main archetye.
> Your output needs to be parsable comma separated without context (the output needs to start directly), use only keyword and very important use new line character after each of the following:
> - Cold War side belonging either Easter, Western or None.
> - The character or group of character representing Western bloc with their values and archetype comma separated or None.
> - The character or group of character representing Eastern bloc with their values and archetype comma separated or None.
> - The Western bloc representation main values and characteristics comma separated or None.
> - The Eatern bloc representation main values and characteristics comma separated or None.
> - The theme of the movies and keywords.
> If and only if the movie does not belong to any of the blocs, put `None` in the respective fields.



We need to parse all the output of GPT and clean it to be able to use it in our dataset.

In [3]:
merged_dataset_enhanced = create_enhanced_dataset()

We now do a final cleaning, processing and drop useless columns of the dataset to prepare it for analysis.

In [4]:
preprocess_movies = create_preprocessed_movies()

After data augmentation and pre-processing, we ended up with a dataset containing 25621 films released between 1945 and 1995, of which 2919 are classified as Eastern side and 3142 as Western side. In total, we identified 6,061 movies that could be associated with one of the blocs. They represent around 25% of our dataset with a good balance between the Western and Eastern blocs, which is a non-negligible proportion of our dataset.

In [5]:
plot_movies_distrib(preprocess_movies)

## A first glance at the data

We are now armed to start our analysis. We start by looking a little bit more in details at the number of movies produced by each country and their ideology.

In [6]:
country_counts_df = compute_side_movie_count_per_country(preprocess_movies)

# Display result
print(country_counts_df.head())

                    Country  Occurrences  Western  Eastern  None
0                    Russia        10853      720     2324  7809
1                     Italy         1374      139       71  1164
2                   Germany          645       68       42   535
3  United States of America         8239     1819      140  6280
4                   Estonia           72        7       13    52


The map below – which you can explore – shows the number of movies released by countries.

In [7]:
display_map_film_nb(country_counts_df)

Map(center=[20, 0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_text…

The United States and Russia stand out as the main film producers during this period, with 8239 and 10853 films respectively. This clearly illustrates their global dominance and Soft Power. There are also a large number of films from European countries such as France, United Kingdom, Italy and Germany. These countries play a relevant role in our study and could prove essential in the analysis of international collaboration in the film industry.

We also examined the distribution of films over years throughout the Cold War to identify trends that could provide insights for our analysis.

In [8]:
plot_movies_distrib_by_year(preprocess_movies)

Throughout the Cold War, the number of films produced rose steadily. This suggests that cinema played an increasingly important role in the ideological conflict.

These were the results for all the films in our dataset. If we focusing exclusively on the films with a clear alignment toward one of the Cold War blocs, we observe the following distribution: