# Boat2023 Milestone 2

In this preparatory part of the project, we agreed on our project proposal; the impact on terrorism on cinema. We performed initial analyses including data exploration, data pre-processing and initial data visualizations, all in relation to our research questions: emotional depiction of terrorism-related movies, genre association, topic analysis as well as popularity.

**Table of content**

**General data processing**  

In [1]:
#useful imports
import xml.etree.ElementTree as ET
import pandas as pd
from typing import Dict
import json
import re

#important libraries for data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import numpy as np
from scipy import stats
import statsmodels.formula.api as smf


#important libraries for the Sentiment analysis
from scipy.signal import savgol_filter
import nltk
from nltk import tokenize
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

First, we import the CMU dataset.

In [12]:
#import the data from movie.metadata and rename its columns
column_names = ['Wikipedia movie ID', 'Freebase movie ID', 'Movie name', 'Release date', 'Box office revenue', 'Runtime', 'Languages', 'Countries', 'Genres']
m_data = pd.read_csv('data/movie.metadata.tsv', delimiter= '\t',on_bad_lines='skip', names=column_names, header=0)
display(m_data)

Unnamed: 0,Wikipedia movie ID,Freebase movie ID,Movie name,Release date,Box office revenue,Runtime,Languages,Countries,Genres
0,3196793,/m/08yl5d,Getting Away with Murder: The JonBenét Ramsey ...,2000-02-16,,95.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/02n4kr"": ""Mystery"", ""/m/03bxz7"": ""Biograp..."
1,28463795,/m/0crgdbh,Brun bitter,1988,,83.0,"{""/m/05f_3"": ""Norwegian Language""}","{""/m/05b4w"": ""Norway""}","{""/m/0lsxr"": ""Crime Fiction"", ""/m/07s9rl0"": ""D..."
2,9363483,/m/0285_cd,White Of The Eye,1987,,110.0,"{""/m/02h40lc"": ""English Language""}","{""/m/07ssc"": ""United Kingdom""}","{""/m/01jfsb"": ""Thriller"", ""/m/0glj9q"": ""Erotic..."
3,261236,/m/01mrr1,A Woman in Flames,1983,,106.0,"{""/m/04306rv"": ""German Language""}","{""/m/0345h"": ""Germany""}","{""/m/07s9rl0"": ""Drama""}"
4,13696889,/m/03cfc81,The Gangsters,1913-05-29,,35.0,"{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...","{""/m/09c7w0"": ""United States of America""}","{""/m/02hmvc"": ""Short Film"", ""/m/06ppq"": ""Silen..."
...,...,...,...,...,...,...,...,...,...
81735,35228177,/m/0j7hxnt,Mermaids: The Body Found,2011-03-19,,120.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/07s9rl0"": ""Drama""}"
81736,34980460,/m/0g4pl34,Knuckle,2011-01-21,,96.0,"{""/m/02h40lc"": ""English Language""}","{""/m/03rt9"": ""Ireland"", ""/m/07ssc"": ""United Ki...","{""/m/03bxz7"": ""Biographical film"", ""/m/07s9rl0..."
81737,9971909,/m/02pygw1,Another Nice Mess,1972-09-22,,66.0,"{""/m/02h40lc"": ""English Language""}","{""/m/09c7w0"": ""United States of America""}","{""/m/06nbt"": ""Satire"", ""/m/01z4y"": ""Comedy""}"
81738,913762,/m/03pcrp,The Super Dimension Fortress Macross II: Lover...,1992-05-21,,150.0,"{""/m/03_9r"": ""Japanese Language""}","{""/m/03_3d"": ""Japan""}","{""/m/06n90"": ""Science Fiction"", ""/m/0gw5n2f"": ..."


In [6]:
# Define the path to the text file containing movie plots
file_path = 'data/plot_summaries.txt'
# import the summaries file
df_summaries = pd.read_csv(file_path, delimiter='\t', header= None)

Since our research focuses on the impact of terrorism on movies, we need to filter the dataset to solely target movies related to terrorism. We did some research and crafted a list of keywords of the terrorism lexicon, mainly based on information from Wikipedia. We used this list to create a subset of our original dataset, that we will use for all our analyses.

In [7]:
KEYWORDS = [ "Terrorism", "Terrorist", "Terrorists", "Jihad", "Extremism", "Extremist", "Attacks", "Attack", "Bombs", "Bombing", "Bombers", 
            "Hijack", "Hijacking", "Kidnap", "Kidnapping", "Counterterrorism", "Counterterrorist", "Radicalization", "Radicalized", 
            "Security Threat", "Political Violence", "Suicide Bomber", "War on Terror", "Homeland Security", "National Security", "Intelligence Agencies", 
            "Counterinsurgency", "Terrorist Cells", "Radical Ideology", "Terrorist Plot", "Terrorist Organization", "Hostage Crisis", "Terrorism Investigation", 
            "Counterterrorist Operation", "Radical", "Guerrilla Warfare", "Insurgency", "Terror Threat", "Covert Operations", "Political Unrest", "Martyrdom", 
            "Cyberterrorism", "Terrorism Financing", "Violent Extremism", "Terrorist Recruitment", "Suicide Attacks", "Terrorist Sleeper Cells", 
            "Counterterror Measures", "Clandestine Activities", "Security Intelligence" ]

In [14]:
movie_ids=[]
# Create a new column in the DataFrame to store the count of keywords in each movie plot
for index, row in df_summaries.iterrows():
    movie_id = row[0]
    plot = row[1]
    # Check if the plot contains any of the keywords
    for keyword in KEYWORDS:
        if keyword.lower() in plot:
            movie_ids.append(movie_id)

# Display the list of movie IDs that match the keywords
movie_ids= set(movie_ids)

In [13]:
# Filter the movies_data DataFrame to include only the rows with IDs that are present in the movie_ids list
filtered_data = m_data[m_data['Wikipedia movie ID'].isin(movie_ids)][['Wikipedia movie ID','Movie name', 'Release date','Countries', 'Languages', 'Genres','Box office revenue']]
filtered_data = filtered_data.sort_values(by=['Release date'])
display(filtered_data) 

Unnamed: 0,Wikipedia movie ID,Movie name,Release date,Countries,Languages,Genres,Box office revenue
42214,32986669,Robbery Under Arms,1907-11-02,"{""/m/0chghy"": ""Australia""}","{""/m/06ppq"": ""Silent film""}","{""/m/06ppq"": ""Silent film"", ""/m/07s9rl0"": ""Dra...",
64189,7870349,Dr. Jekyll and Mr. Hyde,1908-03-07,"{""/m/09c7w0"": ""United States of America""}","{""/m/06ppq"": ""Silent film""}","{""/m/02hmvc"": ""Short Film"", ""/m/06ppq"": ""Silen...",
70994,29391146,The Black Viper,1908-07-25,"{""/m/09c7w0"": ""United States of America""}","{""/m/06ppq"": ""Silent film""}","{""/m/06ppq"": ""Silent film""}",
18652,28777800,The Englishman and the Girl,1910-02-17,"{""/m/09c7w0"": ""United States of America""}","{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...","{""/m/02hmvc"": ""Short Film"", ""/m/06ppq"": ""Silen...",
45311,13254122,What the Daisy Said,1910-07-11,"{""/m/09c7w0"": ""United States of America""}","{""/m/06ppq"": ""Silent film"", ""/m/02h40lc"": ""Eng...","{""/m/02hmvc"": ""Short Film"", ""/m/06ppq"": ""Silen...",
...,...,...,...,...,...,...,...
81156,11971266,La Guerre des tuques,,"{""/m/0d060g"": ""Canada""}","{""/m/064_8sq"": ""French Language""}","{""/m/0hj3myq"": ""Children's/Family"", ""/m/0hj3mt...",
81303,11515305,Buio Omega,,"{""/m/03rjj"": ""Italy""}","{""/m/02bjrlw"": ""Italian Language"", ""/m/02h40lc...","{""/m/03npn"": ""Horror""}",
81312,27613497,Emperor: Young Caesar,,{},{},"{""/m/06l3bl"": ""Epic""}",
81340,27646962,Raging Sharks,,"{""/m/09c7w0"": ""United States of America"", ""/m/...","{""/m/02h40lc"": ""English Language""}","{""/m/03npn"": ""Horror""}",


When displaying the dataframe, we notice that the columns Countries, Languages and Genres include some kind of ID, in addition to the readable value we are interest in (for example {"/m/0chghy": "Australia"}). To get a more visually pleasing dataframe, we use the JSON library as follows.

In [16]:
# Function to extract the readable values from the strings
def extract_values(text):
    try:
        # Load the text as JSON and extract values
        data_dict = json.loads(text)
        return ', '.join(data_dict.values())
    except json.JSONDecodeError:
        # If it's not valid JSON, try to find all readable parts with regex
        return ', '.join(re.findall(r'":\s*"([^"]+)"', text))

# Apply the function to clean the columns of our dataframe
filtered_data['Countries'] = filtered_data['Countries'].apply(extract_values)
filtered_data['Languages'] = filtered_data['Languages'].apply(extract_values)
filtered_data['Genres'] = filtered_data['Genres'].apply(extract_values)
display(filtered_data)

Unnamed: 0,Wikipedia movie ID,Movie name,Release date,Countries,Languages,Genres,Box office revenue
42214,32986669,Robbery Under Arms,1907-11-02,Australia,Silent film,"Silent film, Drama",
64189,7870349,Dr. Jekyll and Mr. Hyde,1908-03-07,United States of America,Silent film,"Short Film, Silent film, Horror, Indie, Black-...",
70994,29391146,The Black Viper,1908-07-25,United States of America,Silent film,Silent film,
18652,28777800,The Englishman and the Girl,1910-02-17,United States of America,"Silent film, English Language","Short Film, Silent film, Comedy",
45311,13254122,What the Daisy Said,1910-07-11,United States of America,"Silent film, English Language","Short Film, Silent film, Drama, Indie, Black-a...",
...,...,...,...,...,...,...,...
81156,11971266,La Guerre des tuques,,Canada,French Language,"Children's/Family, Animal Picture, Comedy-dram...",
81303,11515305,Buio Omega,,Italy,"Italian Language, English Language",Horror,
81312,27613497,Emperor: Young Caesar,,,,Epic,
81340,27646962,Raging Sharks,,"United States of America, Bulgaria",English Language,Horror,


In [17]:
df_summaries.columns = ['Wikipedia movie ID', 'Plot Summary']
#Merge filtered_data with movies_data
df_terrorism_summaries = pd.merge(filtered_data, df_summaries, on='Wikipedia movie ID')
df_terrorism_summaries

Unnamed: 0,Wikipedia movie ID,Movie name,Release date,Countries,Languages,Genres,Box office revenue,Plot Summary
0,32986669,Robbery Under Arms,1907-11-02,Australia,Silent film,"Silent film, Drama",,Key scenes of the film included the branding o...
1,7870349,Dr. Jekyll and Mr. Hyde,1908-03-07,United States of America,Silent film,"Short Film, Silent film, Horror, Indie, Black-...",,Dr. Jekyll and Mr. Hyde began with the raising...
2,29391146,The Black Viper,1908-07-25,United States of America,Silent film,Silent film,,A thug accosts a girl as she leaves her workpl...
3,28777800,The Englishman and the Girl,1910-02-17,United States of America,"Silent film, English Language","Short Film, Silent film, Comedy",,A small town's drama group is preparing for a ...
4,13254122,What the Daisy Said,1910-07-11,United States of America,"Silent film, English Language","Short Film, Silent film, Drama, Indie, Black-a...",,Two farm sisters are feeling romantic and loo...
...,...,...,...,...,...,...,...,...
8621,11971266,La Guerre des tuques,,Canada,French Language,"Children's/Family, Animal Picture, Comedy-dram...",,The film involves a huge snowball fight betwee...
8622,11515305,Buio Omega,,Italy,"Italian Language, English Language",Horror,,"Anna Völkl, the fiance of taxidermist Frank Wy..."
8623,27613497,Emperor: Young Caesar,,,,Epic,,The film will attempt to adapt the first two n...
8624,27646962,Raging Sharks,,"United States of America, Bulgaria",English Language,Horror,,"In the opening, a collision between two alien ..."
