# IV.2) Topic Modeling

In this section, we focus on exploring and visualizing topics identified from movie data. The goal is to observe how these topics evolve over time and across regions. The topic modeling has been performed using **BERTopic** library. All the processing steps are explained in the notebook `topicModeling.ipynb` in the folder `src/models`. To recap it briefly, the following steps were performed:
- Data Preprocessing: The movie data was preprocessed by removing stopwords, lemmatizing, and tokenizing the text.
- Topic Modeling: The BERTopic model was trained on the preprocessed text data to identify topics.
- Topic Data Prepartion: The topic data was prepared by assigning topics to each movie and extracting the most representative words for each topic.


In [9]:
# Modules to import
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [10]:
df = pd.read_csv('data/cultureData/df_movies_with_topics.csv')
df['release_date'] = pd.to_datetime(df['release_date'])
df['year'] = df['release_date'].dt.year
df.head()

Unnamed: 0,wiki_id,topic_name,release_date,countries,region,year
0,20663735,Love & Family,2000-01-01,India,South Asia,2000
1,20663735,Love & Family,2000-01-01,India,South Asia,2000
2,20663735,Love & Family,2000-01-01,India,South Asia,2000
3,20663735,Love & Family,2000-01-01,India,South Asia,2000
4,1952976,Crime,2005-06-27,United States of America,North America,2005


The dataset has been created using the script (`prepareCultureData.py`), which is explain at the beginning of section IV. The dataset contains the following columns:

- **wiki_id**: Unique identifier for each movie.
- **topic_name**: The topic associated with the movie (e.g., "Martial Arts", "Middle East").
- **countries**: The countries where the movie was released.
- **region**: The region where the movie was primarily released.
- **release_date**: The release date of the movie.

The objective is to analyze the distribution and propagation of various topics across countries and regions.

In [11]:
# List of topics
df['topic_name'].unique()

array(['Love & Family', 'Crime', 'Investigation', 'Pirates',
       'Middle East', 'Civil War', 'Africa', 'Cartoons', 'Martial Arts',
       'USSR', 'Betty Boop', 'French Life', 'WWII', 'Samurai',
       'Sci-Fi Earth', 'Politics', 'Family Drama', 'Stooges',
       'Musketeers', 'Charlie Brown', 'Christmas', 'Space', 'Soldiers',
       'Laurel & Hardy', 'Boxing', 'Sports', 'Tom & Jerry', 'College',
       'Roman Empire', 'Royalty', 'Jungle', 'Tokyo Life', 'Racing',
       'Yakuza', 'Monsters', 'Fantasy', 'Disney', 'Godzilla',
       'Pink Panther', 'School Life'], dtype=object)

In [12]:
# Count number of unique wiki_id per topic
df.groupby('topic_name')['wiki_id'].nunique().sort_values(ascending=False)

topic_name
Love & Family     6137
Crime             4408
Investigation     1254
Martial Arts      1036
Family Drama       598
Sci-Fi Earth       558
Civil War          439
French Life        354
Space              324
WWII               322
Pirates            256
Soldiers           216
USSR               201
Sports             199
Monsters           185
Samurai            153
Cartoons           149
Royalty            143
Stooges            124
Christmas          123
Africa             118
Charlie Brown       98
College             88
Middle East         86
Tokyo Life          84
Tom & Jerry         83
Fantasy             81
Yakuza              76
Jungle              73
Musketeers          69
Racing              69
Laurel & Hardy      68
Godzilla            53
Betty Boop          53
Disney              51
Roman Empire        50
Politics            49
Boxing              48
School Life         40
Pink Panther        20
Name: wiki_id, dtype: int64

### IV.2.1) Topic's mean propagation

We now focus on analyzing how each topic is propagated across regions and countries. We select a specific topic, for instance, 'Martial Arts', and calculate the number of countries, regions, and movies where this topic is present.

In [13]:
# Select a topic
topic = 'Martial Arts'

df_topic = df[df['topic_name'] == topic]

# Number of countries where the topic is present
print(f'Number of countries where the topic {topic} is present: {df_topic["countries"].nunique()}')

# Number of regions where the topic is present
print(f'Number of regions where the topic {topic} is present: {df_topic["region"].nunique()}')

# Number of movies where the topic is present
print(f'Number of movies where the topic {topic} is present: {df_topic["wiki_id"].nunique()}')


Number of countries where the topic Martial Arts is present: 33
Number of regions where the topic Martial Arts is present: 7
Number of movies where the topic Martial Arts is present: 1036


We compute the statistics for all topics, such as the number of countries, regions, and movies associated with each topic.

In [14]:
topic_stats = df.groupby('topic_name').agg(
    num_countries=('countries', 'nunique'),
    num_regions=('region', 'nunique'),
    num_movies=('wiki_id', 'nunique')
).reset_index()

# Compute average statistics
avg_countries = topic_stats['num_countries'].mean()
avg_regions = topic_stats['num_regions'].mean()
avg_movies = topic_stats['num_movies'].mean()

# Statistics per topic
print("Statistics per topic:")
topic_stats

Statistics per topic:


Unnamed: 0,topic_name,num_countries,num_regions,num_movies
0,Africa,30,7,118
1,Betty Boop,2,2,53
2,Boxing,9,6,48
3,Cartoons,10,4,149
4,Charlie Brown,10,5,98
5,Christmas,20,6,123
6,Civil War,20,8,439
7,College,13,6,88
8,Crime,53,9,4408
9,Disney,2,1,51


In [15]:
# Average statistics across all topics
print("Average statistics across all topics:")
print(f"Average number of countries per topic: {avg_countries:.2f}")
print(f"Average number of regions per topic: {avg_regions:.2f}")
print(f"Average number of movies per topic: {avg_movies:.2f}")

Average statistics across all topics:
Average number of countries per topic: 20.75
Average number of regions per topic: 5.85
Average number of movies per topic: 463.40


### IV.2.2) World map visualization

We analyze a selection of topics and map their presence across countries. This analysis will allow us to visualize how the presence of these topics varies geographically.

- **Plot**: A world map showing the presence of a specific topic across countries.

In [16]:
# Selected topic for analysis
topics_list = ['Middle East','Africa','Martial Arts','USSR','French Life', 'Samurai', 'Tokyo Life','Roman Empire']

In [17]:
# Remove duplicates for each movie by country and topic
df_unique = df.drop_duplicates(subset=['wiki_id', 'countries', 'topic_name'])

# Add a column indicating presence (1)
df_unique['presence'] = 1

# Unique list of available topics
topics = topics_list

# Create a figure
fig = go.Figure()

# Add traces for each topic
for topic in topics:
    df_topic = df_unique[df_unique['topic_name'] == topic]
    fig.add_trace(
        go.Choropleth(
            locations=df_topic['countries'],
            locationmode='country names',
            z=df_topic['presence'],
            colorbar=dict(title='Presence'),
            colorscale='Blues',
            showscale=False,
            visible=(topic == topics[0])  # Show only the first topic initially
        )
    )

# Create buttons for topic selection
buttons = []
for i, topic in enumerate(topics):
    buttons.append(
        dict(
            label=topic,
            method='update',
            args=[{'visible': [t == topic for t in topics]},
                  {'title': f'Presence of the topic: {topic}'}]
        )
    )

# Add dropdown menu
fig.update_layout(
    updatemenus=[
        dict(
            buttons=buttons,
            direction='down',
            showactive=True,
            x=0.1,
            xanchor='center',
            y=1.15,
            yanchor='top'
        )
    ],
    title=f'Presence of the topic: {topics[0]}',
    geo=dict(
        showframe=False,
        showcoastlines=True,
        projection_type='equirectangular'
    )
)

# Show the figure
fig.show()
#fig.write_html('../../htmlSavedGraph/topic_world_map.html')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unique['presence'] = 1
