<a href="https://colab.research.google.com/github/carloscailao/CSMODEL/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📄 Abstract

This notebook is prepared in partial fulfillment of the **Statistical Modeling and Simulation (CSMODEL)** course under the **College of Computer Studies (CCS)** of **De La Salle University – Manila**.

The research team, **Statisteros Modeleros**, is composed of three undergraduate students from the **Bachelor of Science in Computer Science** program. This project investigates real-world patterns in the **film industry from 2004 to 2025**, in correlation with the **rise of social media platforms** during the same period.

Motivated by the popular belief that global attention spans are shrinking due to the emergence of short-form content such as **TikToks, Instagram Reels, and YouTube Shorts**, this study explores whether the **movie industry has responded or adapted** to these shifting media consumption habits.

### 👥 Researchers

<table>
<tr>
<td>

**Carlos Luis B. Cailao**  
_2nd Year, BS Computer Science – Software Technology_  
📧 carlos_cailao@dlsu.edu.ph  

**Andre Gabriel D. Llanes**  
_2nd Year, BS Computer Science – Software Technology_  
📧 andre_llanes@dlsu.edu.ph  

**Sophia Pauline V. Sena**  
_2nd Year, BS Computer Science – Network and Information Security_  
📧 sophia_sena@dlsu.edu.ph  

</td>
<td align="right" style="text-align: right; vertical-align: top;">

<strong>Statisteros Modeleros</strong><br>  
<img src="https://raw.githubusercontent.com/carloscailao/CSMODEL/main/assets/StatisterosModeleros_Logo.png" alt="Statisteros Modeleros Logo" width="150"><br>  
<sub><i>Logo generated using ChatGPT</i></sub>

</td>
</tr>
</table>

---

# 📊 About the Datasets

## 🎬 1. Movies Dataset – The Movies Database (TMDB)

- The **TMDB Movie Database** is a comprehensive dataset containing key information about films such as:
  - 🎞️ ID
  - 🎬 Title
  - ⭐ Average Vote
  - 🗳️ Vote Count
  - 📅 Release Date
  - ⏱️ Runtime
  - 💰 Revenue
  - 📌 Status
  - …and other attributes

- **Source:** [Kaggle – TMDB Movies Dataset (930K+ Movies)](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies?fbclid=IwY2xjawK7Og1leHRuA2FlbQIxMQABHlArG7Tyo1mSlbbwmYbOZ068LWVINqSO5yVICdjNNztmY--1a57isPs8cuts_aem_nJypZ3sxgaGXGtVjc-kKLA)

- **Filtering Criteria:**
  - The original dataset contained around **1 million movie entries** (500 mb file size).
  - To ensure **relevance and engagement**, only movies with **1,000+ vote counts** were retained.
  - This filter reduced the dataset to **3,940 movies** (3 mb file size).

## 📱 2. Social Media Dataset – Manually Constructed from Statistics and Data Org

- This dataset was manually compiled by the researchers based on an interactive chart from **Statistics and Data** showing:
  - The **Global Top 10 Social Media Platforms** by **Monthly Active Users (MAU)**  
  - Timeline: **February 2004 to January 2025**
  - Entries: **2425** (across 252 months)

- Since no downloadable dataset was provided, the group created their own CSV file containing:
  - 📆 Year and Month  
  - 🏷️ Platform Name  
  - 👥 Monthly Active Users  
  - 🏅 Rank per Month

- **Source Website:** [Statistics & Data – Most Popular Social Media (2004–2025)](https://statisticsanddata.org/data/most-popular-social-media-2004-2025/?fbclid=IwY2xjawK7OolleHRuA2FlbQIxMQABHlLb727mru341vsUF6i4K2_suxcTzDdB0tJDLk_xLoc0Ifb4_PRylDYjOiNK_aem_Na3LzkWUyR_4eRbxyhf7-w#google_vignette)

- The group **attempted to reach out** to the site’s administrators via **email, Messenger, and Instagram**, but received no response as of this notebook's writing.

## 📦 Import the datasets here:

In [1]:
import pandas as pd

# Url of dataset copies as of 06/15/2025 uploaded in GitHub
moviesDatasetRaw = "https://raw.githubusercontent.com/carloscailao/CSMODEL/main/data/Movies_Dataset_TMDB.csv"
socialMediaDatasetRaw = "https://raw.githubusercontent.com/carloscailao/CSMODEL/main/data/Social_Media_Dataset_Top_10_Social_Media_2004-2025.csv"

# Creation of dataframes from CSV files of datasets
moviesDf = pd.read_csv(moviesDatasetRaw)
socialMediaDf = pd.read_csv(socialMediaDatasetRaw)

---

# 🛠️Data Preprocessing
- This section handles cleaning, dataframing, and prior preparations on the Movies and Social Media Datasets for Exploratory Data Analysis.
- This also details how *socialMediaDf and moviesDf* turn into ***cleanedSocialMediaDf and cleanedMoviesDf*** respectively.

## 📱Social Media Dataset Preprocessing
- Since this CSV was manually inputted, the researchers have to make sure that manual errors are handled and the dataframe is cleaned in terms of data typing and input formatting.

Info on the **original, uncleaned** dataset
- Here, we see that year_month and mau are of data type **object**, when we would prefer year_month to be dates, and mau to be integers, for easier plotting in the modeling.

In [2]:
socialMediaDf.info()
socialMediaDf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   year_month  2425 non-null   object
 1   platform    2425 non-null   object
 2   mau         2425 non-null   object
 3   rank        2425 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004_02,Friendster,4500000,1
1,2004_02,Orkut,500000,2
2,2004_02,MySpace,370000,3
3,2004_03,Friendster,5241578,1
4,2004_03,Orkut,1685648,2


### 1. Normalizing MAU entries

- From object data type, **to integer**
- Omitting commas e.g. 4,500,000 to 4500000

In [3]:
# Make a copy of the original Social Media Dataframe for cleaning
cleanedSocialMediaDf = socialMediaDf.copy()

# Remove commas and convert MAU to integer
cleanedSocialMediaDf['mau'] = cleanedSocialMediaDf['mau'].str.replace(',', '', regex=False).astype(int)

cleanedSocialMediaDf['mau'].head()

0    4500000
1     500000
2     370000
3    5241578
4    1685648
Name: mau, dtype: int64

### 2. Normalizing year_month entries
- From object data type to **datetime64**
- Conversion of manually inputted year_month to proper format

In [4]:
# Replace date underscores with dashes
cleanedSocialMediaDf['year_month'] = cleanedSocialMediaDf['year_month'].str.replace('_', '-', regex=False)

# Convert to datetime
cleanedSocialMediaDf['year_month'] = pd.to_datetime(cleanedSocialMediaDf['year_month'], format='%Y-%m')

cleanedSocialMediaDf['year_month'].head()

0   2004-02-01
1   2004-02-01
2   2004-02-01
3   2004-03-01
4   2004-03-01
Name: year_month, dtype: datetime64[ns]

### 3. Normalizing platform entries

In [5]:
cleanedSocialMediaDf['platform'].unique()

array(['Friendster', 'Orkut', 'MySpace', 'Flickr', 'Facebook', 'Hi5',
       'Flicker', 'Youtube', 'Reddit', 'Orkut ', 'Twitter/X', 'Tumblr',
       'QZone', 'Weibo', 'Myspace', 'Whatsapp', 'Wechat', 'Google+',
       'Instagram', 'TIkTok', 'TikTok', 'Telegram', 'Snapchat'],
      dtype=object)

We see **four** issues during manual inputting:
- **Flickr** appears as both "Flickr" and "Flicker".
- **TikTok** appears as both "TikTok" and "TIkTok".
- **MySpace** appears as both "MySpace" and "Myspace".
- **Orkut** has a trailing space ("Orkut ").


In [6]:
# Replace known typos and inconsistencies
cleanedSocialMediaDf['platform'] = cleanedSocialMediaDf['platform'].replace({
    'Flicker': 'Flickr',
    'TIkTok': 'TikTok',
    'Myspace': 'MySpace',
    'Orkut ': 'Orkut'  # trailing space
})

print(sorted(cleanedSocialMediaDf['platform'].unique()))

['Facebook', 'Flickr', 'Friendster', 'Google+', 'Hi5', 'Instagram', 'MySpace', 'Orkut', 'QZone', 'Reddit', 'Snapchat', 'Telegram', 'TikTok', 'Tumblr', 'Twitter/X', 'Wechat', 'Weibo', 'Whatsapp', 'Youtube']


Now that we have normalized all the columns in the Social Media Dataframe, we shall use ***cleanedSocialMediaDf*** going forward.

In [7]:
cleanedSocialMediaDf.info()
cleanedSocialMediaDf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   year_month  2425 non-null   datetime64[ns]
 1   platform    2425 non-null   object        
 2   mau         2425 non-null   int64         
 3   rank        2425 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004-02-01,Friendster,4500000,1
1,2004-02-01,Orkut,500000,2
2,2004-02-01,MySpace,370000,3
3,2004-03-01,Friendster,5241578,1
4,2004-03-01,Orkut,1685648,2


## 🎬Movies Dataset Preprocessing

 Information about the original, uncleaned movie dataset.

- We can see here that the release_date is an object, when we prefer release_date to be of date data type

- homepage, tagline, production_companies, production_countries, spoken_languages, and keywords contain null value entries

In [8]:
moviesDf.info()
moviesDf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3940 entries, 0 to 3939
Data columns (total 24 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    3940 non-null   int64  
 1   title                 3940 non-null   object 
 2   vote_average          3940 non-null   float64
 3   vote_count            3940 non-null   int64  
 4   status                3940 non-null   object 
 5   release_date          3940 non-null   object 
 6   revenue               3940 non-null   int64  
 7   runtime               3940 non-null   int64  
 8   adult                 3940 non-null   bool   
 9   backdrop_path         3940 non-null   object 
 10  budget                3940 non-null   int64  
 11  homepage              2186 non-null   object 
 12  imdb_id               3940 non-null   object 
 13  original_language     3940 non-null   object 
 14  original_title        3940 non-null   object 
 15  overview             

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,adult,backdrop_path,...,original_title,overview,popularity,poster_path,tagline,genres,production_companies,production_countries,spoken_languages,keywords
0,27205,Inception,8.364,34495,Released,15/07/2010,825532764,148,False,/8ZTVqvKDQ8emSGUEMjsS4yHAwrp.jpg,...,Inception,"Cobb, a skilled thief who commits corporate es...",83.952,/oYuLEt3zVCKq57qu2F8dT7NIa6f.jpg,Your mind is the scene of the crime.,"Action, Science Fiction, Adventure","Legendary Pictures, Syncopy, Warner Bros. Pict...","United Kingdom, United States of America","English, French, Japanese, Swahili","rescue, mission, dream, airplane, paris, franc..."
1,157336,Interstellar,8.417,32571,Released,05/11/2014,701729206,169,False,/pbrkL804c8yAv3zBZR4QPEafpAR.jpg,...,Interstellar,The adventures of a group of explorers who mak...,140.241,/gEU2QniE6E77NI6lCU6MxlNBvIx.jpg,Mankind was born on Earth. It was never meant ...,"Adventure, Drama, Science Fiction","Legendary Pictures, Syncopy, Lynda Obst Produc...","United Kingdom, United States of America",English,"rescue, future, spacecraft, race against time,..."
2,155,The Dark Knight,8.512,30619,Released,16/07/2008,1004558444,152,False,/nMKdUUepR0i5zn0y1T4CsSB5chy.jpg,...,The Dark Knight,Batman raises the stakes in his war on crime. ...,130.643,/qJ2tW6WMUDux911r6m7haRef0WH.jpg,Welcome to a world without rules.,"Drama, Action, Crime, Thriller","DC Comics, Legendary Pictures, Syncopy, Isobel...","United Kingdom, United States of America","English, Mandarin","joker, sadism, chaos, secret identity, crime f..."
3,19995,Avatar,7.573,29815,Released,15/12/2009,2923706026,162,False,/vL5LR6WdxWPjLPFRLe133jXWsh5.jpg,...,Avatar,"In the 22nd century, a paraplegic Marine is di...",79.932,/kyeqWdyUXW608qlYkRqosgbbJyK.jpg,Enter the world of Pandora.,"Action, Adventure, Fantasy, Science Fiction","Dune Entertainment, Lightstorm Entertainment, ...","United States of America, United Kingdom","English, Spanish","future, society, culture clash, space travel, ..."
4,24428,The Avengers,7.71,29166,Released,25/04/2012,1518815515,143,False,/9BBTo63ANSmhC4e6r62OJFuK2GL.jpg,...,The Avengers,When an unexpected enemy emerges and threatens...,98.082,/RYMX2wcKCBAr24UyPD7xwmjaTn.jpg,Some assembly required.,"Science Fiction, Action, Adventure",Marvel Studios,United States of America,"English, Hindi, Russian","new york city, superhero, shield, based on com..."


Let's first make a copy of the movie dataset for cleaning

In [9]:
cleanedMoviesDf = moviesDf.copy()

### 1. Normalizing release_date entries

We will follow the yyyy-mm-dd date format and convert the release_date column from an object to a datetime data type.

In [10]:
cleanedMoviesDf['release_date'] = cleanedMoviesDf['release_date'].str.replace('/', '-', regex=False)
cleanedMoviesDf['release_date'].head()

0    15-07-2010
1    05-11-2014
2    16-07-2008
3    15-12-2009
4    25-04-2012
Name: release_date, dtype: object

In [11]:
cleanedMoviesDf['release_date'] = pd.to_datetime(cleanedMoviesDf['release_date'], format='%d-%m-%Y')
cleanedMoviesDf['release_date'].head()

0   2010-07-15
1   2014-11-05
2   2008-07-16
3   2009-12-15
4   2012-04-25
Name: release_date, dtype: datetime64[ns]

The next step is to normalize the null values: homepage, tagline, production_companies, production_countries, spoken_languages, and keywords. 

### 2. Normalizing homepage entries

The 'homepage' column in the data frame serves simply as a link redirecting to the movie's site. 

In [12]:
# Number of null entries in 'homepage'
print(cleanedMoviesDf[cleanedMoviesDf['homepage'].isna()].shape[0])

1754


The homepage variable has no relevance to the data we need. But to avoid confusion and for better practice, we'll replace null entries to N/A

In [13]:
# Replace null entries in 'homepage' with 'N/A'
cleanedMoviesDf['homepage'].fillna('N/A', inplace=True)
print(cleanedMoviesDf[cleanedMoviesDf['homepage'].isna()].shape[0])

0


### 3. Normalizing tagline entries

The 'tagline' variable is a short and catchy slogan used to promote the movie. 

In [14]:
# Number of null entries in 'homepage'
print(cleanedMoviesDf[cleanedMoviesDf['tagline'].isna()].shape[0])

287


We will be using 'tagline' for a corpus analysis later on. We replace null entries with 'N/A'

In [15]:
# Replace null entries in 'tagline' with 'N/A'
cleanedMoviesDf['tagline'].fillna('N/A', inplace=True)
print(cleanedMoviesDf[cleanedMoviesDf['tagline'].isna()].shape[0])

0


### 4. Normalizing production_companies entries

Company that produced the movie. There are five null entries in 'production_companies'. The team decided to research the production companies and manually replace the null entries.

In [16]:
# Print all titles of movies with null entries in 'production_companies'
print(cleanedMoviesDf[cleanedMoviesDf['production_companies'].isna()]['title'])

3136    Wizards of Waverly Place: The Movie
3202                         The Open House
3513                       Un Chien Andalou
3582                             Starstruck
3783           Naomi and Ely's No Kiss List
Name: title, dtype: object


Wizards of Waverly Place: The Movie - It's a Laugh Productions https://www.imdb.com/title/tt1369845/

In [17]:
# Replace null entry in 'production_companies' column for row 3136
cleanedMoviesDf.loc[3136, 'production_companies'] = "It's a Laugh Productions"

# Print both 'production_companies' and 'title' for row 3136
print(f"Production Companies: {cleanedMoviesDf.loc[3136, 'production_companies']}, Title: {cleanedMoviesDf.loc[3136, 'title']}")

Production Companies: It's a Laugh Productions, Title: Wizards of Waverly Place: The Movie


The Open House - Suzane Coote, Matt Angel https://www.imdb.com/title/tt7608028/reviews/ 

In [18]:
# Replace null entry in 'production_companies' column for row 3202
cleanedMoviesDf.loc[3202, 'production_companies'] = 'Suzane Coote, Matt Angel'

# Print both 'production_companies' and 'title' for row 3202
print(f"Production Companies: {cleanedMoviesDf.loc[3202, 'production_companies']}, Title: {cleanedMoviesDf.loc[3202, 'title']}")

Production Companies: Suzane Coote, Matt Angel, Title: The Open House


Un Chien Andalou - Luis Buñuel and Salvador Dalí https://www.imdb.com/title/tt0020530/

In [19]:
# Replace null entry in 'production_companies' column for row 3513
cleanedMoviesDf.loc[3513, 'production_companies'] = 'Luis Buñuel and Salvador Dalí'

# Print both 'production_companies' and 'title' for row 3513
print(f"Production Companies: {cleanedMoviesDf.loc[3513, 'production_companies']}, Title: {cleanedMoviesDf.loc[3513, 'title']}")

Production Companies: Luis Buñuel and Salvador Dalí, Title: Un Chien Andalou


Starstruck - Disney Channel https://www.imdb.com/title/tt1579247/

In [20]:
# Replace null entry in 'production_companies' column for row 3582
cleanedMoviesDf.loc[3582, 'production_companies'] = 'Disney Channel'

# Print both 'production_companies' and 'title' for row 3582
print(f"Production Companies: {cleanedMoviesDf.loc[3582, 'production_companies']}, Title: {cleanedMoviesDf.loc[3582, 'title']}")

Production Companies: Disney Channel, Title: Starstruck


Naomi and Ely's No Kiss List - One Two films https://en.wikipedia.org/wiki/Naomi_and_Ely%27s_No_Kiss_List

In [21]:
# Replace null entry in 'production_companies' column for row 3783
cleanedMoviesDf.loc[3783, 'production_companies'] = 'One Two films'

# Print both 'production_companies' and 'title' for row 3783
print(f"Production Companies: {cleanedMoviesDf.loc[3783, 'production_companies']}, Title: {cleanedMoviesDf.loc[3582, 'title']}")

Production Companies: One Two films, Title: Starstruck


In [22]:
# Checks for null entries in 'product_companies" column
null_count = cleanedMoviesDf['production_companies'].isna().sum()

# Print the result
print(f"Total number of null values in 'production_companies': {null_count}")

Total number of null values in 'production_companies': 0


### 5. Normalizing production_country entries

# 💗 Acknowledgements

- ✍️ **Writing Assistance & Formatting**  
  Portions of the notebook's text — including the abstract, dataset descriptions, and formatting — were improved and polished with the help of **ChatGPT**, developed by **OpenAI**.

- 🎨 **Logo Support**  
  The Statisteros Modeleros logo image was generated using **ChatGPT (Image & Design Guidance)** using OpenAI's generative tools.

- 🧰 **Tool Stack**  
  - [Google Colab](https://colab.research.google.com/) for collaborative Python notebook work  
  - [GitHub](https://github.com/) for version control and dataset hosting  
  - [OpenAI ChatGPT](https://chat.openai.com/) for content support and productivity assistance
  

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ae7c9fb5-985c-4336-9e5a-4c1cceee0157' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>