<a href="https://colab.research.google.com/github/carloscailao/CSMODEL/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📄 Abstract

This notebook is prepared in partial fulfillment of the **Statistical Modeling and Simulation (CSMODEL)** course under the **College of Computer Studies (CCS)** of De La Salle University – Manila.

The research team, **Statisteros Modeleros**, is composed of three undergraduate students from the **Bachelor of Science in Computer Science** program. This project investigates real-world patterns in the **film industry from 2004 to 2025**, in correlation with the **rise of social media platforms** during the same period.

Motivated by the popular belief that global attention spans are shrinking due to the emergence of short-form content such as **TikToks, Instagram Reels, and YouTube Shorts**, this study explores whether the **movie industry has responded or adapted** to these shifting media consumption habits.

### 👥 Researchers

<table>
<tr>
<td>

**Carlos Luis B. Cailao**  
_2nd Year, BS Computer Science – Software Technology_  
📧 carlos_cailao@dlsu.edu.ph  

**Andre Gabriel D. Llanes**  
_2nd Year, BS Computer Science – Software Technology_  
📧 andre_llanes@dlsu.edu.ph  

**Sophia Pauline V. Sena**  
_2nd Year, BS Computer Science – Network and Information Security_  
📧 sophia_sena@dlsu.edu.ph  

</td>
<td align="right" style="text-align: right; vertical-align: top;">

<strong>Statisteros Modeleros</strong><br>  
<img src="https://raw.githubusercontent.com/carloscailao/CSMODEL/main/assets/StatisterosModeleros_Logo.png" alt="Statisteros Modeleros Logo" width="150"><br>  
<sub><i>Logo generated using ChatGPT</i></sub>

</td>
</tr>
</table>

---

# 📊 About the Datasets

## 🎬 1. Movies Dataset – The Movies Database (TMDB)

- The **TMDB Movie Database** is a comprehensive dataset containing key information about films such as:
  - 🎞️ ID
  - 🎬 Title
  - ⭐ Average Vote
  - 🗳️ Vote Count
  - 📅 Release Date
  - ⏱️ Runtime
  - 💰 Revenue
  - 📌 Status
  - …and other attributes

- **Source:** [Kaggle – TMDB Movies Dataset (930K+ Movies)](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies?fbclid=IwY2xjawK7Og1leHRuA2FlbQIxMQABHlArG7Tyo1mSlbbwmYbOZ068LWVINqSO5yVICdjNNztmY--1a57isPs8cuts_aem_nJypZ3sxgaGXGtVjc-kKLA)

- **Filtering Criteria:**
  - The original dataset contained around **1 million movie entries**.
  - To ensure **relevance and engagement**, only movies with **1,000+ vote counts** were retained.
  - This filter reduced the dataset to **3,940 movies**.

## 📱 2. Social Media Dataset – Manually Constructed from Statistics and Data Org

- This dataset was manually compiled by the researchers based on an interactive chart from **Statistics and Data** showing:
  - The **Global Top 10 Social Media Platforms** by **Monthly Active Users (MAU)**  
  - Timeline: **February 2004 to January 2025**
  - Entries: **2425** (across 252 months)

- Since no downloadable dataset was provided, the group created their own CSV file containing:
  - 📆 Year and Month  
  - 🏷️ Platform Name  
  - 👥 Monthly Active Users  
  - 🏅 Rank per Month

- **Source Website:** [Statistics & Data – Most Popular Social Media (2004–2025)](https://statisticsanddata.org/data/most-popular-social-media-2004-2025/?fbclid=IwY2xjawK7OolleHRuA2FlbQIxMQABHlLb727mru341vsUF6i4K2_suxcTzDdB0tJDLk_xLoc0Ifb4_PRylDYjOiNK_aem_Na3LzkWUyR_4eRbxyhf7-w#google_vignette)

- The group **attempted to reach out** to the site’s administrators via **email, Messenger, and Instagram**, but received no response as of this notebook's writing.

## Import them here:

In [58]:
import pandas as pd

# Url of dataset copies as of 06/15/2025 uploaded in GitHub
moviesDatasetRaw = "https://raw.githubusercontent.com/carloscailao/CSMODEL/main/data/Movies_Dataset_TMDB.csv"
socialMediaDatasetRaw = "https://raw.githubusercontent.com/carloscailao/CSMODEL/main/data/Social_Media_Dataset_Top_10_Social_Media_2004-2025.csv"

# Creation of dataframes from CSV files of datasets
moviesDf = pd.read_csv(moviesDatasetRaw)
socialMediaDf = pd.read_csv(socialMediaDatasetRaw)

# A peek into the uncleaned dataframes
moviesDf.head()
socialMediaDf.head()

Unnamed: 0,year_month,platform,mau,rank
0,2004_02,Friendster,4500000,1
1,2004_02,Orkut,500000,2
2,2004_02,MySpace,370000,3
3,2004_03,Friendster,5241578,1
4,2004_03,Orkut,1685648,2


---

# 🛠️Data Preprocessing
- This section handles cleaning, dataframing, and prior preparations on the Movies and Social Media Datasets for Exploratory Data Analysis.
- This also details how *socialMediaDf and moviesDf* turn into ***cleanedSocialMediaDf and cleanedMoviesDf*** respectively.

## 📱Social Media Dataset Preprocessing
- Since this CSV was manually inputted, the researchers have to make sure that manual errors are handled and the dataframe is cleaned in terms of data typing and input formatting.

Info on the **original, uncleaned** dataset
- Here, we see that year_month and mau are of data type **object**, when we would prefer year_month to be dates, and mau to be integers, for easier plotting in the modeling.

In [59]:
socialMediaDf.info()
socialMediaDf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   year_month  2425 non-null   object
 1   platform    2425 non-null   object
 2   mau         2425 non-null   object
 3   rank        2425 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004_02,Friendster,4500000,1
1,2004_02,Orkut,500000,2
2,2004_02,MySpace,370000,3
3,2004_03,Friendster,5241578,1
4,2004_03,Orkut,1685648,2


1. Normalizing MAU entries (omitting commas e.g. 4,500,000 to 4500000)

- From object data type, **to integer**
- This allows for smoother operations later on when working with MAU entries.

In [60]:
# Make a MAU normalized copy of the original Social Media Dataframe
mauNormalizedSocialMediaDf = socialMediaDf.copy()

# Remove commas and convert MAU to integer
mauNormalizedSocialMediaDf['mau'] = mauNormalizedSocialMediaDf['mau'].str.replace(',', '', regex=False).astype(int)

mauNormalizedSocialMediaDf.info()
mauNormalizedSocialMediaDf.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   year_month  2425 non-null   object
 1   platform    2425 non-null   object
 2   mau         2425 non-null   int64 
 3   rank        2425 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004_02,Friendster,4500000,1
1,2004_02,Orkut,500000,2
2,2004_02,MySpace,370000,3
3,2004_03,Friendster,5241578,1
4,2004_03,Orkut,1685648,2


2. Normalizing year_month entries
- From object data type to **datetime64**
- Manually inputted year_month with underscores converted to proper format

In [61]:
# Make a Date Normalized copy of the MAU Normalized Social Media Dataframe
dateNormalizedSocialMediaDf = mauNormalizedSocialMediaDf.copy()

# Replace underscores with dashes
dateNormalizedSocialMediaDf['year_month'] = dateNormalizedSocialMediaDf['year_month'].str.replace('_', '-', regex=False)

# Convert to datetime
dateNormalizedSocialMediaDf['year_month'] = pd.to_datetime(dateNormalizedSocialMediaDf['year_month'], format='%Y-%m')

dateNormalizedSocialMediaDf.info()
dateNormalizedSocialMediaDf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   year_month  2425 non-null   datetime64[ns]
 1   platform    2425 non-null   object        
 2   mau         2425 non-null   int64         
 3   rank        2425 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004-02-01,Friendster,4500000,1
1,2004-02-01,Orkut,500000,2
2,2004-02-01,MySpace,370000,3
3,2004-03-01,Friendster,5241578,1
4,2004-03-01,Orkut,1685648,2


3. Checking for duplicate representations of the same social media platform name

In [62]:
dateNormalizedSocialMediaDf['platform'].unique()

array(['Friendster', 'Orkut', 'MySpace', 'Flickr', 'Facebook', 'Hi5',
       'Flicker', 'Youtube', 'Reddit', 'Orkut ', 'Twitter/X', 'Tumblr',
       'QZone', 'Weibo', 'Myspace', 'Whatsapp', 'Wechat', 'Google+',
       'Instagram', 'TIkTok', 'TikTok', 'Telegram', 'Snapchat'],
      dtype=object)

We see **four** issues during manual inputting:
- **Flickr** appears as both "Flickr" and "Flicker".
- **TikTok** appears as both "TikTok" and "TIkTok".
- **MySpace** appears as both "MySpace" and "Myspace".
- **Orkut** has a trailing space ("Orkut ").


In [63]:
# Copy for progress tracking
platformNormalizedSocialMediaDf = dateNormalizedSocialMediaDf.copy()

# Replace known typos and inconsistencies
platformNormalizedSocialMediaDf['platform'] = platformNormalizedSocialMediaDf['platform'].replace({
    'Flicker': 'Flickr',
    'TIkTok': 'TikTok',
    'Myspace': 'MySpace',
    'Orkut ': 'Orkut'  # trailing space
})

print(sorted(platformNormalizedSocialMediaDf['platform'].unique()))
platformNormalizedSocialMediaDf.info()
platformNormalizedSocialMediaDf.head()

['Facebook', 'Flickr', 'Friendster', 'Google+', 'Hi5', 'Instagram', 'MySpace', 'Orkut', 'QZone', 'Reddit', 'Snapchat', 'Telegram', 'TikTok', 'Tumblr', 'Twitter/X', 'Wechat', 'Weibo', 'Whatsapp', 'Youtube']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   year_month  2425 non-null   datetime64[ns]
 1   platform    2425 non-null   object        
 2   mau         2425 non-null   int64         
 3   rank        2425 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004-02-01,Friendster,4500000,1
1,2004-02-01,Orkut,500000,2
2,2004-02-01,MySpace,370000,3
3,2004-03-01,Friendster,5241578,1
4,2004-03-01,Orkut,1685648,2


Now that we have normalized all the columns in the Social Media Dataframe, we shall name this as ***cleanedSocialMediaDf***

In [64]:
cleanedSocialMediaDf = platformNormalizedSocialMediaDf

cleanedSocialMediaDf.info()
cleanedSocialMediaDf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2425 entries, 0 to 2424
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   year_month  2425 non-null   datetime64[ns]
 1   platform    2425 non-null   object        
 2   mau         2425 non-null   int64         
 3   rank        2425 non-null   int64         
dtypes: datetime64[ns](1), int64(2), object(1)
memory usage: 75.9+ KB


Unnamed: 0,year_month,platform,mau,rank
0,2004-02-01,Friendster,4500000,1
1,2004-02-01,Orkut,500000,2
2,2004-02-01,MySpace,370000,3
3,2004-03-01,Friendster,5241578,1
4,2004-03-01,Orkut,1685648,2


---

# 💗 Acknowledgements

- ✍️ **Writing Assistance & Formatting**  
  Portions of the notebook's text — including the abstract, dataset descriptions, and formatting — were improved and polished with the help of **ChatGPT**, developed by **OpenAI**.

- 🎨 **Logo Support**  
  The Statisteros Modeleros logo image was generated using **ChatGPT (Image & Design Guidance)** using OpenAI's generative tools.

- 🧰 **Tool Stack**  
  - [Google Colab](https://colab.research.google.com/) for collaborative Python notebook work  
  - [GitHub](https://github.com/) for version control and dataset hosting  
  - [OpenAI ChatGPT](https://chat.openai.com/) for content support and productivity assistance
  