<a href="https://colab.research.google.com/github/carloscailao/CSMODEL/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📄 Abstract

This notebook is prepared in partial fulfillment of the **Statistical Modeling and Simulation (CSMODEL)** course under the **College of Computer Studies (CCS)** of De La Salle University – Manila.

The research team, **Statisteros Modeleros**, is composed of three undergraduate students from the **Bachelor of Science in Computer Science** program. This project investigates real-world patterns in the **film industry from 2004 to 2025**, in correlation with the **rise of social media platforms** during the same period.

Motivated by the popular belief that global attention spans are shrinking due to the emergence of short-form content such as **TikToks, Instagram Reels, and YouTube Shorts**, this study explores whether the **movie industry has responded or adapted** to these shifting media consumption habits.

### 👥 Researchers

<table>
<tr>
<td>

**Carlos Luis B. Cailao**  
_2nd Year, BS Computer Science – Software Technology_  
📧 carlos_cailao@dlsu.edu.ph  

**Andre Gabriel D. Llanes**  
_2nd Year, BS Computer Science – Software Technology_  
📧 andre_llanes@dlsu.edu.ph  

**Sophia Pauline V. Sena**  
_2nd Year, BS Computer Science – Network and Information Security_  
📧 sophia_sena@dlsu.edu.ph  

</td>
<td align="right" style="text-align: right; vertical-align: top;">

<strong>Statisteros Modeleros</strong><br>  
<img src="https://raw.githubusercontent.com/carloscailao/CSMODEL/main/assets/StatisterosModeleros_Logo.png" alt="Statisteros Modeleros Logo" width="150"><br>  
<sub><i>Logo generated using ChatGPT</i></sub>

</td>
</tr>
</table>

---

# 📊 About the Datasets

## 🎬 1. Movies Dataset – The Movies Database (TMDB)

- The **TMDB Movie Database** is a comprehensive dataset containing key information about films such as:
  - 🎞️ ID
  - 🎬 Title
  - ⭐ Average Vote
  - 🗳️ Vote Count
  - 📅 Release Date
  - ⏱️ Runtime
  - 💰 Revenue
  - 📌 Status
  - …and other attributes

- **Source:** [Kaggle – TMDB Movies Dataset (930K+ Movies)](https://www.kaggle.com/datasets/asaniczka/tmdb-movies-dataset-2023-930k-movies?fbclid=IwY2xjawK7Og1leHRuA2FlbQIxMQABHlArG7Tyo1mSlbbwmYbOZ068LWVINqSO5yVICdjNNztmY--1a57isPs8cuts_aem_nJypZ3sxgaGXGtVjc-kKLA)

- **Filtering Criteria:**
  - The original dataset contained around **1 million movie entries**.
  - To ensure **relevance and engagement**, only movies with **1,000+ vote counts** were retained.
  - This filter reduced the dataset to **3,940 movies**.

## 📱 2. Social Media Dataset – Manually Constructed from Statistics and Data Org

- This dataset was manually compiled by the researchers based on an interactive chart from **Statistics and Data** showing:
  - The **Global Top 10 Social Media Platforms** by **Monthly Active Users (MAU)**  
  - Timeline: **February 2004 to January 2025**

- Since no downloadable dataset was provided, the group created their own CSV file containing:
  - 📆 Year and Month  
  - 🏷️ Platform Name  
  - 👥 Monthly Active Users  
  - 🏅 Rank per Month

- **Source Website:** [Statistics & Data – Most Popular Social Media (2004–2025)](https://statisticsanddata.org/data/most-popular-social-media-2004-2025/?fbclid=IwY2xjawK7OolleHRuA2FlbQIxMQABHlLb727mru341vsUF6i4K2_suxcTzDdB0tJDLk_xLoc0Ifb4_PRylDYjOiNK_aem_Na3LzkWUyR_4eRbxyhf7-w#google_vignette)

- The group **attempted to reach out** to the site’s administrators via **email, Messenger, and Instagram**, but received no response as of this notebook's writing.

## Import them here:

In [8]:
import pandas as pd

moviesDatasetRaw = "https://raw.githubusercontent.com/carloscailao/CSMODEL/main/data/Movies_Dataset_TMDB.csv"
socialMediaDatasetRaw = "https://raw.githubusercontent.com/carloscailao/CSMODEL/main/data/Social_Media_Dataset_Top_10_Social_Media_2004-2025.csv"

moviesDf = pd.read_csv(moviesDatasetRaw)
socialMediaDf = pd.read_csv(socialMediaDatasetRaw)

---

# Data Preprocessing

1. Normalizing the date format for Social Media Dataframe
- This allows for smoother conversion later on when working with date entries.

In [12]:
socialMediaDf['year_month'] = socialMediaDf['year_month'].astype(str)

# Ensure month is two digits
socialMediaDf['year_month'] = socialMediaDf['year_month'].apply(
    lambda x: f"{x.split('_')[0]}-{int(x.split('_')[1]):02d}"
)
socialMediaDf.head()


IndexError: list index out of range

---

# 💗 Acknowledgements

- ✍️ **Writing Assistance & Formatting**  
  Portions of the notebook's text — including the abstract, dataset descriptions, and formatting — were improved and polished with the help of **ChatGPT**, developed by **OpenAI**.

- 🎨 **Logo Support**  
  The Statisteros Modeleros logo image was generated using **ChatGPT (Image & Design Guidance)** using OpenAI's generative tools.

- 🧰 **Tool Stack**  
  - [Google Colab](https://colab.research.google.com/) for collaborative Python notebook work  
  - [GitHub](https://github.com/) for version control and dataset hosting  
  - [OpenAI ChatGPT](https://chat.openai.com/) for content support and productivity assistance
  