# Data Audit and Validation

This notebook documents the initial inspection and validation of the Reddit dataset used in this project.

The goals of this audit are to:
- understand the structure and time coverage of the dataset
- confirm that the data supports the selected Ukraine War events
- measure comment volume around each event
- identify which subreddits the comments come from

This step ensures that all later analyses are based on verified and well-understood data.


## Dataset Source

Originally, this project was intended to collect Reddit data directly using the Reddit API. 
However, due to API access restrictions and application approval requirements, direct API access 
was not feasible within the project timeline.

Instead, I used a publicly available Reddit comments dataset hosted on Kaggle. 
This dataset contains historical Reddit comments related to Russia–Ukraine discussions and spans multiple years.

- **Title:** Public Opinion Russia Ukraine War  
- **Link:** https://www.kaggle.com/datasets/asaniczka/public-opinion-russia-ukraine-war-updated-daily?resource=download

Using this dataset allowed:
- historical coverage back to the early stages of the war
- reproducible analysis without violating platform policies
- sufficient data volume for event-based sentiment analysis


## Loading Strategy

The raw dataset is approximately 5 GB in size, which makes it impractical to load entirely into memory.

To handle this safely, all analyses in this notebook use **chunked loading**, where the file is read in 
smaller batches (chunks) and processed incrementally.

This approach allows us to:
- scan the full dataset
- compute statistics such as date ranges and counts
- avoid memory crashes


In [1]:
import pandas as pd
from collections import defaultdict


## File Configuration

The following parameters define where the raw data is stored locally and how it is processed.


In [3]:
file_path = "../data/raw/reddit_opinion_ru_ua.csv"
chunksize = 200_000  # number of rows processed per chunk


## Dataset Schema

Before analyzing content, it is important to understand which columns are available in the dataset.


In [5]:
sample = pd.read_csv(file_path, nrows=1000, low_memory=False)
sample.columns


Index(['comment_id', 'score', 'self_text', 'subreddit', 'created_time',
       'post_id', 'author_name', 'controversiality', 'ups', 'downs',
       'user_is_verified', 'user_account_created_time', 'user_awardee_karma',
       'user_awarder_karma', 'user_link_karma', 'user_comment_karma',
       'user_total_karma', 'post_score', 'post_self_text', 'post_title',
       'post_upvote_ratio', 'post_thumbs_ups', 'post_total_awards_received',
       'post_created_time'],
      dtype='object')

## Time Coverage of the Dataset

To determine whether the dataset covers the full time span needed for this project, 
I scanned the entire file to find the earliest and latest comment timestamps.


In [7]:
min_date = None
max_date = None

for chunk in pd.read_csv(file_path, chunksize=chunksize, low_memory=False, on_bad_lines="skip"):
    chunk["dt"] = pd.to_datetime(chunk["created_time"], errors="coerce")
    chunk = chunk.dropna(subset=["dt"])

    chunk_min = chunk["dt"].min()
    chunk_max = chunk["dt"].max()

    if min_date is None or chunk_min < min_date:
        min_date = chunk_min
    if max_date is None or chunk_max > max_date:
        max_date = chunk_max

min_date, max_date


(Timestamp('2014-11-13 02:18:53'), Timestamp('2025-07-08 11:22:39'))

## Event Windows

The analysis focuses on five key events related to the Ukraine War. 
Each event is represented by a fixed time window to allow before/after comparisons.


In [9]:
event_windows = {
    "event1_kyiv": ("2022-02-20", "2022-03-20"),
    "event2_kherson": ("2022-11-01", "2022-11-20"),
    "event3_stalemate": ("2023-03-01", "2023-06-30"),
    "event4_trump_election": ("2024-10-25", "2024-11-20"),
    "event5_white_house_meeting": ("2025-02-15", "2025-03-10"),
}

event_windows = {
    k: (pd.to_datetime(v[0]), pd.to_datetime(v[1]))
    for k, v in event_windows.items()
}


## Comment Volume per Event

To ensure that each event window contains enough data for analysis, 
I counted how many comments fall within each event’s time window.


In [11]:
event_counts = defaultdict(int)

for chunk in pd.read_csv(file_path, chunksize=chunksize, low_memory=False, on_bad_lines="skip"):
    chunk["dt"] = pd.to_datetime(chunk["created_time"], errors="coerce")
    chunk = chunk.dropna(subset=["dt", "self_text"])

    for event, (start, end) in event_windows.items():
        mask = (chunk["dt"] >= start) & (chunk["dt"] <= end)
        event_counts[event] += mask.sum()

event_counts


defaultdict(int,
            {'event1_kyiv': 2940,
             'event2_kherson': 279,
             'event3_stalemate': 9955,
             'event4_trump_election': 359088,
             'event5_white_house_meeting': 241262})

## Subreddit Coverage

To understand which communities the comments come from, 
I extracted and counted the subreddits present in the dataset.


In [21]:
subreddit_counts = defaultdict(int)

for chunk in pd.read_csv(file_path, chunksize=chunksize, low_memory=False, on_bad_lines="skip"):
    if "subreddit" not in chunk.columns:
        continue

    vc = chunk["subreddit"].dropna().value_counts()
    for sub, cnt in vc.items():
        subreddit_counts[sub] += int(cnt)

sub_df = (
    pd.DataFrame(subreddit_counts.items(), columns=["subreddit", "count"])
      .sort_values("count", ascending=False).reset_index(drop=True)
)





sub_df.head(50)


Unnamed: 0,subreddit,count
0,UkraineWarVideoReport,1058238
1,worldnews,894825
2,UkraineRussiaReport,890467
3,UkrainianConflict,567003
4,ukraine,512959
5,europe,424409
6,CombatFootage,305308
7,AskARussian,252520
8,politics,185622
9,conspiracy,108492


## Summary and Observations

Key observations from this data audit:

- The dataset spans from 2014 to 2025, covering all selected events.
- Comment volume varies significantly across events, reflecting differences in public attention.
- Early war events have lower comment volume than later politically driven events.
- The dataset includes a wide range of subreddits, including political, international, and news-oriented communities.

These findings informed the final choice of event windows and analysis methods used in later notebooks.
