# 🎯 Objective

This notebook processes raw Reddit data collected in NB01, cleans it, and stores it in a structured SQLite database.  




# 📚 Libraries




In [106]:
import os  
import json  
import sqlite3  
import pandas as pd  
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import re

# 📊 Data Loading

In [107]:
# Define correct file paths
BASE_DIR = "/files/ds105a-2024-alternative-summative-ajchan03"  
DATA_DIR = os.path.join(BASE_DIR, "data", "raw")  

# Define file paths
POSTS_FILE = os.path.join(DATA_DIR, "reddit_filtered_posts.json")
COMMENTS_FILE = os.path.join(DATA_DIR, "reddit_filtered_comments.json")

# Check if JSON files exist before attempting to load them
if not os.path.exists(POSTS_FILE):
    raise FileNotFoundError(f"🚨 Error: `{POSTS_FILE}` not found. Please run the scraper first.")

if not os.path.exists(COMMENTS_FILE):
    raise FileNotFoundError(f"🚨 Error: `{COMMENTS_FILE}` not found. Please run the scraper first.")

# Load JSON data into DataFrames
with open(POSTS_FILE, "r", encoding="utf-8") as f:
    posts_data = json.load(f)
df_posts = pd.DataFrame(posts_data)

with open(COMMENTS_FILE, "r", encoding="utf-8") as f:
    comments_data = json.load(f)
df_comments = pd.DataFrame(comments_data)

# Data successfully loaded
print("Data Loaded Successfully")


Data Loaded Successfully


# 🧹 Data Cleaning & Transformation

This section does the following:
- Handling **missing values**  
- Converting timestamps to **datetime format**  
- Removing **duplicate posts and comments**  
- Ensuring all comments **link to a valid post**  
- **Filters comments so that only ones containing 'Trump' remain** to match our research question
- Adding **sentiment analysis** to comments  



In [108]:
# 🚀 Step 2: Data Cleaning (Filtering for Trump-Related Comments)
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download NLTK VADER sentiment analysis tool
nltk.download("vader_lexicon")
sia = SentimentIntensityAnalyzer()

# 🧹 Handle Missing Values
df_posts.fillna("", inplace=True)
df_comments.fillna("", inplace=True)

# Convert Date Fields to Datetime
df_posts["created_utc"] = pd.to_datetime(df_posts["created_utc"])
df_comments["created_utc"] = pd.to_datetime(df_comments["created_utc"])

# Add 'subreddit' to Comments Table
df_comments = df_comments.merge(df_posts[['id', 'subreddit']], left_on="post_id", right_on="id", how="left")

# Drop the duplicate 'id' column (from df_posts) since 'post_id' is already in df_comments
df_comments.drop(columns=["id"], inplace=True)

# Remove Duplicate Posts
df_posts.drop_duplicates(subset=["id"], inplace=True)

# Remove Duplicate Comments
df_comments.drop_duplicates(subset=["comment_id"], inplace=True)

# Ensure Foreign Key Consistency
df_comments = df_comments[df_comments["post_id"].isin(df_posts["id"])]

# Filter Comments: Keep only those mentioning "Trump" (case-insensitive)
df_comments = df_comments[df_comments["body"].str.contains(r'\bTrump\b', flags=re.IGNORECASE, regex=True, na=False)]

# Add Sentiment Analysis
df_comments["comment_sentiment"] = df_comments["body"].apply(lambda text: sia.polarity_scores(text)["compound"])

# Ensure No Duplicates in DataFrames
duplicate_posts = df_posts[df_posts.duplicated(subset=["id"], keep=False)]
duplicate_comments = df_comments[df_comments.duplicated(subset=["comment_id"], keep=False)]


print(df_comments.head())

if duplicate_posts.empty:
    print("\n✔️ No duplicate posts found.")
else:
    print("\n⚠️ Duplicate posts detected!")
    print(duplicate_posts)

if duplicate_comments.empty:
    print("\n✔️ No duplicate comments found.")
else:
    print("\n⚠️ Duplicate comments detected!")
    print(duplicate_comments)




[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/datahub/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


   post_id comment_id                                               body  \
0  1d4emcb    l6e9yus  So all those who still believe James Buchanan ...   
2  1d4emcb    l6e9bug  My favorite is Fox News saying this was all or...   
3  1d4emcb    l6dqpw7         Just here for HISTORY!!!!\n\n\nFUCK TRUMP.   
5  1d4emcb    l6dt95e  Trump is number 1!!!  First to lose the popula...   
6  1d4emcb    l6dti72  His Wikipedia page has been updated as well.  ...   

   score         created_utc subreddit  comment_sentiment  
0    113 2024-05-30 23:08:29  politics            -0.7717  
2    145 2024-05-30 23:04:25  politics            -0.5859  
3    994 2024-05-30 21:12:19  politics            -0.7507  
5     57 2024-05-30 21:26:54  politics            -0.7887  
6    110 2024-05-30 21:28:21  politics             0.7783  

✔️ No duplicate posts found.

✔️ No duplicate comments found.


# 💾 Database Design

This seciton does the following:
- **Define the SQLite database structure**
- **Create tables (`posts` & `comments`) with foreign key relationships**
- **Store the cleaned data into the database**

### 
**Database Structure**

I store the data in **`data/reddit_data.db`**.

| **Table**   | **Columns** | **Primary Key** | **Foreign Key** |
|------------|------------|----------------|----------------|
| **posts**  | `id, subreddit, title, score, num_comments, created_utc, text, url` | `id` | - |
| **comments** | `comment_id, post_id, body, score, created_utc, comment_sentiment` | `comment_id` | `post_id` (FK → posts.id) |

---



# 📥 Database Creation

In [109]:
# 📥 Step 4: Database Creation

# Define database path
DB_PATH = os.path.join(BASE_DIR, "data", "reddit_data.db")

# Connect to SQLite and create tables
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()

# Create 'posts' table
cursor.execute("""
CREATE TABLE IF NOT EXISTS posts (
    id TEXT PRIMARY KEY,
    subreddit TEXT,
    title TEXT,
    score INTEGER,
    num_comments INTEGER,
    created_utc DATETIME,
    text TEXT,
    url TEXT
);
""")

# Create 'comments' table with sentiment analysis
cursor.execute("""
CREATE TABLE IF NOT EXISTS comments (
    comment_id TEXT PRIMARY KEY,
    post_id TEXT,
    body TEXT,
    score INTEGER,
    created_utc DATETIME,
    comment_sentiment REAL,
    FOREIGN KEY (post_id) REFERENCES posts (id)
);
""")

# Insert data into SQLite database
df_posts.to_sql("posts", conn, if_exists="replace", index=False)
df_comments.to_sql("comments", conn, if_exists="replace", index=False)

conn.commit()




✅ Database Creation & Data Insertion Completed!


## ✅ Quality Check

The code does the following:
- Checking record counts
- Validating foreign key relationships
- Inspecting sentiment score distribution


In [112]:
# Check Table Row Counts
print("\n📊 Table Row Counts:")
print("Posts:", pd.read_sql_query("SELECT COUNT(*) FROM posts;", conn).iloc[0, 0])
print("Comments:", pd.read_sql_query("SELECT COUNT(*) FROM comments;", conn).iloc[0, 0])

# Validate Foreign Keys (Ensure All Comments Link to Valid Posts)
invalid_comments = pd.read_sql_query("""
    SELECT COUNT(*) FROM comments c
    LEFT JOIN posts p ON c.post_id = p.id
    WHERE p.id IS NULL;
""", conn).iloc[0, 0]

if invalid_comments == 0:
    print("\n✅ Foreign Key Check: All comments have valid posts.")
else:
    print(f"\n⚠️ Warning: {invalid_comments} comments have no associated post!")



# Close Database Connection
conn.close()






📊 Table Row Counts:
Posts: 841
Comments: 19132

✅ Foreign Key Check: All comments have valid posts.

✅ Data Quality Check Completed!
