### Jupyter Notebook for Extracting Subreddit Data with PRAW and Saving Outputs

**Introduction**
This notebook extracts data from a subreddit using PRAW (Python Reddit API Wrapper), 
processes it into a pandas DataFrame, and saves it to JSON, CSV, and SQLite formats.

# Importing Required Libraries
- `praw`: Python Reddit API Wrapper to interact with Reddit.
- `pandas`: Powerful data manipulation and analysis library, ideal for tabular data.
- `sqlite3`: Python's built-in SQLite database library to store data in relational format.
- `tqdm`: Progress bar library to track iterations in loops.

In [None]:
%pip install praw tqdm

In [None]:
import os
from getpass import getpass
import praw
import pandas as pd
import sqlite3
from tqdm import tqdm
from datetime import datetime

### Prompting User for Reddit Credentials
In Jupyter notebooks, we avoid hardcoding sensitive credentials. Instead, we use the following approach:
- First, check if the credentials are defined as global variables.
- If not, check environment variables.
- Finally, prompt the user to input them securely using `getpass`.

If you need to change or reset the credentials below after running the cell, then please restart your kernel from the menu above.

In [None]:
client_id = globals().get("client_id") or os.getenv("REDDIT_CLIENT_ID") or getpass("Enter your Reddit client ID: ")
client_secret = globals().get("client_secret") or os.getenv("REDDIT_CLIENT_SECRET") or getpass("Enter your Reddit client secret: ")
username = globals().get("username") or os.getenv("REDDIT_USERNAME") or getpass("Enter your Reddit username: ")


# Initialize PRAW
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=f"script:praw:{praw.__version__} (by u/{username})"
)

# This cell will prompt for Reddit API credentials only once.
# If you need to change or reset the credentials, then please restart your kernel from the menu above.

## Extracting Data from a Subreddit
The function `extract_subreddit_data` fetches the latest posts from a subreddit using the Reddit API.
Using `tqdm`, we track the progress of fetching submissions to enhance user experience.

In [None]:
def extract_subreddit_data(subreddit_name, limit=100):
    subreddit = reddit.subreddit(subreddit_name)
    data = []
    try:
        for submission in tqdm(subreddit.new(limit=limit), desc=f"Fetching posts from r/{subreddit_name}"):
            submission.comments.replace_more(limit=0)
            comments = [comment.body for comment in submission.comments.list()]
            data.append({
                'post_id': submission.id,
                'title': submission.title,
                'selftext': submission.selftext,
                'author': str(submission.author),
                'created_utc': submission.created_utc,
                'comments': comments
            })
    except Exception as e:
        print(f"An error occurred while extracting data from r/{subreddit_name}: {e}")
    return pd.DataFrame(data)

### Saving the Data
Using pandas, we export the data to multiple formats:
- JSON: A standard, human-readable format for data sharing and APIs.
- JSONL: JSON Lines, a machine-readable format where each record is a line, suitable for processing with tools like `grep` or loading into databases like BigQuery.
- CSV: A widely-used format for spreadsheets and data exchange.
- SQLite: A relational database format, useful for structured queries and relationships (e.g., posts and comments).


In [None]:
def save_data(df, base_filename):
    try:
        # Save to JSON
        json_filename = f"{base_filename}.json"
        df.to_json(json_filename, orient='records', indent=4)
        print(f"Data saved to {json_filename}")

        # Save to JSONL
        jsonl_filename = f"{base_filename}.jsonl"
        df.to_json(jsonl_filename, orient='records', lines=True)
        print(f"Data saved to {jsonl_filename}")

        # Save to CSV
        csv_filename = f"{base_filename}.csv"
        df.to_csv(csv_filename, index=False)
        print(f"Data saved to {csv_filename}")

        # Save to SQLite
        db_filename = f"{base_filename}.db"
        conn = sqlite3.connect(db_filename)
        df[['post_id', 'title', 'selftext', 'author', 'created_utc']].to_sql('posts', conn, if_exists='replace', index=False)
        comments_data = df.explode('comments')[['post_id', 'comments']].rename(columns={'comments': 'comment'})
        comments_data.to_sql('comments', conn, if_exists='replace', index=False)
        conn.close()
        print(f"Data saved to {db_filename}")
    except Exception as e:
        print(f"An error occurred while saving data: {e}")

### Specifying Subreddit and Running Extraction

In [None]:
subreddit_name = input("Enter the subreddit name: ")
base_filename = f"subreddit_{subreddit_name.lower()}_{datetime.now().strftime('%Y-%m-%d')}"

# Fetch data and save outputs
data = extract_subreddit_data(subreddit_name, limit=100)
save_data(data, base_filename)