# **COMMENT SCRAPING FROM REDDIT USING REDDIT API**

In this project, we'll use the Python ***Reddit API Wrapper (PRAW)*** to fetch comments from a specific subreddit and save them into CSV files for further analysis. Let's dive into the code and understand how each part contributes to the project.

##**STEP 1: Installing Necessary Libraries**

First, we import all the required libraries. These include ***praw*** for accessing Reddit, ***datetime*** for handling date and time, ***requests*** for making HTTP requests, ***csv*** for writing data to CSV files, and ***os*** for handling file directories.

In [13]:
!pip install praw
import datetime
import requests
import csv
import time
import os
from os.path import join
import praw




##**Step 2: Setting Up Reddit API Credentials**

We set up the Reddit API credentials using PRAW.


In [14]:
reddit = praw.Reddit(client_id='api_client_id',
                     client_secret='client_secret',
                     user_agent='Aditya Bhandari',
                     check_for_async=False)
data_directory = "data"
os.makedirs(data_directory, exist_ok=True)

##**Step 3: Defining Variables**
Here, we define the subreddit we are interested in and initialize an empty list to store the fetched comments.

In [15]:
# Subreddit from which we will get comments

subreddit_name = 'OpenAI'

# List to store comments
comments_data = []

filepath = "D:\\Spring 24 Study\\Research and Communication\\Research Paper  - Aditya and Reina\\Data and code"


##**Step 4: Fetching Comments**
We define a ***function fetch_comments_per_month*** that fetches comments from the specified subreddit for a given month and year.

The function uses ***PRAW*** to interact with Reddit and filter comments based on specific keywords.

In [16]:
def fetch_comments_per_month(subreddit_name, year, month, limit=1000):
    comments_data = []
    start_time = datetime.datetime(year, month, 1)
    if month == 12:
        end_time = datetime.datetime(year + 1, 1, 1)
    else:
        end_time = datetime.datetime(year, month + 1, 1)

    subreddit = reddit.subreddit(subreddit_name)

    # comments containing these keywords will be fetched
    keywords = ['chatgpt', 'gpt', 'ai', 'LLM']

    for submission in subreddit.new(limit=None):
        submission_time = datetime.datetime.fromtimestamp(submission.created_utc)
        if start_time <= submission_time < end_time:
            submission.comments.replace_more(limit=0)
            for comment in submission.comments.list():
                if len(comments_data) < limit:
                    comment_text = comment.body.lower()
                    if any(keyword in comment_text for keyword in keywords):
                        comments_data.append({
                            'body': comment.body,
                            'created_utc': datetime.datetime.fromtimestamp(comment.created_utc).strftime('%Y-%m-%d %H:%M:%S'),
                            'permalink': f"https://reddit.com{comment.permalink}"
                        })
                else:
                    return comments_data
    return comments_data


##**Step 5: Saving Comments to CSV**
We define a function ***save_comments_to_csv*** that saves the fetched comments into a CSV file.

We have new csv file for each month.

In [17]:
def save_comments_to_csv(comments_data, year, month):
    data_directory = "data"
    os.makedirs(data_directory, exist_ok=True)
    csv_file_name = f"comments_{year}_{month:02d}.csv"
    csv_file_path = os.path.join(data_directory, csv_file_name)

    with open(csv_file_path, mode='w', newline='', encoding='utf-8') as file:
        fieldnames = ['body', 'created_utc', 'permalink']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for comment in comments_data:
            writer.writerow(comment)

    print(f"Data saved to {filepath} with {len(comments_data)} comments.")



##**Step 6: Running the Data Collection**
We set the start and end dates and run a loop to fetch and save comments month by month. This ensures that we collect data for the entire period from the start date to the current date.

In [18]:
start_date = datetime.datetime(2024, 1, 1)
end_date = datetime.datetime.now()

current_date = start_date
while current_date < end_date:
    year = current_date.year
    month = current_date.month
    comments_data = fetch_comments_per_month(subreddit_name, year, month)
    save_comments_to_csv(comments_data, year, month)
    if month == 12:
        current_date = datetime.datetime(year + 1, 1, 1)
    else:
        current_date = datetime.datetime(year, month + 1, 1)

Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 0 comments.
Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 0 comments.
Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 0 comments.
Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 0 comments.
Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 1000 comments.
Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 1000 comments.
Data saved to D:\Spring 24 Study\Research and Communication\Research Paper  - Aditya and Reina\Data and code with 1000 comments.
