# Data Wrangling

## Import Libraries

In [3]:
import praw
import pandas as pd
import time

Create my Reddit connection via my praw.ini file in this project folder

In [2]:
reddit = praw.Reddit()

## Reddit Data Acquition

#### Collecting data from reddit via our api in our praw.ini file from 5 different subreddits and collecting the Title of the post, the questions asked, the subreddit, the answer which is the top ranked comment, and the time the post was created.

#### *NOTE: What I am showing here is the refined code after several iterations of trying to read in the reddit data. I was not able to obtain all 5000 rows in one attempt as it kept timing out at about 800-1000. In actuality I had to do this process 5-6 diferent times in a broken down slower and shorter way. Had I not run into time out issue or was willing to wait several hours to pull this data I would run it this way below as I had originally intended. That messy code will be left out as from the last project it was not desired and wont make any sense if you actually look at it comapred to simply and solely this notebook*

In [None]:
subreddits = ['techsupport','askphilosophy','askculinary','askacademia','askstatistics']

data = pd.DataFrame(columns=['title', 'selftext', 'subreddit', 'comments', 'created_utc'])

posts_per_subreddit = 1100

for subreddit_name in subreddits:
    subreddit = reddit.subreddit(subreddit_name)
    
    collecting_posts = subreddit.hot(limit=posts_per_subreddit)
    
    for post in collecting_posts:
        if not post.over_18:
            post.comments.replace_more(limit=0)
            highest_ups = -1
            best_comment = ""
            for comment in post.comments:
                if comment.ups > highest_ups:
                    best_comment = comment.body
                    highest_ups = comment.ups
            
            data = data.append({
                'title': post.title,
                'selftext': post.selftext,
                'subreddit': post.subreddit.display_name,
                'comments': best_comment,
                'created_utc': post.created_utc
            }, ignore_index=True)
        time.sleep(1.0)

Saving our dataframe and exporting it into a csv file for safety reasons and to ensure we dont overwrite anything or make mistakes in the future. We will then read in this new csv file when obtaining chatgrp responses

In [None]:
data.to_csv('reddit_only_dataset.csv', index=False)

## OpenAI Data Acquisition

In [None]:
import openai

Currently the dataframe below is the data with just reddit information, the chatgpt responses will be added next

In [6]:
merged_df = pd.read_csv('reddit_only_dataset.csv')
merged_df.head()

Unnamed: 0.1,Unnamed: 0,title,selftext,subreddit,comments,created_utc
0,0,Recommended wiki articles (including malware r...,## Check out these recommended threads on our ...,techsupport,,1627263000.0
1,1,Another update on our future,"Dear /r/TechSupport visitors and subscribers,\...",techsupport,Discord is definitely not a proper alternative...,1690183000.0
2,2,Help for my 92yo blind grandad,Hi Reddit I hope this might reach the right pe...,techsupport,"I know this isn't what you want to hear, but I...",1695145000.0
3,3,Moving countries and needing to be sure “illeg...,"Heyho, I’m moving countries in a couple of mon...",techsupport,Most data isn’t deleted when you click delete....,1695140000.0
4,4,Insanely High PC DPC-Count Latency - How To Fix?,I've been trying to record on FL Studio using ...,techsupport,,1695148000.0


I initiated my openai connection, of course for security purposes it will be removed

In [None]:
openai.api_key = 'MY_API_KEY'

In [None]:
total_token_usage = 0
count = 0

The for loop below is how I feed in the questions from the selftext column in my data to ask to openai and will them into the new column chatgpt_response once I get answer from openai.

Some things to note is that I had to use iloc to do a few hundred calls at a time to make sure I wasnt racking up an enourmous bill and wouldnt face time out issues. Also I would have to go back into certain locations after the fact for several chunks of rows that the api misseed.

Furthermore, I also decided to track total token use after every call and count each time it was called to monitor the speed of the requests, the amount of token per request, and see how far along I was in the code instead of running blind.

In [7]:
#For every row in the dataframe we are going to run through it
for index, row in merged_df.iloc[1:5000].iterrows():
    #keep track of how long things run
    count += 1
    #take the column selftext as the question for chatgpt
    question = row['selftext']
    
    #Try catch block so we dont break from our for loop
    try:
        # Make an API request
        chatgpt_response = openai.Completion.create(
            #Using text-babbage-001 since it was extremly cheap
            model='text-babbage-001',
            prompt=question,
            temperature = 0.6,
            max_tokens=50  # Adjust max tokens as needed
        )
        
        # extract and save response
        merged_df.at[index, 'chatgpt_response'] = chatgpt_response.choices[0].text.strip()
        
        #monitor the amount of tokens used
        token_usage = chatgpt_response['usage']['total_tokens']
        total_token_usage += token_usage
        
        #I had to add this so I could slow down my api request calls to avoid a big bill
        if count % 60 == 0:
            time.sleep(60)  # Wait for 60 seconds every 60 requests
        
        # I had to add this as well to slow down the requests
        if total_token_usage >= 250000:
            time.sleep(60)  # Wait for 60 seconds
        print(f"Row: {count} is complete")
        
    except Exception as e:
        #Notify me when a row doesnt work as intended
        print(f"Error processing row {index}: {str(e)}")
        continue

#print the total token usage after processing all requests
print(f"Total Token Usage: {total_token_usage}")


NameError: name 'null_response_df' is not defined

At this point I want to check to see my null value to see which rows get skipped over. After checking how many rows didnt get input into chatgpt I would resort the rows and then run an iloc on the specific rows this time on the above for loop to re-ask the questions.

In [None]:
merged_df.isnull().sum()

In [None]:
merged_df = merged_df.sort_values(by='chatgpt_response', ascending=False)

Once I was done and was able to obtain the entirety of the data I needed I saved and exported it into a new excel sheet and decided to move into the next notebook for cleaning

In [None]:
merged_df.to_csv('reddit_chatgpt_full_data.csv')