# **Reddit API Test**

In [None]:
import praw
import pandas as pd
import time
import json
import sys

In [None]:
# append path to credentials
sys.path.append('c:\\Users\\3leso\\Documents\\Elena\\Uni\\MasterThesis')
from credentials import CLIENT_ID, CLIENT_SECRET, USER_AGENT

In [3]:
users = ["Process-Lumpy", "mothertoker", "emilheu", "Suspiciouspackages1", "MidwestMonster2"] 

users_all = pd.read_csv("output/users_all.csv")
users_all

Unnamed: 0,user
0,Process-Lumpy
1,mothertoker
2,emilheu
3,Suspiciouspackages1
4,MidwestMonster2
...,...
775794,drizzystonks
775795,shinichiblue
775796,Estivenrex18
775797,TexasWithADollarsign


In [None]:
# sample
user_sample = users_all.sample(n = 100, random_state=32)
user_sample

Unnamed: 0,user
240230,BouncingNeuron
726267,aliengoods1
217615,AdreNa1ine25
574269,BostonBopper
246582,Numerous_Photograph9
...,...
717406,CarrierKid
517384,TheIABConnection
451703,Anonymou_brendan
14208,GladiusMortis


***
***
Do You Need to Handle Rate Limits When Using PRAW?

No, you don't need to manually handle rate limits when using PRAW. PRAW automatically respects Reddit's API rate limits and will pause or retry requests as needed. However, understanding how PRAW manages rate limits and how Reddit enforces them is important, especially for large-scale data collection like your case with 776,000 users.
How PRAW Handles Rate Limits

    Automatic Handling:

        PRAW automatically waits and retries when it encounters rate limit errors (e.g., "You're doing that too much. Try again in X seconds").

        You can configure the ratelimit_seconds parameter to set the maximum time PRAW will wait before raising an exception. For example:

        python
        reddit = praw.Reddit(client_id="YOUR_CLIENT_ID",
                             client_secret="YOUR_CLIENT_SECRET",
                             user_agent="YOUR_USER_AGENT",
                             ratelimit_seconds=300)

    Batch Requests:

        PRAW often bundles multiple objects (e.g., submissions or comments) into a single request, which helps optimize API usage.

    Rate Limit Information:

        You can access rate limit details via reddit.auth.limits, which provides information such as remaining requests and reset timestamps.

Reddit's API Rate Limits

    Authenticated Requests (OAuth):

        100 requests per minute per OAuth client ID.

        Averaged over a 10-minute window, allowing bursts of up to 1,000 requests in 10 minutes

    .

Unauthenticated Requests:

    Limited to 10 requests per minute

    .

Special Rate Limits:

    Reddit may enforce additional limits for certain actions (e.g., commenting, banning users), which are not documented but handled by PRAW

        .

Handling 776,000 Users

Given the scale of your task, here’s how you can efficiently collect data while staying within rate limits:
Steps to Optimize Your Workflow

    Use OAuth Authentication:

        Ensure your app is authenticated with OAuth to get the higher rate limit (100 requests/minute).

    Track Progress:

        Use a counter to keep track of processed users and log progress periodically.

    Parallel Processing:

        If possible, split the task across multiple machines or threads using different OAuth client IDs to increase throughput.

    Pause on Rate Limits:

        Let PRAW handle rate limits automatically, but monitor reddit.auth.limits for real-time feedback on remaining requests.

    Retry Logic:

        Implement retry logic with exponential backoff if you encounter API errors or unexpected delays.

Example Code for Large-Scale Data Collection

Here’s a simplified example of how you might process users while respecting rate limits:


***
***

In [None]:


# Authenticate with Reddit API
def authenticate():
    reddit = praw.Reddit(
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET,
        user_agent=USER_AGENT,
        ratelimit_seconds=300
    )
    return reddit

# Fetch subreddits for a user
def fetch_user_subreddits(username, reddit):
    try:
        redditor = reddit.redditor(username)
        subreddits = set()
        
        # Fetch submissions
        for submission in redditor.submissions.new(limit=None):
            subreddits.add(submission.subreddit.display_name)
        
        # Fetch comments
        for comment in redditor.comments.new(limit=None):
            subreddits.add(comment.subreddit.display_name)
        
        return list(subreddits)
    
    except Exception as e:
        print(f"Error fetching data for user {username}: {e}")
        return []




# Process users in batches
def process_users(user_list, reddit):
    user_dict = {}
    processed_count = 0
    
    for username in user_list:

        subreddits = fetch_user_subreddits(username, reddit)
        user_dict[username] = subreddits
        # save
        with open('output/user_data.json', 'w') as f:
            json.dump(user_dict, f)
        
        # Log progress
        processed_count += 1
        print(f"Processed {processed_count}/{len(user_list)} users.")
        
        # Optional: Save results to file or database
        
        # Pause if needed (PRAW handles this automatically)
        time.sleep(0)  # No explicit sleep required unless desired

    return user_dict


In [10]:

# Main execution
if __name__ == "__main__":
    reddit = authenticate()
    
    # Example user list (replace with your actual list)
    #user_list = ["user1", "user2", "user3", ...]
    
    process_users(user_sample['user'], reddit)

Version 7.7.1 of praw is outdated. Version 7.8.1 was released Friday October 25, 2024.


Processed 1/100 users.
Processed 2/100 users.
Processed 3/100 users.
Error fetching data for user BostonBopper: received 404 HTTP response
Processed 4/100 users.
Processed 5/100 users.
Processed 6/100 users.
Processed 7/100 users.
Processed 8/100 users.
Processed 9/100 users.
Processed 10/100 users.


KeyboardInterrupt: 

In [12]:

with open('output/user_data.json','r') as f:
    user_dict = json.load(f)

In [13]:
user_dict

{'BouncingNeuron': ['GamePhysics',
  'elderscrollsonline',
  'hiphopheads',
  'systemofadown',
  'PS4',
  'Music',
  'politics',
  'needysluts',
  'modernwarfare',
  'CaliBanging',
  'MapPorn',
  'skyrim'],
 'aliengoods1': ['entertainment',
  'Jokes',
  'WTF',
  'CatastrophicFailure',
  'sports',
  'politics',
  'Artisan',
  'iwatchedanoldmovie',
  'SelfSufficiency',
  'scuba',
  'Whatcouldgowrong',
  'nfl',
  'shutupandtakemymoney',
  'aquaponics',
  'hillaryclinton',
  'science',
  'technology',
  'AskReddit',
  'GreenBayPackers',
  'DIY',
  'childfree',
  'TheStaircase',
  'IDontWorkHereLady',
  'movies',
  'pics',
  'Libertarian',
  'history',
  'HillaryForAmerica',
  'Liberal',
  'EnoughTrumpSpam',
  'news',
  'business',
  'TinyHouses',
  'environment',
  'gameofthrones',
  'enoughsandersspam',
  'lost',
  'atheism',
  'worldnews',
  'Tennesseetitans',
  'electronics',
  'democrats',
  'videos',
  'Bad_Cop_No_Donut',
  'funny',
  'CFB',
  'webdev',
  'MMA',
  'reddit.com',
  'vic

***
***

Key Considerations

    Time Estimate: At 100 requests/minute, processing 776,000 users would take approximately 129 hours (if each user requires one request). Parallelization can reduce this significantly.

    Ethical Compliance: Ensure you're collecting only publicly available data and adhering to Reddit's API terms of use.

    Monitoring: Use logging or monitoring tools to track progress and detect issues during long-running tasks.

PRAW's automatic rate limit handling makes it well-suited for large-scale data collection tasks like yours!

***
***