# How random is `r/random`?

There's a limit of 0.5 req/s (1 request every 2 seconds)


## What a good response looks like (status code 302)
```
$ curl https://www.reddit.com/r/random

<html>
 <head>
  <title>302 Found</title>
 </head>
 <body>
  <h1>302 Found</h1>
  The resource was found at <a href="https://www.reddit.com/r/Amd/?utm_campaign=redirect&amp;utm_medium=desktop&amp;utm_source=reddit&amp;utm_name=random_subreddit">https://www.reddit.com/r/Amd/?utm_campaign=redirect&amp;utm_medium=desktop&amp;utm_source=reddit&amp;utm_name=random_subreddit</a>;
you should be redirected automatically.


 </body>
</html>
```

## What a bad response looks like (status code 429)
```
$ curl https://www.reddit.com/r/random

<!doctype html>
<html>
  <head>
    <title>Too Many Requests</title>
    <style>
      body {
          font: small verdana, arial, helvetica, sans-serif;
          width: 600px;
          margin: 0 auto;
      }

      h1 {
          height: 40px;
          background: transparent url(//www.redditstatic.com/reddit.com.header.png) no-repeat scroll top right;
      }
    </style>
  </head>
  <body>
    <h1>whoa there, pardner!</h1>



<p>we're sorry, but you appear to be a bot and we've seen too many requests
from you lately. we enforce a hard speed limit on requests that appear to come
from bots to prevent abuse.</p>

<p>if you are not a bot but are spoofing one via your browser's user agent
string: please change your user agent string to avoid seeing this message
again.</p>

<p>please wait 4 second(s) and try again.</p>

    <p>as a reminder to developers, we recommend that clients make no
    more than <a href="http://github.com/reddit/reddit/wiki/API">one
    request every two seconds</a> to avoid seeing this message.</p>
  </body>
</html>
```


# What happens
GET --> 302 (redirect) --> 200 (subreddit)

I only want the name of the subreddit, so I don't need to follow the redirect.

In [None]:
import pandas as pd
import requests

from time import sleep
from tqdm import tqdm
from random import random

In [None]:
def parse_http(req):
  """
  Returns the name of the subreddit from a request

  If the status code isn't 302, returns "Error"
  """
  if req.status_code != 302:
    return "Error"
  
  start_idx = req.text.index('/r/') + len('/r/')
  end_idx = req.text.index('?utm_campaign=redirect') - 1

  return req.text[start_idx:end_idx]

  

In [None]:
sites = []
codes = []

headers = {
    'User-Agent': 'Mozilla/5.0'
}

# Works for 10, 100 @ 3 seconds / request
# Works for 10      @ 2 seconds / request
for _ in tqdm(range(1000), ascii=True):
  # Might have to mess with the User-Agent to look less like a bot
  # https://evanhahn.com/python-requests-library-useragent
  # Yeah the User-Agent says it's coming from python requests
  # Changing it fixed everything
  r = requests.get('https://www.reddit.com/r/random', 
                   headers=headers,
                   allow_redirects=False)
  if r.status_code == 429:
    print("Got rate limit error")
  sites.append(parse_http(r))
  codes.append(r.status_code)
  # Jitter the sleep a bit to throw off bot detection
  sleep(2 + random())



100%|##########| 1000/1000 [43:44<00:00,  2.62s/it]


In [None]:
#[print(code, site) for code, site in zip(codes, sites)];
for row in list(zip(codes, sites))[-10:]:
  print(row[0], row[1])

302 Documentaries
302 MadeInAbyss
302 starcitizen
302 camphalfblood
302 selfhosted
302 MrRobot
302 hajimenoippo
302 Warthunder
302 FifaCareers
302 Pathfinder_Kingmaker


In [None]:
df = pd.DataFrame(list(zip(sites, codes)), columns=['subreddit', 'response_code'])
df.head()

Unnamed: 0,subreddit,response_code
0,ireland,302
1,videography,302
2,jurassicworldevo,302
3,Glocks,302
4,Worldbox,302


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   subreddit      1000 non-null   object
 1   response_code  1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [27]:
from time import time
fname = 'reddit_randomness_' + str(int(time())) + '.csv'
df.to_csv(fname,index=False)