In order to answer questions regarding public sentiment on cannabis usage and its ties to psychosis and schizophrenia, we will get text from Reddit. Reddit functions as a public forum on a large variety of topics, making it a good source for text data featuring discussions on cannabis, schizophrenia, and psychosis.

To get data from the Reddit API, I first made a user account and registered an app. This allowed me to generate a client ID and client secret for my app. My Reddit username and password are also necessary to gain access to the API.

To get started getting data from the Reddit API, I generate an access token using a basic HTTP GET with the `requests` package in Python. Note that I have removed my personal information from this code. 

In [58]:
import requests
import requests.auth



client_auth = requests.auth.HTTPBasicAuth(client_id, client_secret)
post_data = {"grant_type": "password", "username": username, "password": password}
headers = {"User-Agent": "DSANProject/1.0 by u/Haunting_River_226"}

response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
response_data = response.json()

With this call to the API, the response returns an access token as well as more token information. I will use the access token and token type to construct my API requests.

In [59]:
access_token = response_data['access_token']
token_type = response_data['token_type']

Now, I can use my access token to construct a header to use for all of my API calls.

In [60]:
headers = {"Authorization": str(token_type + access_token), "User-Agent": "DSANProject/1.0 by u/Haunting_River_226"}

Now to get the data, I have chosen three subreddits that will be relevant:
    1. r/Psychosis
    2. r/schizophrenia
    3. r/weed
Each of these subreddits relate to cannabis and/or psychosis, and I will be analyzing the text to determine if and how these topics intersect in public conversation. 

In order to get recent data, I will be pulling the top 10,000 posts from the previous year (October 12, 2022 - October 12, 2023). I use the `/top` end point to get the top posts in a given subreddit. The Reddit API pulls only the first 100 results from a subreddit, but I can get more than 100 results by using the `after` parameter and setting it equal to the `after` key in the `response` JSON. This starts by pulling the first 100 posts, then gets the next 100 posts, and so on until we have reached 10,000.

The Reddit API also has stringent limits on the number of requests made per minute, so I'll use a sleep function that limits the API requests to 10 per minute.

I will start with the r/Psychosis subreddit.

In [36]:
import time

post_id = ""
data = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/Psychosis/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    res = response.json()
    data[i] = res
    post_id = res["data"]["after"][3:]

Next, we will repeat this process to get data from r/schizophrenia.

In [61]:
post_id = ""
data_schizophrenia = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/schizophrenia/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    if(response.status_code != 200):
        print(i)
        print(response.status_code)
    res = response.json()
    data_schizophrenia[i] = res
    post_id = res["data"]["after"][3:]

Finally, we will repeat this process once more to get data from r/weed.

In [None]:
post_id = ""
data_cannabis = {}
for i in range(0, 100):
    time.sleep(1)
    response = requests.get("https://oauth.reddit.com/r/weed/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    res = response.json()
    data_cannabis[i] = res
    post_id = res["data"]["after"][3:]