---
title: Data Collection
---

# Text Data

## Reddit

In order to answer questions regarding public sentiment on cannabis usage and its ties to psychosis and schizophrenia, we will get text from Reddit. Reddit functions as a public forum on a large variety of topics, making it a good source for text data featuring discussions on cannabis, schizophrenia, and psychosis.

To get data from the Reddit API, I first made a user account and registered an app. This allowed me to generate a client ID and client secret for my app. My Reddit username and password are also necessary to gain access to the API.

To get started getting data from the Reddit API, I generate an access token using a basic HTTP GET with the `requests` package in Python. Note that I have removed my personal information from this code. 

In [58]:
import requests
import requests.auth

client_id = 'DNxdJxi2w3Ftw3JYWnU3-g'
client_secret = 'EmVUcpy9eweqRWMWiAu-jsDzyHIoLQ'
username = 'Haunting_River_226'
password = 'Yg#xu(fG4FJBG+&'

client_auth = requests.auth.HTTPBasicAuth(client_id, client_secret)
post_data = {"grant_type": "password", "username": username, "password": password}
headers = {"User-Agent": "DSANProject/1.0 by u/Haunting_River_226"}

response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
response_data = response.json()

With this call to the API, the response returns an access token as well as more token information. I will use the access token and token type to construct my API requests.

In [59]:
access_token = response_data['access_token']
token_type = response_data['token_type']

Now, I can use my access token to construct a header to use for all of my API calls.

In [60]:
headers = {"Authorization": str(token_type + access_token), "User-Agent": "DSANProject/1.0 by u/Haunting_River_226"}

Now to get the data, I have chosen three subreddits that will be relevant:
    1. r/Psychosis
    2. r/schizophrenia
    3. r/weed
Each of these subreddits relate to cannabis and/or psychosis, and I will be analyzing the text to determine if and how these topics intersect in public conversation. 

In order to get recent data, I will be pulling the top 10,000 posts from the previous year (October 12, 2022 - October 12, 2023). I use the `/top` end point to get the top posts in a given subreddit. The Reddit API pulls only the first 100 results from a subreddit, but I can get more than 100 results by using the `after` parameter and setting it equal to the `after` key in the `response` JSON. This starts by pulling the first 100 posts, then gets the next 100 posts, and so on until we have reached 10,000.

The Reddit API also has stringent limits on the number of requests made per minute, so I'll use a sleep function that limits the API requests to 10 per minute.

I will start with the r/Psychosis subreddit.

In [36]:
import time

post_id = ""
data = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/Psychosis/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    res = response.json()
    data[i] = res
    post_id = res["data"]["after"][3:]

Next, we will repeat this process to get data from r/schizophrenia.

In [61]:
post_id = ""
data_schizophrenia = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/schizophrenia/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    if(response.status_code != 200):
        print(i)
        print(response.status_code)
    res = response.json()
    data_schizophrenia[i] = res
    post_id = res["data"]["after"][3:]

Finally, we will repeat this process once more to get data from r/weed.

In [62]:
post_id = ""
data_cannabis = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/weed/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    res = response.json()
    data_cannabis[i] = res
    post_id = res["data"]["after"][3:]

Let's save the data to limit calls to the API. We'll save each dictionary as a JSON file to preserve it's data structure. We will work on cleaning this data in the Data Cleaning page of this website.

In [63]:
import json

with open('reddit_psychosis_data.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)

with open('reddit_schizophrenia_data.json', 'w') as json_file:
        json.dump(data_schizophrenia, json_file, indent=4)

with open('reddit_cannabis_data.json', 'w') as json_file:
        json.dump(data_cannabis, json_file, indent=4)

## Wikipedia

To get a more academic perspective on the link between psychosis and cannabis, we will also pull data from Wikipedia. We will use R and the `WikipediR` package to get data from Wikipedia.

In [2]:
library(WikipediR)

In [67]:
long_term_cannabis_backlinks <- page_backlinks(
    "en",
    "wikipedia",
    page = "Long-term effects of cannabis",
    limit = 500
)
long_term_cannabis_links <- page_links(
    "en",
    "wikipedia",
    page = "Long-term effects of cannabis",
    limit = 500,
    namespaces = 0
)
long_term_cannabis <- page_content(
    "en",
    "wikipedia",
    page_name = "Long-term effects of cannabis"
)


In [42]:
long_term_cannabis_links$query$pages$`25905247`$links[[500]]

In [41]:
long_term_cannabis_backlinks$query$backlinks[[500]]

Now that we have the forward and back links for the "Long-term effects of cannabis" page, let's get the content of the forward and back links to create our corpus.

In [92]:
library(tidyverse)

wiki_data <- tibble(
    title = long_term_cannabis$parse$title,
    text = long_term_cannabis$parse$text$`*`,
    link = "main"
)

In [93]:
for(i in 1:500) {
    page_title = long_term_cannabis_links$query$pages$`25905247`$links[[i]]$title
    print(page_title)
    tryCatch({
        page_details <- page_content(
            "en",
            "wikipedia",
            page_name = page_title
        )
    },
    error = function(e) {
        print(paste0("error with ", page_title))
    })

    wiki_data <- wiki_data %>%
        add_row(
            title = page_details$parse$title,
            text = page_details$parse$text$`*`,
            link = "link"
        )
}
for(i in 1:500) {
    page_id = long_term_cannabis_backlinks$query$backlinks[[i]]$page_id
    tryCatch({
        page_details <- page_content(
            "en",
            "wikipedia",
            page = page_id
        )
    },
    error = function(e) {
        print(paste0("error with ", page_id))
    })
    wiki_data <- wiki_data %>%
        add_row(
            title = page_details$parse$title,
            text = page_details$parse$text$`*`,
            link = "back"
        )
}

[1] "(C6)-CP 47,497"
[1] "(C9)-CP 47,497"
[1] "(<U+2212>)-Cannabidiol"
[1] "error with (<U+2212>)-Cannabidiol"
[1] "1-Butyl-3-(2-methoxybenzoyl)indole"
[1] "error with 1-Butyl-3-(2-methoxybenzoyl)indole"
[1] "1-Butyl-3-(4-methoxybenzoyl)indole"
[1] "error with 1-Butyl-3-(4-methoxybenzoyl)indole"
[1] "1-Pentyl-3-(2-methoxybenzoyl)indole"
[1] "error with 1-Pentyl-3-(2-methoxybenzoyl)indole"
[1] "11-Hydroxy-Delta-8-THC"
[1] "11-Hydroxy-THC"
[1] "11-Nor-9-carboxy-THC"
[1] "11-OH-CBN"
[1] "11-OH-HHC"
[1] "2-Arachidonoyl lysophosphatidylinositol"
[1] "error with 2-Arachidonoyl lysophosphatidylinositol"
[1] "2-Arachidonoylglycerol"
[1] "2-Arachidonyl glyceryl ether"
[1] "2-Oleoylglycerol"
[1] "2C (psychedelics)"
[1] "2F-QMPSB"
[1] "3'-Hydroxy-THC"
[1] "3,4-Methylenedioxyamphetamine"
[1] "4'-Fluorocannabidiol"
[1] "4'Cl-CUMYL-PINACA"
[1] "4'F-CUMYL-5F-PICA"
[1] "4'F-CUMYL-5F-PINACA"
[1] "4-HTMPIPO"
[1] "4-Nonylphenylboronic acid"
[1] "420 (cannabis culture)"
[1] "4CN-CUMYL-BUT7AICA"
[1] "error

In [94]:
wiki_data %>% nrow()

In [96]:
wiki_data %>% 
    write_csv(file = "wikipedia_scrape.csv")

In [97]:
save(wiki_data, file = "wikipedia_scrape.Rdata")

# Record Data

To get helpful record data about the use of cannabis and psychosis, we can use datasets gathered by previous researchers.

