---
title: Data Collection
---

# Text Data

## Reddit

In order to answer questions regarding public sentiment on cannabis usage and its ties to psychosis and schizophrenia, we will get text from Reddit. Reddit functions as a public forum on a large variety of topics, making it a good source for text data featuring discussions on cannabis, schizophrenia, and psychosis.

To get data from the Reddit API, I first made a user account and registered an app. This allowed me to generate a client ID and client secret for my app. My Reddit username and password are also necessary to gain access to the API.

To get started getting data from the Reddit API, I generate an access token using a basic HTTP GET with the `requests` package in Python. Note that I have removed my personal information from this code. 

In [58]:
import requests
import requests.auth

client_auth = requests.auth.HTTPBasicAuth(client_id, client_secret)
post_data = {"grant_type": "password", "username": username, "password": password}
headers = {"User-Agent": "DSANProject/1.0 by u/Haunting_River_226"}

response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
response_data = response.json()

With this call to the API, the response returns an access token as well as more token information. I will use the access token and token type to construct my API requests.

In [59]:
access_token = response_data['access_token']
token_type = response_data['token_type']

Now, I can use my access token to construct a header to use for all of my API calls.

In [60]:
headers = {"Authorization": str(token_type + access_token), "User-Agent": "DSANProject/1.0 by u/Haunting_River_226"}

Now to get the data, I have chosen three subreddits that will be relevant:
    1. r/Psychosis
    2. r/schizophrenia
    3. r/weed
Each of these subreddits relate to cannabis and/or psychosis, and I will be analyzing the text to determine if and how these topics intersect in public conversation. 

In order to get recent data, I will be pulling the top 10,000 posts from the previous year (October 12, 2022 - October 12, 2023). I use the `/top` end point to get the top posts in a given subreddit. The Reddit API pulls only the first 100 results from a subreddit, but I can get more than 100 results by using the `after` parameter and setting it equal to the `after` key in the `response` JSON. This starts by pulling the first 100 posts, then gets the next 100 posts, and so on until we have reached 10,000.

The Reddit API also has stringent limits on the number of requests made per minute, so I'll use a sleep function that limits the API requests to 10 per minute.

I will start with the r/Psychosis subreddit.

In [36]:
import time

post_id = ""
data = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/Psychosis/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    res = response.json()
    data[i] = res
    post_id = res["data"]["after"][3:]

Next, we will repeat this process to get data from r/schizophrenia.

In [61]:
post_id = ""
data_schizophrenia = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/schizophrenia/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    if(response.status_code != 200):
        print(i)
        print(response.status_code)
    res = response.json()
    data_schizophrenia[i] = res
    post_id = res["data"]["after"][3:]

Finally, we will repeat this process once more to get data from r/weed.

In [62]:
post_id = ""
data_cannabis = {}
for i in range(0, 100):
    time.sleep(6)
    response = requests.get("https://oauth.reddit.com/r/weed/top.json", params={'t': 'year', 'limit': 100, 'after': post_id}, headers=headers)
    res = response.json()
    data_cannabis[i] = res
    post_id = res["data"]["after"][3:]

Let's save the data to limit calls to the API. We'll save each dictionary as a JSON file to preserve it's data structure. We will work on cleaning this data in the Data Cleaning page of this website.

In [63]:
import json

with open('reddit_psychosis_data.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)

with open('reddit_schizophrenia_data.json', 'w') as json_file:
        json.dump(data_schizophrenia, json_file, indent=4)

with open('reddit_cannabis_data.json', 'w') as json_file:
        json.dump(data_cannabis, json_file, indent=4)

## Wikipedia

To get a more academic perspective on the link between psychosis and cannabis, we will also pull data from Wikipedia. We will use R and the `WikipediR` package to get data from Wikipedia.

In [2]:
library(WikipediR)

In [67]:
long_term_cannabis_backlinks <- page_backlinks(
    "en",
    "wikipedia",
    page = "Long-term effects of cannabis",
    limit = 500
)
long_term_cannabis_links <- page_links(
    "en",
    "wikipedia",
    page = "Long-term effects of cannabis",
    limit = 500,
    namespaces = 0
)
long_term_cannabis <- page_content(
    "en",
    "wikipedia",
    page_name = "Long-term effects of cannabis"
)


In [42]:
long_term_cannabis_links$query$pages$`25905247`$links[[500]]

In [41]:
long_term_cannabis_backlinks$query$backlinks[[500]]

Now that we have the forward and back links for the "Long-term effects of cannabis" page, let's get the content of the forward and back links to create our corpus.

In [92]:
library(tidyverse)

wiki_data <- tibble(
    title = long_term_cannabis$parse$title,
    text = long_term_cannabis$parse$text$`*`,
    link = "main"
)

In [None]:
for(i in 1:500) {
    page_title = long_term_cannabis_links$query$pages$`25905247`$links[[i]]$title
    tryCatch({
        page_details <- page_content(
            "en",
            "wikipedia",
            page_name = page_title
        )
    },
    error = function(e) {
        print(paste0("error with ", page_title))
    })

    wiki_data <- wiki_data %>%
        add_row(
            title = page_details$parse$title,
            text = page_details$parse$text$`*`,
            link = "link"
        )
}
for(i in 1:500) {
    page_id = long_term_cannabis_backlinks$query$backlinks[[i]]$page_id
    tryCatch({
        page_details <- page_content(
            "en",
            "wikipedia",
            page = page_id
        )
    },
    error = function(e) {
        print(paste0("error with ", page_id))
    })
    wiki_data <- wiki_data %>%
        add_row(
            title = page_details$parse$title,
            text = page_details$parse$text$`*`,
            link = "back"
        )
}

In [94]:
wiki_data %>% nrow()

In [96]:
wiki_data %>% 
    write_csv(file = "wikipedia_scrape.csv")

In [97]:
save(wiki_data, file = "wikipedia_scrape.Rdata")

# Record Data

To get helpful record data about the use of cannabis and psychosis, we can use datasets gathered by previous researchers. Generally, we want to find research that measures both cannabis usage and mental health risks.

Many researchers focused on these topics have graciously made their data publicly available. I have utilized public research data based search through [Google Datasets](https://datasetsearch.research.google.com) in order to collect data.

## The Behavioral Sequelae of Cannabis Use in Health People

The first dataset comes from @Sorkhou2021 in **The Behavioral Sequelae of Cannabis Use in Health People: A Systematic Review**. This data was gathered as a collection of longitundial studies on the "cannabis-related adverse behavioral outcomes."

The data comes in the form of a word document, so we can use the `docxtractr` package in R to extract the table. We will clean this table in the next step.

In [104]:
library(docxtractr)

table_as_docx <- read_docx("../data/Table_1_The Behavioral Sequelae of Cannabis Use in Healthy People_ A Systematic Review.DOCX")
tbl_out <- docx_extract_tbl(table_as_docx)
tbl_out %>% write_csv("../data/behavioral_sequelae.csv")

## Cannabis Research Article

The next dataset comes from "cannabis research article" by @baklaci. This data was created to study the differences in cannabis usage between users with PEs and users without PEs. 

This data comes in the form of and SPSS file, `.sav`, so we can read it using the `haven` package in R.

In [107]:
library(haven)

cannabis_research_data <- read_sav("../data/dataset.sav")
cannabis_research_data %>% write_csv("../data/cannabis_research_data.csv")

## Cannabinoid use in psychotic patients impacts inflammatory levels and their association with psychosis severity

The next dataset comes from @GIBSON2020113380 and their research on the impact of cannabinoid usage on psychotic patients. This data is easily parsed as an excel file using `readxl`.

In [None]:
library(readxl)

cannabinoid <- read_excel("../data/DataforSumbission_FINAL.xlsx")

## Cannabis Use, Schizotypy and Kamin Blocking Performance

Next, we will utilize a data set that was used by @Dawes2021 to study cannabis usage and schizotypy and their relationship with Kamin blocking. We will again use `docxtractr` to read in this data.

In [117]:
docx_table_2 <- read_docx("../data/Table_2_Cannabis Use, Schizotypy and Kamin Blocking Performance.DOCX")
kamin_blocking <- docx_table_2 %>%
    docx_extract_tbl() %>%
    write_csv("../data/kamin_blocking.csv")

## Cannabis use in male and female first episode of non-affective psychosis patients: Long-term clinical, neuropsychological and functional differences

In our final dataset, @Setién-Suero2017 aim to study the difference in men and women in the link between cannabis usage and psychosis. Setién-Suero et. al. provide two datasets that will be utilized in this analysis.

In [120]:
read_sav("../data/S1File.sav") %>% 
    write_csv("../data/s1file.csv")

read_sav("../data/S2File.sav") %>% 
    write_csv("../data/s2file.csv")

Now that we have collected an ample amount of data, we can move on to data cleaning.