---
title: Data Cleaning
---

Now that we have gathered a reasonable amount of textual and record data, we can begin the data cleaning process. Our ultimate goal is to do statistical modeling wtih our data, so we need to clean the data. Generally, we will follow the principles of tidy data [@tidydata] in cleaning our data.

# Cleaning Text Data

To begin, we will clean our textual data by parsing out the text from the JSON and HTML objects that were returned from the Reddit and Wikipedia APIs respectively.

## Reddit Data

Recall that the Reddit data was returned as a JSON. We retrieved 10,000 text posts for each of three different text files. Our goal is to turn each of these JSON files into an individual dataframe. From there, we can transform the data into a Bag of Words, Document Term Matrix, or any other helpful format.

We will use `pandas` and `json` to parse this data into a desired output. First let's read in the data.

In [12]:
import pandas as pd
import json

with open("../data/reddit_psychosis_data.json") as f:
    reddit_psychosis = json.load(f)
with open('../data/reddit_cannabis_data.json') as f:
    reddit_cannabis = json.load(f)
with open("../data/reddit_schizophrenia_data.json") as f:
    reddit_schizophrenia = json.load(f)

From the data pull, we know that each of these JSON files is has 100 elements, each with 100 posts.

Let's look at the structure of one element to identify how we can extract the title and text information.


In [15]:
reddit_psychosis['0']

{'kind': 'Listing',
 'data': {'after': 't3_15z9mid',
  'dist': 100,
  'modhash': '',
  'geo_filter': '',
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'Psychosis',
     'selftext': '3 years post-psychosis in recovery some days can be a complete nightmare can be a nightmare but if u hold on long enough every time you get knocked down you grow stronger until you find your breakthrough moment &lt;3 trust me friends DM me for support ❤️',
     'author_fullname': 't2_8e0dojvz',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'first time smiling on camera in... 3 years!',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/Psychosis',
     'hidden': False,
     'pwls': None,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': 140,
     'top_awarded_type': None,
     'hide_score': False,
     'name': 't3_ybi1ao',
     'quarantine': False,
     'link_flair_

From here, we see that the posts are within the parameter `children`.

In [18]:
len(reddit_psychosis['0']['data']['children'])

100

It looks like these are the 100 posts we're looking for. Now we'll extract the title and text from each of these children elements.

In [34]:
reddit_psychosis['0']['data']['children'][1]['data']

{'approved_at_utc': None,
 'subreddit': 'Psychosis',
 'selftext': '',
 'author_fullname': 't2_im5nzt97',
 'saved': False,
 'mod_reason_title': None,
 'gilded': 0,
 'clicked': False,
 'title': 'I quit my meds lmfao',
 'link_flair_richtext': [],
 'subreddit_name_prefixed': 'r/Psychosis',
 'hidden': False,
 'pwls': None,
 'link_flair_css_class': None,
 'downs': 0,
 'thumbnail_height': 99,
 'top_awarded_type': None,
 'hide_score': False,
 'name': 't3_15q9gxw',
 'quarantine': False,
 'link_flair_text_color': 'dark',
 'upvote_ratio': 0.97,
 'author_flair_background_color': None,
 'subreddit_type': 'public',
 'ups': 380,
 'total_awards_received': 0,
 'media_embed': {},
 'thumbnail_width': 140,
 'author_flair_template_id': '40dc1c44-00e9-11e7-8565-0e172963bac8',
 'is_original_content': False,
 'user_reports': [],
 'secure_media': None,
 'is_reddit_media_domain': True,
 'is_meta': False,
 'category': None,
 'secure_media_embed': {},
 'link_flair_text': None,
 'can_mod_post': False,
 'score': 38

In [24]:
text = reddit_psychosis['0']['data']['children'][1]['data']['selftext']
title = reddit_psychosis['0']['data']['children'][0]['data']['title']

Since we have discovered the structure of this data, we can extract the text info for all of the posts. Let's loop through all three files to get the data in a data frame.

First, we can define function to loop through each of our JSON files.

In [52]:
def parse_reddit_json(reddit_json):
    text_list = []
    title_list = []
    subreddit_list = []
    for i in range(0, 100):
        index = str(i)
        for j in range(0, 100):
            text_list.append(reddit_json[index]['data']['children'][j]['data']['selftext'])
            title_list.append(reddit_json[index]['data']['children'][j]['data']['title'])
            subreddit_list.append(reddit_json[index]['data']['children'][j]['data']['subreddit'])

    return text_list, title_list, subreddit_list

Now, the function parses each of the JSON files and outputs a list of the title of each post and the text contents of each post. All posts have titles, but not all posts have additional text.

In [53]:
psy_text, psy_title, psy_sub = parse_reddit_json(reddit_psychosis)
schiz_text, schiz_title, schiz_sub = parse_reddit_json(reddit_schizophrenia)
cannabis_text, cannabis_title, cannabis_sub = parse_reddit_json(reddit_cannabis)

Now that we have these lists, we can combine them into a `pandas` dataframe where each row is one post on Reddit.

In [60]:
text = psy_text + schiz_text + cannabis_text
title = psy_title + schiz_title + cannabis_title
sub = psy_sub + schiz_sub + cannabis_sub

reddit_df = pd.DataFrame({'text': text, 'title': title, 'subreddit': sub})
reddit_df.head()

Unnamed: 0,text,title,subreddit
0,3 years post-psychosis in recovery some days c...,first time smiling on camera in... 3 years!,Psychosis
1,,I quit my meds lmfao,Psychosis
2,,I hate it here,Psychosis
3,,art by me. I thought it kinda visualized how I...,Psychosis
4,,But I’m still god and this is neither a joke a...,Psychosis


Now we have a data frame of labeled text objects that will be easy to work with for modeling.

## Wikipedia Data

Next, let's clean up our Wikipedia data. The Wikipedia API returned a complex nested R object. We already extracted the HTML from this R object and stored it in a csv, but we really want the main text of each webpage. 

We will use `rvest` to "harvest" the data from each HTML and store all of the information in a tibble.

In [5]:
library(rvest)

load("../data/wikipedia_scrape.Rdata")
wiki_data %>% names()

The HTML data is stored in the text column of the data frame. Let's take the first element to parse out the text from each webpage.

In [26]:
first <- wiki_data$text[1]
first %>% 
    read_html() %>%
    html_element("body") %>% 
    html_element("div") %>% 
    html_elements("p") %>% 
    html_text() %>% 
    head()


Throughout exploring the structure of the HTML file in Wikipedia, we see that there is a simple way to get all paragraph text from each page. We just need to pull all the text from the `<p>` tags on each page.

Now, we will loop through the tibble to convert the HTML string into a plain text string representing the paragraphs on the Wikipedia page.

In [45]:
text_column <- list()
for(i in 1:nrow(wiki_data)) {
    html <- wiki_data$text[i]
    text_list <- html %>%
        read_html() %>%
        html_element("body") %>%
        html_elements("p") %>%
        html_text()
    text_list <- paste0(text_list, collapse = " ")
    text_column <- append(text_column, text_list)
}

Now we have a list of all the text from each of these HTML files. Let's add this as a new column to our tibble and get rid of the huge HTML strings to decrease our memory footprint.

In [55]:
#| message: false
library(tidyverse)

wiki_data <- wiki_data %>%
    select(-text) %>%
    tibble::add_column(raw_text = text_column)

In [57]:
wiki_data %>% head()

title,link,raw_text
<chr>,<chr>,<list>
Long-term effects of cannabis,main,"The long-term effects of cannabis have been the subject of ongoing debate. Because cannabis is illegal in most countries, clinical research presents a challenge and there is limited evidence from which to draw conclusions.[1] In 2017, the U.S. National Academies of Sciences, Engineering, and Medicine issued a report summarizing much of the published literature on health effects of cannabis, into categories regarded as conclusive, substantial, moderate, limited and of no or insufficient evidence to support an association with a particular outcome.[2] Cannabis is the most widely used illicit drug in the Western world.[3] In the United States, 10-20% of those who begin the use of cannabis daily will later become dependent.[4][5] Cannabis use can lead to addiction, which is defined as ""when the person cannot stop using the drug even though it interferes with many aspects of his or her life.""[5][6][7][8] Cannabis use disorder is defined in the fifth revision of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) as a condition requiring treatment.[3] A 2012 review of cannabis use and dependency in the United States by Danovitch et al said that ""42% of persons over age 12 have used cannabis at least once in their lifetime, 11.5% have used within the past year, and 1.8% have met diagnostic criteria for cannabis abuse or dependence within the past year. Among individuals who have ever used cannabis, conditional dependence (the proportion who go on to develop dependence) is 9%."" Although no medication is known to be effective in combating dependency, combinations of psychotherapy such as cognitive behavioural therapy and motivational enhancement therapy have achieved some success.[9] Cannabis dependence develops in 9% of users, significantly less than that of heroin, cocaine, alcohol, and prescribed anxiolytics,[10] but slightly higher than that for psilocybin, mescaline, or LSD. Dependence on cannabis tends to be less severe than that observed with cocaine, opiates, and alcohol.[11] A 2018 academic review, published in partnership with Canopy Growth, discussed the limitations of current studies of therapeutic and non-therapeutic cannabis use, and further stated that the nature of dependence formation among regular marijuana consumers has declined since 2002.[12] Cambridge University published a study in 2015 that showed the surprising fact that in England and Wales, the use of cannabis had decreased. Although there was a reported decrease in use, the need for addiction treatment was surging. The study looked more in depth on how the potency of the cannabis affected someone's dependence on the drug. They tested three different levels of potency and found that the most potent cannabis had the highest amount of dependence. Researchers believe that this is because of the high that the participants felt after using. The lower potency strains did not give users the same high, which made them not desire or in turn depend on that strain as much.[13] Acute cannabis intoxication has been shown to negatively affect attention, psychomotor task ability, and short-term memory.[14][15] Studies of chronic cannabis users have demonstrated, although inconsistently, a long-lasting effect on the attention span, memory function, and cognitive abilities of moderate-dose, long-term users. Once cannabis use is discontinued for several months, these effects disappear, unless the user started consuming during adolescence. It is speculated that this is due to neurotoxic effects of cannabis interfering with critical brain development.[16][17] Chronic use of cannabis during adolescence, a time when the brain is still developing, is correlated in the long term with lower IQ and cognitive deficits. It is not clear, though, if cannabis use causes the problems or if the causality is in the reverse. Recent studies have shown that IQ deficits existed in some subjects before chronic cannabis use, suggesting that lower IQ may instead be a risk factor for cannabis addiction.[18][6][19] A prospective cohort study that took place between 1972 and 2012 investigated the association between cannabis use and neuropsychological decline. Subjects were tested at various points in their life administering multiple different neuropsychological tests. The authors concluded that:  Cannabis intoxication was not only found to affect attention, psychomotor task ability, and short-term memory.[14][15] It was also found that intoxicated users were facing the difficulty of having false memories.[20] The use of cannabis has been heavily shown to affect the working-memory network function. Using large amounts of cannabis at a time is associated with hyperactivity of the network during a working-memory task. Most of these findings are showing that people who use cannabis on a daily basis will need additional effort in order to perform certain tasks.[21][22][23] Cannabis contains over 100 different cannabinoid compounds, many of which have displayed psychoactive effects. The most distinguished cannabinoids are tetrahydrocannabinol (THC) and cannabidiol (CBD), with THC being the primary psychoactive agent.[24][12] The effects of THC and CBD are salient regarding psychosis and anxiety.[25] According to the National Academies of Sciences, Engineering and Medicine, there is substantial evidence of a statistical association between cannabis use and the development of schizophrenia or other chronic psychoses, with the highest risk potentially among the most frequent users.[2] A possible connection between psychosis and cannabis is controversial because observational studies suggest a correlation but do not establish any causative effect of cannabis on long-term psychiatric health.[26] Medical evidence strongly suggests that the long-term use of cannabis by people who begin use at an early age display a higher tendency towards mental health problems and other physical and development disorders, although a causal link could not be proven by the available data.[27] The risks appear to be most acute in adolescent users.[27] In one 2013 review, the authors concluded long-term cannabis use ""increases the risk of psychosis in people with certain genetic or environmental vulnerabilities"", but does not cause psychosis. Important predisposing factors were genetic liability, childhood trauma and urban upbringing.[26] Another review that same year concluded that cannabis use may cause permanent psychological disorders in some users such as cognitive impairment, anxiety, paranoia, and increased risks of psychosis. Key predisposing variables included age of first exposure, frequency of use, the potency of the cannabis used, and individual susceptibility.[28] Nevertheless, some researchers maintain there exists ""a strong association between schizophrenia and cannabis use..."", while cannabis use alone does not predict the transition to subsequent psychiatric illness. Many factors are involved, including genetics, environment, time period of initiation and duration of cannabis use, underlying psychiatric pathology that preceded drug use, and combined use of other psychoactive drugs.[29] The temporal relationship between cannabis and psychosis was reviewed in 2014, and the authors proposed that ""[b]ecause longitudinal work indicates that cannabis use precedes psychotic symptoms, it seems reasonable to assume a causal relationship"" between cannabis and psychosis, but that ""more work is needed to address the possibility of gene-environment correlation.""[30] In 2016 a meta-analysis was published on associations studies covering a range of dosing habits, again showing that cannabis use is associated with a significantly increased risk of psychosis, and alleged that a dose<U+2013>response relationship exists between the level of cannabis use and risk of psychosis. The risk was increased 4-fold with daily use, though the analysis was not adequate to establish a causal link.[31] Another 2016 meta-analysis found that cannabis use only predicted transition to psychosis among those who met the criteria for abuse of or dependence on the drug.[32] Another 2016 review concluded that the existing evidence did not show that cannabis caused psychosis, but rather that early or heavy cannabis use were among many factors more likely to be found in those at risk of developing psychosis.[33] An opposing view was expressed by Suzanne Gage and coauthors reviewing the literature available in 2016, who regarded the epidemiologic evidence on cannabis use and psychosis strong enough ""to warrant a public health message that cannabis use can increase the risk of psychotic disorders,"" but also cautioning that additional studies are needed to determine the size of the effect.[34] Such a public health message was subsequently issued in August 2019 by the Surgeon General of the United States.[35] The review by Gage et al. also stated ""If the association between cannabis and schizophrenia is causal and of the magnitude estimated across studies to date, this would equate to a schizophrenia lifetime risk of approximately 2% in regular cannabis users (though risk for broader psychotic outcomes will be greater). This implies that about 98% of regular cannabis users will not develop schizophrenia...[and that] risk could be much greater in those at a higher genetic risk, or in those who use particularly potent strains of cannabis.[34]:<U+200A>11<U+200A> Expressed in terms of odds ratio, another study found that ""Daily cannabis use was associated with increased odds of psychotic disorder compared with never users (adjusted odds ratio [OR] 3.2, 95% CI 2.2<U+2013>4.1), increasing to nearly five-times increased odds for daily use of high-potency types of cannabis (4.8, 2.5<U+2013>6.3).""[36] To calculate what the increased odds ratio[36] means for schizophrenia specifically, a 2005 review placed the lifetime morbid risk of narrowly defined schizophrenia at 0.72%.[37] For some locations, this translates into a substantial population attributable risk, such that ""assuming causality, if high-potency cannabis types were no longer available, then 12% of cases of first-episode psychosis could be prevented across Europe, rising to 30% in London and 50% in Amsterdam.""[36] A 2019 meta-analysis found that 34% of people with cannabis-induced psychosis transitioned to schizophrenia. This was found to be comparatively higher than hallucinogens (26%) and amphetamines (22%).[38] However, a 2004 study noted that general population statistics show no increase in psychosis incidence rates in any developed country over the last 50 years, despite a five-fold increase in cannabis use rates. To quote Macleod et al. 2004: ""Cannabis use appears to have increased substantially amongst young people over the past 30 years, from around 10% reporting ever use in 1969<U+2013>70, to around 50% reporting ever use in 2001, in Britain and Sweden. If the relation between use and schizophrenia were truly causal and if the relative risk was around five-fold then the incidence of schizophrenia should have more than doubled since 1970. However population trends in schizophrenia incidence suggest that incidence has either been stable or slightly decreased over the relevant time period.""[39] Of note, cannabis with a high THC to CBD ratio produces a higher incidence of psychological effects. CBD may show antipsychotic and neuroprotective properties, acting as an antagonist to some of the effects of THC. Studies examining this effect have used high ratios of CBD to THC, and it is unclear to what extent these laboratory studies translate to the types of cannabis used by real life users.[28][40] Research has suggested that CBD can safely reduce some symptoms of psychosis in general.[41] A 2014 review examined psychological therapy as add-on for people with schizophrenia who are using cannabis:  As of 2017 there was clear evidence that long-term use of cannabis increases the risk of psychosis, regardless of confounding factors, and particularly for people who have genetic risk factors,[43] but see previous section. Even in those with no family history of psychosis, the administration of pure THC in clinical settings has been demonstrated to elicit transient psychotic symptoms.[44][45][46][47] Cannabis use may precipitate new-onset panic attacks and depersonalization/derealization symptoms simultaneously. The association between cannabis use and depersonalisation/derealisation disorder has been studied. Depersonalization is defined as a dissociative symptom in which one feels like an outside observer with respect to one's thoughts, body, and sensations. While derealization is marked by feelings of unreality and detachment from one's surroundings, such that one's environment is experienced as remote or unfamiliar.[48] Some individuals experiencing depersonalisation/derealisation symptoms prior to any cannabis use have reported the effects of cannabis to calm these symptoms and make the depersonalisation/derealisation disorder more manageable with regular use.[49] Less attention has been given to the association between cannabis use and depression, though according to the Australian National Drug & Alcohol Research Centre, it is possible this is because cannabis users who have depression are less likely to access treatment than those with psychosis.[50] The findings on marijuana's relationship to depressive disorder are scattered, showing that cannabis use has benefits, but can also be detrimental to overall mental health. However, sufficient evidence exists showing reductions in cannabis use improve anxiety, depression, and sleep quality.[51] A 2017 review suggests that cannabis has been shown to improve the mood of depression-diagnosed patients.[12] This is indicative of a longitudinal relationship between cannabis reduction and improvements in anxiety and depression.  Anxiety and depression have been found to increase susceptibility to marijuana use.[52] This is due to a desire to alleviate the symptoms of these experiences through marijuana use. Chronic users who use for anxiolytic purposes will even develop dependencies on cannabis, making it difficult to cope with anxiety when the drug is absent.  Teenage cannabis users show no difference from the general population in incidence of major depressive disorder (MDD), but an association exists between early exposure coupled with continued use into adult life and increased incidence of MDD in adulthood.[53] Among cannabis users of all ages, there may be an increased risk of developing depression, with heavy users seemingly having a higher risk.[54] Heavy marijuana use in adolescence has also been associated with deficits in cognition. A recent study assessing changes in neuropsychological functioning resulting from long-term cannabis use followed a group of adolescents (ages 12 <U+2013>15 at baseline) over a 14-year period. Researchers found that more days of use were correlated with decreases in inhibitory control, and visuospatial ability. Contrary to existing cross-sectional studies showing marijuana use in adolescence is associated with poor cognitive functioning, there were no associations between long-term cannabis use and memory and processing speed.[55] While this study showed no correlations between memory and cannabis use, others have found that there is. It is important to know that studies looking at associations between cannabis use and poor neurocognitive functioning have found that extended abstinence from marijuana leads to improvements in cognitive deficits. Decreases in cognition resulting from marijuana use are indeed reversible.  A February 2019 systematic review and meta-analysis found that cannabis consumption during adolescence was associated with an increased risk of developing depression and suicidal behavior later in life, while finding no effect on anxiety.[56] In a longitudinal study assessing the associations between long term use and mental health in a group of individuals participating in a drug-based treatment for depression, researchers found that, compared to non-users, patients using both medically and non-medically experienced less improvement in depressive symptoms and an increase in suicidal ideation. Additionally, those who used non-medically, were less likely to visit the psychiatrist.[57] Further research should investigate this finding to see whether non-medical marijuana use serves as a barrier to treatment seeking behavior.  Mania is a mental illness marked by periods of great excitement or euphoria, delusions, and overactivity.[58] This is common in cannabis users when they hit a point of their high that could lead to paranoia, anxiety, increased heart rate. Some strains of the drug can have these effects on the individuals that use them, but no effects are guaranteed when used. A case review reported that an adult user had marijuana-induced mania even though they had no previous psychiatric history.[59] However, some participants that have been previously diagnosed with bipolar disorder, had a worsen occurrence with mania symptoms.[60] This showed that anyone, diagnosed or psychiatrically stable, can develop mania symptoms when under the influence of cannabis.  Adolescent cannabis users show no difference from their peers in suicidal ideation or rate of suicide attempts, but those who continue to use cannabis into adult life exhibit an increased incidence of both, although multiple other contributory factors are also implicated.[53] In the general population a weak (indirect) association appears to exist between suicidal behaviour and cannabis consumption in both psychotic and non-psychotic users,[61] although it remains unclear whether regular cannabis use increases the risk of suicide.[62] Cannabis use is a risk factor in suicidality, but suicide attempts are characterized by many additional risk factors including mood disorders, alcohol use, stress, personal problems and poor support.[61] The gateway drug hypothesis asserts that the use of soft drugs such as cannabis, tobacco or alcohol may ultimately lead to the use of harder drugs. The release of dopamine at CB1 receptors when cannabinoids enter the body can enforce drug seeking behavior. In addition to the gateway framework, there is also the peer clustering theory which says that friendships influence drug seeking behaviors. Friends who use can influence one another to take drugs that are more rewarding and have a higher potential for abuse.[63] Large-scale longitudinal studies in the UK and New Zealand from 2015 and 2017 showed an association between cannabis use and an increased probability of later disorders in the use of other drugs.[64][65][66] Over time, the marijuana gateway hypothesis has been studied more and more. In one published study, the use of marijuana was shown not a reliable gateway cause of illicit drug use.[67] However, social factors and environment influence drug use and abuse, making the gateway effects of cannabis different for those in differing social circumstances. A study looking at associations between drug injection and cannabis use in street-involved youth found that cannabis use was associated with slower time to injection initiation.[68] Injection initiation leads to further patterns of injection initiation which will eventually lead to addiction.[68] A 2013 literature review said that exposure to cannabis was ""associated with diseases of the liver (particularly with co-existing hepatitis C), lungs, heart, and vasculature"". The authors cautioned that ""evidence is needed, and further research should be considered, to prove causal associations of marijuana with many physical health conditions"".[3] Researchers are concerned that with the increase in legalization will lead towards an increase of use which will in turn call for new strategies as well as rehabilitation to minimize the harm that cannabis can do on someone's body.[69] Studies conflict on whether long-term cannabis use causes persistent structural changes in humans. Twin studies have shown no significant difference between users and non-users in twin pairs,[70] but other studies have demonstrated that chronic use affects white matter and hippocampal volume in the brains of healthy (non-psychotic) patients, which is where large amounts of cannabinoid-1 receptors are present.[71][72] Long term cannabis users are at risk for developing cannabinoid hyperemesis syndrome (CHS), characterized by recurrent bouts of intense vomiting. The mechanism behind CHS is poorly understood and is contrary to the antiemetic properties of cannabis and cannabinoids.[73] The acute effects of cannabis use in humans include a dose-dependent increase in heart rate, typically accompanied by a mild increase in blood pressure while lying down and postural hypotension - a drop in blood pressure when standing up. These effects may vary depending on the relative concentration of the many different cannabinoids that can affect the cardiovascular function, such as cannabigerol. Smoking cannabis decreases exercise tolerance.[74] Cardiovascular effects may not lead to serious health issues for the majority of young, healthy users; on the contrary, heart attack, that is myocardial infarction, stroke, and other adverse cardiovascular events, have occurred in association with its use. Cannabis use by people with cardiovascular disease poses a health risk because it can lead to increased cardiac work, increased catecholamine levels, and impaired blood oxygen carrying capacity due to the production of carboxyhemoglobin.[75] A 2012 review examining the relation of cancer and cannabis found little direct evidence that cannabinoids found in cannabis, including THC, are carcinogenic. Cannabinoids are not mutagenic according to the Ames test. However, cannabis smoke has been found to be carcinogenic in rodents and mutagenic in the Ames test. Correlating cannabis use with the development of human cancers has been problematic due to difficulties in quantifying cannabis use, unmeasured confounders, and cannabinoids' potential as cancer treatment.[76] According to a 2013 literature review, cannabis could be carcinogenic, but there are methodological limitations in studies making it difficult to establish a link between cannabis use and cancer risk.[3] The authors say that bladder cancer does seem to be linked to habitual cannabis use, and that there may be a risk for cancers of the head and neck among long-term (more than 20 years) users.[3] Gordon and colleagues said, ""there does appear to be an increased risk of cancer (particularly head and neck, lung, and bladder cancer) for those who use marijuana over a period of time, although what length of time that this risk increases is uncertain.""[3] There have been a limited number of studies that have looked at the effects of smoking cannabis on the respiratory system.[77] Chronic heavy cannabis smoking is associated with coughing, production of sputum, wheezing, and other symptoms of chronic bronchitis.[78] Regular cannabis use has not been shown to cause significant abnormalities in lung function.[79] Regular cannabis smokers show pathological changes in lung cells similar to those that precede the development of lung cancer in tobacco smokers.[80] Gordon and colleagues in a 2013 literature review said: ""Unfortunately, methodological limitations in many of the reviewed studies, including selection bias, small sample size, limited generalizability, and lack of adjustment for tobacco smoking, may limit the ability to attribute cancer risk solely to marijuana use.""[3] Reviewing studies adjusted for age and tobacco use, they said there was a risk of lung cancer even after adjusting for tobacco use, but that the period of time over which the risk increases is uncertain.[3] A 2013 review which specifically examined the effects of cannabis on the lung concluded ""[f]indings from a limited number of well-designed epidemiological studies do not suggest an increased risk for the development of either lung or upper airway cancer from light or moderate use, although evidence is mixed concerning possible carcinogenic risks of heavy, long-term use.""[79] In 2013 the International Lung Cancer Consortium found no significant additional lung cancer risk in tobacco users who also smoked cannabis. Nor did they find an increased risk in cannabis smokers who did not use tobacco. They concluded that ""[o]ur pooled results showed no significant association between the intensity, duration, or cumulative consumption of cannabis smoke and the risk of lung cancer overall or in never smokers."" They cautioned that ""[o]ur results cannot preclude the possibility that cannabis may exhibit an association with lung cancer risk at extremely high dosage."" The same authors supported further study, and called attention to evolving means of cannabis consumption: ""Specifically, respiratory risks may differ with the use of water pipes and vaporizers or with consuming oral preparations.""[81] Cannabis smoke contains thousands of organic and inorganic chemicals, including many of the same carcinogens as tobacco smoke.[82] A 2012 special report by the British Lung Foundation concluded that cannabis smoking was linked to many adverse effects, including bronchitis and lung cancer.[83] They identified cannabis smoke as a carcinogen and also said awareness of the danger was low compared with the high awareness of the dangers of smoking tobacco particularly among younger users. They said there was an increased risk from each cannabis cigarette due to drawing in large puffs of smoke and holding them.[83] Cannabis smoke has been listed on the California Proposition 65 warning list as a carcinogen since 2009, but leaves and pure THC are not.[84] A 2015 review found no association between head and neck cancer and lifetime cannabis smoking.[85] A 2013 literature review by Gordon and colleagues concluded that inhaled cannabis is associated with lung disease,[3] although Tashkin's 2013 review has found ""no clear link to chronic obstructive pulmonary disease"".[79] Smoking cannabis has been linked to adverse respiratory effects including: chronic coughing, wheezing, sputum production, and acute bronchitis.[83] It has been suggested that the common practice of inhaling cannabis smoke deeply and holding breath could lead to pneumothorax. In a few case reports involving immunocompromised patients, pulmonary infections such as aspergillosis have been attributed to smoking cannabis contaminated with fungi. The transmission of tuberculosis has been linked to cannabis inhalation techniques, such as sharing water pipes and 'Hotboxing'.[86] Of the various methods of cannabis consumption, smoking is considered the most harmful; the inhalation of smoke from organic materials can cause various health problems (e.g., coughing and sputum). Isoprenes help to modulate and slow down reaction rates, contributing to the significantly differing qualities of partial combustion products from various sources.[87][88] Male cannabis use has been associated with reduced fertility and decreased sperm counts.[89] Initial epigenetic studies have shown that male cannabis use causes widespread DNA methylation changes in sperm, resulting in lower rates of fertilization and higher rates of miscarriage.[90] Sperm DNA methylation alterations from cannabis extract exposure are evident in the offspring of rats.[91] This is important as prenatal cannabis exposure has been associated with neuropsychiatric disorders, and rates of autism have increased in the U.S., particularly in states where cannabis is legal.[92] A study released by the National Academies of Sciences, Engineering, and Medicine cited significant evidence for a statistical link between mothers who smoke cannabis during pregnancy and lower birth weights of their babies.[2] Cannabis consumption in pregnancy is associated with restrictions in growth of the fetus, miscarriage, and cognitive deficits in offspring.[93] Although the majority of research has concentrated on the adverse effects of alcohol, there is now evidence that prenatal exposure to cannabis has serious effects on the developing brain and is associated with ""deficits in language, attention, areas of cognitive performance, and delinquent behavior in adolescence"".[94] A report prepared for the Australian National Council on Drugs concluded cannabis and other cannabinoids are contraindicated in pregnancy as it may interact with the endocannabinoid system.[50] No fatal overdoses associated with cannabis use have ever been reported.[62] Due to the small number of studies that have been conducted, the evidence is insufficient to show a long-term elevated risk of mortality from any cause. Motor vehicle accidents, suicide, and possible respiratory and brain cancers are all of interest to many researchers, but no studies have been able to show a consistent increase in mortality from these causes.[62]"
"(C6)-CP 47,497",link,"(C6)-CP 47,497 (CP 47,497 dimethylhexyl homologue) is a synthetic cannabinoid, a CP 47,497 homologue.[1] Its systematic name is 2-[(1S,3R)-3-hydroxycyclohexyl]-5-(1,1-dimethylhexyl)phenol.[2] This cannabinoid related article is a stub. You can help Wikipedia by expanding it."
"(C9)-CP 47,497",link,"(C9)-CP 47,497 (CP 47,497 dimethylnonyl homologue) is a synthetic cannabinoid, a CP 47,497 homologue.[1] Its systematic name is 2-[(1S,3R)-3-hydroxycyclohexyl]-5-(1,1-dimethylnonyl)phenol.  This cannabinoid related article is a stub. You can help Wikipedia by expanding it."
"(C9)-CP 47,497",link,"(C9)-CP 47,497 (CP 47,497 dimethylnonyl homologue) is a synthetic cannabinoid, a CP 47,497 homologue.[1] Its systematic name is 2-[(1S,3R)-3-hydroxycyclohexyl]-5-(1,1-dimethylnonyl)phenol.  This cannabinoid related article is a stub. You can help Wikipedia by expanding it."
"(C9)-CP 47,497",link,"(C9)-CP 47,497 (CP 47,497 dimethylnonyl homologue) is a synthetic cannabinoid, a CP 47,497 homologue.[1] Its systematic name is 2-[(1S,3R)-3-hydroxycyclohexyl]-5-(1,1-dimethylnonyl)phenol.  This cannabinoid related article is a stub. You can help Wikipedia by expanding it."
"(C9)-CP 47,497",link,"(C9)-CP 47,497 (CP 47,497 dimethylnonyl homologue) is a synthetic cannabinoid, a CP 47,497 homologue.[1] Its systematic name is 2-[(1S,3R)-3-hydroxycyclohexyl]-5-(1,1-dimethylnonyl)phenol.  This cannabinoid related article is a stub. You can help Wikipedia by expanding it."


Now we have our text data from Wikipedia in a data frame with labels representing whether this was a forward or back link for the page titled "Long-term effects of cannabis."

# Record Data