## Project 3: Webscraping subreddit r/schizophrenia

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-3:-Webscraping-subreddit-r/schizophrenia" data-toc-modified-id="Project-3:-Webscraping-subreddit-r/schizophrenia-1">Project 3: Webscraping subreddit r/schizophrenia</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Import-Libraries" data-toc-modified-id="Import-Libraries-1.0.1">Import Libraries</a></span></li></ul></li><li><span><a href="#Step-1:-Create-Reddit-instance" data-toc-modified-id="Step-1:-Create-Reddit-instance-1.1">Step 1: Create Reddit instance</a></span></li><li><span><a href="#Step-2:-Scrape-the-URL" data-toc-modified-id="Step-2:-Scrape-the-URL-1.2">Step 2: Scrape the URL</a></span></li><li><span><a href="#Step-3:-Create-a-pandas-DataFrame-from-list-of-subreddit-posts¶" data-toc-modified-id="Step-3:-Create-a-pandas-DataFrame-from-list-of-subreddit-posts¶-1.3">Step 3: Create a pandas DataFrame from list of subreddit posts¶</a></span></li><li><span><a href="#Step-4:-Export-data-to-csv" data-toc-modified-id="Step-4:-Export-data-to-csv-1.4">Step 4: Export data to csv</a></span></li></ul></li></ul></div>

 Intro to Notebook 2
 
This notebook (Notebook 2) will display a step-by-step process for scraping subreddit r/schizophrenia. The scrape will be for 1000 posts that contain text which will be analyzed and later applied to classification models in the notebooks to follow.

#### Import Libraries

In [2]:
# Allows HTTP/1.1 (URL) requests to so users can add content from sites
import requests
# For data manipulation
import pandas as pd
# Ability to save file to be exported and used elsewhere (i.e. other than this notebook)
import csv
# Easy to use for scraping Reddit data
import praw

### Step 1: Create Reddit instance

Before any data can be scraped, users need to be authenticated. In order to do this, a Reddit instance must be created.

1. Create Reddit app here: https://www.reddit.com/prefs/apps
2. After pressing "create app", the authentification information needed to create the praw.Reddit instance will be provided.

In [3]:
# Input values client_id, client_secret, user_agent which can be found after "create app" action
reddit = praw.Reddit(client_id='inQ3U2b00SevdQ', client_secret='mWYaNvAOx-0ilXOfMV82nHZ7R3U',user_agent='reddit scrape')

### Step 2: Scrape the URL

In [5]:
# Start with an empty list
posts = [] 
# Scrape subreddit r/schizophrenia
scz_subreddit = reddit.subreddit('schizophrenia') 

# Grab 'hot' 1000 posts from subreddit r/schizophrenia
for post in scz_subreddit.hot(limit=1000):
# Create columns    
    posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
# Input the names that will scraped and eventually become columns in dataframe
posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])
# Print 'posts' to confirm data successfully scraped
print(posts)

                                                 title  score      id  \
0        Frequently Asked Questions (Read This Sticky)     41  53xfmu   
1                                        The latest :)     68  dhouqp   
2                                   Hurting loved ones     10  dhsf49   
3    Today I saw a lot of strange sparkly dust fall...     42  dhmke5   
4                                 Im getting so fat...      5  dhtybm   
..                                                 ...    ...     ...   
994                   Side effects from antipchycotics      5  d4q7ru   
995   Any studies of ppl who quit meds after long use?      3  d4s9eq   
996    "Spider-Man: Far from Home" is mega triggering.     24  d4jaj6   
997                                  Supporting Fiancé      4  d4qhxu   
998              There's another reality and I know it     15  d4js18   

         subreddit                                                url  \
0    schizophrenia  https://www.reddit.com/r/schiz

Observations: 999 rows is close to 1000 so 999 number in what is needed for analysis. 

### Step 3: Create a pandas DataFrame from list of subreddit posts¶

In [6]:
# Save scraped data of subreddit schizophrenia(also known as 'scz') to new dataframe
df_scz = pd.DataFrame(posts)
# Print dataframe to confirm it was successfully created
df_scz

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,Frequently Asked Questions (Read This Sticky),41,53xfmu,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,7,Welcome to [/r/schizophrenia](https://www.redd...,1.474549e+09
1,The latest :),68,dhouqp,schizophrenia,https://i.redd.it/dhg7lm4wkhs31.jpg,10,,1.571079e+09
2,Hurting loved ones,10,dhsf49,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,10,korba went insane for about a month. big surpr...,1.571097e+09
3,Today I saw a lot of strange sparkly dust fall...,42,dhmke5,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,10,,1.571063e+09
4,Im getting so fat...,5,dhtybm,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,11,I'm on risperdone. Is there a schizophrenia me...,1.571103e+09
...,...,...,...,...,...,...,...,...
994,Side effects from antipchycotics,5,d4q7ru,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,4,About two weeks ago i was injected a high dose...,1.568609e+09
995,Any studies of ppl who quit meds after long use?,3,d4s9eq,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,4,Big brother stopped taking his meds after bein...,1.568618e+09
996,"""Spider-Man: Far from Home"" is mega triggering.",24,d4jaj6,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,4,"I just got about halfway through ""Spider-Man: ...",1.568576e+09
997,Supporting Fiancé,4,d4qhxu,schizophrenia,https://www.reddit.com/r/schizophrenia/comment...,1,How do I support my fiancé who has paranoid de...,1.568610e+09


### Step 4: Export data to csv
Save scraped data to the processed directory to be read and cleaned for further analysis in Notebook 3.

In [7]:
# Export schizophrenia dataframe file
df_scz.to_csv(r'./subreddit_scz.csv')