#### Notebook Contents:
- Import Python Notebooks
- Use PRAW API to extract 1,000 posts from the FOXNEWS and MSNBC subreddits
- Convert posts to DataFrame
- Prepare DataFrame to transfer to other Notebook

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

The below links were instrumental in the construction of this notebook.
- Towards Data Science: Scraping Reddit data    
    - https://towardsdatascience.com/scraping-reddit-data-1c0af3040768


- Introduction and Basics - Python Reddit API Wrapper (PRAW) Tutorial P.1
    - https://www.youtube.com/watch?v=NRgfgtzIhBQ

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

#### Imports

In [1]:
# API 
import praw

# Data Manipulation
import pandas as pd
import numpy as np

# Tokenization
from nltk.tokenize import RegexpTokenizer

# Data Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# DateTime
import time

# Web Scrapping, Text Cleaning
from bs4 import BeautifulSoup 

# Stop-Word list
from nltk.corpus import stopwords

# Detect Patterns in Text
import regex as re

# Instantiate stemmer.
from nltk.stem.porter import PorterStemmer

# Train Test Split
from sklearn.model_selection import train_test_split

%matplotlib inline

In [2]:
# Set Pandas to view all rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

## Data Extraction

#### PRAW API Parameters Setup

In [3]:
# Using Reddit PRAW API to pull new stories
reddit = praw.Reddit(client_id = 'bwollOzqeTFAGw',
                     client_secret = "DvMUBxO8ZsMgLXYR7mW7uDWY6CE", 
                     password = 'scoober12',
                     user_agent = 'USERAGENT', 
                     username = 'DramaticPlate')

#### Web Scrapping

In [4]:
num_pulls = int(input())

10


In [5]:
# Prints the Post (Thread ID's)
FOXNEWS_posts = []

FOXNEWS = reddit.subreddit('FOXNEWS')

# Sets Limit and Parameters for what data we want to pull in
for post in FOXNEWS.new(limit = num_pulls):
    
    FOXNEWS_posts.append([post.id, 
                  post.title, 
                  post.ups, 
                  post.downs, 
                  post.subreddit,
                  post.selftext,
                  post.url, 
                  post.num_comments, 
                  post.created])
    time.sleep(3)

# Countdown showing number of rows in model
    print(f'Number of Fox News Posts Obtained: {len(FOXNEWS_posts)}')
    
# Converts Post content into DataFrame
FOXNEWS_posts = pd.DataFrame(FOXNEWS_posts, columns = ['ID', 
                                       'Title', 
                                       'Upvotes', 
                                       'Downvotes', 
                                       'Subreddit',
                                       'Body',
                                       'URL', 
                                       'Number of Comments', 
                                       'Date Created'])

Number of Fox News Posts Obtained: 1
Number of Fox News Posts Obtained: 2
Number of Fox News Posts Obtained: 3
Number of Fox News Posts Obtained: 4
Number of Fox News Posts Obtained: 5
Number of Fox News Posts Obtained: 6
Number of Fox News Posts Obtained: 7
Number of Fox News Posts Obtained: 8
Number of Fox News Posts Obtained: 9
Number of Fox News Posts Obtained: 10


In [6]:
# Prints the Post (Thread ID's)
MSNBC_posts = []

MSNBC = reddit.subreddit('MSNBC')

# Sets Limit and Parameters for what data we want to pull in
for post in MSNBC.new(limit = num_pulls):
    
    MSNBC_posts.append([post.id, 
                  post.title, 
                  post.ups, 
                  post.downs, 
                  post.subreddit,
                  post.selftext,
                  post.url, 
                  post.num_comments, 
                  post.created])
    time.sleep(3)

# Countdown showing number of rows in model
    print(f'Number of MSNBC Posts Obtained: {len(MSNBC_posts)}')
    
# Converts Post content into DataFrame
MSNBC_posts = pd.DataFrame(MSNBC_posts, columns = ['ID', 
                                       'Title', 
                                       'Upvotes', 
                                       'Downvotes', 
                                       'Subreddit',
                                       'Body',
                                       'URL', 
                                       'Number of Comments', 
                                       'Date Created'])

Number of MSNBC Posts Obtained: 1
Number of MSNBC Posts Obtained: 2
Number of MSNBC Posts Obtained: 3
Number of MSNBC Posts Obtained: 4
Number of MSNBC Posts Obtained: 5
Number of MSNBC Posts Obtained: 6
Number of MSNBC Posts Obtained: 7
Number of MSNBC Posts Obtained: 8
Number of MSNBC Posts Obtained: 9
Number of MSNBC Posts Obtained: 10


#### Create Master DataFrame

In [7]:
# Creating master DataFrame by combining both subreddit DataFrames
master = pd.concat([FOXNEWS_posts, MSNBC_posts], axis = 0)

In [8]:
# Resetting index for concatenated master
master = master.reset_index(drop = True)
master.head()

Unnamed: 0,ID,Title,Upvotes,Downvotes,Subreddit,Body,URL,Number of Comments,Date Created
0,en1us5,Shadow banned !!😡,0,0,FOXNEWS,So lately every time I try to post on Foxnews....,https://www.reddit.com/r/FOXNEWS/comments/en1u...,3,1578741000.0
1,ekw19z,Train wreck,3,0,FOXNEWS,The whole crew at Fox probably has a bad case ...,https://www.reddit.com/r/FOXNEWS/comments/ekw1...,10,1578355000.0
2,ekuwee,Why did you flip the Epstein images?,1,0,FOXNEWS,After watching the 60 minutes broadcast last n...,https://www.reddit.com/r/FOXNEWS/comments/ekuw...,3,1578350000.0
3,egdoxi,What’s the name of this dude?,0,0,FOXNEWS,,https://i.redd.it/v1lwq0jtj7741.jpg,5,1577495000.0
4,eg164t,How can I watch fox news broadcast in Europe (...,0,0,FOXNEWS,My dad has an interest of my chrome cast and t...,https://www.reddit.com/r/FOXNEWS/comments/eg16...,18,1577424000.0


In [9]:
# Checking to make sure all entries are unique
print(f'Number of Unique: {len(master["Title"].unique())}')
print(f'Number of Total Posts: {len(master["Title"])}')

Number of Unique: 20
Number of Total Posts: 20


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

#### DataFrame Transfer

In [10]:
# Stores master DataFrame so that it can be imported into the EDA Notebook.
master_extract_1 = master
%store master_extract_1

Stored 'master_extract_1' (DataFrame)
