# Project 3: Webscraping, APIs, and Natural Language Processing (NLP)

## Part 1: Scraping Data From Reddit with APIs

In this project I picked up two types of data from reddit for classifying with NLP, which are board games and card games.

In [1]:
import requests
import time
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

## 1. Scraping Board Games' Data

In [2]:
# get all of toptic posts
URL = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# search a specific topic posts
params_boardgames = {
    'subreddit': 'boardgames',
    'size': 100,
    'before': 1603002965
}

In [4]:
res_boardgames = requests.get(URL, params_boardgames)

In [5]:
res_boardgames.status_code

200

In [6]:
data_boardgames = res_boardgames.json()

In [7]:
board_posts = []

for i in range(20):
    if res_boardgames.status_code == 200:
        board_posts.extend(data_boardgames['data'])
        time.sleep(1)

In [8]:
len(board_posts)

2000

In [10]:
df_boardgames = pd.DataFrame(board_posts)

In [11]:
df_boardgames.keys()

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'removed_by_category', 'retrieved_on', 'score',
       'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit',
       'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail',
       'thumbnail_height', 

In [12]:
# extract somes columns
boardgames = df_boardgames[['id', 'author', 'selftext', 'title', 'subreddit']]

In [13]:
boardgames.head()

Unnamed: 0,id,author,selftext,title,subreddit
0,jdb4vs,StayAwhileandList-n,,Our group's game stash for our 4 day weekend!,boardgames
1,jdat35,cybershocker455,,How to Win Against Any Board Game Geek | One P...,boardgames
2,jdafqp,the-postminimalist,"Mostly looking for uncommon settings. Europe, ...",I loved Pax Pamir's informed depiction of an u...,boardgames
3,jdacfk,AutoModerator,**Welcome to /r/boardgames Daily Discussion an...,/r/boardgames Daily Discussion and Game Recomm...,boardgames
4,jdaa7f,bethcabforcutie,None of my family or friends seem to remember ...,Does anyone remember this 80’s/90’s game?,boardgames


## 2. Scraping Card Games' Data

In [14]:
# get all of toptic posts
URL = 'https://api.pushshift.io/reddit/search/submission'

In [15]:
# search a specific topic posts
params_cardgames = {
        'subreddit': 'cardgames',
        'before': 1510065136
}

In [16]:
res_cardgames = requests.get(URL, params_cardgames)

In [17]:
res_cardgames.status_code

200

In [18]:
data_cardgames = res_cardgames.json()

In [19]:
# data_cargames['data']

In [20]:
card_posts = []

for i in range(80):
    if res_cardgames.status_code == 200:
        card_posts.extend(data_cardgames['data'])
        time.sleep(1)

In [21]:
len(card_posts)

2000

In [22]:
df_cardgames = pd.DataFrame(card_posts)

In [23]:
# extract somes columns
cardgames = df_cardgames[['id', 'author', 'selftext', 'title', 'subreddit']]

In [24]:
cardgames.head()

Unnamed: 0,id,author,selftext,title,subreddit
0,7b2bdk,Scorpinock,,The final version of my card game is now avail...,cardgames
1,7b1lp9,2ndshot,Bidding Game\n\nSetup\n\nDeal each player 25 t...,Bidding Game,cardgames
2,7ayeig,Tylerjsmall09,"Hey guys! I'm new to Reddit, and I joined spec...",Need help with funny card game title!,cardgames
3,7awtpn,cherchak4,I am going to thailand and want to bring some ...,Help! Which games to bring backpacking?,cardgames
4,7auhsy,Le_Codex14,"Hi, this is my first post on this subreddit. I...",Created a card game in 10 minutes called Karma,cardgames


## 3. Pre-Processing

### Concatenated two DataFrames

In [25]:
data = boardgames.append(cardgames, ignore_index=True, sort=False)

In [26]:
data

Unnamed: 0,id,author,selftext,title,subreddit
0,jdb4vs,StayAwhileandList-n,,Our group's game stash for our 4 day weekend!,boardgames
1,jdat35,cybershocker455,,How to Win Against Any Board Game Geek | One P...,boardgames
2,jdafqp,the-postminimalist,"Mostly looking for uncommon settings. Europe, ...",I loved Pax Pamir's informed depiction of an u...,boardgames
3,jdacfk,AutoModerator,**Welcome to /r/boardgames Daily Discussion an...,/r/boardgames Daily Discussion and Game Recomm...,boardgames
4,jdaa7f,bethcabforcutie,None of my family or friends seem to remember ...,Does anyone remember this 80’s/90’s game?,boardgames
...,...,...,...,...,...
3995,78d1zn,EnvyMe__,,Beyblade card game anyone?,cardgames
3996,786uwa,JulieMaddon95,,Play Roulette online,cardgames
3997,77zy6k,sbobet9,,play baccarat - เล่นบาคาร่า - Thailand,cardgames
3998,77tfjf,dollypolly666,,Bonus Malam Minggu Hanya di Kartuking Sebesar ...,cardgames


### Convert target column to bianry value

In [27]:
data['subreddit'] = data['subreddit'].map({'boardgames': 0, 'cardgames': 1})

### Shuffle Data

In [28]:
data = data.sample(frac=1).reset_index(drop=True)

In [29]:
data

Unnamed: 0,id,author,selftext,title,subreddit
0,jcepuk,smarter_than_an_oreo,,"I made an Oxygen, Heat, and Generation marker ...",0
1,7ayeig,Tylerjsmall09,"Hey guys! I'm new to Reddit, and I joined spec...",Need help with funny card game title!,1
2,7a9boe,SleepyPanda_1945,,[Monster Cry Eternal] Fantasy CCG for Mobile.,1
3,jco7ut,Squirrelhenge,I've been out of my apartment for 7 months bec...,My stuff just got moved back in to my apartmen...,0
4,jd2wz0,backhandle1982,Is anyone able to give me some tips on how to ...,Tips for 3 card solitaire,0
...,...,...,...,...,...
3995,7b2bdk,Scorpinock,,The final version of my card game is now avail...,1
3996,jck6zp,PearlsofRon,"I'm contemplating picking this one up, but I w...",Anyone play Trudvang Legends?,0
3997,7939no,onlinerummyindia,,Ace2Three Paytm Offer,1
3998,jcp3i8,randallion,Which means that I don't really play board gam...,My wife does not consider board games a way to...,0


## 4. Exported the Final DataFrame

In [30]:
data.to_csv('./data/boardgames_and_cardgames.csv', index=False)