<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Application Programming Interface, Natural Language Processing & Classification Modelling

### Contents:
- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Objectives / Rationales](#Objectives-/-Rationales)


- [Web Scraping](#Web-Scraping)

## Problem Statement

We, the Data Team at CryptoTrade23, aim to build a chatbot for the company’s website as the first checkpoint for users enquiring about cryptocurrencies through the application of Natural Language Processing (NLP) and Machine Learning (ML) Classifiers. 
The classifier in the chatbot will be trained to respond to these enquiries based on keywords in the users’ inputs.

## Background

CryptoTrade23 is a fintech startup specialising in cryptocurrency investments. Recently, our Customer Service Team has been receiving an overwhelming number of enquiries about cryptocurrencies and their applications. The Head of Customer Service has engaged the Data Team to automate responses to simple enquiries in the face of increasing workload and resource constraints. This will enable the Customer Service Team to focus on more complex enquiries.

## Objectives / Rationales

In this project, our team will be using data from two subreddits to achieve the following:
1. Analyse and understand the text data in each of the subreddits using Natural Language Processing (NLP).
2. Use Machine Learning (ML) classifiers to identify the subreddit a submission is likely to originate from.
3. Evaluate the ML classifiers against our baseline model using accuracy and ROC AUC as the metrics
4. Propose a suitable optimal ML classifier that could be used to develop a minimum viable product (MVP) for the chatbot and make other recommendations.

## Web Scraping

In [1]:
import gensim.downloader as api #allows us to get word2vec anf glove embeddins that we need
from gensim.models.word2vec import Word2Vec

In [2]:
# import libraries
import pandas as pd, numpy as np, requests, time, nltk, datetime as dt
from random import randint
from time import sleep

# NLP
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


# classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, confusion_matrix, plot_confusion_matrix, recall_score, precision_score)
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.dummy import DummyClassifier


# easier to see full text with a bigger maxwidth:
pd.options.display.max_colwidth = 200

### Set up single post scrape

In [3]:
# query the PushShift API for the subReddit data we need
url1 = 'https://api.pushshift.io/reddit/search/submission?subreddit=Bitcoin'
url2 = 'https://api.pushshift.io/reddit/search/submission?subreddit=ethereum'

In [4]:
# define what we need to get from the subreddit Bitcoin
params1 = {
    'subreddit': 'Bitcoin',
    'size': 100,
    'before': '1626939127'
}

In [5]:
# define what we need to get from the subreddit ethereum
params2 = {
    'subreddit': 'ethereum',
    'size': 100,
    'before': '1626939643'
}

Let's check if our response is valid. We're looking for a 200 response code.

In [6]:
res1 = requests.get(url1, params1)
res1.status_code

200

In [7]:
res2 = requests.get(url2, params2)
res2.status_code

200

In [8]:
# get Bitcoin content in JSON format
data1 = res1.json()
posts1 = data1['data']
posts1

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'theremnanthodl',
  'author_flair_css_class': 'noob',
  'author_flair_richtext': [{'e': 'text', 't': 'redditor for a day'}],
  'author_flair_template_id': '2ec8e69e-6c36-11e9-a04b-0afb553d4ea6',
  'author_flair_text': 'redditor for a day',
  'author_flair_text_color': 'dark',
  'author_flair_type': 'richtext',
  'author_fullname': 't2_dg8srid3',
  'author_is_blocked': False,
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1626939006,
  'domain': 'self.Bitcoin',
  'full_link': 'https://www.reddit.com/r/Bitcoin/comments/op915k/bitcoin_town_a_fiction_novel_about_using_bitcoin/',
  'gildings': {},
  'id': 'op915k',
  'is_created_from_ads_ui': False,
  'is_crosspostable': False,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': False,
  'is_self': True,
  'is_video': Fa

In [9]:
# get ethereum content in JSON format
data2 = res2.json()
posts2 = data2['data']
posts2

[{'all_awardings': [],
  'allow_live_comments': False,
  'author': 'wuzzgucci',
  'author_flair_css_class': None,
  'author_flair_richtext': [],
  'author_flair_text': None,
  'author_flair_type': 'text',
  'author_fullname': 't2_53ftzlxj',
  'author_patreon_flair': False,
  'author_premium': False,
  'awarders': [],
  'can_mod_post': False,
  'contest_mode': False,
  'created_utc': 1626938727,
  'domain': 'self.ethereum',
  'full_link': 'https://www.reddit.com/r/ethereum/comments/op8z9u/should_i_sell_bitcoin_and_just_go_all_in_ethereum/',
  'gildings': {},
  'id': 'op8z9u',
  'is_created_from_ads_ui': False,
  'is_crosspostable': True,
  'is_meta': False,
  'is_original_content': False,
  'is_reddit_media_domain': False,
  'is_robot_indexable': True,
  'is_self': True,
  'is_video': False,
  'link_flair_background_color': '',
  'link_flair_richtext': [],
  'link_flair_text_color': 'dark',
  'link_flair_type': 'text',
  'locked': False,
  'media_only': False,
  'no_follow': False,
  'n

Checking number of posts extracted for both subreddits

In [10]:
len(posts1)

100

In [11]:
len(posts2)

100

Filter out the important tags that I intend to use

In [12]:
df1 = pd.DataFrame(posts1)
df1.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,...,media_embed,secure_media,secure_media_embed,distinguished,suggested_sort,link_flair_text,gallery_data,is_gallery,media_metadata,author_flair_background_color
0,[],False,theremnanthodl,noob,"[{'e': 'text', 't': 'redditor for a day'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for a day,dark,richtext,t2_dg8srid3,...,,,,,,,,,,
1,[],False,theremnanthodl,,[],,,,text,t2_dg8srid3,...,,,,,,,,,,
2,[],False,ReadDailyCoin,noob,"[{'e': 'text', 't': 'redditor for 3 months'}]",2ec8e69e-6c36-11e9-a04b-0afb553d4ea6,redditor for 3 months,dark,richtext,t2_bmm97n7n,...,,,,,,,,,,
3,[],False,theloiteringlinguist,,[],,,,text,t2_7em1h7ph,...,"{'content': '&lt;iframe width=""356"" height=""200"" src=""https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; e...","{'oembed': {'author_name': 'The Valuable Investors', 'author_url': 'https://www.youtube.com/channel/UCcdigZ5bdD_FGyeHnfW1Tbg', 'height': 200, 'html': '&lt;iframe width=""356"" height=""200"" src=""http...","{'content': '&lt;iframe width=""356"" height=""200"" src=""https://www.youtube.com/embed/7pLusWKO86Y?feature=oembed&amp;enablejsapi=1"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; e...",,,,,,,
4,[],False,Electronic_Chard1987,,[],,,,text,t2_994q7jme,...,,,,,,,,,,


In [13]:
df2 = pd.DataFrame(posts2)
df2.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,post_hint,preview,secure_media,secure_media_embed,url_overridden_by_dest,crosspost_parent,crosspost_parent_list,author_flair_background_color,author_flair_text_color,banned_by
0,[],False,wuzzgucci,,[],,text,t2_53ftzlxj,False,False,...,,,,,,,,,,
1,[],False,Needle_NFT,,[],,text,t2_bamj1t7z,False,False,...,,,,,,,,,,
2,[],False,AOFEX__Official,,[],,text,t2_b00ercqa,False,False,...,,,,,,,,,,
3,[],False,FarEnergy3518,,[],,text,t2_bonkrdx5,False,False,...,rich:video,"{'enabled': False, 'images': [{'id': 'jFjNd1uiVnBZCa6C2enlMYDrqff6DjiOi0c05LoN2FE', 'resolutions': [{'height': 81, 'url': 'https://external-preview.redd.it/kJ6OI_6lRA2ZzQxC6LDgFL_lM4CM8w-yNSnFFR18...","{'oembed': {'author_name': 'The Money Game Capital', 'author_url': 'https://www.youtube.com/channel/UCjpdIY9eURfiSssHasnB_9Q', 'description': 'Hope you guys enjoy this video!!! This is my first ac...","{'content': '&lt;iframe class=""embedly-embed"" src=""https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FWgyM0tQ0Hfs%3Ffeature%3Doembed&amp;display_name=YouTube&am...",https://m.youtube.com/watch?v=WgyM0tQ0Hfs,,,,,
4,[],False,excusemealot,,[],,text,t2_b4v6m29o,False,False,...,link,"{'enabled': False, 'images': [{'id': 'YyUykZ9N8GKcddhbXPI4xo3AQBZ4KDQInaIPFPK67EA', 'resolutions': [{'height': 108, 'url': 'https://external-preview.redd.it/fp6wEJgB9knF94Y4PvPar470G8JMHVexuKK3_Gt...","{'oembed': {'author_name': 'Mihailo Bjelic', 'author_url': 'https://twitter.com/MihailoBjelic', 'cache_age': 3153600000, 'height': None, 'html': '&lt;blockquote class=""twitter-video""&gt;&lt;p lang...","{'content': '&lt;blockquote class=""twitter-video""&gt;&lt;p lang=""en"" dir=""ltr""&gt;I truly hope the success of Polygon will help both VCs and founders understand that it&amp;#39;s (much) better to ...",https://twitter.com/mihailobjelic/status/1417591789645561856?s=21,,,,,


In [14]:
# these are the important tags that i plan to use so i'm filtering them out
subfields = ['subreddit', 'title', 'selftext', 'created_utc', 'author', 'is_self', 'score', 'num_comments']

In [15]:
df1 = df1[subfields]
df1.head()

Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments
0,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626939006,theremnanthodl,True,1,0
1,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626938084,theremnanthodl,True,1,0
2,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,1626937970,ReadDailyCoin,False,1,1
3,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),,1626937137,theloiteringlinguist,False,1,2
4,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is...",,1626936557,Electronic_Chard1987,False,1,20


In [16]:
df2 = df2[subfields]
df2.head()

Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments
0,ethereum,Should I sell bitcoin and just go all in ethereum?,"* I have like $37.5k invested total in bitcoin, and after the crash, I only have $10.8k gains. If I sold at $65k, I'd have $100k right now in cash.\n* I basically bought a whole coin at around $10...",1626938727,wuzzgucci,True,1,44
1,ethereum,The hug – ultra rare 1/1,[removed],1626938319,Needle_NFT,True,1,0
2,ethereum,L2BEAT website upgrade,[removed],1626938005,AOFEX__Official,True,1,0
3,ethereum,The Internet World of our Future!!! (INSANE) the best vid I’ve ever watched in my life of crypto,,1626937513,FarEnergy3518,False,1,0
4,ethereum,I like the way Polygon thinks - Mihailo Bjelic on Twitter,,1626936362,excusemealot,False,1,1


Remove any duplicate posts

In [17]:
# dropping all duplicates
df1.drop_duplicates(inplace=True)
df2.drop_duplicates(inplace=True)

We only want original text so remove non-self posts

In [18]:
# filter only self posts for Bitcoin
df1 = df1[df1['is_self']==True]
df1.head()

Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments
0,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626939006,theremnanthodl,True,1,0
1,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626938084,theremnanthodl,True,1,0
6,Bitcoin,what moves crypto market apart from the speculators,"I would like to know if there is anything that moves crypto market apart from the speculators, or are cryptocurrencies and their prices absolutely speculative?",1626936533,hawk-fe,True,1,7
9,Bitcoin,"Only morons post about Elon Musk, SpaceX, or Tesla",[removed],1626935783,monoslim,True,1,0
10,Bitcoin,Help starting crypto business,Hi guys.\n\nI'm interested in starting a crypto business / app www.hashtaghodl.com that would take small amounts of money from your account over time and then invest the money into your favourite ...,1626935613,TheWanderer09,True,1,5


In [19]:
# filter only self posts for ethereum
df2 = df2[df2['is_self']==True]
df2.head()

Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments
0,ethereum,Should I sell bitcoin and just go all in ethereum?,"* I have like $37.5k invested total in bitcoin, and after the crash, I only have $10.8k gains. If I sold at $65k, I'd have $100k right now in cash.\n* I basically bought a whole coin at around $10...",1626938727,wuzzgucci,True,1,44
1,ethereum,The hug – ultra rare 1/1,[removed],1626938319,Needle_NFT,True,1,0
2,ethereum,L2BEAT website upgrade,[removed],1626938005,AOFEX__Official,True,1,0
5,ethereum,Is it possible to specify at which block height my transaction get included?,[removed],1626935304,zjiekai,True,1,0
6,ethereum,Create your Token in 3 easy steps with SuperToken,[removed],1626934954,orchidkart,True,1,0


### Webscraping Submissions - TEST RUNS

Bitcoin Submissions - TEST RUN

In [20]:
# loop to iterate through pulls
# specify subreddit, filter out video submissions, run 500 iterations (which is also no. of df outputs to be concat)
def fetch_posts_test(subreddit, kind='submission', is_video=False, n=500):
       
        # establish params
        url = 'https://api.pushshift.io/reddit/search/' + kind
        posts = []
        current_time = '1626939127'
        
        # for n+1 iterations
        for i in range(1, n+1):
            try:
                res = requests.get(
                    url,
                    params={
                        'subreddit': 'Bitcoin',
                        'size': 100,
                        'before': current_time
                    }
                )
            except:
                continue
            
            # convert JSON format into to DataFrame 
            df = pd.DataFrame(res.json()['data'])
            
            # updates to the latest utc
            current_time = df['created_utc'].min()
            
            # append df from each interation into posts list
            posts.append(df)
            
            # find out no. of posts we're getting
            total_scraped = sum(len(x) for x in posts)
            
            print(total_scraped)
            
            # wait 3-10 seconds between requests and break once we obtain 300 total_scraped rows
            if total_scraped < 300:
                time.sleep(randint(3,10))
            else:
                break
                
        # merge list of dfs from our requests
        df_merged = pd.concat(posts, sort=False)
        
        # i only want specific columns
        subfields = ['subreddit', 'title', 'selftext', 'created_utc', 'author', 'is_self', 'score', 'num_comments']
        df_merged = df_merged[subfields]
        
        # remove duplicates
        df_merged.drop_duplicates(inplace=True)
        
        print(df_merged.shape)
        
        # convert time from utc integer to timestamp
        df_merged['timestamp'] = df_merged['created_utc'].map(dt.date.fromtimestamp)
        
        return df_merged.reset_index(drop=True)

In [21]:
btc_test = fetch_posts_test('Bitcoin')
btc_test.head()

100
200
300
(300, 8)


Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments,timestamp
0,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626939006,theremnanthodl,True,1,0,2021-07-22
1,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626938084,theremnanthodl,True,1,0,2021-07-22
2,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,1626937970,ReadDailyCoin,False,1,1,2021-07-22
3,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),,1626937137,theloiteringlinguist,False,1,2,2021-07-22
4,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is...",,1626936557,Electronic_Chard1987,False,1,20,2021-07-22


In [40]:
btc_test.to_csv('../data/bitcoin_sub_test.csv', index=False)

ethereum Submissions - TEST RUN

In [23]:
# loop to iterate through pulls
# specify subreddit, filter out video submissions, run 500 iterations (which is also no. of df outputs to be concat)
def fetch_posts_test(subreddit, kind='submission', is_video=False, n=500):
       
        # establish params
        url = 'https://api.pushshift.io/reddit/search/' + kind
        posts = []
        current_time = '1626939643'
        
        # for n+1 iterations
        for i in range(1, n+1):
            try:
                res = requests.get(
                    url,
                    params={
                        'subreddit': 'ethereum',
                        'size': 100,
                        'before': current_time
                    }
                )
            except:
                continue
            
            # convert JSON format into to DataFrame 
            df = pd.DataFrame(res.json()['data'])
            
            # updates to the latest utc
            current_time = df['created_utc'].min()
            
            # append df from each interation into posts list
            posts.append(df)
            
            # find out no. of posts we're getting
            total_scraped = sum(len(x) for x in posts)
            
            print(total_scraped)
            
            # wait 3-10 seconds between requests and break once we obtain 300 total_scraped rows
            if total_scraped < 300:
                time.sleep(randint(3,10))
            else:
                break
                
        # merge list of dfs from our requests
        df_merged = pd.concat(posts, sort=False)
        
        # i only want specific columns
        subfields = ['subreddit', 'title', 'selftext', 'created_utc', 'author', 'is_self', 'score', 'num_comments']
        df_merged = df_merged[subfields]
        
        # remove duplicates
        #df_merged.drop_duplicates(inplace=True)
        
        print(df_merged.shape)
        
        # convert time from utc integer to timestamp
        df_merged['timestamp'] = df_merged['created_utc'].map(dt.date.fromtimestamp)
        
        return df_merged.reset_index(drop=True)

In [24]:
eth_test = fetch_posts_test('ethereum')
eth_test.head()

100
200
300
(300, 8)


Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments,timestamp
0,ethereum,Should I sell bitcoin and just go all in ethereum?,"* I have like $37.5k invested total in bitcoin, and after the crash, I only have $10.8k gains. If I sold at $65k, I'd have $100k right now in cash.\n* I basically bought a whole coin at around $10...",1626938727,wuzzgucci,True,1,44,2021-07-22
1,ethereum,The hug – ultra rare 1/1,[removed],1626938319,Needle_NFT,True,1,0,2021-07-22
2,ethereum,L2BEAT website upgrade,[removed],1626938005,AOFEX__Official,True,1,0,2021-07-22
3,ethereum,The Internet World of our Future!!! (INSANE) the best vid I’ve ever watched in my life of crypto,,1626937513,FarEnergy3518,False,1,0,2021-07-22
4,ethereum,I like the way Polygon thinks - Mihailo Bjelic on Twitter,,1626936362,excusemealot,False,1,1,2021-07-22


In [41]:
eth_test.to_csv('../data/ethereum_sub_test.csv', index=False)

### Webscraping - Bitcoin Submissions

Bitcoin Submissions - ACTUAL RUN

In [28]:
# loop to iterate through pulls
# specify subreddit, filter out video submissions, run 500 iterations (which is also no. of df outputs to be concat)
def fetch_posts(subreddit, kind='submission', is_video=False, n=500):
       
        # establish params
        url = 'https://api.pushshift.io/reddit/search/' + kind
        posts = []
        current_time = '1626939127'
        
        # for n+1 iterations
        for i in range(1, n+1):
            try:
                res = requests.get(
                    url,
                    params={
                        'subreddit': 'Bitcoin',
                        'size': 100,
                        'before': current_time
                    }
                )
            except:
                continue
                
            # convert JSON format into to DataFrame 
            df = pd.DataFrame(res.json()['data'])
            
            # updates to the latest utc
            current_time = df['created_utc'].min()
            
            # append df from each interation into posts list
            posts.append(df)
            
            # find out no. of posts we're getting
            total_scraped = sum(len(x) for x in posts)
            
            print(total_scraped)
            
            # wait 3-10 seconds between requests and break once we obtain 10_000 total_scraped rows
            if total_scraped < 10_000:
                time.sleep(randint(3,10))
            else:
                break
                
        # merge list of dfs from our requests
        df_merged = pd.concat(posts, sort=False)
        
        # i only want specific columns
        subfields = ['subreddit', 'title', 'selftext', 'created_utc', 'author', 'is_self', 'score', 'num_comments']
        df_merged = df_merged[subfields]
        
        # remove duplicates
        df_merged.drop_duplicates(inplace=True)
        
        print(df_merged.shape)
        
        # convert time from utc integer to timestamp
        df_merged['timestamp'] = df_merged['created_utc'].map(dt.date.fromtimestamp)
        
        return df_merged.reset_index(drop=True)

In [30]:
btc = fetch_posts('Bitcoin')
btc.head()

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
(9998, 8)


Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments,timestamp
0,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626939006,theremnanthodl,True,1,0,2021-07-22
1,Bitcoin,Bitcoin Town - A fiction novel about using Bitcoin against the Great Reset,[removed],1626938084,theremnanthodl,True,1,0,2021-07-22
2,Bitcoin,"Crypto Influencers Dorsey, Woods, and Musk Face-off During B-Word Conference",,1626937970,ReadDailyCoin,False,1,1,2021-07-22
3,Bitcoin,Elon Musk’s View on Bitcoin (July 21 2021),,1626937137,theloiteringlinguist,False,1,2,2021-07-22
4,Bitcoin,"You’ve undoubtedly heard about crypto currencies by now such as Bitcoin, Ethereum and DOGE coin. But you can’t help but wonder what is it? What can I do with it? What is mining? Well CryptoMapz is...",,1626936557,Electronic_Chard1987,False,1,20,2021-07-22


In [42]:
btc.to_csv('../data/bitcoin_sub.csv', index=False)

### Webscraping - ethereum Submissions

ethereum Submissions - ACTUAL RUN

In [32]:
# loop to iterate through pulls
# specify subreddit, filter out video submissions, run 500 iterations (which is also no. of df outputs to be concat)
def fetch_posts(subreddit, kind='submission', is_video=False, n=500):
       
        # establish params
        url = 'https://api.pushshift.io/reddit/search/' + kind
        posts = []
        current_time = '1626939643'
        
        # for n+1 iterations
        for i in range(1, n+1):
            try:
                res = requests.get(
                    url,
                    params={
                        'subreddit': 'ethereum',
                        'size': 100,
                        'before': current_time
                    }
                )
            except:
                continue
                
            # convert JSON format into to DataFrame 
            df = pd.DataFrame(res.json()['data'])
            
            # updates to the latest utc
            current_time = df['created_utc'].min()
            
            # append df from each interation into posts list
            posts.append(df)
            
            # find out no. of posts we're getting
            total_scraped = sum(len(x) for x in posts)
            
            print(total_scraped)
            
            # wait 3-10 seconds between requests and break once we obtain 10_000 total_scraped rows
            if total_scraped < 10_000:
                time.sleep(randint(3,10))
            else:
                break
                
        # merge list of dfs from our requests
        df_merged = pd.concat(posts, sort=False)
        
        # i only want specific columns
        subfields = ['subreddit', 'title', 'selftext', 'created_utc', 'author', 'is_self', 'score', 'num_comments']
        df_merged = df_merged[subfields]
        
        # remove duplicates
        df_merged.drop_duplicates(inplace=True)
        
        print(df_merged.shape)
        
        # convert time from utc integer to timestamp
        df_merged['timestamp'] = df_merged['created_utc'].map(dt.date.fromtimestamp)
        
        return df_merged.reset_index(drop=True)

In [33]:
eth = fetch_posts('ethereum')
eth.head()

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
(9999, 8)


Unnamed: 0,subreddit,title,selftext,created_utc,author,is_self,score,num_comments,timestamp
0,ethereum,Should I sell bitcoin and just go all in ethereum?,"* I have like $37.5k invested total in bitcoin, and after the crash, I only have $10.8k gains. If I sold at $65k, I'd have $100k right now in cash.\n* I basically bought a whole coin at around $10...",1626938727,wuzzgucci,True,1,44,2021-07-22
1,ethereum,The hug – ultra rare 1/1,[removed],1626938319,Needle_NFT,True,1,0,2021-07-22
2,ethereum,L2BEAT website upgrade,[removed],1626938005,AOFEX__Official,True,1,0,2021-07-22
3,ethereum,The Internet World of our Future!!! (INSANE) the best vid I’ve ever watched in my life of crypto,,1626937513,FarEnergy3518,False,1,0,2021-07-22
4,ethereum,I like the way Polygon thinks - Mihailo Bjelic on Twitter,,1626936362,excusemealot,False,1,1,2021-07-22


In [43]:
eth.to_csv('../data/ethereum_sub.csv', index=False)