# Project 3

# Nvidia Subreddit Scrapping

In this notebook, we will be scrapping data from the Nvidia Subreddit. There will be four types of data to be scrapped. The types will be based on the views on the subreddit page. They are hot posts, new posts, rising posts and top posts on the day.

In [1]:
#Import necessary modules
import requests
import pandas as pd
import time
import random

In [2]:
#Set max display of columns and rows
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)

In [3]:
def web_scrapper(url,csv_link):
    '''Function for webscrapping and creating a csv for data'''
    posts = []
    after = None
 
    for a in range(40):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Darion Inc 1.0'})
    
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
        
        sleep_duration = random.randint(2,10)
        print(sleep_duration)
        time.sleep(sleep_duration)
        
        pd.DataFrame(posts).to_csv(csv_link,index=False)

## New Nvidia Subreddit Posts View

In [19]:
#This line is commented out to avoid accidental trigger of webscrapping
#web_scrapper('https://www.reddit.com/r/nvidia/new.json','datasets/new_nvidia_posts.csv')

https://www.reddit.com/r/nvidia/new.json
3
https://www.reddit.com/r/nvidia/new.json?after=t3_k30166
6
https://www.reddit.com/r/nvidia/new.json?after=t3_k2v97c
3
https://www.reddit.com/r/nvidia/new.json?after=t3_k2ozsk
7
https://www.reddit.com/r/nvidia/new.json?after=t3_k2jhsa
9
https://www.reddit.com/r/nvidia/new.json?after=t3_k2f48c
3
https://www.reddit.com/r/nvidia/new.json?after=t3_k2aqei
6
https://www.reddit.com/r/nvidia/new.json?after=t3_k23sj1
4
https://www.reddit.com/r/nvidia/new.json?after=t3_k1ze3k
4
https://www.reddit.com/r/nvidia/new.json?after=t3_k1ust4
9
https://www.reddit.com/r/nvidia/new.json?after=t3_k1phrw
10
https://www.reddit.com/r/nvidia/new.json?after=t3_k1lmbp
2
https://www.reddit.com/r/nvidia/new.json?after=t3_k1gk9m
3
https://www.reddit.com/r/nvidia/new.json?after=t3_k1aq4a
6
https://www.reddit.com/r/nvidia/new.json?after=t3_k13hfl
5
https://www.reddit.com/r/nvidia/new.json?after=t3_k0ziin
5
https://www.reddit.com/r/nvidia/new.json?after=t3_k0vegl
10
https://www

In [4]:
#Reading in the CSV
nvi_new=pd.read_csv('datasets/new_nvidia_posts.csv')
nvi_new.shape

(986, 114)

In [21]:
#First 5 rows of dataframe
nvi_new.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,post_hint,url_overridden_by_dest,preview,link_flair_template_id,media_metadata,is_gallery,gallery_data,author_cakeday,crosspost_parent_list,crosspost_parent
0,,nvidia,"Have a 3080 Ventus, I've been running on drive...",t2_247yan2v,False,,0,False,Custom fan curve not working with latest drivers,[],...,,,,,,,,,,
1,,nvidia,,t2_4hzb2,False,,0,False,Just found an EVGA GeForce RTX 2080 Super Ftw3...,"[{'e': 'text', 't': 'Question'}]",...,image,https://i.redd.it/1alqwcvtl4261.jpg,{'images': [{'source': {'url': 'https://previe...,fb6f8e52-4086-11e6-b284-0ee96c7aff3d,,,,,,
2,,nvidia,I have read rumours that RTX 3080 Ti is launch...,t2_10jc8s,False,,0,False,Are the RTX 3080 Ti leaks reliable?,"[{'e': 'text', 't': 'Question'}]",...,,,,fb6f8e52-4086-11e6-b284-0ee96c7aff3d,,,,,,
3,,nvidia,"for Malaysian currency, it's RM3999 for the Ga...",t2_10y4b1,False,,0,False,I have a chance to buy either the RTX 3080 Gam...,"[{'e': 'text', 't': 'Question'}]",...,,,,fb6f8e52-4086-11e6-b284-0ee96c7aff3d,,,,,,
4,,nvidia,"Hi guys, \nA little background. I currently ha...",t2_8hurluin,False,,0,False,If I have the option to enable G-Sync in the N...,"[{'e': 'text', 't': 'Question'}]",...,,,,fb6f8e52-4086-11e6-b284-0ee96c7aff3d,,,,,,


## Hot Nvidia Subreddit Posts View

In [22]:
#This line is commented out to avoid accidental trigger of webscrapping
#web_scrapper('https://www.reddit.com/r/nvidia.json','datasets/main_nvidia_posts.csv')

https://www.reddit.com/r/nvidia.json
10
https://www.reddit.com/r/nvidia.json?after=t3_k2sehd
6
https://www.reddit.com/r/nvidia.json?after=t3_k2b445
7
https://www.reddit.com/r/nvidia.json?after=t3_k2nf8b
5
https://www.reddit.com/r/nvidia.json?after=t3_k250qk
9
https://www.reddit.com/r/nvidia.json?after=t3_k2cz90
4
https://www.reddit.com/r/nvidia.json?after=t3_k2gl9r
7
https://www.reddit.com/r/nvidia.json?after=t3_k2az6b
10
https://www.reddit.com/r/nvidia.json?after=t3_k1tprb
5
https://www.reddit.com/r/nvidia.json?after=t3_k1zgiv
7
https://www.reddit.com/r/nvidia.json?after=t3_k1ww2o
10
https://www.reddit.com/r/nvidia.json?after=t3_k1si47
3
https://www.reddit.com/r/nvidia.json?after=t3_k1fniw
6
https://www.reddit.com/r/nvidia.json?after=t3_k1h1kd
2
https://www.reddit.com/r/nvidia.json?after=t3_k0qemk
5
https://www.reddit.com/r/nvidia.json?after=t3_k0qhhm
5
https://www.reddit.com/r/nvidia.json?after=t3_k10yeu
5
https://www.reddit.com/r/nvidia.json?after=t3_k0vegl
10
https://www.reddit.com

In [5]:
#Reading in the CSV
nvi_main=pd.read_csv('datasets/main_nvidia_posts.csv')
nvi_main.shape

(997, 115)

In [24]:
#First 5 rows of dataframe
nvi_main.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,num_crossposts,media,is_video,is_gallery,media_metadata,gallery_data,url_overridden_by_dest,author_cakeday,crosspost_parent_list,crosspost_parent
0,,nvidia,# Game Ready Driver 457.30 has been released.\...,t2_bvuns,False,,0,False,Game Ready Driver 457.30 FAQ/Discussion,"[{'e': 'text', 't': 'Discussion'}]",...,0,,False,,,,,,,
1,,nvidia,We're consolidating **all** tech support posts...,t2_bvuns,False,,0,False,Tech Support and Question Megathread - Novembe...,"[{'e': 'text', 't': 'Tech Support'}]",...,0,,False,,,,,,,
2,,nvidia,,t2_y88hz,False,,0,False,I like wood,"[{'e': 'text', 't': 'Build/Photos'}]",...,0,,False,True,"{'7314z8hgy0261': {'status': 'valid', 'e': 'Im...","{'items': [{'media_id': 'l131jqcey0261', 'id':...",https://www.reddit.com/gallery/k2svw3,,,
3,,nvidia,,t2_7lukx,False,,0,False,My 2021 - TRON - RTX3080 build is complete!,"[{'e': 'text', 't': 'Build/Photos'}]",...,0,,False,,,,https://i.redd.it/dbqm5ko7rz161.png,,,
4,,nvidia,,t2_3dluredk,False,,0,False,My 3080 Vision from last weeks Best Buy drop s...,"[{'e': 'text', 't': 'Build/Photos'}]",...,0,,False,True,"{'vm8yup20w2261': {'status': 'valid', 'e': 'Im...","{'items': [{'media_id': 'vm8yup20w2261', 'id':...",https://www.reddit.com/gallery/k2zs9k,,,


## Rising Nvidia Subreddit Posts View

In [25]:
#This line is commented out to avoid accidental trigger of webscrapping
#web_scrapper('https://www.reddit.com/r/nvidia/rising.json','datasets/rising_nvidia_posts.csv')

https://www.reddit.com/r/nvidia/rising.json
9
https://www.reddit.com/r/nvidia/rising.json
3
https://www.reddit.com/r/nvidia/rising.json
6
https://www.reddit.com/r/nvidia/rising.json
7
https://www.reddit.com/r/nvidia/rising.json
9
https://www.reddit.com/r/nvidia/rising.json
2
https://www.reddit.com/r/nvidia/rising.json
5
https://www.reddit.com/r/nvidia/rising.json
6
https://www.reddit.com/r/nvidia/rising.json
2
https://www.reddit.com/r/nvidia/rising.json
5
https://www.reddit.com/r/nvidia/rising.json
10
https://www.reddit.com/r/nvidia/rising.json
9
https://www.reddit.com/r/nvidia/rising.json
6
https://www.reddit.com/r/nvidia/rising.json
3
https://www.reddit.com/r/nvidia/rising.json
5
https://www.reddit.com/r/nvidia/rising.json
3
https://www.reddit.com/r/nvidia/rising.json
10
https://www.reddit.com/r/nvidia/rising.json
8
https://www.reddit.com/r/nvidia/rising.json
5
https://www.reddit.com/r/nvidia/rising.json
4
https://www.reddit.com/r/nvidia/rising.json
4
https://www.reddit.com/r/nvidia/

In [6]:
#Reading in the CSV
nvi_rising=pd.read_csv('datasets/rising_nvidia_posts.csv')
nvi_rising.shape

(920, 111)

In [33]:
#First 5 rows of dataframe
nvi_rising.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,created_utc,num_crossposts,media,is_video,post_hint,url_overridden_by_dest,preview,is_gallery,media_metadata,gallery_data
0,,nvidia,"for Malaysian currency, it's RM3999 for the Ga...",t2_10y4b1,False,,0,False,I have a chance to buy either the RTX 3080 Gam...,"[{'e': 'text', 't': 'Question'}]",...,1606632000.0,0,,False,,,,,,
1,,nvidia,,t2_4hzb2,False,,0,False,Just found an EVGA GeForce RTX 2080 Super Ftw3...,"[{'e': 'text', 't': 'Question'}]",...,1606633000.0,0,,False,image,https://i.redd.it/1alqwcvtl4261.jpg,{'images': [{'source': {'url': 'https://previe...,,,
2,,nvidia,,t2_y88hz,False,,0,False,I like wood,"[{'e': 'text', 't': 'Build/Photos'}]",...,1606588000.0,0,,False,,https://www.reddit.com/gallery/k2svw3,,True,"{'7314z8hgy0261': {'status': 'valid', 'e': 'Im...","{'items': [{'media_id': 'l131jqcey0261', 'id':..."
3,,nvidia,,t2_7lukx,False,,0,False,My 2021 - TRON - RTX3080 build is complete!,"[{'e': 'text', 't': 'Build/Photos'}]",...,1606574000.0,0,,False,image,https://i.redd.it/dbqm5ko7rz161.png,{'images': [{'source': {'url': 'https://previe...,,,
4,,nvidia,,t2_3dluredk,False,,0,False,My 3080 Vision from last weeks Best Buy drop s...,"[{'e': 'text', 't': 'Build/Photos'}]",...,1606612000.0,0,,False,,https://www.reddit.com/gallery/k2zs9k,,True,"{'vm8yup20w2261': {'status': 'valid', 'e': 'Im...","{'items': [{'media_id': 'vm8yup20w2261', 'id':..."


## Today Top Nvidia Subreddit Posts View

In [28]:
#This line is commented out to avoid accidental trigger of webscrapping
#web_scrapper('https://www.reddit.com/r/nvidia/top.json','datasets/today_top_nvidia_posts.csv')

https://www.reddit.com/r/nvidia/top.json
4
https://www.reddit.com/r/nvidia/top.json?after=t3_k2zky8
3
https://www.reddit.com/r/nvidia/top.json?after=t3_k2twhn
2
https://www.reddit.com/r/nvidia/top.json?after=t3_k2ozsk
8
https://www.reddit.com/r/nvidia/top.json
5
https://www.reddit.com/r/nvidia/top.json?after=t3_k2zky8
4
https://www.reddit.com/r/nvidia/top.json?after=t3_k2twhn
7
https://www.reddit.com/r/nvidia/top.json?after=t3_k2ozsk
10
https://www.reddit.com/r/nvidia/top.json
7
https://www.reddit.com/r/nvidia/top.json?after=t3_k2zky8
9
https://www.reddit.com/r/nvidia/top.json?after=t3_k2twhn
10
https://www.reddit.com/r/nvidia/top.json?after=t3_k2ozsk
2
https://www.reddit.com/r/nvidia/top.json
2
https://www.reddit.com/r/nvidia/top.json?after=t3_k2zky8
3
https://www.reddit.com/r/nvidia/top.json?after=t3_k2twhn
4
https://www.reddit.com/r/nvidia/top.json?after=t3_k2ozsk
9
https://www.reddit.com/r/nvidia/top.json
10
https://www.reddit.com/r/nvidia/top.json?after=t3_k2zky8
10
https://www.re

In [12]:
#Reading in the CSV
nvi_today_top=pd.read_csv('datasets/today_top_nvidia_posts.csv')
nvi_today_top.shape

(970, 114)

In [37]:
#First 5 rows of dataframe
nvi_today_top.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,is_gallery,title,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,crosspost_parent_list,crosspost_parent,author_cakeday
0,,nvidia,,t2_y88hz,False,,0,False,True,I like wood,...,587648,1606588000.0,0,,False,,,,,
1,,nvidia,,t2_7lukx,False,,0,False,,My 2021 - TRON - RTX3080 build is complete!,...,587648,1606574000.0,0,,False,image,{'images': [{'source': {'url': 'https://previe...,,,
2,,nvidia,,t2_f7yo8,False,,0,False,,Built my first gaming PC earlier this week! R5...,...,587648,1606561000.0,0,,False,image,{'images': [{'source': {'url': 'https://previe...,,,
3,,nvidia,,t2_8n8jcts,False,,0,False,,How to surprise your friend with your new GPU,...,587648,1606575000.0,1,,False,image,{'images': [{'source': {'url': 'https://previe...,,,
4,,nvidia,,t2_16pp7zkq,False,,0,False,True,Finally got the RTX 3090 Strix OC &amp; AMD 59...,...,587648,1606549000.0,0,,False,,,,,


Next, we will be dropping all columns except for subreddit, selftext and title. We will be using selftext and title to train out the classification model.

In [14]:
#Drop all columns except subreddit, selftext & title
nvi_main.drop(nvi_main.columns.difference(['subreddit','selftext','title']),axis=1,inplace=True)

In [15]:
#Drop all columns except subreddit, selftext & title
nvi_rising.drop(nvi_rising.columns.difference(['subreddit','selftext','title']),axis=1,inplace=True)

In [16]:
#Drop all columns except subreddit, selftext & title
nvi_new.drop(nvi_new.columns.difference(['subreddit','selftext','title']),axis=1,inplace=True)

In [17]:
#Drop all columns except subreddit, selftext & title
nvi_today_top.drop(nvi_today_top.columns.difference(['subreddit','selftext','title']),axis=1,inplace=True)

In [18]:
#Shape of data to check the number of posts
nvi_main.shape

(997, 3)

In [19]:
#Shape of data to check the number of posts
nvi_rising.shape

(920, 3)

In [20]:
#Shape of data to check the number of posts
nvi_today_top.shape

(970, 3)

In [21]:
#Shape of data to check the number of posts
nvi_new.shape

(986, 3)

We will take all four data sets and combine into one.

In [22]:
#Combining all data sets into one
nvi_combine = pd.concat([nvi_rising,nvi_main,nvi_new,nvi_today_top])

In [24]:
#Shape of data to check the total number of posts
nvi_combine.shape

(3873, 3)

In [25]:
#Dropping duplicate values
nvi_combine = nvi_combine.drop_duplicates()

In [26]:
#Shape of data to check the total number of posts
nvi_combine.shape

(1000, 3)

In [27]:
#Checking for null values
nvi_combine.isnull().sum()

subreddit      0
selftext     278
title          0
dtype: int64

In [32]:
#Will be dropping null selftext values as it means that it is a non-text post
nvi_combine.dropna(subset=['selftext'],inplace=True)

In [35]:
#Top 5 rows of the combined dataset
nvi_combine.head()

Unnamed: 0,subreddit,selftext,title
0,nvidia,"for Malaysian currency, it's RM3999 for the Ga...",I have a chance to buy either the RTX 3080 Gam...
12,nvidia,Download your promotional code for GeForce NOW...,[GIVEAWAY] GeForce NOW 1 year Founders Members...
13,nvidia,I’m wanting to build my pc and I’m buying part...,When is the best time to buy a GTX 2060??
21,nvidia,Anyone who has this card want to provide some ...,Thoughts on Evga 3080 FTW3 Ultra
0,nvidia,# Game Ready Driver 457.30 has been released.\...,Game Ready Driver 457.30 FAQ/Discussion


In [37]:
#Saving the data set into a csv
nvi_combine.to_csv('datasets/nvi_combine.csv',index=False)