<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Notebook 1
## Audio Engineering Subreddit Pre-Processing

---
## Section 1: Imports

---

**BELOW:** Importing Python libraries for processing and testing my first Subreddit. With help from a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E&feature=youtu.be) by GA instructor Riley Dallas.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests
from time import sleep

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Defining a parameter list to set the scope of my scrape - here, the Audio Engineering Subreddit, with a scrape size of 100 posts per scrape.
params = {
        'subreddit': 'audioengineering',
        'size': 100
}

In [4]:
# Setting a response variable for my requests from the Pushshift API.
res = requests.get(url, params)

In [5]:
# Checking the status code - response is 200, so looking good!
res.status_code

200

In [6]:
# Converting data to json.
data = res.json()

In [7]:
# Defining 'posts' to equal jsonified data. I'll later concatenate all the subreddit entries to 'posts'.
posts = data['data']

In [8]:
# Checking to see if my pull worked - it did!
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': 'Tuniac',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_978df',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1606787707,
 'domain': 'self.audioengineering',
 'full_link': 'https://www.reddit.com/r/audioengineering/comments/k4axms/help_identifying_a_mic/',
 'gildings': {},
 'id': 'k4axms',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '',
 'link_flair_richtext': [],
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_only': False,
 'no_follow': False,
 'num_comments': 0,
 'num_crossposts': 0,
 'over_18': False,
 'parent_whitelist_status':

---
## Section 2: Pulling Data From Audio Engineering Subreddit
---

**BELOW:** Setting up the Pushshift API, with help from a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E&feature=youtu.be) by GA instructor Riley Dallas. I also worked with Nolan A, and had help from Noah and Patrick working out the kinks!

In [9]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [10]:
# I had help with this loop from Nolan A.

# Iterating through a range of numbers, 0 to, and including 49.
for i in range(49):
    # Setting a 'sleep' timer to make sure the computer waits 3 seconds between each scrape.
    sleep(3)
    # Creating a 'utc' variable for universal time code, and assigning it to the utc of the oldest post scraped in a given iteration.
    # This enables me to bring in posts in groups of 100, while ensuring that I don't pull the same 100 posts each time through the loop.
    utc = posts[-1]['created_utc']
    # Defining a parameter list to set the scope of my scrape - here, the Audio Engineering Subreddit, with a scrape size
    # of 100 posts per scrape, and 'before' specifying the utc (defined above) so that posts aren't duplicated.
    params = {
        'subreddit': 'audioengineering',
        'size': 100,
        'before': utc
    }
    
    # Setting a response variable for my requests from the Pushshift API.
    res = requests.get(url, params)
    # Converting these responses to json.
    data = res.json()['data']
    # Compiling all posts by adding the new data obtained upon each iteration.
    posts = posts + data

In [11]:
# Looking a how many posts I pulled in.
len(posts)

5000

In [12]:
# Creating a pandas dataframe from my scraped posts.
audio_df = pd.DataFrame(posts)

In [13]:
audio_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,suggested_sort,author_cakeday,crosspost_parent,crosspost_parent_list,thumbnail_height,thumbnail_width,url_overridden_by_dest,edited,banned_by,distinguished
0,[],False,Tuniac,,[],,text,t2_978df,False,False,...,,,,,,,,,,
1,[],False,out-of-the-ordinary,,[],,text,t2_5etcgj4w,False,False,...,,,,,,,,,,
2,[],False,ChristopherEv,,[],,text,t2_q72eyoi,False,False,...,,,,,,,,,,
3,[],False,gainstager,bluestripe,[],Audio Software,text,t2_3t4mqs5b,False,False,...,,,,,,,,,,
4,[],False,ThesisWarrior,,[],,text,t2_1z268pgg,False,False,...,,,,,,,,,,


In [14]:
# Singling out columns to look at as per Riley Dallas' video.
audio_df[['subreddit', 'title', 'selftext']]

Unnamed: 0,subreddit,title,selftext
0,audioengineering,Help Identifying a Mic,[https://www.youtube.com/watch?v=Gr9mOTKQELE&a...
1,audioengineering,Import eq filters to t.racks dsp 4x4 mini from...,"Hello dear people of the internet,\n\n&amp;#x2..."
2,audioengineering,Could I theoretically split a male 3.5mm into ...,Basically I have a headset with microphone and...
3,audioengineering,"I’ve got a great mix, but I know it will be a ...",I’ve been working on a mix for a little too lo...
4,audioengineering,Attack and Release on a Compressor - HOW sound...,Hi all - ive created a basic diagram graphic f...
...,...,...,...
4995,audioengineering,The state of plugin licensing is terrible.,[removed]
4996,audioengineering,Question about XLR and 1/4inch cables,"this may sound like a stupid question, but her..."
4997,audioengineering,Help with Ceiling Mounted Mic Stand,I've been on a little bit of a journey with th...
4998,audioengineering,Sony Walkman NW-A40 vs NW-A50,"Apart from some obvious cosmetic changes, is t..."


---
## Section 3: Cleaning
---

In [15]:
audio_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,suggested_sort,author_cakeday,crosspost_parent,crosspost_parent_list,thumbnail_height,thumbnail_width,url_overridden_by_dest,edited,banned_by,distinguished
0,[],False,Tuniac,,[],,text,t2_978df,False,False,...,,,,,,,,,,
1,[],False,out-of-the-ordinary,,[],,text,t2_5etcgj4w,False,False,...,,,,,,,,,,
2,[],False,ChristopherEv,,[],,text,t2_q72eyoi,False,False,...,,,,,,,,,,
3,[],False,gainstager,bluestripe,[],Audio Software,text,t2_3t4mqs5b,False,False,...,,,,,,,,,,
4,[],False,ThesisWarrior,,[],,text,t2_1z268pgg,False,False,...,,,,,,,,,,


In [16]:
audio_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title',
       'total_awards_received', 'treatmen

In [17]:
audio_df.isnull().sum()

all_awardings                0
allow_live_comments          0
author                       0
author_flair_css_class    4872
author_flair_richtext       34
                          ... 
thumbnail_width           4992
url_overridden_by_dest    4949
edited                    4969
banned_by                 4990
distinguished             4957
Length: 76, dtype: int64

In [18]:
# Modifying dataframe to contain only relevant features for building my model.
audio_df = audio_df[['subreddit', 'author', 'title', 'selftext']]

In [19]:
audio_df.head()

Unnamed: 0,subreddit,author,title,selftext
0,audioengineering,Tuniac,Help Identifying a Mic,[https://www.youtube.com/watch?v=Gr9mOTKQELE&a...
1,audioengineering,out-of-the-ordinary,Import eq filters to t.racks dsp 4x4 mini from...,"Hello dear people of the internet,\n\n&amp;#x2..."
2,audioengineering,ChristopherEv,Could I theoretically split a male 3.5mm into ...,Basically I have a headset with microphone and...
3,audioengineering,gainstager,"I’ve got a great mix, but I know it will be a ...",I’ve been working on a mix for a little too lo...
4,audioengineering,ThesisWarrior,Attack and Release on a Compressor - HOW sound...,Hi all - ive created a basic diagram graphic f...


In [20]:
# I noticed the author 'staticblue38' seemed to be posting lots of spam, so here looking for instances of his postings.
# All appear to be spam, so I'll drop all his posts.
audio_df[audio_df['author'] == 'staticblue38']

Unnamed: 0,subreddit,author,title,selftext
229,audioengineering,staticblue38,"[Black Friday] 60% Off 3000+ Sample Packs, Tem...",[removed]


In [21]:
# Dropping staticblue38's posts.
audio_df = audio_df[audio_df['author'] != 'staticblue38']

In [22]:
# Looking for instances of removed posts in selftext - designated on Reddit with '[removed]'.
audio_df[audio_df['selftext'] == '[removed]']

Unnamed: 0,subreddit,author,title,selftext
8,audioengineering,Murky_Yogurtcloset56,Add artificial emotion to a tts bot?,[removed]
42,audioengineering,Pmalo512,Where do I start!?!?,[removed]
48,audioengineering,ruqweller,Behringer U-Phoria UM2 Auido Interface Weird N...,[removed]
50,audioengineering,Gangliitis,Turn Text into Human Sounding Voiceovers (One ...,[removed]
59,audioengineering,hoppedoffzeeplane,Anyone know some reputable mastering engineers...,[removed]
...,...,...,...,...
4954,audioengineering,frmrmtribmbr,Uli Behringer is a racist and a sexist. Worked...,[removed]
4966,audioengineering,elhemida,A fake neumann u87ai on aliexpress ? Wondering...,[removed]
4971,audioengineering,KrutoChuvak265,Which Pair of Headphones Would You Guys Recomm...,[removed]
4986,audioengineering,fishermansbluegrass,UA Needs our help,[removed]


In [23]:
# Removing 'removed' selftext rows.
audio_df = audio_df[audio_df['selftext'] != '[removed]']

In [24]:
# Engineering a new feature, 'full_post', to contain title and selftext (in other words, the entire content of the user's original post, comments excluded).
audio_df['full_post'] = audio_df['title'] + ' ' + audio_df['selftext']

In [25]:
audio_df.shape

(4622, 5)

In [26]:
audio_df.head()

Unnamed: 0,subreddit,author,title,selftext,full_post
0,audioengineering,Tuniac,Help Identifying a Mic,[https://www.youtube.com/watch?v=Gr9mOTKQELE&a...,Help Identifying a Mic [https://www.youtube.co...
1,audioengineering,out-of-the-ordinary,Import eq filters to t.racks dsp 4x4 mini from...,"Hello dear people of the internet,\n\n&amp;#x2...",Import eq filters to t.racks dsp 4x4 mini from...
2,audioengineering,ChristopherEv,Could I theoretically split a male 3.5mm into ...,Basically I have a headset with microphone and...,Could I theoretically split a male 3.5mm into ...
3,audioengineering,gainstager,"I’ve got a great mix, but I know it will be a ...",I’ve been working on a mix for a little too lo...,"I’ve got a great mix, but I know it will be a ..."
4,audioengineering,ThesisWarrior,Attack and Release on a Compressor - HOW sound...,Hi all - ive created a basic diagram graphic f...,Attack and Release on a Compressor - HOW sound...


In [27]:
audio_df.shape

(4622, 5)

In [28]:
audio_df.drop_duplicates().shape

(4594, 5)

In [29]:
audio_df.drop_duplicates(inplace = True)

In [30]:
audio_df.isnull().sum()

subreddit     0
author        0
title         0
selftext     10
full_post    10
dtype: int64

In [31]:
audio_df = audio_df.dropna()

In [32]:
audio_df.shape

(4584, 5)

---

In [33]:
# Saving initial cleaned dataframe for Audio Engineering subreddit to csv.
audio_df.to_csv('../data/audioengineering_subreddit_01.csv', index = False)