<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Notebook 2
## Songwriting Subreddit Pre-Processing

---
## Section 1: Imports

---

**BELOW:** Importing Python libraries for processing and testing my second Subreddit. With help from a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E&feature=youtu.be) by GA instructor Riley Dallas.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests
from time import sleep

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Defining a parameter list to set the scope of my scrape - here, the Songwriting Subreddit, with a scrape size of 100 posts per scrape.
params = {
        'subreddit': 'Songwriting',
        'size': 100
}

In [4]:
# Setting a response variable for my requests from the Pushshift API.
res = requests.get(url, params)

In [5]:
# Checking the status code - response is 200, so looking good!
res.status_code

200

In [6]:
# Converting data to json.
data = res.json()

In [7]:
# Defining 'posts' to equal jsonified data. I'll later concatenate all the subreddit entries to 'posts'.
posts = data['data']

In [8]:
# Checking to see if my pull worked - it did!
posts[0]

{'all_awardings': [],
 'allow_live_comments': False,
 'author': '3rrr6',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_jyu6a',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1606783939,
 'domain': 'self.Songwriting',
 'full_link': 'https://www.reddit.com/r/Songwriting/comments/k49sb4/a_melody_first_lyrics_last_songwriting_procedure/',
 'gildings': {},
 'id': 'k49sb4',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': True,
 'is_video': False,
 'link_flair_background_color': '#ffd779',
 'link_flair_richtext': [],
 'link_flair_template_id': 'e3905cf4-fd16-11e8-8cbc-0e37a6fc933c',
 'link_flair_text': 'Resource',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'text',
 'locked': False,
 'media_on

---
## Section 2: Pulling Data From Songwriting Subreddit
---

**BELOW:** Setting up the Pushshift API, with help from a [YouTube video](https://www.youtube.com/watch?v=AcrjEWsMi_E&feature=youtu.be) by GA instructor Riley Dallas. I also worked with Nolan A, and had help from Noah and Patrick working out the kinks!

In [9]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [10]:
# I had help with this loop from Nolan A.

# Iterating through a range of numbers, 0 to, and including 49.
for i in range(49):
    # Setting a 'sleep' timer to make sure the computer waits 2 seconds between each scrape.
    sleep(3)
    # Creating a 'utc' variable for universal time code, and assigning it to the utc of the oldest post scraped in a given iteration.
    # This enables me to bring in posts in groups of 100, while ensuring that I don't pull the same 100 posts each time through the loop.
    utc = posts[-1]['created_utc']
    # Defining a parameter list to set the scope of my scrape - here, the Songwriting Subreddit, with a scrape size
    # of 100 posts per scrape, and 'before' specifying the utc (defined above) so that posts aren't duplicated.
    params = {
        'subreddit': 'Songwriting',
        'size': 100,
        'before': utc
    }
    
    # Setting a response variable for my requests from the Pushshift API.
    res = requests.get(url, params)
    # Converting these responses to json.
    data = res.json()['data']
    # Compiling all posts by adding the new data obtained upon each iteration.
    posts = posts + data

In [11]:
# Looking a how many posts I pulled in.
len(posts)

5000

In [12]:
# Creating a pandas dataframe from my scraped posts.
song_df = pd.DataFrame(posts)

In [13]:
song_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_flair_background_color,author_flair_text_color,removed_by_category,gallery_data,is_gallery,media_metadata,edited,author_cakeday,poll_data,banned_by
0,[],False,3rrr6,,[],,text,t2_jyu6a,False,False,...,,,,,,,,,,
1,[],False,PjoSuperFan1,,[],,text,t2_3vat0sey,False,False,...,,,,,,,,,,
2,[],False,SeveralSmalMouses,,[],,text,t2_3jvx93kc,False,False,...,,,,,,,,,,
3,[],False,Letitbedone56363,,[],,text,t2_84luw0q8,False,False,...,,,,,,,,,,
4,[],False,feli-usp,,[],,text,t2_93d6fd4z,False,False,...,,,,,,,,,,


In [14]:
# Singling out columns to look at as per Riley Dallas' video.
song_df[['subreddit', 'title', 'selftext']]

Unnamed: 0,subreddit,title,selftext
0,Songwriting,"A ""melody first, lyrics last"" songwriting proc...","1. Find a key, tempo, and time signature you l..."
1,Songwriting,Teen Music Makers Discord,\nI’m looking for new teen members (ages 13-19...
2,Songwriting,I would like to start writing songs.,I haven't written songs before and would like ...
3,Songwriting,Some lyrics- if you want to sing them and reco...,Never knew I could meet someone that can make ...
4,Songwriting,Hey!! Wrote this one a couple of months ago an...,
...,...,...,...
4995,Songwriting,How Pain In Your Life Affects The Songs You Make,
4996,Songwriting,Edgy Song,[removed]
4997,Songwriting,New to this sub and need feedback,New to this sub and this would be my very firs...
4998,Songwriting,New to this sub and need feedback,New to this sub and this would be my very firs...


---
## Section 3: Cleaning
---

In [15]:
song_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_flair_background_color,author_flair_text_color,removed_by_category,gallery_data,is_gallery,media_metadata,edited,author_cakeday,poll_data,banned_by
0,[],False,3rrr6,,[],,text,t2_jyu6a,False,False,...,,,,,,,,,,
1,[],False,PjoSuperFan1,,[],,text,t2_3vat0sey,False,False,...,,,,,,,,,,
2,[],False,SeveralSmalMouses,,[],,text,t2_3jvx93kc,False,False,...,,,,,,,,,,
3,[],False,Letitbedone56363,,[],,text,t2_84luw0q8,False,False,...,,,,,,,,,,
4,[],False,feli-usp,,[],,text,t2_93d6fd4z,False,False,...,,,,,,,,,,


In [16]:
song_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_template_id', 'link_flair_text', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title',
       'total_awards

In [17]:
song_df.isnull().sum()

all_awardings                0
allow_live_comments          0
author                       0
author_flair_css_class    5000
author_flair_richtext      126
                          ... 
media_metadata            4936
edited                    4963
author_cakeday            4991
poll_data                 4993
banned_by                 4991
Length: 80, dtype: int64

In [18]:
# Modifying dataframe to contain only relevant features for building my model.
song_df = song_df[['subreddit', 'author', 'title', 'selftext']]

In [19]:
song_df.head()

Unnamed: 0,subreddit,author,title,selftext
0,Songwriting,3rrr6,"A ""melody first, lyrics last"" songwriting proc...","1. Find a key, tempo, and time signature you l..."
1,Songwriting,PjoSuperFan1,Teen Music Makers Discord,\nI’m looking for new teen members (ages 13-19...
2,Songwriting,SeveralSmalMouses,I would like to start writing songs.,I haven't written songs before and would like ...
3,Songwriting,Letitbedone56363,Some lyrics- if you want to sing them and reco...,Never knew I could meet someone that can make ...
4,Songwriting,feli-usp,Hey!! Wrote this one a couple of months ago an...,


In [20]:
# Looking for instances of removed posts in selftext - designated on Reddit with '[removed]'.
song_df[song_df['selftext'] == '[removed]']

Unnamed: 0,subreddit,author,title,selftext
56,Songwriting,Magical_Naomi,"Um, I made this song like a few months ago I r...",[removed]
57,Songwriting,allisonvenditti,songwritting,[removed]
65,Songwriting,lykasupernova,Can't Run Anymore (Original Song),[removed]
132,Songwriting,jaxx2010nov,What to do next,[removed]
143,Songwriting,techshavi,High CPC keywords,[removed]
...,...,...,...,...
4963,Songwriting,quenri23,When is it OK to make up a word to fit the song?,[removed]
4964,Songwriting,lach_225x,Wrote a song and I'd like to know what you lot...,[removed]
4984,Songwriting,Faaaaann,[lyrics] Feed back plz :)),[removed]
4996,Songwriting,stalkingseagull,Edgy Song,[removed]


In [21]:
# Removing 'removed' selftext rows.
song_df = song_df[song_df['selftext'] != '[removed]']

In [22]:
# Engineering a new feature, 'full_post', to contain title and selftext (in other words, the entire content of the user's original post, comments excluded).
song_df['full_post'] = song_df['title'] + ' ' + song_df['selftext']

In [23]:
song_df.shape

(4818, 5)

In [24]:
song_df.head()

Unnamed: 0,subreddit,author,title,selftext,full_post
0,Songwriting,3rrr6,"A ""melody first, lyrics last"" songwriting proc...","1. Find a key, tempo, and time signature you l...","A ""melody first, lyrics last"" songwriting proc..."
1,Songwriting,PjoSuperFan1,Teen Music Makers Discord,\nI’m looking for new teen members (ages 13-19...,Teen Music Makers Discord \nI’m looking for ne...
2,Songwriting,SeveralSmalMouses,I would like to start writing songs.,I haven't written songs before and would like ...,I would like to start writing songs. I haven't...
3,Songwriting,Letitbedone56363,Some lyrics- if you want to sing them and reco...,Never knew I could meet someone that can make ...,Some lyrics- if you want to sing them and reco...
4,Songwriting,feli-usp,Hey!! Wrote this one a couple of months ago an...,,Hey!! Wrote this one a couple of months ago an...


In [25]:
song_df.shape

(4818, 5)

In [26]:
song_df.drop_duplicates().shape

(4718, 5)

In [27]:
song_df.drop_duplicates(inplace = True)

In [28]:
song_df.isnull().sum()

subreddit    0
author       0
title        0
selftext     9
full_post    9
dtype: int64

In [29]:
song_df = song_df.dropna()

In [30]:
song_df.shape

(4709, 5)

---

In [31]:
# Saving initial cleaned dataframe for Songwriting subreddit to csv.
song_df.to_csv('../data/songwriting_subreddit_01.csv', index = False)