# Project 3: Reddit NLP - Data Collection & Cleaning

Pushshift is a service that archives and indexes Reddit at regular intervals. It allows for higher-level search functionality and querying for Reddit comments and submissions, facilitating data collection for analysis and modeling. The API leverages the requests library to return a json response that can then be parsed for the data of interest.

Resources: 
- Pushshift Endpoints: https://pushshift.io/
- Pushshift Documentation: https://github.com/pushshift/api
- Pushshift Subreddit: https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/

Credit goes to Chris Sinatra

In [4]:
import requests, time, csv, json, re
import pandas as pd

## Setting the base query syntax:

Setting the query url to the pushshift api. Submission is chained at the end of the url because I am setting the parameter to only select subreddit posts/submissions and not comments. In the context of the dating and relationship advice subreddits, the submissions are usually more verbose than comments because posters are seeking help for their situation(s).

In [5]:
url = 'https://api.pushshift.io/reddit/search/submission'

In the following test run, I am going to pull data from the dating subreddit and pull the 5 most recent submissions. The `sort` feature allows me to choose the oldest/newest posts. 

In [6]:
params = {
          'subreddit':'dating',
          'sort':'desc',
          'size':5,
         }

In [7]:
response = requests.get(url, params=params)
response.status_code

200

The status code returned from the server is 200, meaning the query was accepted and there aren't any connection issues. Checking length of the json file.

In [8]:
len(response.json()['data'])

5

List of attributes from a single submission:

In [9]:
response.json()['data'][0].keys()

dict_keys(['author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'url', 'whitelist_status', 'wls'])

Keys of interest are:
- author
- selftext
- permalink
- subreddit

I am going to look at each pulled posts' permalinks to verify that they were pulled correctly:

In [11]:
for _ in range(len(response.json()['data'])):
    print(response.json()['data'][_]['permalink'])

/r/dating/comments/bbium8/vip_escorts_in_bangalore_at_affordable_price/
/r/dating/comments/bbilev/feel_like_my_mate_was_hitting_all_night_on_a/
/r/dating/comments/bbiktd/am_i_being_too_picky_how_important_is_the_spark/
/r/dating/comments/bbiild/what_are_some_great_first_date_questions_that_can/
/r/dating/comments/bbifsz/considering_going_off_birth_control_so_that_some/


Checking the length of the json, which is equivalent to how many submissions the API grabbed.

In [7]:
len(response.json()['data'])

5

 ## Querying Reddit and saving raw data in .json format:

Writing a function for creating a logfile and formatting file names with a unique timestamp.
Keeps track of the files created from pulling data.

In [8]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

Writing a function for collecting comments and parsing into a dataframe with the features of interest, saving out the raw data for each pull. Request loop inspired: [(Source)](https://www.reddit.com/r/pushshift/comments/89pxra/pushshift_api_with_large_amounts_of_data/).

In [9]:
def reddit_query(subreddits, n_samples=1500, before=None, after=None):
    
    url = f'https://api.pushshift.io/reddit/search/submission'
    last_comment = round(time.time())
    comment_list = []
    
    run = 1
    while len(comment_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {
              'subreddit':subreddits,
              'sort':'desc',
              'size':n_samples,
              'before':last_comment-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_comment = last_comment
            else:
                last_comment = posts[-1]['created_utc']
                comment_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/{subreddits}_raw_submissions.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(comment_list, f)
    
    print(f'Saved and completed query and returned {len(comment_list)} submissions.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Using the query function to collect 10000 comments from the dating subreddit.

In [10]:
reddit_query(subreddits='dating',
             n_samples=10000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 submissions.
Reddit text is ready for processing.
Last timestamp was 1545699198.


Using the query function to collect 10000 comments from the dating subreddit.

In [11]:
reddit_query(subreddits='relationship_advice',
             n_samples=10000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Saved and completed query and returned 10000 submissions.
Reddit text is ready for processing.
Last timestamp was 1553649404.


In [12]:
!ls '../data/'

dating_1545268715_raw_submissions.json
dating_1545699198_raw_submissions.json
df.csv
relationship_advice_1553295834_raw_submissions.json
relationship_advice_1553649404_raw_submissions.json


In [4]:
with open(f'../data/dating_1545268715_raw_submissions.json', 'r') as f:
    dating_json = json.load(f)

In [5]:
len(dating_json)

10000

Parsing the json file into a dataframe containing the features of interest.

In [6]:
dating_json[0].keys()

dict_keys(['author', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_text', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'can_mod_post', 'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id', 'is_crosspostable', 'is_meta', 'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video', 'link_flair_background_color', 'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id', 'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink', 'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit', 'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail', 'title', 'url', 'whitelist_status', 'wls'])

Keys of interest are:
- author
- selftext
- permalink
- subreddit

In [7]:
def reddit_to_df(sample):
    
    col_list = ['author',
                'selftext',
                'subreddit',
                'permalink',
                ]
    
    submissions_df = pd.DataFrame(sample)
    submissions_df = submissions_df[col_list]

    return submissions_df

In [8]:
dating_df = reddit_to_df(dating_json)

In [9]:
with open(f'../data/relationship_advice_1553295834_raw_submissions.json', 'r') as f:
    relationship_json = json.load(f)

In [10]:
relationship_advice_df = reddit_to_df(relationship_json)

In [11]:
relationship_advice_df.head()

Unnamed: 0,author,selftext,subreddit,permalink
0,thethrowayboi,"Hey Reddit,\n\nSo I've been in a very stable a...",relationship_advice,/r/relationship_advice/comments/b7yrg8/my_girl...
1,throwaway66770220,I (19F) have been with my boyfriend (20M) for ...,relationship_advice,/r/relationship_advice/comments/b7yr4m/i_dont_...
2,forgottenveggie,- warning will probably be long as semi ventin...,relationship_advice,/r/relationship_advice/comments/b7yq90/how_to_...
3,trowway988,Throwaway account for the reason that my bf is...,relationship_advice,/r/relationship_advice/comments/b7yq2l/my21f_b...
4,CrazyMonkey0425,I’m (19M) in that mindset that I really like t...,relationship_advice,/r/relationship_advice/comments/b7ypr2/when_pu...


Combining the two dataframes together

In [12]:
df = pd.concat([dating_df,relationship_advice_df], ignore_index= True)

I need to remove `[removed]`, `[deleted]`, and `null` comments in the `selftext` column

In [13]:
df[(df['selftext'] == '[removed]')|(df['selftext'] == '[deleted]')]['selftext'].value_counts()

[removed]    2115
[deleted]       8
Name: selftext, dtype: int64

In [14]:
df = df[~df['selftext'].isin(['[removed]','[deleted]'])]

I also will remove `null` selftexts:

In [15]:
df[df['selftext'].isnull()]

Unnamed: 0,author,selftext,subreddit,permalink
332,[deleted],,dating,/r/dating/comments/b6nwaf/a_girl_i_see_on_a_da...
3361,[deleted],,dating,/r/dating/comments/av6ya1/the_solution_to_all_...
3365,[deleted],,dating,/r/dating/comments/av6k1i/never_dated_in_my_li...


In [16]:
df = df[~df['selftext'].isnull()]

In [17]:
df.shape

(17874, 4)

There are a total of 17,874 posts from both dating and relationship advice subreddits.

In [18]:
df.subreddit.value_counts()

relationship_advice    9738
dating                 8136
Name: subreddit, dtype: int64

Creating a new column that will indicate whether a post is from `dating` (1) or `relationship_advice` (0).

In [19]:
df['dating'] = df['subreddit'].map({'dating':1, 'relationship_advice':0})
df.reset_index(drop=True, inplace=True)

In [20]:
df.head()

Unnamed: 0,author,selftext,subreddit,permalink,dating
0,Illumnyx,Last weekend I went on a lunch date for the fi...,dating,/r/dating/comments/b7yhwo/first_date_in_a_whil...,1
1,whitewoods,I’ve been friends with a girl from my high sch...,dating,/r/dating/comments/b7y2t4/fearing_rejection/,1
2,rebeccazone,I bumped into a lady I used to know casually a...,dating,/r/dating/comments/b7y20k/got_stood_up_by_an_o...,1
3,iiDaddy_Jj,Okay so this girl i have history with has done...,dating,/r/dating/comments/b7xzn2/is_being_stubborn_to...,1
4,eelikay,I can't tell if my standards for looks/body an...,dating,/r/dating/comments/b7xwbp/what_to_do_if_the_in...,1


In order to run natural language processing, I will remove the following:
1. Line breaks
2. Apostrophes
3. All Punctuation
4. Xa0
5. New line breaks
6. Mentions of any subreddit
7. Only keeping letters

In [21]:
df['selftext'] = df['selftext'].map(lambda x: re.sub('\/\/', ' ', x)) # Removing line breaks
df['selftext'] = df['selftext'].map(lambda x: re.sub('[\\][\']', '', x)) # Removing apostrophes
df['selftext'] = df['selftext'].map(lambda x: re.sub('[^\w\s]', ' ', x)) # Removing all punctuation 
df['selftext'] = df['selftext'].map(lambda x: re.sub('\xa0', ' ', x)) # removing xa0
df['selftext'] = df['selftext'].map(lambda x: re.sub('\n', ' ', x)) # removing line break
df['selftext'] = df['selftext'].map(lambda x: re.sub('\s[\/]?r\/[^\s]+', ' ', x)) # removing mentions of any subreddit
df['selftext'] = df['selftext'].map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x)) # removing urls
df['selftext'] = df['selftext'].map(lambda x: re.sub("[^a-zA-Z]", " ", x)) # Only keeping letters

In [22]:
df.head()

Unnamed: 0,author,selftext,subreddit,permalink,dating
0,Illumnyx,Last weekend I went on a lunch date for the fi...,dating,/r/dating/comments/b7yhwo/first_date_in_a_whil...,1
1,whitewoods,I ve been friends with a girl from my high sch...,dating,/r/dating/comments/b7y2t4/fearing_rejection/,1
2,rebeccazone,I bumped into a lady I used to know casually a...,dating,/r/dating/comments/b7y20k/got_stood_up_by_an_o...,1
3,iiDaddy_Jj,Okay so this girl i have history with has done...,dating,/r/dating/comments/b7xzn2/is_being_stubborn_to...,1
4,eelikay,I cant tell if my standards for looks body and...,dating,/r/dating/comments/b7xwbp/what_to_do_if_the_in...,1


I will check if the texts are cleaned up.

In [23]:
df['selftext'][0]

'Last weekend I went on a lunch date for the first time in a while with someone I met on a dating app  Was awkward at first  first time meeting so thats natural  but once we settled in we had decent discussion    As it went on though  I felt increasingly tense and uncomfortable at some of the stuff they were saying  like  it just didnt sit right with how I view things   They also kept knocking my feet under the table   not sure if that was intentional or not  The conversations seemed pretty one sided in the sense that I seemed to be talking a lot about myself without them really reciprocating  They also ordered a lot of food and only ended up eating half  Didnt even take it in a container when offered  that kind of wastefulness really ticks me off    After an hour  I was really telling myself to start making moves to leave as I was that tense  We eventually did and ended up splitting the bill  I offered as the girl behind the counter was getting confused by her account of what we both 

In [25]:
df = df[df['selftext'] != '']

Shuffling the rows in the dataframe for future training and testing datasets.

In [26]:
df = df.sample(n=len(df), replace = False).reset_index(drop=True)

In [27]:
df.to_csv('../data/df.csv', index = False)

# Please go to Notebook 2: Exploratory Data Analysis