## Project 3 Reddit Posts

### Problem Statement

I am a data scientist in Reddit and recently, the operation team discover that there is data corruption resulting in a loss of data for recent subreddit posts for these 2 particular subreddits - Parenting and Relationship Advice and they happen to be subreddits which have many members and active. The operation team managed to recover the some data which are basically the posts and their descriptions but not the subreddits which they belong to i.e. Parenting or Relationship Advice. Hence, they have engaged the data team, in particular myself to help them classify these posts to the respective subreddits for them to restore the data.

For this problem,, my proposed soution will be two-fold:
1. Using [Pushshift's](https://github.com/pushshift/api) API, I will collect posts from the two subreddits - Parenting and Relationship Advice.
2. I will then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

### Reading data from the Reddit API

I will use the Pushshift's API to import reddit posts from both subreddits and then store them in a DataFrame for further processing in the second notebook to prevent unncessary importing when running the notebook.

The 2 subreddits and their information about the community (subreddit) are shown below:
- **Parenting** - /r/Parenting is the place to discuss the ins and out as well as ups and downs of child-rearing. From the early stages of pregnancy to when your teenagers are finally ready to leave the nest (even if they don't want to) we're here to help you through this crazy thing called parenting. You can get advice on potty training, talk about breastfeeding, discuss how to get your baby to sleep or ask if that one weird thing your kid does is normal.


- **Relationship Advice** -  A community to help in relationships whether it's romance, friendship, family, co-workers, or basic human interaction: 

### Data retrieval and initial processing

I have created a key function get_redditposts() to import all the data and features from the 2 selected subreddits by repeating the importing of data using the API as the API only limits a maximum of 100 posts for each request. The function will require an input of the number of times I would like to call the API to get the number of entries required. The subsequent posts will be imported using the timestamp of the last entry of the previous request which the created_utc is the earliest of the 100 posts. Then I will use the timestamp and the before parameter in the API to import another 100 posts before the timestamp.

For this project, I have used a count of 20 times to import data in the individual subreddit. There are a total of 72 different data fields obtained from calling the API. However, for this project, I am only using the subreddit post title and and post text/description to train the classifer. Hence, I have only kept 3 columns which are required this will explained in the data section below.

After getting the dataframe with the 3 relevant fields, I have removed duplicate posts first by considering the title column followed by the selftext column as Reddit ‘jams’ your webscraping by giving you duplicate posts.

Finally, i concatenate both dataframes for both subreddits into 1 combined dataframe, saved to a csv file. This file will then be used in the second python notebook for further processing, modelling and training the classifier.

### Data

The data were collected by reading from the Reddit API on Friday September 03, 2021 between 2.50pm and 2.55pm for the final data to be used in my project. This is done after exploring the API, data imported earlier in the week. As mentioned above, there are a total of 72 data fields extracted from the API such as created_utc (creation time), author_fullname (ID of the author) and etc. However, i am only interested in the following fields for my projects. Hence, i created a dataframe to only collect these 3 fields.


| **Column**  | **Description**  |
| :-|:-|
| subreddit  | subreddit category housing this post  |
| selftext   | Text of the post  |
| title  | post title |


### References

- https://docs.python.org/3/library/time.html
- https://github.com/pushshift/api
- https://youtu.be/AcrjEWsMi_E
- https://www.reddit.com/r/relationship_advice/
- https://www.reddit.com/r/Parenting/

### Import Libraries

In [1]:
## Imports

import requests
import pandas as pd
import time

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# link for pushshift api
url = 'https://api.pushshift.io/reddit/search/submission'

In [3]:
# Function to extract more data and combining into 1 dataframe based on 1 subreddit
def get_redditposts(post, count):
    
    # get the current time
    ts = int(time.time())
    
    # get 100 posts from the selected subreddit before the timestamp
    params = {'subreddit': post,
             'size' : 100,
             'before' : ts}
    
    print("No. of retrieval: 1")
    
    # get the request and status code
    res = requests.get(url, params)
    print("Status Code is " + str(res.status_code))
    print("------------------")
    data = res.json()
    posts = data['data']
    df = pd.DataFrame(posts)
    df_combined = df.copy()
    ts = posts[-1]['created_utc']
    
    if count == 1:
        
        return df_combined
    
    else:
        
        for i in range(count-1):
            print('No. of retrieval: '+ str(i+2))
            params = {'subreddit': post,
                 'size' : 100,
                 'before' : ts}
            res = requests.get(url, params)
            print("Status Code is " + str(res.status_code))
            print("------------------")
            data = res.json()
            posts = data['data']
            df = pd.DataFrame(posts)
            df_combined = df_combined.append(df, sort=False)
            ts = posts[-1]['created_utc']

        return df_combined

### Retrieving and Processing of the Data from subreddit Parenting

In [4]:
# Getting the pots for subreddit "Parenting"
df_parenting = get_redditposts('parenting', 20)

No. of retrieval: 1
Status Code is 200
------------------
No. of retrieval: 2
Status Code is 200
------------------
No. of retrieval: 3
Status Code is 200
------------------
No. of retrieval: 4
Status Code is 200
------------------
No. of retrieval: 5
Status Code is 200
------------------
No. of retrieval: 6
Status Code is 200
------------------
No. of retrieval: 7
Status Code is 200
------------------
No. of retrieval: 8
Status Code is 200
------------------
No. of retrieval: 9
Status Code is 200
------------------
No. of retrieval: 10
Status Code is 200
------------------
No. of retrieval: 11
Status Code is 200
------------------
No. of retrieval: 12
Status Code is 200
------------------
No. of retrieval: 13
Status Code is 200
------------------
No. of retrieval: 14
Status Code is 200
------------------
No. of retrieval: 15
Status Code is 200
------------------
No. of retrieval: 16
Status Code is 200
------------------
No. of retrieval: 17
Status Code is 200
------------------
No. of

In [5]:
# checking for any duplicated considering the created_utc which is the time where the post is created.
df_parenting['created_utc'].nunique()

1999

In [6]:
# Checking the columns for the data imported from the API
df_parenting.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_sub

In [7]:
# saving the 3 key columns into a new dataframe for futher processing and using info function to check the dataframe
df_parenting1 = df_parenting[['subreddit', 'selftext', 'title']]
df_parenting1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 99
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  2000 non-null   object
 1   selftext   1996 non-null   object
 2   title      2000 non-null   object
dtypes: object(3)
memory usage: 62.5+ KB


In [8]:
# dropping the duplicates row based on selftext and title columns but keeping the first occurence
df_parenting1.drop_duplicates(subset=['title'], keep='first', inplace=True)
df_parenting1.drop_duplicates(subset=['selftext'], keep='first', inplace=True)

In [9]:
# using the info function to check for the null values and number of rows
df_parenting1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1580 entries, 0 to 99
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1580 non-null   object
 1   selftext   1579 non-null   object
 2   title      1580 non-null   object
dtypes: object(3)
memory usage: 49.4+ KB


In [10]:
# using the head function to see the first 10 rows of the dataframe.
df_parenting1.head(10)

Unnamed: 0,subreddit,selftext,title
0,Parenting,So I live with a 3 year old who constantly hit...,Is it normal for a 3 year old to hit constantly
1,Parenting,"My son just turned 7, and at times he seems li...",Normal common sense in a child...
2,Parenting,My son is 6 months old and I'm fully aware thi...,Getting bruised by a baby
3,Parenting,My 8 year old is terrible to get out of the be...,Morning routine
4,Parenting,TW: PPD\n\nMy 17 month old is a tornado and th...,My house is trashed...
5,Parenting,These days you can find out everything has som...,Soon to be a new parent; does anyone have a co...
6,Parenting,This is probably an unpopular opinion. When an...,Over the “Check-in”
7,Parenting,I am finding motherhood a struggle. I love my...,Motherhood a struggle (33f with 2yo son)
8,Parenting,My son is 3 now and I’m trying to potty train ...,Diaper Change Fights
9,Parenting,Hi! My daughter is turning one soon and becaus...,1st birthday time capsule


### Retrieving and Processing of the Data from subreddit Relationship Advice

In [11]:
# Getting the pots for subreddit "Life Pro Tips"
df_rshipadvice = get_redditposts('relationship_advice', 20)

No. of retrieval: 1
Status Code is 200
------------------
No. of retrieval: 2
Status Code is 200
------------------
No. of retrieval: 3
Status Code is 200
------------------
No. of retrieval: 4
Status Code is 200
------------------
No. of retrieval: 5
Status Code is 200
------------------
No. of retrieval: 6
Status Code is 200
------------------
No. of retrieval: 7
Status Code is 200
------------------
No. of retrieval: 8
Status Code is 200
------------------
No. of retrieval: 9
Status Code is 200
------------------
No. of retrieval: 10
Status Code is 200
------------------
No. of retrieval: 11
Status Code is 200
------------------
No. of retrieval: 12
Status Code is 200
------------------
No. of retrieval: 13
Status Code is 200
------------------
No. of retrieval: 14
Status Code is 200
------------------
No. of retrieval: 15
Status Code is 200
------------------
No. of retrieval: 16
Status Code is 200
------------------
No. of retrieval: 17
Status Code is 200
------------------
No. of

In [12]:
# checking for any duplicated considering the created_utc which is the time where the post is created.
df_rshipadvice['created_utc'].nunique()

1986

In [13]:
# saving the 3 key columns into a new dataframe for futher processing and using info function to check the dataframe
df_rshipadvice1 = df_rshipadvice[['subreddit', 'selftext', 'title']]
df_rshipadvice1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 99
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  2000 non-null   object
 1   selftext   1999 non-null   object
 2   title      2000 non-null   object
dtypes: object(3)
memory usage: 62.5+ KB


In [14]:
# dropping the duplicates row based on selftext and title columns but keeping the first occurence
df_rshipadvice1.drop_duplicates(subset=['title'], keep='first', inplace=True)
df_rshipadvice1.drop_duplicates(subset=['selftext'], keep='first', inplace=True)

In [15]:
# using the info function to check for the null values and number of rows
df_rshipadvice1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1684 entries, 0 to 99
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1684 non-null   object
 1   selftext   1683 non-null   object
 2   title      1684 non-null   object
dtypes: object(3)
memory usage: 52.6+ KB


In [16]:
# using the head function to see the first 10 rows of the dataframe.
df_rshipadvice1.head(10)

Unnamed: 0,subreddit,selftext,title
0,relationship_advice,[removed],How much is it normal or acceptable for couple...
1,relationship_advice,\nI really just want to break something. Havin...,I want to break something.
3,relationship_advice,I need some advices in how to get back my girl...,I need some advice
4,relationship_advice,This is kind of long but I’ll make it short\nF...,Physical affection
5,relationship_advice,"So for some context, my ex and I dated for 3 y...",My [21M] ex gf [20F] wants to hang out with my...
6,relationship_advice,"Sorry for formatting issues, I’m on mobile. Th...",Am I [26F] delusional for thinking my brother ...
7,relationship_advice,Earlier tonight I called my brother because I ...,Need advice for me and my brother’s relationship
8,relationship_advice,So this was my first proper relationship. My e...,I (26f) broke up with my boyfriend (25m) about...
9,relationship_advice,This is less about a relationship issue but I ...,"My feelings are hurt, but should they be??"
10,relationship_advice,Do you think it’s normal to find ur man attrac...,Physically attracted to a guy but having hard ...


In [17]:
df_final = pd.concat([df_parenting1, df_rshipadvice1], ignore_index=True)

In [18]:
df_final

Unnamed: 0,subreddit,selftext,title
0,Parenting,So I live with a 3 year old who constantly hit...,Is it normal for a 3 year old to hit constantly
1,Parenting,"My son just turned 7, and at times he seems li...",Normal common sense in a child...
2,Parenting,My son is 6 months old and I'm fully aware thi...,Getting bruised by a baby
3,Parenting,My 8 year old is terrible to get out of the be...,Morning routine
4,Parenting,TW: PPD\n\nMy 17 month old is a tornado and th...,My house is trashed...
...,...,...,...
3259,relationship_advice,i’ve been talking to this guy for almost a yea...,What do i mean to him?
3260,relationship_advice,"I, a cis male, teenage, have liked girls my en...",What do I do now that I know I like them?
3261,relationship_advice,"I, (16M) are currently dating my partner, T (1...",It feels like my partner likes making me jealo...
3262,relationship_advice,TLDR: He has had past with some girls. He che...,Boyfriend(24M) of 3 years keeps checking out o...


In [19]:
# using value counts to check the number of rows for each subreddit
df_final['subreddit'].value_counts()

relationship_advice    1684
Parenting              1580
Name: subreddit, dtype: int64

In [20]:
# saving the df_final to csv for further processing in the second python notebook to prevent multiple scrapping using the api
df_final.to_csv("Combined Data.csv", index=False)