# Introduction & Problem Statement 

Reddit is a social media website focused on social news aggregation, web content rating, and discussions. Members contribute content through different subreddits(community-managed pages) that are then voted up or down by other members of the community. Scored through an algorithm that measures votes and among other things time that the post has been there, which assures that new content will always float to the top. Submissions with a high enough score will make it to the front page of reddit. Reddit's subreddits have created interesting niche communities that have flourished and may not have survived on other parts of the internet.

Reddit/r/AmitheAsshole is one such niche community that has arisen with a unique premise that aims to both help and entertain users. The premise is such, people post situations that they have found themselves in in which they feel they will be perceived as an 'asshole'. Commenters reply to the original posters situation and vote on them by using coded language designated in the subreddit rules. There are times, however, that posters can't make up their minds, and leave some form of vaguely worded message with a conflicting vote.  

This project aims to create a model that can use comments to predict on comments that contain both YTA & NTA.

To that end, 4 forms of different lemmatizations were created and put through word-frequency based classification models which were evaluated based on accuracy. The final production model ended up having an accuracy score of 84.6%. 



In [2]:
import pandas as pd
import praw

# PRAW
Using Python Reddit API Wrapper (PRAW) I pulled posts and the comments within the posts. 

For the comments, I pulled 
- comment_text: The body text of the comment 
- comment_dist: Comment Distinguished (Moderator posts for cleaning) 
- comment_score: Comment Score (Number of upvotes for the comment)
- comment_parentpost_id: The parent post ID

Parent posts:
- post_title
- post_text
- post_id
- post_dist
- post_score
- post_upvoteratio
- post_date

The intention to pull the parent posts were for further research to show the predicted comments with word similarity to the parent text. However due to a lack of time, this wasn't followed up on. 

In total, 4 pulls were made each a week apart with the exception of the fifth pull due to complications I will discuss in the next page

In [1]:
#Acessing the reddit api
reddit = praw.Reddit(client_id="9tAy2O858UyM-Q",#client id
                     client_secret="wRRhIAzTIgVBvcsvJftgZfK6dC4Nsg", #client secret
                     user_agent="alex_bot") #user agent name

# define custom scraping function
def scrape_subreddit(subreddit, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by hot
    for submission in subreddit.hot(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # put posts into a df
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments

NameError: name 'praw' is not defined

In [9]:
%%time
# scrape
aita_posts, aita_comments = scrape_subreddit('AmiTheAsshole')
aita_posts.to_csv('datasets/posts.csv', index=False)
aita_comments.to_csv('datasets/comments.csv', index=False)

CPU times: user 55.9 s, sys: 3.47 s, total: 59.4 s
Wall time: 31min 50s


In [20]:
#Second Pull 2/1/21
#Acessing the reddit api
reddit = praw.Reddit(client_id="9tAy2O858UyM-Q",#client id
                     client_secret="wRRhIAzTIgVBvcsvJftgZfK6dC4Nsg", #client secret
                     user_agent="alex_bot") #user agent name

# define custom scraping function
def scrape_subreddit(subreddit, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by hot
    for submission in subreddit.hot(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # put posts into a df
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments


# Putting saving the code within the loop itself to avoid accidents 
aita_posts, aita_comments = scrape_subreddit('AmiTheAsshole')

aita_posts.to_csv('datasets/posts2.csv', index=False)
aita_comments.to_csv('datasets/comments2.csv', index=False)

In [3]:
#Third Pull 7/1
#Acessing the reddit api
reddit = praw.Reddit(client_id="9tAy2O858UyM-Q",#client id
                     client_secret="wRRhIAzTIgVBvcsvJftgZfK6dC4Nsg", #client secret
                     user_agent="alex_bot") #user agent name

# define custom scraping function
def scrape_subreddit(subreddit, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by hot
    for submission in subreddit.hot(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # put posts into a df
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments


# scrape
aita_posts, aita_comments = scrape_subreddit('AmiTheAsshole')

aita_posts.to_csv('datasets/posts3.csv', index=False)
aita_comments.to_csv('datasets/comments3.csv', index=False)

In [4]:
#Fourth Pull 13/1
#Acessing the reddit api
reddit = praw.Reddit(client_id="9tAy2O858UyM-Q",#client id
                     client_secret="wRRhIAzTIgVBvcsvJftgZfK6dC4Nsg", #client secret
                     user_agent="alex_bot") #user agent name

# define custom scraping function
def scrape_subreddit(subreddit, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by hot
    for submission in subreddit.hot(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # put posts into a df
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments


# scrape
aita_posts, aita_comments = scrape_subreddit('AmiTheAsshole')

aita_posts.to_csv('datasets/posts4.csv', index=False)
aita_comments.to_csv('datasets/comments4.csv', index=False)

In [28]:
#Fourth Pull 21/1
#Acessing the reddit api
reddit = praw.Reddit(client_id="9tAy2O858UyM-Q",#client id
                     client_secret="wRRhIAzTIgVBvcsvJftgZfK6dC4Nsg", #client secret
                     user_agent="alex_bot") #user agent name

# define custom scraping function
def scrape_subreddit(subreddit, postlimit=1000):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by hot
    for submission in subreddit.hot(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # put posts into a df
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments


# scrape
aita_posts, aita_comments = scrape_subreddit('AmiTheAsshole')

aita_posts.to_csv('datasets/posts4_1.csv', index=False)
aita_comments.to_csv('datasets/comments4_1.csv', index=False)