# Subreddit Classification: Writing & Blogging

![cover](images/cover.png)

Blogging and Writing subreddit groups are very similar in nature. Both are communities that are focused on writing. 

- However, the size of the communities differ, with 67.7K users in the Blogging subreddit, whereas Writing subreddit has 1.4M users. 
- Based on the following image, we can conclude that the Writing subreddit is far more active than Blogging subreddit, with 1.1k users online at the point of snapshot, and only 96 online for blogging.
- Lastly, both of these groups were created at around the same period of time, in Q1 2008.
![cover](images/groups_r.png)

##### What is blogging? 
A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web (www) consisting of discrete, often informal diary-style text entries (posts).
[Source](https://en.wikipedia.org/wiki/Blog)

##### What is writing?
Writing is a medium of human communication that involves the representation of a language with symbols. Writing systems are not themselves human languages (with the debatable exception of computer languages); they are means of rendering a language into a form that can be reconstructed by other humans separated by time and/or space.
[Source](https://en.wikipedia.org/wiki/writing)

***In other words, a blogger is also a writer, who writes in the internet through weblogs ('blogs'). However, writing is an art itself, which emphasise the communication through languages. A writer could write anywhere (newspapers, books, magazines, emails, blogs etc.).*** 

# Business Case

As a data scientist in Alytics under the Internet team, my team works with a online influencers and youtubers to improve their views for their posts. 

A rising trend in the market was for Bloggers to improve their views on their posts through the use of analytics. A subset of this trend is the rise of Reddit as a virtual community for bloggers to ask questions and seek guidance from like-minded individuals. 

Our team is working on a long-term project targeted to assist bloggers to improve their views on their blog. The first phase of this project would be targeted at the Reddit platform, where bloggers often visit for idea sharing, feedbacks and questions. 
- **Phase A Part 1**: Create a classifying tool to help bloggers to post their questions and experiences in the correct Subreddit group. **(Current Project!)**
- **Phase A Part 2**: We will look into Subreddit analytics to understand how to structure a reddit post to maximise eyeballs (upvotes, comments)

# Problem Statement

To create a text classifier to determine whether a reddit post would be classified into the Subreddit group "Blogging" or "Writing". 

We will be measuring the success of our classifier model by looking at several metrics, including accuracy, specificity, sensitivity and model scores.

# Executive Summary

#### EDA
From our analysis, the Blogging subreddit group have a huge emphasis on their blog optimisation (common phrases that appear: SEO, traffic flow, keyword search etc.) and less on technical writing elements. However, the Writing subreddit group appears to be the opposite. Posts appear to be focused on writing techniques, with many users posting questions and seeking help for their stories (common phrases that appear: 'writing advice, don know, dont want, help writing, start writing'). 

In terms of overall tonality of the words/phrases that commonly appear in both subreddit groups, we can conclude Blogging subreddit group appears to be more formal and professional, whilst Writing subreddit group appears to be more casual and community-based. This could be due to the fact that Bloggers are more marketing/promotion oriented, whilst writers are more focues on the art of writing. 

#### Modelling (Classification Model)
<div>
<img src="images/model.png" width="500"/>
</div>

- As seen above from the models summary, both models have performed similarly in predicting whether posts fall under the Writing or Blogging subreddit groups, with an accuracy of approximately 93%.
- From our Logistic regression, we were able to understand how our Logistic regression classify our posts based on the words appeared. 
<br>
<div>
<img src="images/lr.png" width="700"/>
</div>

<div>
<img src="images/lr2.png" width="700"/>
</div>

- Interestingly, words that are of greater importance in classfiying posts into the **Blogging** subreddit are: Posts, Content, Website, Niche, Article, Google, SEO, link etc (web-analytics oriented)
- Whereas words that are of greater importance in classfiying posts into the **Writing** subreddit are: story, character book, novel, read, writer, plot, feel, chapter etc. (traditional writing-oriented)

# Content

The two subreddits that I have scrapped my reddit posts from are:

- [Blogging](https://www.reddit.com/r/blogging/)
- [Writing](https://www.reddit.com/r/writing/)

[**Part 1**](https://github.com/alysesu/GA-Projects/blob/master/Project-3/Project%203%20-%201.%20Webscrapping.ipynb)

- Webscrapping

[**Part 2**](https://github.com/alysesu/GA-Projects/blob/master/Project-3/Project%203%20-%202.%20Data%20Cleaning%20%26%20EDA.ipynb)
- Exploratory Data Analysis
- Sentiment Analysis 

[**Part 3**](https://github.com/alysesu/GA-Projects/blob/master/Project-3/Project%203%20-%203.%20Data%20Processing%20%26%20Modelling.ipynb)
- Preprocessing Text Data
- Modelling (Logistic Regression)
- Modelling (Naive-Bayes Modelling)
- Conclusion

In [1]:
# Basic Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Webscrapping Imports
import requests
import time
import random

In [2]:
write = 'https://www.reddit.com/r/writing.json'
blog = 'https://www.reddit.com/r/Blogging.json'

In [3]:
res_w = requests.get(write,headers={'User-agent': 'GAProj3'})
res_b = requests.get(blog,headers={'User-agent': 'GAProj3'})

In [4]:
res_w.status_code

200

In [5]:
res_b.status_code

200

In [6]:
write_dict = res_w.json()
blog_dict = res_b.json()

## First Round Scraping (Hot Posts - Default)
Accidentally reran the code at the last minute. Hence stopped the process and retained my previous data for EDA and modelling thereafter.
Codes below would work. 

In [None]:
posts_w = []
after = None

for a in range(30):
    if after == None:
        current_url = write
    else:
        current_url = write + '?after=' + after
    print(current_url)
    res_w = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_w.status_code != 200:
        print('Status error', res.status_code)
        break
    
    write_dict = res_w.json()
    current_posts = [p['data'] for p in write_dict['data']['children']]
    posts_w.extend(current_posts)
    after = write_dict['data']['after']
    
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,15)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/writing.json
7


In [None]:
posts_b = []
after = None

for a in range(30):
    if after == None:
        current_url = blog
    else:
        current_url = blog + '?after=' + after
    print(current_url)
    res_b = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_b.status_code != 200:
        print('Status error', res.status_code)
        break
    blog_dict = res_b.json()    
    current_posts = [p['data'] for p in blog_dict['data']['children']]
    posts_b.extend(current_posts)
    after = blog_dict['data']['after']
    
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,10)
    print(sleep_duration)
    time.sleep(sleep_duration)

# Second round of scraping (New Posts)

In [None]:
posts_w2 = []
after = None

website_w2 = 'https://www.reddit.com/r/writing/new.json'
for a in range(15):
    if after == None:
        current_url = website_w2
    else:
        current_url = website_w2 + '?after=' + after
    print(current_url)
    res_w = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_w.status_code != 200:
        print('Status error', res.status_code)
        break
    
    write_dict = res_w.json()
    current_posts = [p['data'] for p in write_dict['data']['children']]
    posts_w2.extend(current_posts)
    after = write_dict['data']['after']
    
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,15)
    print(sleep_duration)
    time.sleep(sleep_duration)

In [None]:
posts_b2 = []
after = None

website_b2 = 'https://www.reddit.com/r/blogging/new.json'
for a in range(15):
    if after == None:
        current_url = website_b2
    else:
        current_url = website_b2 + '?after=' + after
    print(current_url)
    res_b = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_b.status_code != 200:
        print('Status error', res.status_code)
        break
    
    blog_dict = res_b.json()
    current_posts = [p['data'] for p in blog_dict['data']['children']]
    posts_b2.extend(current_posts)
    after = blog_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,15)
    print(sleep_duration)
    time.sleep(sleep_duration)

# Merging all together

In [None]:
print(f'Scrape 1 Type (Blogging): {type(posts_b)}')
print(f'Scrape 2 Type (Blogging): {type(posts_b2)}')
print('\n')
print(f'Scrape 1 Type (Writing): {type(posts_w)}')
print(f'Scrape 2 Type (Writing): {type(posts_w2)}')

In [None]:
print(f'#Posts for Scrape 1 (Blogging): {len(posts_b)}')
print(f'#Posts for Scrape 2 (Blogging): {len(posts_b2)}')
print('\n')
print(f'#Posts for Scrape 1 (Writing): {len(posts_w)}')
print(f'#Posts for Scrape 2 (Writing): {len(posts_w2)}')

In [None]:
blogposts = posts_b
type(blogposts)

In [None]:
posts_b.extend(posts_b2)
len(posts_b)

In [None]:
writeposts = posts_w
type(writeposts)

In [None]:
posts_w.extend(posts_w2)
len(posts_w)

# Data Checking
### Unique Posts

In [None]:
blog= pd.DataFrame(posts_b)
write= pd.DataFrame(posts_w)

In [None]:
blog.shape

In [None]:
write.shape

In [None]:
write[['selftext','title']].duplicated().sum()

In [None]:
blog[['selftext','title']].duplicated().sum()

**Initial Observations**
- By the looks of our results from our two rounds of webscrape, we see that there are many duplicated posts across both writing and blogging subreddit. This is perhaps due to combining both New and Hot posts. There should be some posts that are duplicated within the 'New' and 'hot' categories as well.

In [None]:
write.drop_duplicates(subset=['selftext','title'],inplace=True)
blog.drop_duplicates(subset=['selftext','title'],inplace=True)

In [None]:
print(write.shape)
print(blog.shape)

# Exporting Data

In [None]:
blog.to_csv('datasets/blogging.csv', index = False)
write.to_csv('datasets/writing.csv', index = False)