# Post Topics from Posts in the r/Python subreddit

The aim of this project is to classify the topic of posts from the r/Python subreddit from end-to-end. In order to do this, I first needed to scrape a sufficient number of posts for an accurate NLP model. I have done this via this [submission downloader code](https://github.com/dgadish/projects/blob/master/NLP/Reddit_Scrape_Analysis/Reddit%20Submission%20downloader.ipynb) . 

Once a sufficient number of submissions were scraped, I read them into a pandas dataframe in order to clean them sufficiently to be fed into a model.

The submissions were collected at around 10:00 on 07/02/2021

The submissions have both title and text sections. I will attempt to determine topics using both. 

I will attempt to use both LSA (Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation) to determine the topics of each post. When I have more time, I will attempt to use Word Embedding combined with something like K-Means clustering as well.

In [26]:
import nltk
import gensim
import spacy
import numpy as np
import pandas as pd

In [2]:
subs = pd.read_json('python_posts.json', orient='index')

In [3]:
subs.head()

Unnamed: 0,title,subreddit,score,num_comments,created_utc,selftext
1,Python conversion tool for converting csv to J...,Python,1,1,1612467290,[removed]
2,lynda courses,Python,1,2,1612466655,[removed]
3,Anybody have sample code to determine if a web...,Python,4,5,1612466550,"Hi, has anyone created python code that tests ..."
5,Barnsley Fern - an interesting fractal created...,Python,1,0,1612465003,[deleted]
6,Project – Find Neapolitan pizza with AI help,Python,2,0,1612464251,Just finished this project! An unfiltered revi...


In [4]:
subs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26003 entries, 1 to 40100
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         26003 non-null  object
 1   subreddit     26003 non-null  object
 2   score         26003 non-null  int64 
 3   num_comments  26003 non-null  int64 
 4   created_utc   26003 non-null  int64 
 5   selftext      26003 non-null  object
dtypes: int64(3), object(3)
memory usage: 1.4+ MB


In [5]:
# Drop submissions which have been removed or deleted

n_r_d = (subs['selftext'] != '[removed]') & (subs['selftext'] != '[deleted]')
subs_1 = subs[n_r_d]
subs_1.reset_index(drop=True, inplace=True)

In [6]:
subs_1.head()

Unnamed: 0,title,subreddit,score,num_comments,created_utc,selftext
0,Anybody have sample code to determine if a web...,Python,4,5,1612466550,"Hi, has anyone created python code that tests ..."
1,Project – Find Neapolitan pizza with AI help,Python,2,0,1612464251,Just finished this project! An unfiltered revi...
2,"In response to the ""Medium bad"" thread, here a...",Python,9,4,1612460552,I agree with some of the sentiments shared in ...
3,To open source or not to open source?,Python,3,8,1612453887,When do you guys know when to open source a pr...
4,How can i decrypt signature Url of YouTube Videos,Python,2,3,1612453834,I made a python module which download youtube ...


In [7]:
subs_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18102 entries, 0 to 18101
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         18102 non-null  object
 1   subreddit     18102 non-null  object
 2   score         18102 non-null  int64 
 3   num_comments  18102 non-null  int64 
 4   created_utc   18102 non-null  int64 
 5   selftext      18102 non-null  object
dtypes: int64(3), object(3)
memory usage: 848.7+ KB


### Data cleaning and pre-processing

Now that I have collected a sufficiently large data set and dropped any submissions that were deleted or removed, I can begin to clean the data set. For now my cleaning will be focused on preparing the data for a bag-of-words model, suitable for LSA and LDA

I will begin with the 'selftext' column and then move to the 'title' column.

**Selftext**

I will start by removing the url's which are included in a number of the submissions. I have taken a regex expression from [Github Gist user gruber](https://gist.github.com/gruber/8891611). To remove urls with pandas methods.

In [8]:
# First create a new copy of the dataframe

subs_2 = subs_1.copy()

# Remove url's with provided regex

rgx = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"
subs_2['selftext'] = subs_2['selftext'].str.replace(rgx," ")

Next I shall remove punctuation, tokenize the text and remove any stop words and words less than 4 letters long

In [12]:
# Remove punctuation and convert all to lowercase

subs_2['selftext'] = subs_2['selftext'].str.replace('\W', ' ').str.lower()

# Tockenize

subs_2['selftext'] = subs_2['selftext'].apply(nltk.word_tokenize)


In [17]:
# Remove stop words and anything less than 4 letters long

stopwords = set(nltk.corpus.stopwords.words('english'))

def no_stop(list):
    
    nostops = []
    
    for l in list:
        if l not in stopwords:
            nostops.append(l)
    
    return nostops

            
def no_short(list):
    
    noshort = []
    
    for l in list:
        if len(l) >= 4:
            noshort.append(l)
    
    return noshort


In [19]:
strin = 'This is a test string to see if my functions work super duper well'
strin = nltk.word_tokenize(strin)

In [20]:
no_stop(strin)

['This',
 'test',
 'string',
 'see',
 'functions',
 'work',
 'super',
 'duper',
 'well']

In [21]:
no_short(strin)

['This', 'test', 'string', 'functions', 'work', 'super', 'duper', 'well']

In [22]:
# Apply my tested functions to the 'selftext' column

subs_2['selftext'] = subs_2['selftext'].apply(no_stop).apply(no_short)

In [23]:
subs_2['selftext'].head()

0    [anyone, created, python, code, tests, page, d...
1    [finished, project, unfiltered, review, pizza,...
2    [agree, sentiments, shared, thread, medium, tu...
3    [guys, know, open, source, project, working, p...
4    [made, python, module, download, youtube, vide...
Name: selftext, dtype: object

The next step in processing the data is to lemmetize and stem the words to bring everything into the present tense and remove endings such as 'ed', 'ly', 's' etc.