# Post Topics from Posts in the r/Python subreddit

The aim of this project is to classify the topic of posts from the r/Python subreddit from end-to-end. In order to do this, I first needed to scrape a sufficient number of posts for an accurate NLP model. I have done this via this [submission downloader code](https://github.com/dgadish/projects/blob/master/NLP/Reddit_Scrape_Analysis/Reddit%20Submission%20downloader.ipynb) . 

Once a sufficient number of submissions were scraped, I read them into a pandas dataframe in order to clean them sufficiently to be fed into a model.

The submissions were collected at around 10:00 on 07/02/2021

The submissions have both title and text sections. I will attempt to determine topics using both. 

I will attempt to use both LSA (Latent Semantic Analysis) and LDA (Latent Dirichlet Allocation) to determine the topics of each post. When I have more time, I will attempt to use Word Embedding combined with something like K-Means clustering as well.

### Progress so far...

So far I have imported the submissions and I have cleaned and preprocessed the text sections of the submissions. I have then created a bag-of-words model for the text column and used it to train an LDA model.

### Next steps...

Next I need to play around with the parameters of the model to find suitable topics with which I can label and then assign to submissions. 

In [1]:
import nltk
#nltk.download('all')
import gensim
import spacy
import numpy as np
import pandas as pd

In [2]:
subs = pd.read_json('python_posts.json', orient='index')

In [3]:
subs.head()

Unnamed: 0,title,subreddit,score,num_comments,created_utc,selftext
1,Python conversion tool for converting csv to J...,Python,1,1,1612467290,[removed]
2,lynda courses,Python,1,2,1612466655,[removed]
3,Anybody have sample code to determine if a web...,Python,4,5,1612466550,"Hi, has anyone created python code that tests ..."
5,Barnsley Fern - an interesting fractal created...,Python,1,0,1612465003,[deleted]
6,Project – Find Neapolitan pizza with AI help,Python,2,0,1612464251,Just finished this project! An unfiltered revi...


In [4]:
subs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26003 entries, 1 to 40100
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         26003 non-null  object
 1   subreddit     26003 non-null  object
 2   score         26003 non-null  int64 
 3   num_comments  26003 non-null  int64 
 4   created_utc   26003 non-null  int64 
 5   selftext      26003 non-null  object
dtypes: int64(3), object(3)
memory usage: 1.4+ MB


In [5]:
# Drop submissions which have been removed or deleted

n_r_d = (subs['selftext'] != '[removed]') & (subs['selftext'] != '[deleted]')
subs_1 = subs[n_r_d]
subs_1.reset_index(drop=True, inplace=True)

In [6]:
subs_1.head()

Unnamed: 0,title,subreddit,score,num_comments,created_utc,selftext
0,Anybody have sample code to determine if a web...,Python,4,5,1612466550,"Hi, has anyone created python code that tests ..."
1,Project – Find Neapolitan pizza with AI help,Python,2,0,1612464251,Just finished this project! An unfiltered revi...
2,"In response to the ""Medium bad"" thread, here a...",Python,9,4,1612460552,I agree with some of the sentiments shared in ...
3,To open source or not to open source?,Python,3,8,1612453887,When do you guys know when to open source a pr...
4,How can i decrypt signature Url of YouTube Videos,Python,2,3,1612453834,I made a python module which download youtube ...


In [7]:
subs_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18102 entries, 0 to 18101
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         18102 non-null  object
 1   subreddit     18102 non-null  object
 2   score         18102 non-null  int64 
 3   num_comments  18102 non-null  int64 
 4   created_utc   18102 non-null  int64 
 5   selftext      18102 non-null  object
dtypes: int64(3), object(3)
memory usage: 848.7+ KB


### Data cleaning and pre-processing

Now that I have collected a sufficiently large data set and dropped any submissions that were deleted or removed, I can begin to clean the data set. For now my cleaning will be focused on preparing the data for a bag-of-words model, suitable for LSA and LDA

I will begin with the 'selftext' column and then move to the 'title' column.

**Selftext**

I will start by removing the url's which are included in a number of the submissions. I have taken a regex expression from [Github Gist user gruber](https://gist.github.com/gruber/8891611). To remove urls with pandas methods.

In [8]:
# First create a new copy of the dataframe

subs_2 = subs_1.copy()

# Remove url's with provided regex

rgx = r"(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'\".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"
subs_2['selftext'] = subs_2['selftext'].str.replace(rgx," ")

Next I shall remove punctuation, tokenize the text and remove any stop words and words less than 4 letters long

In [9]:
# Remove punctuation and convert all to lowercase

subs_2['selftext'] = subs_2['selftext'].str.replace('\W', ' ').str.lower()

# Tockenize

subs_2['selftext'] = subs_2['selftext'].apply(nltk.word_tokenize)


In [10]:
# Remove stop words and anything less than 4 letters long
''' 
By the very nature of the subreddit, every submission should be about python.
As such, the word 'python' can also be removed.

After running the model, the term 'x200b' is common. 
This is to do with a zero-width space character and is in the dataset due to how the data is encoded. 
It does not add any information and so will be removed.
'''

stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords.add('python')
stopwords.add('x200b')

def no_stop(list):
    
    nostops = []
    
    for l in list:
        if l not in stopwords:
            nostops.append(l)
    
    return nostops

            
def no_short(list):
    
    noshort = []
    
    for l in list:
        if len(l) >= 4:
            noshort.append(l)
    
    return noshort


In [11]:
strin = 'This is a test string to see if my functions work super duper well'
strin = nltk.word_tokenize(strin)

In [12]:
no_stop(strin)

['This',
 'test',
 'string',
 'see',
 'functions',
 'work',
 'super',
 'duper',
 'well']

In [13]:
no_short(strin)

['This', 'test', 'string', 'functions', 'work', 'super', 'duper', 'well']

In [14]:
# Apply my tested functions to the 'selftext' column

subs_2['selftext'] = subs_2['selftext'].apply(no_stop).apply(no_short)

In [15]:
subs_2['selftext'].head()

0    [anyone, created, code, tests, page, different...
1    [finished, project, unfiltered, review, pizza,...
2    [agree, sentiments, shared, thread, medium, tu...
3    [guys, know, open, source, project, working, p...
4    [made, module, download, youtube, videos, with...
Name: selftext, dtype: object

The next step in processing the data is to lemmetize and stem the words to bring everything into the present tense and remove endings such as 'ed', 'ly', 's' etc.

This can be done using proven functions in the nltk package.

In [16]:
wnl = nltk.stem.WordNetLemmatizer()
stemmer = nltk.stem.SnowballStemmer('english')

wnl.lemmatize() takes two arguements, the word and a 'part-of-speech' tag. For now I will focus on just converting the verbs, pos='v', but I may come back to this in the future to ammend it to account for all words such that the stemmer, which results in stems wich aren't proper English, isn't needed.

In [17]:
# Function to lemmatize and stem the text

def lemmatize_stemmer(list):
    new_list = []
    for word in list:
        new_list.append(stemmer.stem(wnl.lemmatize(word, pos='v')))
        
    return new_list

In [18]:
# Test function of some rows in subs_2

slftxt = subs_2['selftext'][1]
print(slftxt)
cleaned = lemmatize_stemmer(slftxt)
print(cleaned)

['finished', 'project', 'unfiltered', 'review', 'pizza', 'places', 'city', 'authentic', 'pizza', 'places', 'boston', 'said', 'combining', 'computer', 'vision', 'machine', 'learning', 'ease', 'search', 'neapolitan', 'pizza', 'based', 'photos', 'public', 'crowd', 'sourced', 'reviews', 'check', 'city', 'learn', 'going', 'hood', 'would', 'glad', 'receive', 'feedback']
['finish', 'project', 'unfilt', 'review', 'pizza', 'place', 'citi', 'authent', 'pizza', 'place', 'boston', 'say', 'combin', 'comput', 'vision', 'machin', 'learn', 'eas', 'search', 'neapolitan', 'pizza', 'base', 'photo', 'public', 'crowd', 'sourc', 'review', 'check', 'citi', 'learn', 'go', 'hood', 'would', 'glad', 'receiv', 'feedback']


Although not perfect, this will be good enough for the purpose of this project at this stage. The function will now be applied to subs_2.

In [19]:
subs_2['selftext'] = subs_2['selftext'].apply(lemmatize_stemmer)

In [20]:
subs_2['selftext'].head()

0    [anyon, creat, code, test, page, differ, previ...
1    [finish, project, unfilt, review, pizza, place...
2    [agre, sentiment, share, thread, medium, turn,...
3    [guy, know, open, sourc, project, work, projec...
4    [make, modul, download, youtub, video, without...
Name: selftext, dtype: object

The next step is to produce a bag-of-words model from the texts. This can be done with the assistance of the gensim package.

First, a dictionary containing a word count for each word will be produced. This will then be cut down by filtering out any words that appear in less than 20 documents and in more than 50 % as these words will be be either to rare or to common to use to classify topics.

Once this has been done, the dictionary will be used to create a bag-of-words model for each submission in the data set.

In [21]:
slf_dictionary = gensim.corpora.Dictionary(subs_2['selftext'])
slf_dictionary.filter_extremes(no_below=20, no_above=0.5)
slf_bow_corpus = [slf_dictionary.doc2bow(doc) for doc in subs_2['selftext']]

In [22]:
# Preview BOW for the nth submission in the data set

slf_bow_first_sub = slf_bow_corpus[0]

for i in range(len(slf_bow_first_sub)):
    print(f'Word {slf_bow_first_sub[i][0]} ("{slf_dictionary[slf_bow_first_sub[i][0]]}") appears {slf_bow_first_sub[i][1]} time.')

Word 0 ("anyon") appears 2 time.
Word 1 ("autom") appears 1 time.
Word 2 ("basic") appears 1 time.
Word 3 ("code") appears 1 time.
Word 4 ("creat") appears 1 time.
Word 5 ("differ") appears 1 time.
Word 6 ("display") appears 1 time.
Word 7 ("enough") appears 1 time.
Word 8 ("get") appears 1 time.
Word 9 ("keep") appears 1 time.
Word 10 ("need") appears 1 time.
Word 11 ("page") appears 2 time.
Word 12 ("previous") appears 1 time.
Word 13 ("process") appears 1 time.
Word 14 ("refresh") appears 1 time.
Word 15 ("right") appears 1 time.
Word 16 ("test") appears 1 time.
Word 17 ("ticket") appears 1 time.
Word 18 ("tri") appears 1 time.


In [23]:
# What does slf_bow_corpus actually look like?

print(slf_bow_corpus[:1])

[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1)]]


### Selftext Topic Model (LDA)

The bag-of-words model based on the selftext column is now ready to train a model. I have chosen to start with LDA. In addition to the corpus and the dictionary, the number of topics also needs to be provided. At this time I shall train the model for 10 topics. In the future, I may look to find a way to determine the optimum number of topics.

I will run LDA using multiple CPU cores to parallelize and speed up the model training. 

some of the parameters involved are

* **num_topics** is the number of topics to be extracted from the corpus.
* **id2word** is a mapping from word id's to the actual words, in this case, our dictionary.
* **workers** is the number of extra cores to use.
* **passes** is the number of training passes through the corpus.


In [24]:
slf_lda = gensim.models.LdaMulticore(slf_bow_corpus, num_topics=10, 
                                     id2word= slf_dictionary, passes=10, workers=2)

In [30]:
# Compute coherence score (used to judge how good a model is, higher is better)

slf_lda_coherence_model = gensim.models.CoherenceModel(model=slf_lda, texts=subs_2['selftext'], 
                                                       dictionary=slf_dictionary, coherence='c_v')
slf_lda_coherence = slf_lda_coherence_model.get_coherence()

print(f'Coherence Score: {slf_lda_coherence:.2f}')

Coherence Score: 0.56


In [31]:
# Visualise the topics

for idx, topic in slf_lda.print_topics(-1):
    print(f'Topic: {idx} \nWords: {topic}')
    print('\n') 

Topic: 0 
Words: 0.027*"learn" + 0.018*"would" + 0.017*"program" + 0.017*"like" + 0.016*"know" + 0.015*"want" + 0.013*"start" + 0.013*"make" + 0.012*"help" + 0.012*"work"


Topic: 1 
Words: 0.102*"print" + 0.042*"input" + 0.021*"return" + 0.020*"els" + 0.019*"number" + 0.018*"enter" + 0.018*"import" + 0.016*"elif" + 0.015*"true" + 0.015*"code"


Topic: 2 
Words: 0.017*"post" + 0.015*"websit" + 0.014*"page" + 0.014*"find" + 0.012*"make" + 0.012*"link" + 0.011*"send" + 0.011*"email" + 0.010*"use" + 0.010*"user"


Topic: 3 
Words: 0.055*"file" + 0.024*"data" + 0.017*"script" + 0.013*"use" + 0.013*"write" + 0.013*"server" + 0.012*"request" + 0.010*"need" + 0.009*"json" + 0.009*"connect"


Topic: 4 
Words: 0.057*"imag" + 0.037*"pygam" + 0.022*"color" + 0.018*"import" + 0.017*"game" + 0.017*"screen" + 0.016*"card" + 0.012*"turtl" + 0.011*"draw" + 0.011*"rect"


Topic: 5 
Words: 0.054*"video" + 0.042*"text" + 0.022*"button" + 0.021*"root" + 0.020*"column" + 0.020*"grid" + 0.019*"label" + 0.01