**In this notebook I will be doing topic modelling on both the posts and comments.**

In [1]:
import pandas as pd
pd.set_option('max_colwidth',1000)
%matplotlib inline
import re
import string
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from nltk.corpus import stopwords

Let's first make some functions to remove non-useful words or items in the post that are not words.

In [2]:
stopwords_remove = ["anxiety", "suicide", "suicidal", "depression", "ive", "im" "depressed", "www", "reddit", "anxious", "depression" ]
stopwords = stopwords.words("english")  + stopwords_remove


Lot's of these stopwords that I have added come from trial and error, looking at words that I don't want to be in the topics. I removed words like "anxiety" or "suicide" because I didn't want topics to cluster around types of illnesses, but just on how Reddit talks about mental health in general.

In [3]:
def text_cleaner_and_stemmer(row):
    if type(row) != str :
        row = row.decode('utf-8')
    row = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',' ', row)
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    row = regex.sub('', row)
    row = word_tokenize(row)
    lemmatizer = WordNetLemmatizer()
    lemmatizer_fun = lambda x: lemmatizer.lemmatize(x, get_wordnet_pos(x))
    
    lemmatizer_fun_outer = lambda x: list(map(lemmatizer_fun, x))
    return generalize_fun(row, lemmatizer_fun_outer)

def generalize_fun(corpus, lambda_fun):
# must handle a list of lists (tokenized docs) and also a simple list

    if isinstance(corpus[0], list):
        # list of lists
        corpus = map(lambda_fun, corpus)
    else:
        # single list
        corpus = lambda_fun(corpus)

    return list(corpus)

    
        

In [4]:
def get_wordnet_pos(row):
    """Convert the part-of-speech naming scheme
       from the nltk default to that which is
       recognized by the WordNet lemmatizer"""
    treebank_tag = pos_tag([row])
    if treebank_tag[0][1].startswith('J'):
        return wordnet.ADJ
        print(treebank_tag)
    elif treebank_tag[0][1].startswith('V'):
        return wordnet.VERB

    elif treebank_tag[0][1].startswith('N'):
        return wordnet.NOUN

    elif treebank_tag[0][1].startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [6]:
posts = pd.read_pickle("posts.pkl")

In [10]:
posts.head(2)

Unnamed: 0,_id,created,id,name,score,selftext,title,Type,total_text,split_text,text_length_simple,label
0,591dffd6f7327e5f6df05935,1460275000.0,4e3oc9,Dance_trey_dance,2,Why do I live? Why do I keep fighting? What's the point? Is there any reason for me to stick around?,But why?,depression,But why? Why do I live? Why do I keep fighting? What's the point? Is there any reason for me to stick around?,"[But, why?, Why, do, I, live?, Why, do, I, keep, fighting?, What's, the, point?, Is, there, any, reason, for, me, to, stick, around?]",23,2
1,591dffd6f7327e5f6df05937,1460275000.0,4e3o7b,EmptyShell11,2,"I don't know how I got to this point. Nearly 30 years old with no friends, no love life, no close family relationships and generally no one who would enjoy spending time with me.\nI emigrated to USA as a young child and always had trouble fitting in. I had a group of friends back in high school but I was always the outcast of the group. My interests never really did match up with theirs and try as I might, I never did feel comfortable with them. I stayed in contact with them throughout college by attending the few lunches or dinners they had once or twice a year. Over time, I have lost touch with all of them. A few friends I did manage to make in university never reach out to me. I tried getting in contact with one of them recently and was flat out ignored. \nThis brings me to where I am now. I am tired, I am lonely and I just wanted to share my sad pathetic life with everyone on the internet. I don't want to die and I'm not suicidal but I don't want to feel this way anymore. \nI h...",Just Survive Somehow,depression,"Just Survive Somehow I don't know how I got to this point. Nearly 30 years old with no friends, no love life, no close family relationships and generally no one who would enjoy spending time with me.\nI emigrated to USA as a young child and always had trouble fitting in. I had a group of friends back in high school but I was always the outcast of the group. My interests never really did match up with theirs and try as I might, I never did feel comfortable with them. I stayed in contact with them throughout college by attending the few lunches or dinners they had once or twice a year. Over time, I have lost touch with all of them. A few friends I did manage to make in university never reach out to me. I tried getting in contact with one of them recently and was flat out ignored. \nThis brings me to where I am now. I am tired, I am lonely and I just wanted to share my sad pathetic life with everyone on the internet. I don't want to die and I'm not suicidal but I don't want to feel th...","[Just, Survive, Somehow, I, don't, know, how, I, got, to, this, point., Nearly, 30, years, old, with, no, friends,, no, love, life,, no, close, family, relationships, and, generally, no, one, who, would, enjoy, spending, time, with, me.\nI, emigrated, to, USA, as, a, young, child, and, always, had, trouble, fitting, in., I, had, a, group, of, friends, back, in, high, school, but, I, was, always, the, outcast, of, the, group., My, interests, never, really, did, match, up, with, theirs, and, try, as, I, might,, I, never, did, feel, comfortable, with, them., I, stayed, in, contact, with, them, throughout, college, by, attending, ...]",211,1


I used both Count Vectorizer and TFIDF Vectorizer along wiht NMF and LDA. I tried a lot of diffrent ngram ranges and number of topics. Below are some examples. For my project/presentation I ended up using TFIDF vectorizer and NMF for posts and count vectorizer and NMF for the commnets. I also ended up using unigrams, bigrams and trigrams for posts and unigrams and bigrams for the comments. I ended up with 5 topics for posts and 4 for comments.

In [5]:
CV = CountVectorizer(ngram_range=((1,2)),  
                                   stop_words=stopwords, 
                                   analyzer = "word",
                                   tokenizer = text_cleaner_and_stemmer,
                                   token_pattern="\\b[a-z][a-z]+\\b",
                                   min_df=30, max_df=.95,
                                   lowercase=True)

TV = TfidfVectorizer(ngram_range=(1, 3), 
                    stop_words=stopwords, 
                    tokenizer = text_cleaner_and_stemmer,
                    token_pattern="\\b[a-z][a-z]+\\b", min_df=40, max_df=.95)

In [16]:
X = CV.fit_transform(posts.total_text)

In [18]:
X1 = TV.fit_transform(posts.total_text)

In [20]:
lda = LatentDirichletAllocation(n_topics=3, max_iter=20)


In [21]:
nmf = NMF(n_components = 5, max_iter=50)

In [22]:
lda.fit(X)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=20,
             mean_change_tol=0.001, n_jobs=1, n_topics=3, perp_tol=0.1,
             random_state=None, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [23]:
nmf.fit(X1)

NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=50,
  n_components=5, nls_max_iter=2000, random_state=None, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

This below function will print out the top words for each topic in an easily readable way.

In [19]:
def print_topic_top_words(model, cv, n_top_words=10):
    feature_names = cv.get_feature_names()
    
    for topic_vec in enumerate(model.components_):
        topic_num = topic_vec[0]
        topic_words = topic_vec[1]
        
        print('Topic {}:'.format(topic_num + 1))
        
        topic_values = sorted(zip(topic_words, feature_names), 
                              reverse=True)[:n_top_words]
    
        print(' '.join([y for x,y in topic_values]))
        # pprint(topic_values)
        
    return

In [27]:
n_top_words = 20

print_topic_top_words(lda, CV, n_top_words)

Topic 1:
feel im like get think feel like go anyone thought know really attack dont panic bad something thing time cant else
Topic 2:
im dont want like feel get know life go people friend make even cant time thing think try one really
Topic 3:
go get take im work year time day help week start month back last really one job make first well


In [28]:
print_topic_top_words(nmf, TV, n_top_words)

Topic 1:
life friend people want get make one thing would say talk never time year even love go know try think
Topic 2:
take get work go day week start doctor time sleep im month medication back year help night job anyone last
Topic 3:
feel like feel like like im feel like im anyone else make something anyone else make feel really normal sometimes dont feel way thought feel way always head
Topic 4:
im dont know dont know want cant fuck go dont want anymore think scar im go know im anything die tire really even think im
Topic 5:
attack panic panic attack help breathing calm heart felt symptom get go happen something trigger thought fear first bad body disorder


Topic 1: Seems to have a lot to do with relating and issues with people.

Topic 2: Seems to have to do with medicine/doctors, but also a lot of temporal words be used (like people are telling a story).

Topic 3: Seems to have to do with relating (with uses of words like anyone) and feeling.

Topic 4: Seems to be pretty dismal. People not thinking they can move on, not feeling like they can do anything.

Topic 5: Seems to focus a lot on panic attacks, and symptoms for things.

Overall, I see a lot of describing their issues, look for similarity, talking about doctors and medication, and general describing how they feel.

**Now onto the comments.**

In [6]:
comments = pd.read_pickle("comments.pkl")

All comments were combined into one field for topic modelling.

In [8]:
comments.head()

Unnamed: 0,_id,com0,com1,com10,com100,com101,com102,com103,com104,com105,...,com93,com94,com95,com96,com97,com98,com99,id,type,total_comments
0,591dffd5f7327e5f6df05934,"That's all up to you, my friend. You can choose to end your story now or you can choose to keep going and see what comes next. We have to come up with our own reasons to stick around. As for me, I stick around partly because I'm stubborn and refuse to let the depression win and I have some people who love me who would be totally devastated. I reply to people on here so I can feel like I have a purpose when I encourage others and show a little love and care.","No. And me neither. We stick around because killing ourselves is hard and we feel bad because it feels like throwing in the towel. And because we have hope that it will get better. But what does that mean? Will we get friends? Will we have a life we want to come home to? In my case, no.",,,,,,,,...,,,,,,,,4e3oc9,depression,"That's all up to you, my friend. You can choose to end your story now or you can choose to keep going and see what comes next. We have to come up with our own reasons to stick around. As for me, I stick around partly because I'm stubborn and refuse to let the depression win and I have some people who love me who would be totally devastated. I reply to people on here so I can feel like I have a purpose when I encourage others and show a little love and care. No. And me neither. We stick around because killing ourselves is hard and we feel bad because it feels like throwing in the towel. And because we have hope that it will get better. But what does that mean? Will we get friends? Will we have a life we want to come home to? In my case, no."
1,591dffd6f7327e5f6df05936,What are your interests?,Hang in there. Many feel like you. It is much harder to maintain and build friendships outside of high school and college. Everyone seems so self absorbed or not interested in new friendships. It's especially hard for people who aren't naturally outgoing. Don't be so hard on yourself. You took the time to put yourself out here. Hopefully we can provide you support and encouragement. I understand you feeling tired and lonely. You are not alone here. Hang in there,,,,,,,,...,,,,,,,,4e3o7b,depression,What are your interests? Hang in there. Many feel like you. It is much harder to maintain and build friendships outside of high school and college. Everyone seems so self absorbed or not interested in new friendships. It's especially hard for people who aren't naturally outgoing. Don't be so hard on yourself. You took the time to put yourself out here. Hopefully we can provide you support and encouragement. I understand you feeling tired and lonely. You are not alone here. Hang in there
2,591dffd7f7327e5f6df05938,"keep trying.Your link with your creativity has weakened so you need to kick on it constantly to get it back. Dont feel down if you your drawings are shit in the beginning,once you get in the flow you will be connected back to your creativity :)","As some others have said, keep trying. Grab a pen and make it go places on paper. Watch a movie or browse through deviantArt to gain some inspiration.",,,,,,,,...,,,,,,,,4e3nmq,depression,"keep trying.Your link with your creativity has weakened so you need to kick on it constantly to get it back. Dont feel down if you your drawings are shit in the beginning,once you get in the flow you will be connected back to your creativity :) As some others have said, keep trying. Grab a pen and make it go places on paper. Watch a movie or browse through deviantArt to gain some inspiration. I have the same exact problem. Structure and a push helps me to draw to my optimum. I wish I could help, but I'm just as distraught as you about this. I just kinda... put the pen in one direction, then another and dont really think about. I seperate my mind and my hand"
3,591dffd8f7327e5f6df0593a,,,,,,,,,,...,,,,,,,,4e3n9j,depression,
4,591dffd9f7327e5f6df0593c,"The first thing you can do is be social on here. It always sucks to drink alone; I've done plenty of it myself. If it didn't screw with my blood sugar, I'd drink more. It's hard to be social in a university setting when you don't feel like you fit in or that you're not ""one of them"". I went to a little bitty school and felt like a near outcast most of the time I was there. I know now that it was my perception because I have several friends from that time who have kept up with me for nearly 20 years now.",,,,,,,,,...,,,,,,,,4e3llb,depression,"The first thing you can do is be social on here. It always sucks to drink alone; I've done plenty of it myself. If it didn't screw with my blood sugar, I'd drink more. It's hard to be social in a university setting when you don't feel like you fit in or that you're not ""one of them"". I went to a little bitty school and felt like a near outcast most of the time I was there. I know now that it was my perception because I have several friends from that time who have kept up with me for nearly 20 years now."


In [38]:
CV1 = CountVectorizer(ngram_range=((1,2)),  
                                   stop_words=stopwords, 
                                   analyzer = "word",
                                   tokenizer = text_cleaner_and_stemmer,
                                   token_pattern="\\b[a-z][a-z]+\\b",
                                   min_df=5, max_df=.95)

TV1 = TfidfVectorizer(ngram_range=(1, 3), 
                    stop_words=stopwords, 
                    tokenizer = text_cleaner_and_stemmer,
                    token_pattern="\\b[a-z][a-z]+\\b", min_df=5, max_df=.93)




In [13]:
comments = comments[comments.total_comments.notnull()]

In [39]:
X_comments = CV1.fit_transform(comments.total_comments)

In [40]:
X_comments1 = TV1.fit_transform(comments.total_comments)

In [41]:
n_topics = 3
n_iter = 30

lda_comments = LatentDirichletAllocation(n_topics=n_topics,
                                max_iter=n_iter,
                                random_state=42)

lda_comments.fit(X_comments1)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=30,
             mean_change_tol=0.001, n_jobs=1, n_topics=3, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [42]:
nmf_comments = NMF(n_components = 4, max_iter=50, random_state=42)
nmf_comments.fit(X_comments)

NMF(alpha=0.0, beta=1, eta=0.1, init=None, l1_ratio=0.0, max_iter=50,
  n_components=4, nls_max_iter=2000, random_state=42, shuffle=False,
  solver='cd', sparseness=None, tol=0.0001, verbose=0)

In [45]:
print_topic_top_words(lda_comments, TV1, 20)

Topic 1:
get feel like go im help dont make thing people take know time think try well youre really work life
Topic 2:
officially medical advice valid medical remove perfectly valid remove perfectly post hasnt remove perfectly valid hasnt remove hasnt remove perfectly post hasnt best recommend doctor generally best come idea may advice online good thing keep idea may receive consider medical receive area everyones
Topic 3:
resource ton great page get helpranxietywgettinghelp helpranxietywgettinghelp wiki list crisis u international information type it’s international armor crisis tip link downloads chatroom link say u


In [46]:
print_topic_top_words(nmf_comments, CV1, 20)

Topic 1:
people dont like feel make know life think thing friend get one say go want way youre time talk someone
Topic 2:
get take help work go thing well medication time try make start also doctor good really need bad day youre
Topic 3:
im feel get like go really time year day want week feel like well know cant make last try dont bad
Topic 4:
panic attack panic attack go feel like im time thought get help heart life know day one first really year experience


Topic 1: Responses seem to involve relating to or talking to friends or other people.

Topic 2: Responses seem on the line of getting medical help.

Topic 3: This seems to be a general relating topic and about feelings.
  
Topic 4: We get some time words in here and as we as mentions about panic attacks (people describing their own experiences?).

Overall these topics are similar to the posts ones (and all center on relating or trying to help the person out).