<div class="alert" style="background-color:#fff; color:white; padding:0px 10px; border-radius:5px;"><h1 style='margin:15px 15px; color:#5d3a8e; font-size:40px'>Community Detection</h1>
</div>

***
## Stage 1:


<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 1. Import Packages</h2>
</div>

In [2]:
#files packages
from os import listdir
from os.path import isfile, join
import pathlib

import pandas as pd
import nltk
import string

#stemmer
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.stem import WordNetLemmatizer, SnowballStemmer

#stopWords
nltk.download('stopwords')
from nltk.corpus import stopwords
sw = stopwords.words("english")

# list of the names of the channels
channels = ['techchap', "dave2d", "ijustine", "mkbhd", "unboxtherapy"]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rama\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


****

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 2. Tokenize words, remove Stopwords, Lemmatize, and Clean-up text </h2>
</div>

### Here we created a dataframe that contains the stems of each video subtitles.

In [12]:
df = pd.DataFrame()
#looping over the channels
for channel in channels:
    #opening the directory of that channel
    files = [f for f in listdir("Youtube-Data/"+ channel + "/"+ channel + "/" + channel + "-subtitles")]
    #looping over the files of channel
    for file in files: 
        try:
            with open (join("Youtube-Data/"+ channel + "/"+ channel + "/" + channel + "-subtitles", file), "r") as myfile:
                data=myfile.read().replace('\n', ' ')
            # clean the data from all punctuation 
            data = data.translate(str.maketrans('', '', string.punctuation))

            nltk_tokens = nltk.word_tokenize(data)
            cleanTokens = [x for x in nltk_tokens if not x in sw]

            stems = []
            ps = PorterStemmer()
            for w in cleanTokens:
                stems.append(ps.stem(WordNetLemmatizer().lemmatize(w, pos='v'))) 


            tempData = pd.DataFrame([[file, stems]])

            df = df.append(tempData, ignore_index=True)
        except:
            print(join("Youtube-Data/"+ channel + "/"+ channel + "/" + channel + "-subtitles", file))



Youtube-Data/dave2d/dave2d/dave2d-subtitles\RARKNr6ru-c.txt
Youtube-Data/dave2d/dave2d/dave2d-subtitles\sMib1nMCdfc.txt
Youtube-Data/ijustine/ijustine/ijustine-subtitles\B5a_k25lSAE.txt
Youtube-Data/unboxtherapy/unboxtherapy/unboxtherapy-subtitles\0JzXL_uAHfo.txt
Youtube-Data/unboxtherapy/unboxtherapy/unboxtherapy-subtitles\F1ZT8XyxyH4.txt
Youtube-Data/unboxtherapy/unboxtherapy/unboxtherapy-subtitles\G2ZE5W97kXE.txt
Youtube-Data/unboxtherapy/unboxtherapy/unboxtherapy-subtitles\kdQM2dUHuDc.txt
Youtube-Data/unboxtherapy/unboxtherapy/unboxtherapy-subtitles\l87M93p7PC0.txt
Youtube-Data/unboxtherapy/unboxtherapy/unboxtherapy-subtitles\T1wkbDxCa6o.txt


## Function for preprocessing texts

In [13]:
def preprocess(data):
    data = data.translate(str.maketrans('', '', string.punctuation))

    nltk_tokens = nltk.word_tokenize(data)
    cleanTokens = [x for x in nltk_tokens if not x in sw]

    stems = []
    ps = PorterStemmer()
    for w in cleanTokens:
        stems.append(ps.stem(WordNetLemmatizer().lemmatize(w, pos='v'))) 
        
    return stems


In [15]:
df.columns = ["file", "stems"]
df

Unnamed: 0,file,stems
0,-7gyHZEving.txt,"[hey, guy, anton, tech, chap, new, msi, GF, 65..."
1,-cOYX11AfPc.txt,"[hey, guy, Im, tom, tech, chap, someth, littl,..."
2,-g1mHQwkpQY.txt,"[hi, guy, welcom, back, tech, chap, peopl, thi..."
3,-lCQMFC2D5Q.txt,"[oh, that, way, heavier, I, expect, actual, Im..."
4,-My0ls6Da-c.txt,"[hey, guy, im, tummi, tech, chapman, ive, get,..."
...,...,...
6483,_RU8FktAnlU.txt,"[welcom, back, favorit, seri, unbox, therapi, ..."
6484,_T3uDK90PvA.txt,"[chair, mean, din, read, newspap, still, other..."
6485,_uLIiWSqAAg.txt,"[come, check, So, guy, follow, instagram, youv..."
6486,_VSC4iGYGQA.txt,"[music, what, guy, lew, see, behind, final, ge..."


***

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 3. Create the Dictionary and Corpus needed for Topic Modeling</h2>
</div>

### Creating the Dictionary

In [17]:
import gensim
processed_docs = df['stems']

dictionary = gensim.corpora.Dictionary(processed_docs)


### Creating the Corpus

In [None]:
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [19]:
from gensim import corpora, models
tfidf = models.TfidfModel(bow_corpus)
corpus_tfidf = tfidf[bow_corpus]

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 4. Building the Topic Model</h2>
</div>

In [None]:
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, num_topics=30, id2word=dictionary, passes=2, workers=4)

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 5. View the topics in LDA model</h2>
</div>

In [31]:
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic: {} Word: {}'.format(idx, topic))


Topic: 0 Word: 0.005*"phone" + 0.003*"camera" + 0.003*"iphon" + 0.002*"galaxi" + 0.002*"ipad" + 0.002*"android" + 0.002*"display" + 0.002*"screen" + 0.002*"pixel" + 0.002*"nexu"
Topic: 1 Word: 0.004*"So" + 0.003*"thi" + 0.002*"and" + 0.002*"it" + 0.002*"iphon" + 0.002*"Oh" + 0.002*"laptop" + 0.002*"music" + 0.002*"the" + 0.002*"you"
Topic: 2 Word: 0.003*"So" + 0.002*"googl" + 0.002*"oh" + 0.002*"game" + 0.002*"thi" + 0.002*"Oh" + 0.002*"phone" + 0.002*"emoji" + 0.002*"he" + 0.002*"app"
Topic: 3 Word: 0.003*"So" + 0.002*"phone" + 0.002*"car" + 0.002*"it" + 0.002*"question" + 0.002*"thi" + 0.002*"iphon" + 0.002*"danc" + 0.002*"Oh" + 0.002*"and"
Topic: 4 Word: 0.005*"So" + 0.004*"thi" + 0.003*"and" + 0.003*"Oh" + 0.003*"it" + 0.002*"phone" + 0.002*"you" + 0.002*"camera" + 0.002*"iphon" + 0.002*"the"
Topic: 5 Word: 0.003*"laptop" + 0.003*"phone" + 0.002*"game" + 0.002*"camera" + 0.002*"So" + 0.002*"batteri" + 0.002*"pixel" + 0.002*"screen" + 0.001*"devic" + 0.001*"appl"
Topic: 6 Word: 0.00

### Predicting the topic of a document

In [65]:
unseen_document = '''The phones are great but I think they should have made the prices lower, to compete with apple.( coming from a Samsung fanboy)'''
bow_vector = dictionary.doc2bow(preprocess(unseen_document))
for index, score in sorted(lda_model_tfidf[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model_tfidf.print_topic(index, 5)))


Score: 0.7007796764373779	 Topic: 0.004*"phone" + 0.003*"camera" + 0.003*"iphon" + 0.002*"screen" + 0.002*"fold"
Score: 0.19530367851257324	 Topic: 0.004*"So" + 0.003*"and" + 0.003*"phone" + 0.002*"thi" + 0.002*"headphon"
Score: 0.0037113104481250048	 Topic: 0.005*"phone" + 0.003*"camera" + 0.003*"iphon" + 0.002*"galaxi" + 0.002*"ipad"
Score: 0.0037113104481250048	 Topic: 0.004*"So" + 0.003*"thi" + 0.002*"and" + 0.002*"it" + 0.002*"iphon"
Score: 0.0037113104481250048	 Topic: 0.003*"So" + 0.002*"googl" + 0.002*"oh" + 0.002*"game" + 0.002*"thi"
Score: 0.0037113104481250048	 Topic: 0.003*"So" + 0.002*"phone" + 0.002*"car" + 0.002*"it" + 0.002*"question"
Score: 0.0037113104481250048	 Topic: 0.005*"So" + 0.004*"thi" + 0.003*"and" + 0.003*"Oh" + 0.003*"it"
Score: 0.0037113104481250048	 Topic: 0.003*"laptop" + 0.003*"phone" + 0.002*"game" + 0.002*"camera" + 0.002*"So"
Score: 0.0037113104481250048	 Topic: 0.003*"iphon" + 0.002*"phone" + 0.002*"devic" + 0.002*"laptop" + 0.002*"app"
Score: 0.003

### Saving the model

In [25]:
from gensim.test.utils import datapath

# Save model to disk.
temp_file = datapath("C:/Users/Rama/nlp-datamining/communityDetection/LDAmodel")
lda_model_tfidf.save(temp_file)

#save dictionary to disk
dictionary.save(datapath("C:/Users/Rama/nlp-datamining/communityDetection/dictionary"))

<div class="alert alert-info" style="background-color:#5d3a8e; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'> 6. Finding the dominant topic for each video subtitles</h2>
</div>


In [86]:
df["topic"] = df['stems'].apply(lambda processedDoc:
    sorted(lda_model_tfidf[dictionary.doc2bow(processedDoc)], key=lambda tup: -1*tup[1])[0]
)
df["topicID"] = df["topic"].apply(lambda x: x[0])
df["confidence"] = df["topic"].apply(lambda x: x[1])
df = df.drop(["topic"],1)
df["file"] = df["file"].apply(lambda x: x[:-4]) 
df.head()

Unnamed: 0,file,stems,topicID,confidence
0,-7gyHZEving,"[hey, guy, anton, tech, chap, new, msi, GF, 65...",26,0.535545
1,-cOYX11AfPc,"[hey, guy, Im, tom, tech, chap, someth, littl,...",14,0.421258
2,-g1mHQwkpQY,"[hi, guy, welcom, back, tech, chap, peopl, thi...",26,0.535237
3,-lCQMFC2D5Q,"[oh, that, way, heavier, I, expect, actual, Im...",26,0.594496
4,-My0ls6Da-c,"[hey, guy, im, tummi, tech, chapman, ive, get,...",26,0.454431


In [90]:
df = df.rename({"file":"videoId"}, axis='columns')

### Exporting the dataframe into a csv file.

In [92]:
df.to_csv("videosAndTopics.csv")