#### Overview: 
*** 
To do extract topics from the syllabus documents. A total of 21 course syllabus documents are being used.

We will be performing Topic Modelling using Latent Dirichlet association (LDA) and Non-negative matrix factorization (NMF) techniques.

BiGram and TriGram Topic Modelling will also be done using LDA.

PyLDAVis Library will be used for the purposes of visualisation.

We will see the topics, top words within those topics and the top subjects under the topics. We will also see the topic distribution within the subject documents as each syllabi will conist of a combination of topics.

Library used for Topic Modelling: SkLearn as it gave better results than Gensim, others on our dataset.

What is Topic Modelling:
- Topic modeling is a statistical model to discover hidden semantic patterns in unstructured collection of documents. Large collection of documents are represented in terms of topics and topics are represented in terms of words. This Top-Down approach will help in exposing hidden insights from the corpus. In this approach, every document is a distribution of topics and every topic is a distribution of words. The topics extracted using Topic modeling are collection of similar words. The intuition behind Topic modeling is built on top of mathematical framework, which is based on probability and statistics of words in each topic.

What is Non-negative matrix factorization (NMF)?
- Non-Negative Matrix Factorization is a statistical method that helps us to reduce the dimension of the input corpora or corpora. Internally, it uses the factor analysis method to give comparatively less weightage to the words that are having less coherence. 

Some Important points about NMF:
1. It belongs to the family of linear algebra algorithms that are used to identify the latent or hidden structure present in the data.

2. It is represented as a non-negative matrix.

3. It can also be applied for topic modelling, where the input is the term-document matrix, typically TF-IDF normalized.

4. NMF has become so popular because of its ability to automatically extract sparse and easily interpretable factors.

What is Latent Dirichlet association (LDA)?
- Latent Dirichlet Allocation (LDA) is a popular topic modeling technique to extract topics from a given corpus. The term latent conveys something that exists but is not yet developed. In other words, latent means hidden or concealed.

- The topics that we want to extract from the data are also “hidden topics”. It is yet to be discovered. Hence, the term “latent” in LDA. The Dirichlet allocation is after the Dirichlet distribution and process.

- This process is a distribution over distributions, meaning that each draw from a Dirichlet process is itself a distribution. What this implies is that a Dirichlet process is a probability distribution wherein the range of this distribution is itself a set of probability distributions!
***

### Importing Libraries

In [63]:
#@title Option to display the Subject Topic Combination table in a single line.

pd.set_option('expand_frame_repr', False)

In [1]:
#@title Visualization library

pip install pyldavis==2.1.2



In [33]:
#@title Importing all the important libraries

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim
# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint
# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline

from plotly.offline import plot
import plotly.graph_objects as go
import plotly.express as px


Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations




In [3]:
#@title Connecting to Google Drive for the syllabus documents folder 

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
#@title Getting the filenames from the folder

import glob
path = "/content/drive/MyDrive/syllabus/*.*"

fileNames = []

for file in glob.glob(path):
   fileNames.append(file)

fileNames

['/content/drive/MyDrive/syllabus/100_Syll_Policies_F20.pdf',
 '/content/drive/MyDrive/syllabus/ENVS_101_Syllabus_Fall__2020.pdf',
 '/content/drive/MyDrive/syllabus/CSCI_130SyllabusSpring2020.pdf',
 '/content/drive/MyDrive/syllabus/ART_125__Syllabus_Spring_2021.pdf',
 '/content/drive/MyDrive/syllabus/Syllabus_SP21_BUS_100.pdf',
 '/content/drive/MyDrive/syllabus/1_210117BIOS100PhamSp21.pdf',
 '/content/drive/MyDrive/syllabus/CBL_101_Syllabus_GarriganSpring2020.pdf',
 '/content/drive/MyDrive/syllabus/101_syll19_chang.pdf',
 '/content/drive/MyDrive/syllabus/100_syllabus_2019.pdf',
 '/content/drive/MyDrive/syllabus/2020_F_HESM_270_Lifetime_wellness__Syllabus__v1_0.pdf',
 '/content/drive/MyDrive/syllabus/HIS_101_Syllabus_Fall_2020.pdf',
 '/content/drive/MyDrive/syllabus/Syllabus_PSYC_101__001__Carlstrom_Spring_2019.pdf',
 '/content/drive/MyDrive/syllabus/CSCI_241_syllabusFall20.pdf',
 '/content/drive/MyDrive/syllabus/ANTH201_Syllabus_2020.pdf',
 '/content/drive/MyDrive/syllabus/Syllabus_105

In [5]:
#@title Installing the PyMuPDF library for reading PDF files. Very simple and easy to use.

pip install PyMuPDF



In [6]:
#@title Reading the files and putting it into a list. Creates a list of lists.

import fitz # install using: pip install PyMuPDF
documents = []

for file in fileNames:
  with fitz.open(file) as doc:
      text = ""
      for page in doc:
          text += page.get_text()

  documents.append(text)
# print(text)
documents

['1 | P a g e  \n \nUniversity of Wisconsin-Parkside, Fall 2020 \n \n \nANTH 100 003 Online \nINTRODUCTION TO ANTHROPOLOGY \nCOURSE POLICIES \nInstructor: \nKathleen Gillogly  \n \nClass Time: \nOnline All the Way! \nEmail: \n \ngillogly@uwp.edu \n \nClass Room: \nYour computer contains ALL the things!  \nPhone:  \n262-595-2147 \n \n \nOffice:  \nMOLN 219 \nDept. Phone: \n262-595-2177 \n \n \nDept. Office: \nMOLN 319 \nOffice Hours:  \nM/W, 11:30 AM-1:00 PM or by appointment (or online!) \nI will respond to emails within 24 hours (M-F) – even if only to acknowledge receipt until I can write a more \ncomprehensive response.  \nText (Required):  \nKottak, C. Anthropology: Appreciating Human Diversity, 18th.  McGraw-Hill. 2019.  \n \nTABLE OF CONTENTS \nANTH 100 003 ONLINE .................................................................................................... 1 \nINTRODUCTION TO ANTHROPOLOGY ....................................................... 1 \nMasks & Social Distancing

In [7]:
#@title Manually entering the subject names since syllabus's do not have them in a consistent form

Subjects = ['ANTH 100', 'ENVS 101', 'CSCI130', 'Art 125', 'BUS 100', 'BIOS 100 Spring 2021', 'CBL 101',
            'CHEM 101 Fall 2019', 'CHEM 100', 'HESM 207', 'HIS 101', 'PYSC 101','CSCI 241', 'ANTH 201', 'CSCI 105', 'Art 103',
            'BIOS 100 Spring 2020', 'CHEM 101 Fall 2020', 'COMM 107', 'Literature', 'CRMJ 101']

In [8]:
len(documents)

21

### DataFrame Creation and Pre-Processing

In [9]:
#@title Punctuation remover function

def punctuation_stripper(data):
  
  # Punctuation remover
  data = re.sub(r'[^\w\s]', '', data)
  # Remove Emails
  data = re.sub(r'\S*@\S*\s?', '', data)
  # Remove new line characters
  data = re.sub(r'\s+', ' ', data)
  # Remove distracting single quotes
  data = re.sub(r"\'", "", data)
  # pprint(data[:1])
  return data

In [10]:
#@title Convert to a DataFrame and render.
import pandas as pd
dataset_df = pd.DataFrame(documents)
dataset_df.head(n=10)

dataset_df = pd.DataFrame(list(zip(Subjects, documents)), columns =['Subject', 'Syllabus'])

In [11]:
for i in range(0, 21):
  dataset_df.at[i, 'Syllabus'] = punctuation_stripper(dataset_df.at[i, 'Syllabus'])
  # df.at[2,'age']=40
# dataset_df

In [12]:
#@title Preprocessing functions and steps such as tokenization, lemmatization, lower case transformation and more.

import string

#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

#storing the puntuation free text
dataset_df['Syllabus']= dataset_df['Syllabus'].apply(lambda x:remove_punctuation(x))
# data.head()

# Convert to lowercase
dataset_df['Syllabus'] = dataset_df['Syllabus'].apply(lambda x: x.lower())

#defining function for tokenization
import re
def tokenization(text):
    tokens = re.split('W+',text)
    return tokens
#applying function to the column
dataset_df['Syllabus'] = dataset_df['Syllabus'].apply(lambda x: tokenization(x))

#importing nlp library
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

#Stop words present in the library
stopwords = stopwords.words('english')
# stopwords[0:10]

#defining the function to remove stopwords from tokenized text
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output


#applying the function
dataset_df['Syllabus'] = dataset_df['Syllabus'].apply(lambda x:remove_stopwords(x))

# from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer 

#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()

#defining the function for lemmatization
def lemmatizer(text):
  lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
  return lemm_text

dataset_df['Syllabus'] = dataset_df['Syllabus'].apply(lambda x:lemmatizer(x))


# dataset_df['Syllabus'] = str(dataset_df['Syllabus'])
dataset_df

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Subject,Syllabus
0,ANTH 100,[1 p a g e university of wisconsinparkside fal...
1,ENVS 101,[envs 101 introduction to environmental studie...
2,CSCI130,[csci 130 introduction to programming spring 2...
3,Art 125,[art 125 survey of world art spring 2021 sylla...
4,BUS 100,[introduction to businessbus 100 tuesdaythursd...
5,BIOS 100 Spring 2021,[1 of 6 bios 100 nature of life spring 2021 co...
6,CBL 101,[1 cbl 101 introduction to community engagemen...
7,CHEM 101 Fall 2019,[chem 101 general chemistry i tentative fall s...
8,CHEM 100,[chemistry 100 the world of chemistry spring s...
9,HESM 207,[ 1 university of wisconsin parkside departmen...


In [13]:
for i in range(0, 21):
  dataset_df.at[i, 'Syllabus'] = dataset_df.at[i, 'Syllabus'][0]

In [14]:
dataset_df

Unnamed: 0,Subject,Syllabus
0,ANTH 100,1 p a g e university of wisconsinparkside fall...
1,ENVS 101,envs 101 introduction to environmental studies...
2,CSCI130,csci 130 introduction to programming spring 20...
3,Art 125,art 125 survey of world art spring 2021 syllab...
4,BUS 100,introduction to businessbus 100 tuesdaythursda...
5,BIOS 100 Spring 2021,1 of 6 bios 100 nature of life spring 2021 cou...
6,CBL 101,1 cbl 101 introduction to community engagement...
7,CHEM 101 Fall 2019,chem 101 general chemistry i tentative fall se...
8,CHEM 100,chemistry 100 the world of chemistry spring se...
9,HESM 207,1 university of wisconsin parkside department...


### EDA

In [34]:
#@title Top 30 Unigram words in the documents
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(dataset_df['Syllabus'], 30)
df2 = pd.DataFrame(common_words, columns = ['unigram' , 'count'])

fig = go.Figure([go.Bar(x=df2['unigram'], y=df2['count'])])
fig.update_layout(title=go.layout.Title(text="Top 30 unigrams in the question text after removing stop words and lemmatization"))
fig.show()

In [36]:
#@title Top 20 Bigram words in the documents

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(ngram_range = (2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(dataset_df['Syllabus'], 20)
df2 = pd.DataFrame(common_words, columns = ['unigram' , 'count'])

fig = go.Figure([go.Bar(x=df2['unigram'], y=df2['count'])])
fig.update_layout(title=go.layout.Title(text="Top 20 Bi-grams in the question text after removing stop words and lemmatization"))
fig.show()

In [37]:
#@title Top 20 Trigram words in the documents

def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(ngram_range = (3, 3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(dataset_df['Syllabus'], 20)
df2 = pd.DataFrame(common_words, columns = ['unigram' , 'count'])

fig = go.Figure([go.Bar(x=df2['unigram'], y=df2['count'])])
fig.update_layout(title=go.layout.Title(text="Top 20 Tri-grams in the question text after removing stop words and lemmatization"))
fig.show()

### Topic Modelling

Manually set the following parameters for the topic modelling procedure.

no_topics = Total topics within the documents

no_top_words = Number of top words within every topic

no_top_documents = Number of top documents for every topic

In [15]:
no_topics =  10#@param {type:"integer"}

no_top_words =  15#@param {type:"integer"}

no_top_documents =  5#@param {type:"integer"}

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np

In [58]:
#@title Functions to display the results, NMF and LDA topic modelling, Visualization

# Results display funtion
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([ (feature_names[i] + " (" + str(topic[i].round(2)) + ")")
          for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            # print(str(doc_index) + ". " + documents[doc_index])
            # print(str(doc_index))
            print(dataset_df.loc[doc_index, 'Subject'])

# NMF topic modelling function
def nmf_executor(ngram_range = (1,1)):
  # NMF is able to use tf-idf
  tfidf_vectorizer = TfidfVectorizer(ngram_range = ngram_range, max_df=0.95, min_df=2, stop_words='english')
  # tfidf = tfidf_vectorizer.fit_transform(documents)
  docs_list = list(dataset_df['Syllabus'])
  tfidf = tfidf_vectorizer.fit_transform(docs_list)
  tfidf_feature_names = tfidf_vectorizer.get_feature_names()
  tfidf_stop_words = tfidf_vectorizer.get_stop_words()

  # Run NMF
  nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
  nmf_W = nmf_model.transform(tfidf)
  nmf_H = nmf_model.components_

  print("NMF Topics")
  display_topics(nmf_H, nmf_W, tfidf_feature_names, 
                docs_list, 
                no_top_words, no_top_documents)

  print("--------------")
  print("SUBJECT TOPIC DISTRIBUTION\n")
  df = pd.DataFrame(nmf_W, index = Subjects, columns = ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 
                                                        'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'])

  print(df.head(21))
  print("\n--------------")

  return nmf_model, tfidf, tfidf_vectorizer

# LDS topic modelling function
def lda_executor(ngram_range = (1, 1)):

  # LDA can only use raw term counts for LDA because it is a probabilistic graphical model
  tf_vectorizer = CountVectorizer(ngram_range = ngram_range, max_df=0.95, min_df=2, stop_words='english')
  docs_list = list(dataset_df['Syllabus'])

  tf = tf_vectorizer.fit_transform(docs_list)
  tf_feature_names = tf_vectorizer.get_feature_names()

  # Run LDA
  lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
  lda_W = lda_model.transform(tf)
  lda_H = lda_model.components_

  print("LDA Topics")
  display_topics(lda_H, lda_W, tf_feature_names, docs_list, no_top_words, no_top_documents)
  print("--------------")
  print("SUBJECT TOPIC DISTRIBUTION\n")
  df = pd.DataFrame(lda_W, index = Subjects, columns = ['Topic 0', 'Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5', 
                                                        'Topic 6', 'Topic 7', 'Topic 8', 'Topic 9'])

  print(df.head(21))
  print("\n--------------")
  return lda_model, tf, tf_vectorizer

# Visualization was not executing when done through a function, hence manually coded for every instance.
import pyLDAvis.sklearn

### 1 Gram Topic Modelling

In [59]:
#@title Topic Modelling with NMF - 1 gram

nmf_model, tfidf, vectorizer = nmf_executor()


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.



`alpha` was deprecated in version 1.0 and will be removed in 1.2. Use `alpha_W` and `alpha_H` instead



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations



`alpha` was deprecated in version 1.0 and will be removed in 1.2. Use `alpha_W` and `alpha_H` instead




NMF Topics
Topic 0:
points (0.15) exam (0.12) academic (0.1) canvas (0.1) discussion (0.1) community (0.09) university (0.09) activities (0.08) final (0.08) writing (0.07) project (0.07) exams (0.07) report (0.06) exercise (0.06) personal (0.06)
HESM 207
ANTH 100
HIS 101
COMM 107
CBL 101
Topic 1:
homework (0.29) chemistry (0.2) hat (0.18) exam (0.17) quiz (0.15) chemical (0.15) chem (0.14) scientific (0.13) pts (0.12) chapter (0.12) ch (0.12) questions (0.09) 101 (0.08) exams (0.08) systems (0.07)
CHEM 101 Fall 2020
CHEM 101 Fall 2019
CHEM 100
BIOS 100 Spring 2020
BIOS 100 Spring 2021
Topic 2:
ch (0.78) environmental (0.15) concept (0.12) exam (0.1) oct (0.09) science (0.09) sept (0.08) nov (0.08) week (0.08) pages (0.08) connection (0.06) 10 (0.06) lab (0.05) 20 (0.05) pts (0.04)
PYSC 101
ENVS 101
CSCI 241
CHEM 100
CHEM 101 Fall 2019
Topic 3:
programming (0.28) computer (0.18) lab (0.16) problem (0.12) program (0.12) outcome (0.12) computers (0.1) electronically (0.09) solve (0.08) te

In [32]:
#@title Visualise 1 gram NMF with pyLDAVis

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(nmf_model, tfidf, vectorizer)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)



  head(R).drop('saliency', 1)



NMF shows good results, however, at times the topics are incoherent. Hence this is used only for 1 gram topic modelling. Results were not consistent for n-gram topic modelling with NMF.

We can see the different topics and the topic distribution for the documents too.

In [60]:
#@title Topic Modelling with LDA 1 gram

lda_model, tfidf_1, vectorizer_1 = lda_executor()

LDA Topics
Topic 0:
points (2.58) writing (2.47) learn (2.27) academic (2.2) exams (2.18) chapter (2.17) ch (2.15) write (2.15) discussions (2.1) 10 (1.9) make (1.83) week (1.83) read (1.82) understanding (1.79) help (1.78)
Art 103
Literature
CHEM 100
Art 125
BUS 100
Topic 1:
ch (1.86) exam (1.49) points (1.33) personal (1.21) week (1.18) credit (1.17) paper (1.16) skills (1.15) materials (1.15) concept (1.15) understanding (1.12) exams (1.12) date (1.1) pages (1.09) discussion (1.09)
Art 103
Literature
CHEM 100
Art 125
BUS 100
Topic 2:
exam (1.14) time (1.07) talk (1.06) systems (1.04) canvas (1.02) pm (1.02) personal (1.01) action (1.0) account (1.0) catch (1.0) notes (0.99) supplement (0.99) ch (0.99) know (0.99) lab (0.98)
Art 103
Literature
CHEM 100
Art 125
BUS 100
Topic 3:
week (2.14) canvas (1.88) points (1.76) ch (1.7) exam (1.51) review (1.44) performance (1.44) responsibility (1.44) text (1.43) project (1.42) read (1.42) university (1.39) material (1.39) personal (1.37) 10 (1


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.




In [30]:
#@title Visualise 1 gram LDA with pyLDAVis

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(lda_model, tfidf_1, vectorizer_1)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)


  head(R).drop('saliency', 1)



### 2 Gram Topic Modelling with LDA

In [61]:
#@title Topic Modelling with LDA 2 gram

lda_model_2, tfidf_2, vectorizer_2 = lda_executor((2, 2))


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.




LDA Topics
Topic 0:
computer science (2.83) programming assignments (2.57) information technology (2.34) disability services (2.24) general education (2.01) previous knowledge (1.93) test data (1.88) course requirements (1.79) students disabilities (1.75) solve problem (1.75) located wyll (1.75) academic dishonesty (1.73) day drop (1.73) basis reasonable (1.72) final grade (1.72)
CSCI 241
CSCI130
BUS 100
Art 125
CRMJ 101
Topic 1:
learning outcome (2.69) discussion board (1.93) outcome students (1.9) team members (1.82) learning objectives (1.81) general education (1.73) 10 points (1.71) students able (1.56) drop box (1.56) midnight week (1.53) verbal nonverbal (1.48) box discussion (1.44) students read (1.42) reasoned judgment (1.38) interpret communicate (1.37)
Literature
Art 125
CRMJ 101
BUS 100
Art 103
Topic 2:
writing assignments (6.23) written assignments (2.76) critical thinking (2.65) general education (2.39) geography anthropology (2.24) office hours (2.18) disability services 

In [26]:
#@title Visualise 2 gram LDA with pyLDAVis

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(lda_model_2, tfidf_2, vectorizer_2)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)


  head(R).drop('saliency', 1)



### 3 Gram Topic Modelling with LDA

In [62]:
#@title Topic Modelling with LDA 3 gram

lda_model_3, tfidf_3, vectorizer_3 = lda_executor((3, 3))

LDA Topics
Topic 0:
new york times (3.99) worth 20 points (1.76) day drop course (1.59) worth 100 points (1.51) ch ch 12 (1.5) understanding effective communication (1.35) final course grade (1.31) writing understanding effective (1.28) general education requirements (1.28) responsible choice ones (1.26) education lifelong learning (1.24) critical thinking skills (1.23) social personal responsibility (1.22) equity gaps areas (1.22) points total points (1.2)
PYSC 101
Art 125
Literature
CRMJ 101
ENVS 101
Topic 1:
wisconsin administrative code (1.92) university wisconsin parkside (1.79) sanctions academic misconduct (1.58) academic misconduct include (1.58) information including limited (1.57) university wisconsin administrative (1.53) chapter 14 uws (1.53) contact disability services (1.51) academic misconduct subject (1.49) administrative code chapter (1.48) requirements religious observances (1.46) located wyll d175 (1.43) activities meet course (1.42) used class time (1.42) requiremen


Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.




In [28]:
#@title Visualise 3 gram LDA with pyLDAVis

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(lda_model_3, tfidf_3, vectorizer_3)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)


  head(R).drop('saliency', 1)



### Conclusion:

***

The results are pretty self explanatory. Good top words have been able to be identified for the topics, the topics show promise too. Bi grams and Tri grams also are of importance as they identify phrases and topics which the unigram modelling misses out on. 

The distance map in the visualization gives good visual insights and also the salient terms within those topics.

A way of gaining insights and quantifying that needs to be developed from the identified topics. 

LDA seemingly does better at identifying a variety of topics within the documents and gives a thorough understand comparatively.

Further Work: We can create topics based on the manual HIP work done on the 60 odd documents. We have keywords from that work of study and that can be used in some combination with this to go further into the work done here to get better results. Document summarization can also be looked into along with word embeddings in combination with Topic Modelling.
***