<a href="https://colab.research.google.com/github/armaank/dbn/blob/main/text-analysis/umap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# clone repo
%cd /content
%rm -rf dbn
!ls
!git clone https://github.com/armaank/dbn

/content
sample_data
Cloning into 'dbn'...
remote: Enumerating objects: 12284, done.[K
remote: Counting objects: 100% (127/127), done.[K
remote: Compressing objects: 100% (95/95), done.[K
remote: Total 12284 (delta 51), reused 85 (delta 30), pack-reused 12157[K
Receiving objects: 100% (12284/12284), 127.76 MiB | 18.17 MiB/s, done.
Resolving deltas: 100% (556/556), done.
Checking out files: 100% (378/378), done.


In [2]:
# cd to text-analysis directory
%cd ./dbn/text-analysis/
%ls
%pip install -r requirements.txt
%pip install umap-learn 
%pip install umap-learn[plot]

/content/dbn/text-analysis
csvpreprocess.py  mplda.py          README.md         topicmodel.ipynb
[0m[01;34mdata[0m/             nltk_download.py  requirements.txt
datahandler.py    [01;34mnotebooks[0m/        stopwords.txt
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting chardet==4.0.0
  Using cached chardet-4.0.0-py2.py3-none-any.whl (178 kB)
Installing collected packages: chardet
  Attempting uninstall: chardet
    Found existing installation: chardet 3.0.4
    Uninstalling chardet-3.0.4:
      Successfully uninstalled chardet-3.0.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
requests 2.23.0 requires chardet<4,>=3.0.2, but you have chardet 4.0.0 which is incompatible.
google-colab 1.0.0 requires ipython~=5.5.0, but you have ipython 7.16.1 which is incompatible.
datascience 0.10.6 requ

In [3]:
# download datasets 
!python3 ./data/arch-lectures/get_dataset.py

In [4]:
# imports 
import os 
import sys

import numpy as np

from datahandler import DataHandler

import pandas as pd
import umap.umap_ as umap
import umap.plot

# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE

In [5]:

seed=123

# supply data directory
data_dir = os.path.join(".","preprocessed")
# load corpus
corpus = DataHandler(data_dir, seed)

# print some various information from the corpus
print("Total Word Count: {}".format(corpus.total_words))
print("Number of Docs in the Corpus: {}".format(corpus.total_docs))

# summarize statistics from all institutions in the corpus
print(corpus.stats)

Total Word Count: 2996921
Number of Docs in the Corpus: 283
[{'inst': 'MIT', 'n_docs': 18, 'wc': 135822}, {'inst': 'GSD', 'n_docs': 37, 'wc': 408691}, {'inst': 'AA', 'n_docs': 137, 'wc': 1532531}, {'inst': 'CU', 'n_docs': 42, 'wc': 449772}, {'inst': 'Know', 'n_docs': 32, 'wc': 293989}, {'inst': 'RISD', 'n_docs': 17, 'wc': 176116}]


## Raw Counts

In [6]:
# vectorizer
vectorizer = CountVectorizer(min_df=0, input='filename')
word_doc_matrix = vectorizer.fit_transform(corpus.data.keys())
print(word_doc_matrix.shape)
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
print(embedding.embedding_.shape)
f = umap.plot.points(embedding, labels=np.array(list(corpus.data.values())))

(283, 46056)




(283, 2)


<IPython.core.display.Javascript object>

In [7]:
# vectorizer
vectorizer = CountVectorizer(min_df=5, stop_words="english", input='filename')
word_doc_matrix = vectorizer.fit_transform(corpus.data.keys())
print(word_doc_matrix.shape)
# print(vectorizer.get_feature_names())
embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)
print(embedding.embedding_.shape)
f = umap.plot.points(embedding, labels=np.array(list(corpus.data.values())))

(283, 13806)
(283, 2)


<IPython.core.display.Javascript object>

## TF-IDF

In [8]:
output_notebook()

hover_df = pd.DataFrame(corpus.data.keys(), columns=['inst'])

tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words='english', input='filename')
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(corpus.data.keys())
tfidf_embedding = umap.UMAP(metric='hellinger').fit(tfidf_word_doc_matrix)
fig = umap.plot.interactive(tfidf_embedding, labels=np.array(list(corpus.data.values())),
                            hover_data = hover_df, point_size=10)
show(fig)

Thoughts: tf-idf gives puts lectures from AA into closer to other schools, it isn't clear why using the speaker as a viz label. maybe the closely related documents are from the same speaker( speaker tours)

In [9]:
output_notebook()

hover_df = pd.DataFrame(corpus.data.keys(), columns=['inst'])

tfidf_vectorizer = TfidfVectorizer(input='filename')
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(corpus.data.keys())
tfidf_embedding = umap.UMAP(metric='hellinger').fit(tfidf_word_doc_matrix)
f = umap.plot.points(tfidf_embedding, labels=np.array(list(corpus.data.values())))
fig = umap.plot.interactive(tfidf_embedding, labels=np.array(list(corpus.data.values())),
                            hover_data = hover_df, point_size=10)
show(fig)

<IPython.core.display.Javascript object>

difference between professional vs academic lectures. still isn't clear why 
*   AA tend to have round table



## Experiments with UMAP


In [10]:

hover_df = pd.DataFrame(corpus.data.keys(), columns=['inst'])


tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', min_df=5)
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(corpus.data.keys())
print(tfidf_word_doc_matrix.shape)
tfidf_embedding = umap.UMAP(n_neighbors=50, metric='cosine', min_dist=1, spread=2.0).fit(tfidf_word_doc_matrix)
print(tfidf_embedding.embedding_.shape)
fig = umap.plot.interactive(tfidf_embedding, labels=np.array(list(corpus.data.values())),
                            hover_data = hover_df, point_size=10)
show(fig)


(283, 13806)
(283, 2)


Experimented with cosine embeddings and varied various other umap params, nothing interesting


## KMEANS


In [11]:
# learning clusters with kmeans, 
from sklearn.cluster import KMeans

hover_df = pd.DataFrame(corpus.data.keys(), columns=['inst'])


tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english', min_df=5)
tfidf_word_doc_matrix = tfidf_vectorizer.fit_transform(corpus.data.keys())
tfidf_embedding = umap.UMAP(n_neighbors=15, metric='cosine', min_dist=.1).fit(tfidf_word_doc_matrix)

clusters = KMeans(n_clusters=5)
clusters.fit(tfidf_word_doc_matrix)

f = umap.plot.points(tfidf_embedding, labels=np.array(["c{}".format(c) for c in clusters.labels_]))

<IPython.core.display.Javascript object>

Using labels learned from kmeans, it looks like we have some rough clusters. c3 looks to be a smaller cluster to the left, c0 in the upper right, c4 dominates the top, c2 documents are in the bottom and c1 is all over. Perhaps there is something more here


In [12]:
%mkdir ./umap

In [13]:
import csv

labels=np.array([c for c in clusters.labels_])

files = np.array(hover_df).flatten()
names = []
for f in files:
   names.append(os.path.relpath(f, "./preprocessed/"))

x = tfidf_embedding.embedding_[:,0]

y = tfidf_embedding.embedding_[:,1]
    
with open('./umap/output.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerows(zip(labels, names, x, y))
