# MSDS DS Tools Final Project: Topic Clustering

This notebook approaches topic clustering and visualization using a few processes:
* Articles are scraped from Times.com
* Article content is summarized using `sentencetransformers`
* Articles are clustered by summary using `sentencetranformers` again
* Clusters are tagged with a human-readable topic using LLama 3.1
* Results are visualized as a network graph in `pyvis`

This process worked surprisingly well and the visualization provides an at-a-glance view of the latest Times.com content.\
More information is available below.

### Requirements
> **Note:** Some of these requirements are time-consuming/difficult to set up, they can be skipped by loading data from the PKL files
#### General / Web Scraping
* `ipywidgets`: notebook tools
* `pyquery`: HTML parsing
* `tqdm`: progress bars
* `pyvis`: graph visual
* `matplotlib`: coloring
* `nltk`: NLP tools
* `gensim`: NLP tools
#### For Basic Topic Grouping (Skippable)
* `sentencetransformers`: basic semantic models, clustering, summarizing
* `pytorch`: GPU processing for `sentencetransformers`
* `CUDA`, `CUDNN`, `Visual Studio + MSVC`: for `pytorch` GPU capabilities
### For Topic Defining (Skippable)
* `ollama`: LLM communication
* `llama3.1-70b`: LLM choice

### General / Topic Grouping Dependencies Installs
> See above for information on these dependencies

In [2]:
%%capture
%pip install ipywidgets
%pip install pyquery
%pip install tqdm
%pip install nltk
%pip install pyvis
%pip install matplotlib
%pip install gensim

### Common Functions And Constants

In [12]:
from pyquery import PyQuery as pq
import requests
import asyncio
from typing import Any, AsyncGenerator, List, Set, Tuple, Dict
from urllib.parse import urljoin
import json
import unicodedata
from tqdm.notebook import tqdm

## Constants
TIMES_URL = "https://time.com/"
TIMES_HOME_URL = urljoin(TIMES_URL, "section/us/")
TIMES_PAGE_URL = "/section/us/?page={page}"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:128.0) Gecko/20100101 Firefox/128.0"
}
N_ARTICLES = 20
SUMMARIZED_ARTICLES_PICKLE = "summarized_articles.pkl"
ARTICLES_CORPUS_PICKLE = "articles_with_corpus.pkl"
ARTICLES_PICKLE = "articles.pkl"
ARTICLES_TOPIC_PICKLE = "articles_with_topic.pkl"
CLUSTERS_PICKLE = "clusters.pkl"

In [13]:
def json_str(cls: type):
    """
    Decorates a class with a `__str__` and `__repr__` function that returns a JSON representation of the object
    """
    def custom_str(self):
        class_name = self.__class__.__name__
        properties = {k: v for k, v in self.__dict__.items()}
        return json.dumps({"class": class_name, "properties": properties}, indent=4)
    cls.__str__ = custom_str
    cls.__repr__ = custom_str
    return cls

def remove_unicode_and_normalize(s: str) -> str:
    """
    Normalizes unicode characters to ascii or removes them otherwise
    """
    normalized = unicodedata.normalize('NFKD', s)
    return normalized.encode('ascii', 'ignore').decode('ascii')

In [14]:
@json_str
class RateLimiter:
    """
    Shared rate limiter for web requests
    """
    def __init__(self, n: int=1, interval:float=2):
        """
        Creates a simple `RateLimiter` that allows for `n` requests in `interval` seconds.
        """
        self.n = n
        self.interval = interval
        self.calls = 0
        self.lock = asyncio.Lock()

    async def acquire(self) -> None:
        """
        Waits for next available call, concurrency-safe
        """
        async with self.lock:
            if self.calls >= self.n:
                await asyncio.sleep(self.interval)
                self.calls = 0
            self.calls += 1

In [15]:
from typing import TypeVar


T = TypeVar('T')
async def async_islice(generator: AsyncGenerator[T, None], stop: int, start: int = 0) -> AsyncGenerator[T, None]:
    """
    Async-compatible implementation of `itertools.islice`
    """
    i = -1
    async for value in generator:
        i += 1
        if i < start:
            continue
        if i >= stop:
            break
        yield value

In [16]:
## Extensible article definitions

@json_str
class Article:
    def __init__(self, url: str, title: str, paragraphs: List[str]):
        self.paragraphs: List[str] = paragraphs
        self.title: str = title
        self.url: str = url

@json_str
class SummarizedArticle(Article):    
    def __init__(self, summary: str, **kwargs):
        super().__init__(**kwargs)
        self.summary: str = summary

@json_str
class ArticleWithCorpus(SummarizedArticle):
    def __init__(self, corpus: List[str], **kwargs):
        super().__init__(**kwargs)
        self.corpus : List[str] = corpus

@json_str
class ArticleWithTopic(ArticleWithCorpus):
    def __init__(self, topic: str, **kwargs):
        super().__init__(**kwargs)
        self.topic: str = topic

        


### Web Scraping
Web scraping is done with a asyncronous iterator. Pages are iterated directly using URLs and articles are iterated from each page. These processes join on an iterator of `Article`'s and obeys rate limiting as its needed. `Article`'s are only retrieved as they are iterated.\
Times.com is relatively easy to scrape due to it having lists of article on each page of its US category and an iterable URL structure in the form `/section/us/?page={page}`\
Rate-limiting is done as one request per second and `500` articles are grabbed by default. This takes a bit of time to complete.

In [17]:
async def iter_times(max_pages: int = 10) -> AsyncGenerator[Article, None]:
    """
    Returns an `AsyncGenerator` that yields `Article` instances. Articles are retrieved as they are iterated and rate limiting happens with the flow of the iterator
    """
    rl = RateLimiter(interval=1)

    def get_page_from_url(url) -> pq:
        """
        Creates a `PyQuery` instance for the page indicated by `url`
        """
        resp = requests.get(url, headers=HEADERS)
        if resp.status_code != 200:
            return None
        page = pq(resp.text)
        return page
    
    async def iter_pages(start_url: str) -> AsyncGenerator[pq, None]:
        """
        Returns an `AsyncGenerator` of pages, returns a `PyQuery` instance for each page
        """
        page_num = 1
        
        def get_next_page() -> pq:
            """
            Returns a `PyQuery` instance for the next page by determining the URL from the next index
            """
            nonlocal page_num
            page_num += 1
            page = get_page_from_url(urljoin(TIMES_URL, TIMES_PAGE_URL.format(page=page_num)))
            return page
        
        await rl.acquire()
        page = get_page_from_url(start_url)
        await rl.acquire()
        next_page = get_next_page()
        yield page

        while next_page is not None:
            yield page
            await rl.acquire()
            page = next_page
            next_page = get_next_page()

    def get_article_urls(head_page: pq) -> List[str]:
        """
        Returns a list of article urls from the given page (`PyQuery`)
        """
        article_urls = [x.attr("href") for x in head_page("div.component div.taxonomy-tout a").items()]
        return article_urls
    
    def get_article_from_url(url: str) -> Article:
        """
        Creates an `Article` instance for the article page `url` provided.
        """
        page = get_page_from_url(url)
        try:
            title = remove_unicode_and_normalize(page("h1.self-baseline").text())
            paragraphs = [remove_unicode_and_normalize(x.text()) for x in page("p.self-baseline.px-0").items()]
            return Article(url, title, paragraphs)
        except Exception as e:
            print(f"Article at url='{url}' failed to parse: {e}")
            return None

    async for page in async_islice(iter_pages(TIMES_HOME_URL), max_pages):
        urls = get_article_urls(page)
        for url in urls:
            yield get_article_from_url(urljoin(TIMES_URL, url))
            await rl.acquire()

### Text Cleaning
In order to clean the text, the following processes are used:
* Removing numbers
* Removing stopwords
* Lemmatizing tokens
* Removing words used only a single time
* Removing named-entities (ex: Trump, Washington)
* Lowercases words

The token list is not used for determining the final topic but is instead used to summarize and cluster articles. By removing named entities and infrequent words the models in `sentencetransformers` are less prone to confusion with unfamiliar or unimportant tokens.\
These tokens are otherwise mainly used during visualization where named entities clutter the screen without displaying meaningful connections between topics.

In [18]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from nltk.probability import FreqDist


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

#Implementation from https://stackoverflow.com/questions/43742956/fast-named-entity-removal-with-nltk
def fast_ne_removal(tokens: List[str]) -> List[str]:
    """
    Removes named-entities from the given token list
    """
    tagged = pos_tag(tokens)
    tree = ne_chunk(tagged)
    non_entities = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
    return non_entities

def clean_text(text: str) -> List[str]:
    """
    Cleans the text provided and tokenizes using the following processes:
    * Removes digits
    * Removes stopwords
    * Lemmatizes tokens
    * Removes words used only a single time
    * Removes named-entities
    * Lowercases words
    """
    text = re.sub(r'\d+', '', text)
    words = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalnum() and word not in stop_words]
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    freq_dist = FreqDist(words)
    words = [word for word in words if freq_dist[word] > 1]
    
    words = fast_ne_removal(words)
    words = [s.lower() for s in words]
    return words

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gsuga\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gsuga\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\gsuga\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\gsuga\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\gsuga\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\gsuga\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-

### Coloration
The final number of topics is uncertain so ChatGPT helped make a function that could combine and wrap colorsets from `matplotlib`. These are converted to hex before being used in visualization

In [19]:
import matplotlib.pyplot as plt
from matplotlib.colors import rgb2hex

#chatgpt
def get_combined_colormap(N):
    """
    Returns a list of colors of length `N` in hex format by wrapping through multiple colorsets
    """
    colormaps = ['tab20', 'tab20c', 'tab10', 'Set3']  # A combination of discrete colormaps
    colors = []

    for cmap_name in colormaps:
        cmap = plt.cm.get_cmap(cmap_name)
        colors.extend([rgb2hex(c) for c in cmap.colors])

    return colors[:N] if N <= len(colors) else colors * (N // len(colors)) + colors[:N % len(colors)]

### Data Checkpoint
> **Note**: All of the code above is common definitions. After running those, run the cell below and skip to visualization if getting new data is not necessary

The data in this notebook can be very time-consuming to obtain. Any time a long process is used, the output is saved to a PKL file so that it can be restored.\
If there are any parts of the code that cannot be run on the current system, the provided PKL files can be used to load the data needed for visualization.

In [20]:
import pickle
import os

def read_pickle(filename: str) -> Any:
    """
    Read pickle and print status
    """
    if os.path.exists(filename):
        with open(filename, "rb") as f:
            print(f"{filename}: SUCCESS")
            return pickle.load(f)
    else:
        print(f"{filename}: MISSING")
        return None


articles: List[Article] = read_pickle(ARTICLES_PICKLE)
summarized_articles: List[SummarizedArticle] = read_pickle(SUMMARIZED_ARTICLES_PICKLE)
articles_corpus: List[ArticleWithCorpus] = read_pickle(ARTICLES_CORPUS_PICKLE)
clusters: List[List[int]] = read_pickle(CLUSTERS_PICKLE)
articles_topic: List[ArticleWithTopic] = read_pickle(ARTICLES_TOPIC_PICKLE)

articles.pkl: SUCCESS
summarized_articles.pkl: SUCCESS
articles_with_corpus.pkl: SUCCESS
clusters.pkl: SUCCESS
articles_with_topic.pkl: SUCCESS


### Collecting Articles (Skippable)
This code collects the articles this notebook operates on. By default `500` articles are collected.\
This was done relatively easily and there are no findings at this point.

In [46]:
N_ARTICLES = 500

articles: List[Article] = []
with tqdm(total=N_ARTICLES) as pbar:
    async for article in async_islice(iter_times(max_pages=100), N_ARTICLES):
        pbar.update(1)
        if article is not None:
            articles.append(article)
    tqdm.write(f"Processed {len(articles)}/{N_ARTICLES} before iterator ended.")


import pickle
with open(ARTICLES_PICKLE, "wb") as f:
    pickle.dump(articles, f)

  0%|          | 0/500 [00:00<?, ?it/s]

Processed 500/500 before iterator ended.


### Summarizing (Skippable)
This code uses one of my favorite implementations in the `sentencetransformers` examples. It is a version of `LexRank` which will be used to rank and extract top sentences from the articles.\
This is a form of summarizing that does not require generating new text and has low risk of adding false data.\
The summaries are saved to the extended `Article` class `SummarizedArticle`

These summaries will be used to check work and be passed into a second `LexRank` process later to summarize topics. Doing the second `LexRank` process this way maintains the importance of individual articles while summarizing topics.

Install for `sentencetransformers`

In [None]:
%%capture
%pip install sentencetransformers

Summarizing implementation

In [12]:
from sentence_transformers import SentenceTransformer
from nltk.tokenize import sent_tokenize
import numpy as np
# https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/text-summarization
from LexRank import degree_centrality_scores

SUMMARY_LENGTH = 3

model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

summarized_articles: List[SummarizedArticle] = []

with tqdm(total=len(articles)) as pbar:
    for article in articles:
        sentences: List[str] = []
        for paragraph in article.paragraphs:
            if paragraph is not None and len(paragraph) > 0:
                sentences.extend([tok for tok in sent_tokenize(paragraph) if len(tok.strip()) > 0])

        sentences = np.array(sentences)

        if len(sentences) < 1:
            continue
        
        embeddings = model.encode(sentences)

        # https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/text-summarization/text-summarization.py
        similarity_scores = model.similarity(embeddings, embeddings).numpy()
        centrality_scores = degree_centrality_scores(similarity_scores, threshold=None)
        most_central_sentence_indices = np.argsort(-centrality_scores)

        summary = "\n".join(sentences[most_central_sentence_indices[0:SUMMARY_LENGTH]])
        summarized_articles.append(SummarizedArticle(summary, **article.__dict__))
        pbar.update(1)
    pbar.write(f"Summarized {len(summarized_articles)} articles.")

import pickle

with open(SUMMARIZED_ARTICLES_PICKLE, "wb") as f:
    pickle.dump(summarized_articles, f)

  0%|          | 0/500 [00:00<?, ?it/s]

Summarized 499 articles.


The summaries show that key points are being addressed from each article. These seem intuitive and the data is ready for the next steps.

In [21]:
print("\n\n".join([f"{s.title}:\n\t{s.summary}" for s in summarized_articles[0:2]]))

Iran Is Trying to Interfere in U.S. Election, Including Hacking Campaigns: Intel Agencies:
	Besides breaching the Trump campaign, officials also believe that Iran tried to hack into the presidential campaign of Kamala Harris.
We have observed increasingly aggressive Iranian activity during this election cycle, specifically involving influence operations targeting the American public and cyber operations targeting Presidential campaigns, said the statement released by the FBI, the Office of the Director of National Intelligence and the Cybersecurity and Infrastructure Security Agency.
Earlier this month, Microsoft issued a report detailing foreign agents attempts to interfere in this years election, citing an instance of an Iranian military intelligence unit in June sending a spear-phishing email to a high-ranking official of a presidential campaign from a compromised email account of a former senior advisor.

The Future Is Here: The Biggest Moments From Hillary Clintons 2024 DNC Speech

### Generating Clean Corpuses For Each Article (Skippable)
The content of each article is passed into the text cleaning process described above. These clean corpuses are saved to the extended `Article` class `ArticleWithCorpus`

In [14]:
articles_corpus: List[ArticleWithCorpus] = [ArticleWithCorpus(clean_text("\n".join(s.paragraphs)), **s.__dict__) for s in summarized_articles]

import pickle
with open(ARTICLES_CORPUS_PICKLE, "wb") as f:
    pickle.dump(articles_corpus, f)

### Semantic Clustering (Skippable)
Using `sentencetransformers` again, the articles can be clustered by their embeddings by detecting "communities". This does not actually decide a description of the topic as this is a much trickier process.\
Usually, getting named topics involves defining or seeding topics for a model to look for. This may have been possible here but the intention was for the process to be as unguided as possible and seeding topics can cause a bias in the modelling.\
These untagged clusters will get tagged later with the help of a more modern approach.

In [15]:
from sentence_transformers import util, SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode([" ".join(a.corpus) for a in articles_corpus], batch_size=64, show_progress_bar=True, convert_to_tensor=True)
clusters: List[List[int]] = util.community_detection(embeddings, min_community_size=3, threshold=0.5)

import pickle
with open(CLUSTERS_PICKLE, "wb") as f:
    pickle.dump(clusters, f)

  from tqdm.autonotebook import tqdm, trange


Batches:   0%|          | 0/8 [00:00<?, ?it/s]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


The clusters can be seen below and each cluster appears related according to human intuition. Next, tagging the clusters needs to happen.

In [16]:
for i, cluster in enumerate(clusters[0:3]):
    print(f"({i + 1}) {len(cluster)} Articles")
    for sentence_id in cluster[0:3]:
        print("\t", articles_corpus[sentence_id].title)
    print("\t...")

(1) 37 Articles
	 
	 How Republicans Are Reacting to the Colorado Ruling to Remove Trump From the Ballot
	 Read the Full Transcripts of Donald Trumps Interviews With TIME
	...
(2) 35 Articles
	 What Americas Student Photojournalists Saw at the Campus Protests
	 Columbias Relationship With Student Protesters Has Long Been Fraught
	 Gaza Calls, Columbia Falls: Campus Protesters Defy Suspension Threats and Occupy Hall
	...
(3) 12 Articles
	 High Debt Threatens the U.S. Economy
	 The Republican War on Food Programs
	 Is America in Decline?
	...


### Topic Tagging With LLama 3.1 70B (Skippable)
> **Note**: This is a time and resource intensive process, even with ~50 clusters. It is recommend to load this from the checkpoint and skip this.

In the interest of cost, a local Ollama instance was used instead of an API. Due to the limitations and performance of LLMs, it was necessary to condense the information in the articles to a manageable amount before reaching this step.\
LLMs perform best when the fewest unnecessary tokens are used. To start, each summary of each article in a cluster is tokenized by sentence and added to a central list. `LexRank` is used again to pick the top `10` sentences from this list to form a cluster summary.\
Now, the data is ready for LLama to determine the shared topic. The system prompt used can be seen below and is used to create a specialized LLama instance in Ollama. Each cluster summary is passed to the LLM which is tasked with finding a couple of words to describe the similarities. This takes quite a bit of time and computing power.

Install for Ollama

In [22]:
%%capture
%pip install ollama

Create the specialized LLama 3.1 70B instance in Ollama

In [70]:
import ollama

modelspec='''
FROM llama3.1:70b
SYSTEM You will be given a number of article summaries you need to find the common topic between. This topic should only be a couple of words at a maximum and there should only be one topic. Make sure the topic corresponds to EVERY summary provided. If you cannot find a topic respond with only NONE. Summaries are separated by newlines. Respond only with the short topic as described with no extra information.
PARAMETER temperature 0.0
'''

ollama.create(model='llama3.1-70-topic', modelfile=modelspec)

{'status': 'success'}

Tag topics for each cluster with the specialized LLM

In [74]:
from sentence_transformers import SentenceTransformer
from nltk.tokenize import sent_tokenize
import numpy as np
# https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/text-summarization
from LexRank import degree_centrality_scores

topic_sentences: List[List[str]] = []
for cluster in clusters:
    sentences: List[str] = []
    for idx in cluster:
        sentences.extend(sent_tokenize(articles_corpus[idx].summary))
    topic_sentences.append(sentences)

TOPIC_SUMMARY_LENGTH = 10

model = SentenceTransformer("all-mpnet-base-v2", device="cuda")

summaries: List[str] = []
with tqdm(total=len(clusters), desc="Summarizing Clusters") as pbar:
    for i, sentences in enumerate(topic_sentences):
        sentences = np.array(sentences)
        if len(sentences) < 1:
            continue
        
        embeddings = model.encode(sentences)

        # https://github.com/UKPLab/sentence-transformers/blob/master/examples/applications/text-summarization/text-summarization.py
        similarity_scores = model.similarity(embeddings, embeddings).numpy()
        centrality_scores = degree_centrality_scores(similarity_scores, threshold=None)
        most_central_sentence_indices = np.argsort(-centrality_scores)

        summary = "\n\n".join(sentences[most_central_sentence_indices[0:TOPIC_SUMMARY_LENGTH]])
        summaries.append(summary)
        pbar.update(1)
    pbar.write(f"Summarized {len(summaries)} clusters.")

articles_topic: List[ArticleWithTopic] = []
with tqdm(total=len(summaries), desc="Extracting topics") as pbar:
    for i, summary in enumerate(summaries):
        topic = ollama.generate(model="llama3.1-70-topic", prompt=summary)['response']
        cluster = clusters[i]
        
        for sentence_id in cluster:
            articles_topic.append(ArticleWithTopic(topic, **articles_corpus[sentence_id].__dict__))

        pbar.update(1)

import pickle
with open(ARTICLES_TOPIC_PICKLE, "wb") as f:
    pickle.dump(articles_topic, f)



Summarizing Clusters:   0%|          | 0/3 [00:00<?, ?it/s]

Summarized 54 clusters.


Extracting topics:   0%|          | 0/54 [00:00<?, ?it/s]

### Main Graph Visualization (Skippable)
> **Note**: The visualization is output as an HTML file. Since HTML cannot be easily loaded into VSCode it is not running in Notebook mode. Open [net.html](./net.html) in your browser to view the graph.

The bulk of the results is seen in the visual generated below. It uses `pyvis` to create a graph of topics and their connected keywords.\
At this step, there were ~5300 unique keywords from the 500 articles so a lot of extra cleaning is needed. Many words are either too generic or specific to be meaningful. Additionally words need to be important to the topic. If a word is not present enough in the cluster it is not guaranteed to be significant. Likewise if a word is present in too many topics it is not meaningful enough to include.\

The process for this keyword filtering is as follows:
* A word needs to be present in at least 75% of articles in the cluster
* A word needs to be present in at most 3 clusters

This helps guarantee that the words are not too infrequent, not too specfic, and not too generic. These conditions were tuned iteratively based on the visual.\

Each topic is graphed as a node with its size according to the number of related articles. A colored edge is formed to the filtered keywords associated with this. Each keyword node is subtly sized based on its overall frequency. Topics with at least one shared word are ancestors. The rest is much easier seen in the visual.

Install for `pyvis`

In [23]:
%%capture
%pip install pyvis

Graph visualization code

In [39]:
from nltk.probability import FreqDist
from pyvis.network import Network
from collections import defaultdict
from statistics import median


net = Network(height="100vh", width="100%", bgcolor="#222222", font_color="white", cdn_resources="remote", neighborhood_highlight=True)
net.toggle_physics(True)

all_words: List[str] = []

# remove NONE topic where LLM could not tag the cluster for some reason
for at in articles_topic:
    if at.topic == "NONE":
        continue
    all_words.extend(at.corpus)

shared_words: List[str] = []
word_cluster_residency: List[str] = []
# get filtered set of words shared within their clusters within requirements and populate the word cluster frequency dict
for cluster in clusters:
    cluster_words: List[str] = []
    for idx in cluster:
        ac = articles_corpus[idx]
        cluster_words.extend(ac.corpus)
    cluster_freq = FreqDist(cluster_words)
    shared_words.extend([word for word in cluster_words if cluster_freq[word] >= len(cluster) * .75])
    word_cluster_residency.extend(set(cluster_words))

word_cluster_count = FreqDist(word_cluster_residency)

topic_articles: Dict[str, List[ArticleWithTopic]] = defaultdict(lambda: [])
# collect final articles by topic
for article in articles_topic:
    l = topic_articles[article.topic]
    l.append(article)
    topic_articles[article.topic] = l

# extra word stats for debug/checking
word_freq = FreqDist(all_words)
mean_freq = sum(word_freq.values()) / len(articles_corpus)
median_freq = median(word_freq.values())
print(f"Total Unique Words={len(set(all_words))}")
print(f"Mean Word Frequency Overall={mean_freq}")
print(f"Median Word Frequency Overall={median_freq}")
# filter words present in more than 3 clusters
shared_words = list(set([word for word in shared_words if word_cluster_count[word] <= 3 and word in all_words]))
# add words as nodes sizing by how many clusters they appear in
net.add_nodes(shared_words, size=[min(max(word_cluster_count[word] * 3, 1), 10) for word in shared_words])

edges: List[Tuple[str, str]] = []
topics_n_edges: Dict[str, int] = defaultdict(lambda: 0)
topic_colors: Dict[str, str] = {}
# one color for each topic
colors = get_combined_colormap(len(topic_articles.items()))
# delete topic group "NONE" if it exists
if "NONE" in topic_articles:
    del topic_articles["NONE"]
i = 0
# determine topic colors and define colored edges between topics and words
for topic, articles in topic_articles.items():
    color = colors[i]
    topic_colors[topic] = color
    i += 1
    topic_words: Set[str] = set()
    for article in articles:
        topic_words = topic_words.union(article.corpus)
    for word in topic_words:
        if word in shared_words:
            topics_n_edges[topic] = topics_n_edges[topic] + 1
            edges.append((topic, word, color))

# delete totally disjoint topics with no edges
for k in list(topic_articles.keys()):
    if topics_n_edges[k] < 1:
        del topic_articles[k]

# create topic nodes sized by relative proportion of all articles
max_articles = max([len(k) for k in topic_articles.values()])
for topic, articles in topic_articles.items():
    net.add_node(topic, size=(len(articles)/max_articles) * 50 + 20, color=topic_colors[topic])

# create colored edges defined above
for k1, k2, color in edges:
    net.add_edge(source=k1, to=k2, color=color)

net.save_graph("net.html")
print("Open net.html in your browser to view.")

Total Unique Words=5081
Mean Word Frequency Overall=178.82565130260522
Median Word Frequency Overall=4
Open net.html in your browser to view.


  cmap = plt.cm.get_cmap(cmap_name)


### Visualization Findings
A number of things can be seen from the visual. The topics and words are familiar from recent events and the associated words are intuitive the majority of the time.\
Since the last 500 articles were used, the data goes back a few months and the topics are somewhat general sometimes. The largest and most connected topics are those relating to politics and economics.\
This section will be continued further in the final paper.\
Much more work can be done in this notebook but ultimately the results are readable and a lot of time has already been sunk into this notebook.