<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Topic Modeling using Defined Topics
In some cases we may have a pre-existing list of topics and wish to identify which topic(s) is contained within each document in a set of documents.  We could accomplish this using a supervised classification model if we had a portion of the documents labeled with their corresponding topic, but we can also use unsupervised approaches to do this.

In [3]:
import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import numpy as np

## Get documents to tag with topics
We will use BeautifulSoup to get the content of a few articles from the web and strip the text content from the hmtl.  

In [4]:
# Get article
article_urls = ['https://www.cbssports.com/college-basketball/news/duke-basketballs-game-vs-clemson-postponed-due-to-positive-covid-19-tests-in-blue-devils-program/',
                'https://www.usatoday.com/story/news/health/2021/12/21/covid-holiday-safety-need-to-know/8968198002/',
                'https://www.fayobserver.com/story/sports/college/basketball/2021/12/29/duke-blue-devils-basketball-recruiting-jon-scheyer-commits/9032663002/',
                'https://www.today.com/health/health/covid-19-cold-flu-tell-difference-rcna10114',
                'https://www.dukechronicle.com/article/2021/06/duke-mens-basketball-head-coach-jon-scheyer-mike-krzyzewski',
                'https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus']
article_text = []
titles = []
for url in article_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    # Extract body text from article
    bodytext = soup.find_all('p')
    bodytext = [i.text for i in bodytext]
    bodytext = ' '.join(bodytext)
    article_text.append(bodytext)
    # Extract titles for articles
    title = soup.find_all('h1')
    title = title[0].text.strip()
    titles.append(title)


We will define a list of topics below which are contained within our set of documents.  Our goal will then be to try to correctly identify the corresponding topic from the list for each article.

In [1]:
topic_list = ['coronavirus','Duke basketball']

## Embed topics and documents and find closest matching topics

In [11]:
def model_topics(documents,candidates):
    model = SentenceTransformer('all-MiniLM-L6-v2')
    # Encode each of the articles
    doc_embeddings = [model.encode([doc]) for doc in documents]
    # Encode the candidate topics
    candidate_embeddings = model.encode(candidates)

    # Calculate cosine similarity between each document and candidate topics
    # Take the top candidate topic as topic for each document
    topics = []
    for doc in doc_embeddings:
        scores = cosine_similarity(doc, candidate_embeddings)
        topic = candidates[scores.argsort()[0][-1]]
        topics.append(topic)
    
    return topics

In [13]:
topics = model_topics(article_text,topic_list)
for i,keywords in enumerate(topics):
    print('Article {}: {}'.format(i,titles[i]))
    print('Topic: {}'.format(keywords))
    print()

Article 0: Duke basketball games vs. Clemson, Notre Dame postponed due to positive COVID-19 tests in Blue Devils program
Topic: Duke basketball

Article 1: Vaccinated and test positive? What to know about omicron, COVID for this holiday season.
Topic: coronavirus

Article 2: How did Duke basketball and Jon Scheyer keep up their major recruiting hot streak in December?
Topic: Duke basketball

Article 3: Is it COVID-19 or just a cold? Here's how to tell the difference
Topic: coronavirus

Article 4: Jon Scheyer to succeed Mike Krzyzewski after Duke men's basketball's 2021-22 season
Topic: Duke basketball

Article 5: What Is Coronavirus?
Topic: coronavirus

