# NLP Final Project - sample data

For this final project, there is a collection of ~200K news articles on our favorite topics, data science, machine learning, and artificial intelligence. Our task is to identify what industries and job lines are going to be most impacted by AI over the next several years, based on the information/insights you can extract from this text corpus.

Goal: provide actionable recommendations on what can be done with AI to automate the jobs, improve employee productivity, and generally make AI adoption successful. Please pay attention to the introduction of novel technologies and algorithms, such as AI for image generation and Conversational AI, as they represent the entire paradigm shift in adoption of AI technologies and data science in general.


## Importing Data

In [1]:
import pandas as pd
df_news_final_project = pd.read_parquet('https://storage.googleapis.com/msca-bdp-data-open/news_final_project/news_final_project.parquet', engine='pyarrow')
df_news_final_project.shape

(200141, 5)

In [2]:
df_news_final_project.head()

Unnamed: 0,url,date,language,title,text
0,http://auckland.scoop.co.nz/2020/01/aut-boosts...,2020-01-28,en,auckland.scoop.co.nz » AUT boosts AI expertise...,\n\nauckland.scoop.co.nz » AUT boosts AI exper...
1,http://spaceref.com/astronomy/observation-simu...,2021-07-05,en,"Observation, Simulation, And AI Join Forces To...","\n\nObservation, Simulation, And AI Join Force..."
2,http://www.mysmartrend.com/news-briefs/technic...,2020-04-17,en,Cr Bard Inc Has Returned 48.9% Since SmarTrend...,\n\nCr Bard Inc Has Returned 48.9% Since SmarT...
3,http://www.productivityapps.itbusinessnet.com/...,2020-06-23,en,Applitools Visual AI Reaches One Billion Image...,\n\nApplitools Visual AI Reaches One Billion I...
4,http://www.sbwire.com/press-releases/data-scie...,2020-12-24,en,Data Science and Machine-Learning Platforms Ma...,\n\nData Science and Machine-Learning Platform...


## Creating Sample

In [6]:
sample = df_news_final_project.sample(n=200, random_state=42)


# Display the sampled DataFrame
sample.head()


Unnamed: 0,url,date,language,title,text
126836,https://www.kold.com/prnewswire/2022/03/16/chi...,2022-03-16,en,"CHIPOTLE TESTS AI KITCHEN ASSISTANT, CHIPPY","CHIPOTLE TESTS AI KITCHEN ASSISTANT, CHIPPY\n\..."
69009,https://www.newsbytesapp.com/news/science/amaz...,2023-04-14,en,"How Amazon is competing against Microsoft, Goo...",\n\n\nHow Amazon is competing against Microsof...
188489,https://www.beckershospitalreview.com/healthca...,2022-07-19,en,Healthcare AI startup Olive laying off 450 emp...,\nHealthcare AI startup Olive laying off 450 e...
176028,https://www.rittmanmead.com/blog/2020/02/accel...,2020-02-20,en,Accelerated Data Science SDK Configuration,\n\nAccelerated Data Science SDK Configuration...
174638,https://www.13abc.com/prnewswire/2023/09/19/or...,2023-09-19,en,Oracle Introduces Generative AI Capabilities t...,Oracle Introduces Generative AI Capabilities t...


## Data Cleaning

In [7]:
# Check for missing values
print(sample.isnull().sum())

url         0
date        0
language    0
title       0
text        0
dtype: int64


No missing values.

In [8]:
# Check for duplicates
duplicates = sample.duplicated()
print(f"Number of duplicates: {duplicates.sum()}")

Number of duplicates: 0


## Text Cleaning

In [9]:
import re

def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespace
    text = re.sub(r'[^\w\s]', '', text)  # Remove special characters
    return text

sample['cleaned_text'] = sample['text'].apply(clean_text)


In [10]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text.lower())  # Tokenize and lower case
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalpha()]  # Lemmatize and remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stop_words]  # Remove stopwords
    return tokens

sample['tokens'] = sample['cleaned_text'].apply(preprocess_text)


[nltk_data] Downloading package punkt to /Users/csong97/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/csong97/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/csong97/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
sample.head()

Unnamed: 0,url,date,language,title,text,cleaned_text,tokens
126836,https://www.kold.com/prnewswire/2022/03/16/chi...,2022-03-16,en,"CHIPOTLE TESTS AI KITCHEN ASSISTANT, CHIPPY","CHIPOTLE TESTS AI KITCHEN ASSISTANT, CHIPPY\n\...",CHIPOTLE TESTS AI KITCHEN ASSISTANT CHIPPY Ski...,"[chipotle, test, ai, kitchen, assistant, chipp..."
69009,https://www.newsbytesapp.com/news/science/amaz...,2023-04-14,en,"How Amazon is competing against Microsoft, Goo...",\n\n\nHow Amazon is competing against Microsof...,How Amazon is competing against Microsoft Goo...,"[amazon, competing, microsoft, google, ai, rac..."
188489,https://www.beckershospitalreview.com/healthca...,2022-07-19,en,Healthcare AI startup Olive laying off 450 emp...,\nHealthcare AI startup Olive laying off 450 e...,Healthcare AI startup Olive laying off 450 em...,"[healthcare, ai, startup, olive, laying, emplo..."
176028,https://www.rittmanmead.com/blog/2020/02/accel...,2020-02-20,en,Accelerated Data Science SDK Configuration,\n\nAccelerated Data Science SDK Configuration...,Accelerated Data Science SDK Configuration Se...,"[accelerated, data, science, sdk, configuratio..."
174638,https://www.13abc.com/prnewswire/2023/09/19/or...,2023-09-19,en,Oracle Introduces Generative AI Capabilities t...,Oracle Introduces Generative AI Capabilities t...,Oracle Introduces Generative AI Capabilities t...,"[oracle, introduces, generative, ai, capabilit..."


In [75]:
print(type(sample['tokens'][69009]))

<class 'list'>


## Filter relevant articles

In [128]:
# Define keywords that might indicate relevance to AI, machine learning, etc.
keywords = ['artificial intelligence', 'AI', 'machine learning', 'ML', 'deep learning', 'neural network', 'data science']

# Filtering function to check for the presence of keywords
def filter_relevant_articles(text):
    return any(keyword in text.lower() for keyword in keywords)

# Apply the filter function
sample = sample[sample['cleaned_text'].apply(filter_relevant_articles)]


## Create Dictionary & Corpus

In [129]:
from gensim import corpora

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(sample['tokens'])

# Filter out rare and common tokens
dictionary.filter_extremes(no_below=3, no_above=0.5)

# Create a corpus from the dictionary representation
corpus = [dictionary.doc2bow(tokens) for tokens in sample['tokens']]


In [130]:
# corpus - document to bag of words mapping

print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 4492
Number of documents: 137


## BERTopic

In [98]:
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer

In [110]:
# instantiating the embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Encoding the entire corpus to get the embeddings
corpus_embeddings = embedding_model.encode(corpus)

In [111]:
# save embeddings

import numpy as np

np.save('embeddings.npy', corpus_embeddings)  # Save
loaded_embeddings = np.load('embeddings.npy')  # Load


In [112]:
# Outputting the results
print("Number of sentences:", len(loaded_embeddings))
for i, embedding in enumerate(loaded_embeddings):
    print(f"\nEmbedding for sentence {i+1} (Dimensions: {len(embedding)}):")
    print(embedding)

Number of sentences: 200

Embedding for sentence 1 (Dimensions: 384):
[ 4.61167805e-02  6.72440678e-02 -5.97034767e-02 -1.98393166e-02
 -6.75167367e-02 -8.55588261e-03  6.50242791e-02 -3.33559662e-02
  2.57946420e-02 -8.44580233e-02 -2.14971224e-04 -8.27888474e-02
  5.00708371e-02  2.58647706e-02 -1.39342593e-02 -3.08165476e-02
 -8.12518597e-02 -8.74144062e-02 -1.19804349e-02  4.05123271e-02
  4.39748131e-02  4.29789610e-02 -2.31525842e-02  3.80028747e-02
  2.91329958e-02 -9.71123297e-03  8.77918229e-02  6.17577992e-02
  7.80545687e-03 -2.48631947e-02 -1.44937569e-02 -1.22951001e-01
 -2.79651000e-03 -2.44545779e-04  2.62370538e-02 -3.12412344e-02
  4.78930734e-02 -2.67221574e-02 -3.54568101e-02  6.31025583e-02
 -1.04654264e-02 -7.47207031e-02  6.25477433e-02 -2.15796437e-02
 -7.11317211e-02  4.53871638e-02  1.91807244e-02  1.00624003e-01
  3.02315485e-02 -6.77172467e-02 -4.19888683e-02  5.00252582e-02
 -5.34921177e-02  3.42230778e-03  5.21486513e-02  2.92243697e-02
 -9.02218148e-02  1.

This next step is the dimensionality reduction. As you can see from above, there can be quite a few dimensions. The dimensionality of the embedding depends on the model used; some models can produced embeddings a few thousand dimensions long (for fun, see here: https://huggingface.co/spaces/mteb/leaderboard). 

The default dimensionality reduction algorithm for BERTopic is the UMAP. UMAP, like some other dimensionality reduction algorithm are stochastic in nature. We can create repoducible results using a random state seed.

In [113]:
# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=RANDOM_STATE)

Clustering is the next step in the BERTopic pipeline. After we have our documents embedded in some n-dimensional space, we have to cluster them. Like other components, we can just which clustering algorithm we want to use. HDBSCAN is a great choice because (1) it allows us to find clusters of different shapes (as opposed to k-means which assumes "spherical" shapes), and (2) it can help us identify outliers so we don't force documents into a given topic.

**Clustering defines the number of topics that BERTopic produces.** The `nr_topics` parameter of the `BERTopic` class, controls the topics by **merging the topics after they have been created**. We can instead limit the number of topics during the clustering stage by adjusting `min_cluster_size`. A higher value will generate fewer clusters (topics) and a smaller value will generate more clusters (topics).

In [114]:
# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

The next step is the CountVectorizer step which tokenizes our text for us, allowing us to specify n-grams, stop words, etc. Keep in mind that stop words can be task specific so we can always pass a custom list of stop words. The reason we tokenize text **after** embeddings is the because embedding models are sophisticated enough to make use of grammatical artifacts in text. Tokenization here allows us extract key words that we can use to intepret the clusters. Furthermore the fact that we are tokenizing after the emebddings means we can tune these paremters to get best representation for the clusters: **the clusters are already created, changing the tokenization allows us to find the best interpretation**.

In [115]:
# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 2))

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

In [116]:
# All steps together; customize the BERTopic pipeline

topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,                # Step 5 - Extract topic words
)

### Training Model

In [117]:
# training model
topics, probs = topic_model.fit_transform(sample["cleaned_text"])

In [118]:
# topics contains the topic assingment for each document
print(
    len(topics) == len(sample["cleaned_text"]),
    topics[:10], # topics assignmetns of first 10 articles,
    sep="\n\n"
)

True

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [119]:
# probs store the probabilities that each topic
# by default only returns the probability for the assignment topic
# this behavior can be changed using the calculate_proababilities parameter when instantiating the model
probs[:10]

array([1.        , 1.        , 1.        , 1.        , 0.90459662,
       1.        , 1.        , 0.90439826, 1.        , 1.        ])

In [120]:
# can get the topics from the model
# why are we seeing stop words? because we used default parameters upon initialization, which means we are using
# CountVectorizer, which doesn't remove stopwords for us
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,0,182,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...
1,1,18,1_px_jpeg_px 300_300,"[px, jpeg, px 300, 300, generated image, rawpi...",[Laughing adult movie kid AI Free Photo rawp...


In [121]:
# can also use this
topic_model.get_topics()

{0: [('ai', 0.027483926595978658),
  ('news', 0.01755824614864792),
  ('new', 0.012851512538110321),
  ('best', 0.012390756846549798),
  ('market', 0.011439280465450657),
  ('data', 0.010683804975314872),
  ('technology', 0.00922166103920482),
  ('2023', 0.008890831695108023),
  ('business', 0.008736310813817866),
  ('intelligence', 0.008133664329406287)],
 1: [('px', 0.13685303248195832),
  ('jpeg', 0.13440306511649594),
  ('px 300', 0.13440306511649594),
  ('300', 0.1306917417884623),
  ('generated image', 0.09926508270448099),
  ('rawpixel', 0.09926508270448099),
  ('ai generated', 0.09926508270448099),
  ('generated', 0.09277131442775867),
  ('image', 0.08792875910136755),
  ('ai', 0.0802980809257699)]}

In [122]:
# can get specific topic
topic_model.get_topic(0)

[('ai', 0.027483926595978658),
 ('news', 0.01755824614864792),
 ('new', 0.012851512538110321),
 ('best', 0.012390756846549798),
 ('market', 0.011439280465450657),
 ('data', 0.010683804975314872),
 ('technology', 0.00922166103920482),
 ('2023', 0.008890831695108023),
 ('business', 0.008736310813817866),
 ('intelligence', 0.008133664329406287)]

In [123]:
# topics

In [124]:
# accessing the frequent topics
topic_model.get_document_info(sample["cleaned_text"])

Unnamed: 0,Document,Topic,Name,Representation,Representative_Docs,Top_n_words,Probability,Representative_document
0,CHIPOTLE TESTS AI KITCHEN ASSISTANT CHIPPY Ski...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
1,How Amazon is competing against Microsoft Goo...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
2,Healthcare AI startup Olive laying off 450 em...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
3,Accelerated Data Science SDK Configuration Se...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
4,Oracle Introduces Generative AI Capabilities t...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,0.904597,False
...,...,...,...,...,...,...,...,...
195,Opinion Discussing Social Justice with Bard a...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
196,UNs English Language Day celebrated today to ...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
197,Leading AI decisioning platform Rich Data Co s...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False
198,Highlymechanised society AI challenging moder...,0,0_ai_news_new_best,"[ai, news, new, best, market, data, technology...",[ Valens Semiconductor and AI Image Processing...,ai - news - new - best - market - data - techn...,1.000000,False


In [126]:
print(len(topic_model.get_topics()))


2


In [125]:
# visualization can get at the idea their may be little practical difference between clusters
topic_model.visualize_topics()

ValueError: zero-size array to reduction operation maximum which has no identity