Natural Language Processing (NLP) is a hot topic in machine learning. It involves analyzing and understanding text-based data. Clustering algorithms are quite popular in NLP. They group a set of unlabeled texts in such a way that texts in the same cluster are more like one another. Topic modeling is one application of clustering in NLP. It uses unsupervised learning to extract topics from a collection of documents. Other applications include automatic document organization and fast information retrieval or filtering.

You'll learn how to use Cohere’s NLP tools to perform semantic search and clustering of AI Papers. This will help you discover trends in AI. You'll scrape the Journal of Artificial Intelligence Research. The output is a list of recently published AI papers. You’ll use Cohere’s Embed Endpoint to generate word embeddings using your list of AI papers. Finally, visualize the embeddings and proceed to build semantic search and topic modeling.


# **Pre-Requisites**
To follow along with this tutorial, you need to be familiar with Python. Make sure you have python version 3.6+ installed in your development machine. You can also use Google Colab to try out the project in the cloud. Finally, you need to have a Cohere Account. If you haven’t signed up already, register for a New Cohere Account. All new accounts receive $75 free credits. You'll access a Pay-As-You-Go option after finishing your credits.

# **Getting Started**

First, you need to install the python dependencies required to run the project. Use pip to install them using the command below

In [17]:
pip install requests beautifulsoup4 cohere altair clean-text numpy pandas sklearn > /dev/null

Create a new python file named cohere_nlp.py. Write all your code in this file. import the dependencies and initialize Cohere’s client.

In [18]:
import cohere

# Paste your API key here. Remember to not share it publicly 
api_key = '<API_KEY.' 
co = cohere.Client(api_key)

# Data Collection and Cleaning
This tutorial focuses on applying topic modeling to look for recent trends in AI. This task requires you to source a list of titles for AI papers. Use web scraping techniques to collect a list of AI papers. Use the Journal of Artificial Intelligence Research as your data source. Finally, you will clean this data by removing unwanted characters and stop words.


First, import the required libraries to make web requests and process the web content .


In [19]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from cleantext import clean

Next, make an HTTP request to the source website that has an archive of the AI papers. 

In [20]:
URL = "https://www.jair.org/index.php/jair/issue/archive"
page = requests.get(URL)

Use this archive to get the list of AI papers published. This archive has papers published since 2015. This tutorial considers papers published recently, on or after 2020 only.

In [21]:
soup = BeautifulSoup(page.content, "html.parser")
archive_links = []

for link in soup.select('a.title'):
  vol = link.text
  link = link.get('href')
  year = int(vol[vol.find("(")+1:vol.find(")")])
  if year >= 2020:
    archive_links.append({ 'year': year, 'link': link })

Finally, you’ll need to clean the titles of the AI papers gathered. Remove trailing white spaces and unwanted characters. Use the NTLK library to get English stop words and filter them out.

In [22]:
papers = []
for archive in archive_links:
  page = requests.get(archive['link'])
  soup = BeautifulSoup(page.content, "html.parser")
  links = soup.select('h3.media-heading a')
  for link in links:
    # clean the title
    title = clean(text=link.text,
            fix_unicode=True,
            to_ascii=True,
            lower=True,
            no_line_breaks=False,
            no_urls=False,
            no_emails=False,
            no_phone_numbers=False,
            no_numbers=False,
            no_digits=False,
            no_currency_symbols=False,
            no_punct=False,
            replace_with_punct="",
            replace_with_url="This is a URL",
            replace_with_email="Email",
            replace_with_phone_number="",
            replace_with_number="123",
            replace_with_digit="0",
            replace_with_currency_symbol="$",
            lang="en")
    papers.append({ 'year': archive['year'], 'title': title, 'link': link.get('href') })


The dataset created using this process has 258 AI papers published between 2020 and 2022. Use pandas library to create a data frame to hold our text data. 

In [23]:
df = pd.DataFrame(papers)
print(len(df))

260


# Create and Visualize Word Embeddings
Word embedding is a technique for learning the representation of words. Words that have same meanings have similar representation. You can use these embeddings to:

•	cluster large amounts of text
•	match a query with other similar sentences
•	perform classification tasks like sentiment classification

Cohere’s platform provides an Embed Endpoint that returns text embeddings. An embedding is a list of floating-point numbers. They capture the semantic meaning of the represented text. Models used to create these embeddings are available in 3 sizes: small, medium, and large. Small models are faster while large models offer better performance.

Write a function to create the word embeddings using Cohere. The function should read as follows:


In [24]:
# Get text embeddings
def get_embeddings(text,model='medium'):
  output = co.embed(
                model=model,
                texts=[text])
  return output.embeddings[0]

Create a new column in your pandas data frame to hold the embeddings created.

In [25]:
df['title_embeds'] = df['title'].apply(get_embeddings)

Congratulations! You have created the word embeddings . Now, you will proceed to visualize the embeddings using a scatter plot. First, you need to reduce the dimensions of the word embeddings. You’ll use the Principal Component Analysis (PCA) method to achieve this task. Import the necessary packages and create a function to return the principle components.

In [26]:
# Reduce dimensionality using PCA
from sklearn.decomposition import PCA

# Function to return the principal components
def get_pc(arr,n):
  pca = PCA(n_components=n)
  embeds_transform = pca.fit_transform(arr)
  return embeds_transform

Next, create a function to generate a scatter plot chart. You’ll use the altair library to create the charts.

In [27]:
import altair as alt
# Function to generate the 2D plot
def generate_chart(df,xcol,ycol,lbl='off',color='basic',title=''):
  chart = alt.Chart(df).mark_circle(size=500).encode(
    x= alt.X(xcol,
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),

    y= alt.Y(ycol,
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
    ),
    
    color= alt.value('#333293') if color == 'basic' else color,
    tooltip=['title']
  )

  if lbl == 'on':
    text = chart.mark_text(align='left', baseline='middle',dx=15, size=13,color='black').encode(text='title', color= alt.value('black'))
  else:
    text = chart.mark_text(align='left', baseline='middle',dx=10).encode()

  result = (chart + text).configure(background="#FDF7F0"
        ).properties(
        width=800,
        height=500,
        title=title
       ).configure_legend(
  orient='bottom', titleFontSize=18,labelFontSize=18)
        
  return result


Finally, use the embeddings with reduced dimensionality to create a scatter plot. 

In [28]:
sample = 200
# Reduce embeddings to 2 principal components to aid visualization
embeds = np.array(df['title_embeds'].tolist())
embeds_pc2 = get_pc(embeds,2)
# Add the principal components to dataframe
df_pc2 = pd.concat([df, pd.DataFrame(embeds_pc2)], axis=1)

# Plot the 2D embeddings on a chart
df_pc2.columns = df_pc2.columns.astype(str)
print(df_pc2.iloc[:sample])


     year                                              title  \
0    2022  metric-distortion bounds under limited informa...   
1    2022  recursion in abstract argumentation is hard --...   
2    2022  crossing the conversational chasm: a primer on...   
3    2022  hebo: pushing the limits of sample-efficient h...   
4    2022  learning bayesian networks under sparsity cons...   
..    ...                                                ...   
195  2020  using machine learning for decreasing state un...   
196  2020  mapping the landscape of artificial intelligen...   
197  2020  contrasting the spread of misinformation in on...   
198  2020  to regulate or not: a social dynamics analysis...   
199  2020  qualitative numeric planning: reductions and c...   

                                                  link  \
0    https://www.jair.org/index.php/jair/article/vi...   
1    https://www.jair.org/index.php/jair/article/vi...   
2    https://www.jair.org/index.php/jair/article/vi...   

Here’s a chart demonstrating the word embeddings for AI papers. It is important to note that the chart represents a sample size of 200 papers.

In [29]:
generate_chart(df_pc2.iloc[:sample],'0','1',title='2D Embeddings')

# Semantic Search
Data searching techniques focus on using keywords to retrieve text-based information. You can take this a level higher. Use search queries to determine the intent and contextual meaning. In this section, you’ll use Cohere to create embeddings for the search query. Use the embeddings to get the similarity with your dataset’s embeddings. The output is a list of similar AI papers.

First, create a function to get similarities between two embeddings. This will use the cosine similarity algorithm from the sci-kit learn library.


In [30]:
from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target,candidates):
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)

  # Calculate cosine similarity
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()
  sort_index = np.argsort(sim)[::-1]
  sort_score = [sim[i] for i in sort_index]
  similarity_scores = zip(sort_index,sort_score)

  # Return similarity scores
  return similarity_scores


Next, create embeddings for the search query

In [31]:
# Add new query
new_query = "graph network strategies"

# Get embeddings of the new query
new_query_embeds = get_embeddings(new_query)


Finally, check the similarity between the two embeddings. Display the top 10 similar papers using your result

In [32]:
# Get the similarity between the search query and existing queries
similarity = get_similarity(new_query_embeds,embeds[:sample])

# View the top 5 articles
print('Query:')
print(new_query,'\n')

print('Similar queries:')
for idx,sim in similarity:
  print(f'Similarity: {sim:.2f};',df.iloc[idx]['title'])


Query:
graph network strategies 

Similar queries:
Similarity: 0.49; amp chain graphs: minimal separators and structure learning algorithms
Similarity: 0.46; pure nash equilibria in resource graph games
Similarity: 0.44; general value function networks
Similarity: 0.42; on the online coalition structure generation problem
Similarity: 0.42; efficient local search based on dynamic connectivity maintenance for minimum connected dominating set
Similarity: 0.42; graph kernels: a survey
Similarity: 0.39; rwne: a scalable random-walk based network embedding framework with personalized higher-order proximity preserved
Similarity: 0.39; the petlon algorithm to plan efficiently for task-level-optimal navigation
Similarity: 0.38; election manipulation on social networks: seeding, edge removal, edge addition
Similarity: 0.38; a semi-exact algorithm for quickly computing a maximum weight clique in large sparse graphs
Similarity: 0.37; probabilistic temporal networks with ordinary distributions: the

Visualizing semantic search: https://github.com/cohere-ai/notebooks/blob/main/notebooks/Visualizing_Text_Embeddings.ipynb

# Text Clustering
Clustering is a process of grouping similar documents into clusters. It allows you to organize many documents into a smaller number of groups. As a result, you can discover emerging patterns in the documents. In this section, you will use the k-Means clustering algorithm to identify the top 5 clusters. 

First, import the k-means algorithm from the scikit-learn package. Then configure two variables: the number of clusters and a duplicate dataset.


In [33]:
from sklearn.cluster import KMeans

# Pick the number of clusters
df_clust = df_pc2.copy()
n_clusters=5


Next, initialize the k-means model and use it to fit the embeddings to create the clusters.

In [34]:
# Cluster the embeddings
kmeans_model = KMeans(n_clusters=n_clusters, random_state=0)
classes = kmeans_model.fit_predict(embeds).tolist()
print(classes)
df_clust['cluster'] = (list(map(str,classes)))


[2, 0, 3, 4, 4, 3, 1, 1, 0, 3, 0, 2, 0, 1, 1, 0, 0, 2, 0, 1, 3, 2, 1, 3, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 0, 1, 1, 1, 2, 2, 2, 4, 3, 3, 3, 3, 2, 1, 2, 4, 3, 0, 2, 0, 1, 1, 0, 4, 0, 2, 2, 3, 1, 2, 4, 1, 2, 1, 4, 0, 3, 3, 4, 2, 0, 2, 2, 2, 0, 0, 0, 4, 1, 4, 1, 2, 0, 4, 1, 1, 4, 1, 4, 1, 4, 1, 0, 0, 4, 2, 4, 3, 4, 3, 2, 0, 2, 1, 1, 4, 2, 4, 2, 2, 0, 3, 1, 3, 2, 3, 1, 2, 0, 4, 4, 1, 0, 0, 4, 1, 1, 2, 2, 1, 2, 3, 0, 0, 1, 1, 1, 0, 4, 1, 4, 2, 4, 2, 4, 3, 2, 0, 1, 4, 1, 1, 2, 2, 0, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 0, 4, 2, 4, 2, 1, 0, 3, 0, 1, 0, 2, 2, 1, 4, 1, 3, 4, 1, 0, 2, 1, 2, 0, 2, 4, 1, 4, 2, 2, 1, 0, 0, 1, 0, 2, 1, 0, 4, 1, 4, 0, 2, 1, 4, 1, 3, 2, 4, 2, 0, 1, 0, 3, 0, 2, 4, 1, 1, 3, 2, 3, 1, 3, 4, 2, 2, 0, 1, 1, 1, 4, 1, 0, 4, 3, 2, 2, 2, 2, 0, 1, 3, 1, 3, 4, 2, 4, 2, 1, 3]


Finally, plot a scatter plot to visualize the 5 clusters in our sample size.

In [35]:
# Plot on a chart
df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:sample],'0','1',lbl='off',color='cluster',title='Clustering with 5 Clusters')

# Conclusion
Let's recap the NLP tasks implemented in this tutorial. You’ve created word embeddings, perform a semantic search, and text clustering. Cohere’s platform provides NLP tools that are easy and intuitive to integrate. You can create digital experiences that support powerful NLP capabilities like text clustering. It’s easy to Register a Cohere account and gain access to an API key. New cohere accounts have $75 free credits for the first 3 months. It also offers a Pay-as-you-go Pricing Model that bills you upon usage.
