<a target="_blank" href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/guide/Fueling_Generative_Content_with_Keyword_Research.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Fueling Generative Content with Keyword Research

Generative models have proven extremely useful in content idea generation. But they don’t take into account user search demand and trends. In this notebook, let’s see how we can solve that by adding keyword research into the equation.

Read the accompanying [blog post here](https://txt.cohere.ai/generative-content-keyword-research/).

In [2]:
# Install packages
! pip install cohere -q

In [10]:
import cohere
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans

import cohere
co = cohere.Client("COHERE_API_KEY") # Get your API key: https://dashboard.cohere.com/api-keys

In [2]:
#@title Enable text wrapping in Google Colab

from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

# Step 1: Get a list of High-performing Keywords 

First, we need to get a supply of high-traffic keywords for a given topic. We can get this via keyword research tools, of which are many available. We’ll use Google Keyword Planner, which is free to use.

In [3]:
# Download the pre-created dataset (feel free to replace with your CSV file, containing two columns - "keyword" and "volume")

import wget
wget.download("https://raw.githubusercontent.com/cohere-ai/notebooks/main/notebooks/data/remote_teams.csv", "remote_teams.csv")

'remote_teams.csv'

In [4]:
# Create a dataframe
df = pd.read_csv('remote_teams.csv')
df.columns = ["keyword","volume"]
df.head()

Unnamed: 0,keyword,volume
0,managing remote teams,1000
1,remote teams,390
2,collaboration tools for remote teams,320
3,online games for remote teams,320
4,how to manage remote teams,260


# Step 2: Group the Keywords into Topics 

We now have a list of keywords, but this list is still raw. For example, “managing remote teams” is the top-ranking keyword in this list. But at the same time, there are many similar keywords further down in the list, such as “how to effectively manage remote teams.”

We can do that by clustering them into topics. For this, we’ll leverage Cohere’s Embed endpoint and scikit-learn.

### Embed the Keywords with Cohere Embed

The Cohere Embed endpoint turns a text input into a text embedding.

In [6]:
def embed_text(texts):
  output = co.embed(
                texts=texts,
                model='embed-english-v3.0',
                input_type="search_document",
                )
  return output.embeddings

embeds = np.array(embed_text(df['keyword'].tolist()))

### Cluster the Keywords into Topics with scikit-learn

We then use these embeddings to cluster the keywords. A common term used for this exercise is “topic modeling.” Here, we can leverage scikit-learn’s KMeans module, a machine learning algorithm for clustering.

In [7]:
NUM_TOPICS = 4
kmeans = KMeans(n_clusters=NUM_TOPICS, random_state=21, n_init="auto").fit(embeds)
df['topic'] = list(kmeans.labels_)
df.head()

Unnamed: 0,keyword,volume,topic
0,managing remote teams,1000,0
1,remote teams,390,1
2,collaboration tools for remote teams,320,1
3,online games for remote teams,320,3
4,how to manage remote teams,260,0


### Generate Topic Names with Cohere Chat

We use the Chat to generate a topic name for that cluster.

In [14]:
# Group the DataFrame by 'topic' and aggregate the 'keyword' column into sets (which automatically removes duplicates)
topic_keywords_dict = {topic: list(set(group['keyword'])) for topic, group in df.groupby('topic')}

In [22]:
# Function to generate a topic name based on keywords
def generate_topic_name(keywords):
    # Construct the prompt
    prompt = f"""Generate a concise topic name that best represents these keywords.\
Provide just the topic name and not any additional details.

Keywords: {', '.join(keywords)}"""
    
    # Call the Cohere API
    response = co.chat(
        model='command-r',  # Choose the model size
        message=prompt,
        preamble="")
    
    # Return the generated text
    return response.text

In [23]:
# Generate topic names and create a mapping of topic number to topic name
topic_name_mapping = {topic: generate_topic_name(keywords) for topic, keywords in topic_keywords_dict.items()}

# Use the mapping to create a new column in the DataFrame
df['topic_name'] = df['topic'].map(topic_name_mapping)

# Display the first few rows to verify the new column
df.head()

Unnamed: 0,keyword,volume,topic,topic_name
0,managing remote teams,1000,0,Remote Team Management
1,remote teams,390,1,Remote Team Tools and Tips
2,collaboration tools for remote teams,320,1,Remote Team Tools and Tips
3,online games for remote teams,320,3,Remote Team Fun
4,how to manage remote teams,260,0,Remote Team Management


In [25]:
# View the list of topics
for topic, name in topic_name_mapping.items():
    print(f"Topic {topic}: {name}")

Topic 0: Remote Team Management
Topic 1: Remote Team Tools and Tips
Topic 2: Remote Team Resources
Topic 3: Remote Team Fun


# Step 3: Generate Blog Post Ideas for Each Topic

Now that we have the keywords nicely grouped into topics, we can proceed to generate the content ideas.


### Take the Top Keywords from Each Topic

Here we can implement a filter to take just the top N keywords from each topic, sorted by the search volume. In our case, we use 10.

In [26]:
TOP_N = 10

# Group the DataFrame by topic and select the top N keywords sorted by volume
top_keywords = (df.groupby('topic')
                        .apply(lambda x: x.nlargest(TOP_N, 'volume'))
                        .reset_index(drop=True))


# Convert the DataFrame to a nested dictionary
content_by_topic = {}
for topic, group in top_keywords.groupby('topic'):
    keywords = ', '.join(list(group['keyword']))
    topic2name = topic2name = dict(df.groupby('topic')['topic_name'].first())
    topic_name = topic2name[topic]
    content_by_topic[topic] = {'topic_name': topic_name, 'keywords': keywords}

In [27]:
# Print the topics and they top keywords
content_by_topic

{0: {'topic_name': 'Remote Team Management',
  'keywords': 'managing remote teams, how to manage remote teams, leading remote teams, managing remote teams best practices, remote teams best practices, best practices for managing remote teams, manage remote teams, building culture in remote teams, culture building for remote teams, managing remote teams training'},
 1: {'topic_name': 'Remote Team Tools and Tips',
  'keywords': 'remote teams, collaboration tools for remote teams, team building for remote teams, scrum remote teams, tools for remote teams, zapier remote teams, working agreements for remote teams, working with remote teams, free collaboration tools for remote teams, free retrospective tools for remote teams'},
 2: {'topic_name': 'Remote Team Resources',
  'keywords': 'best collaboration tools for remote teams, slack best practices for remote teams, best communication tools for remote teams, best tools for remote teams, always on video for remote teams, best apps for remote t

### Create a Prompt with These Keywords

Next, we use the Chat endpoint to produce the content ideas. The prompt we’ll use is as follows

In [28]:
def generate_blog_ideas(keywords):
  prompt = f"""{keywords}\n\nThe above is a list of high-traffic keywords obtained from a keyword research tool. 
Suggest three blog post ideas that are highly relevant to these keywords. 
For each idea, write a one paragraph abstract about the topic. 
Use this format:
Blog title: <text>
Abstract: <text>"""
  
  response = co.chat(
    model='command-r',
    message = prompt)
  return response.text


### Generate Content Ideas

Next, we generate the blog post ideas. It takes in a string of keywords, calls the Chat endpoint, and returns the generated text.

In [30]:
# Generate content ideas
for key,value in content_by_topic.items():
  value['ideas'] = generate_blog_ideas(value['keywords'])


# Print the results
for key,value in content_by_topic.items():
  print(f"Topic Name: {value['topic_name']}\n")
  print(f"Top Keywords: {value['keywords']}\n")
  print(f"Blog Post Ideas: {value['ideas']}\n")
  print("-"*50)

Topic Name: Remote Team Management

Top Keywords: managing remote teams, how to manage remote teams, leading remote teams, managing remote teams best practices, remote teams best practices, best practices for managing remote teams, manage remote teams, building culture in remote teams, culture building for remote teams, managing remote teams training

Blog Post Ideas: Here are three blog post ideas:

1. Blog title: "Leading Remote Teams: Strategies for Effective Management"
   Abstract: Effective management of remote teams is crucial for success, but it comes with unique challenges. This blog will explore practical strategies for leading dispersed employees, focusing on building a cohesive and productive virtual workforce. It will cover topics such as establishing clear communication protocols, fostering a collaborative environment, and the importance of trusting and empowering your remote employees for enhanced performance.

2. Blog title: "Remote Teams' Best Practices: Creating a Vib