<a target="_blank" href="https://colab.research.google.com/github/cohere-ai/notebooks/blob/main/notebooks/guide/Hello_World_Meet_Language_AI.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Hello World! Meet Language AI

Here we take a quick tour of what’s possible with language AI via Cohere’s Large Language Model (LLM) API. This is the Hello, World! of language AI, written for developers with little or no background in AI. In fact, we’ll do that by exploring the Hello, World! phrase itself.

Read the accompanying [blog post here](https://txt.cohere.ai/hello-world-p1/).

![Hello World! Meet Language AI](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/hello-world/hello-world-feat.png)

We’ll cover three groups of tasks that you will typically work on when dealing with language data, including:
- Generating text
- Classifying text
- Analyzing text


The first step is to install the Cohere Python SDK. Next, create an API key, which you can generate from the Cohere [dashboard](https://os.cohere.ai/register) or [CLI tool](https://docs.cohere.ai/cli-key).

In [1]:
# Install the libraries
! pip install cohere altair umap-learn -q

In [1]:
# Import the libraries
import cohere
import pandas as pd
import numpy as np
import altair as alt

# Setup the Cohere client
co = cohere.Client("COHERE_API_KEY") # Get your API key: https://dashboard.cohere.com/api-keys

# 1 - Generating Text

The Cohere Generate endpoint generates text given an input, called “prompt”. The prompt provides a context of what we want the model to generate text. To illustrate this, let’s start with a simple prompt as the input. 

### Try a Simple Prompt

In [3]:
# Create a simple one-line prompt
prompt = "What is a Hello World program."

# Generate text by calling the Chat endpoint
response = co.chat(
  message=prompt,
  model='command-r')

print(response.text)

A "Hello World" program is a traditional and simple program that is often used as an introduction to a new programming language. The program typically displays the message "Hello World" as its output. The concept of a "Hello World" program originated from the book *The C Programming Language* written by Kernighan and Ritchie, where the example program in the book displayed the message using the C programming language. 

The "Hello World" program serves as a basic and straightforward way to verify that your development environment is set up correctly and to familiarize yourself with the syntax and fundamentals of the programming language. It's a starting point for learning how to write and run programs in a new language.

The program's simplicity makes it accessible to programmers of all skill levels, and it's often one of the first programs beginners write when learning to code. The exact implementation of a "Hello World" program varies depending on the programming language being used,

### Create a Better Prompt

The output is not bad, but it can be better. We need to find a way to make the output tighter to how we want it to be, which is where we leverage *prompt engineering*.

In [4]:
# Create a prompt containing a short description, examples, and stop sequences
prompt = """
Write the first paragraph of a blog post given a blog title.
--
Blog Title: Best Activities in Toronto
First Paragraph: Looking for fun things to do in Toronto? When it comes to exploring Canada's
largest city, there's an ever-evolving set of activities to choose from. Whether you're looking to
visit a local museum or sample the city's varied cuisine, there is plenty to fill any itinerary. In
this blog post, I'll share some of my favorite recommendations
--
Blog Title: Mastering Dynamic Programming
First Paragraph: In this piece, we'll help you understand the fundamentals of dynamic programming,
and when to apply this optimization technique. We'll break down bottom-up and top-down approaches to
solve dynamic programming problems.
--
Blog Title: Learning to Code with Hello, World!
First Paragraph:"""

# Generate text by calling the Chat endpoint
response = co.chat(
  message=prompt,
  model='command-r')

print(response.text)

Starting to code can be daunting, but it's actually simpler than you think! The famous first program, "Hello, World!" is a rite of passage for all coders, and an excellent starting point to begin your coding journey. This blog will guide you through the process of writing your very first line of code, and help you understand why learning to code is an exciting and valuable skill to have, covering the fundamentals and the broader implications of this seemingly simple phrase.


### Automating the Process

In real applications, you will likely need to produce these text generations on an ongoing basis, given different inputs. Let’s simulate that with our example.

In [9]:
# A function that generates text given a base prompt and a new topic
def generate_text(topic):
  prompt = f"""
Write the first paragraph of a blog post given a blog title.
--
Blog Title: Best Activities in Toronto
First Paragraph: Looking for fun things to do in Toronto? When it comes to exploring Canada's
largest city, there's an ever-evolving set of activities to choose from. Whether you're looking to
visit a local museum or sample the city's varied cuisine, there is plenty to fill any itinerary. In
this blog post, I'll share some of my favorite recommendations
--
Blog Title: Mastering Dynamic Programming
First Paragraph: In this piece, we'll help you understand the fundamentals of dynamic programming,
and when to apply this optimization technique. We'll break down bottom-up and top-down approaches to
solve dynamic programming problems.
--
Blog Title: {topic}
First Paragraph:"""
  # Generate text by calling the Chat endpoint
  response = co.chat(
    message=prompt,
    model='command-r')

  return response.text

In [10]:
# The list of topics
topics = ["How to Grow in Your Career",
          "The Habits of Great Software Developers",
          "Ideas for a Relaxing Weekend"]

In [12]:
# Keep the generations in a list of paragraphs
paragraphs = []

for topic in topics:
  paragraphs.append(generate_text(topic))
  
# Display the generated paragraphs
for topic,para in zip(topics,paragraphs):
  print(f"Topic: {topic}")
  print(f"First Paragraph: {para}")
  print("-"*10)

Topic: How to Grow in Your Career
First Paragraph: Advancing in your career can seem like a daunting task, especially if you're unsure of the path ahead. In this ever-changing professional landscape, there are numerous factors to consider. This blog aims to shed light on the strategies and skills that can help you navigate the complexities of career progression and unlock your full potential. Whether you're looking to secure a promotion or explore new opportunities, these insights will help you chart a course for your future. Let's embark on this journey of self-improvement and professional growth, equipping you with the tools to succeed in your career aspirations.
----------
Topic: The Habits of Great Software Developers
First Paragraph: Great software developers are renowned for their ability to write robust code and create innovative applications, but what sets them apart from their peers? In this blog, we'll delve into the daily habits that contribute to their success. From their a

# 2 - Classifying Text

Cohere’s Classify endpoint makes it easy to take a list of texts and predict their categories, or classes. A typical machine learning model requires many training examples to perform text classification, but with the Classify endpoint, you can get started with as few as 5 examples per class.

### Sentiment Analysis

In [5]:
# Create the training examples for the classifier
from cohere import ClassifyExample

examples = [
    ClassifyExample(text="I’m so proud of you", label="positive"), 
    ClassifyExample(text="What a great time to be alive", label="positive"), 
    ClassifyExample(text="That’s awesome work", label="positive"), 
    ClassifyExample(text="The service was amazing", label="positive"), 
    ClassifyExample(text="I love my family", label="positive"), 
    ClassifyExample(text="They don't care about me", label="negative"), 
    ClassifyExample(text="I hate this place", label="negative"), 
    ClassifyExample(text="The most ridiculous thing I've ever heard", label="negative"), 
    ClassifyExample(text="I am really frustrated", label="negative"), 
    ClassifyExample(text="This is so unfair", label="negative"),
    ClassifyExample(text="This made me think", label="neutral"), 
    ClassifyExample(text="The good old days", label="neutral"), 
    ClassifyExample(text="What's the difference", label="neutral"), 
    ClassifyExample(text="You can't ignore this", label="neutral"), 
    ClassifyExample(text="That's how I see it", label="neutral")
]

In [6]:
# Enter the inputs to be classified
inputs=["Hello, world! What a beautiful day",
        "It was a great time with great people",
        "Great place to work",
        "That was a wonderful evening",
        "Maybe this is why",
        "Let's start again",
        "That's how I see it",
        "These are all facts",
        "This is the worst thing",
        "I cannot stand this any longer",
        "This is really annoying",
        "I am just plain fed up"
        ]

In [7]:
# A function that classifies a list of inputs given the examples
def classify_text(inputs, examples):
  """
  Classify a list of input texts
  Arguments:
    inputs(list[str]): a list of input texts to be classified
    examples(list[Example]): a list of example texts and class labels
  Returns:
    classifications(list): each result contains the text, labels, and conf values
  """
  # Classify text by calling the Classify endpoint
  response = co.classify(
    model='embed-english-v2.0',
    inputs=inputs,
    examples=examples)
  
  classifications = response.classifications
  
  return classifications

In [8]:
# Classify the inputs
predictions = classify_text(inputs,examples)

# Display the classification outcomes
classes = ["positive","negative","neutral"]
for inp,pred in zip(inputs,predictions):
  class_pred = pred.predictions[0]
  class_idx = classes.index(class_pred)
  class_conf = pred.confidences[0]

  print(f"Input: {inp}")
  print(f"Prediction: {class_pred}")
  print(f"Confidence: {class_conf:.2f}")
  print("-"*10)

Input: Hello, world! What a beautiful day
Prediction: positive
Confidence: 0.84
----------
Input: It was a great time with great people
Prediction: positive
Confidence: 0.99
----------
Input: Great place to work
Prediction: positive
Confidence: 0.91
----------
Input: That was a wonderful evening
Prediction: positive
Confidence: 0.96
----------
Input: Maybe this is why
Prediction: neutral
Confidence: 0.70
----------
Input: Let's start again
Prediction: neutral
Confidence: 0.83
----------
Input: That's how I see it
Prediction: neutral
Confidence: 1.00
----------
Input: These are all facts
Prediction: neutral
Confidence: 0.78
----------
Input: This is the worst thing
Prediction: negative
Confidence: 0.93
----------
Input: I cannot stand this any longer
Prediction: negative
Confidence: 0.93
----------
Input: This is really annoying
Prediction: negative
Confidence: 0.99
----------
Input: I am just plain fed up
Prediction: negative
Confidence: 1.00
----------


# 3 - Analyzing Text

Cohere’s Embed endpoint takes a piece of text and turns it into a vector embedding. Embeddings represent text in the form of numbers that capture its meaning and context. What it means is that it gives you the ability to turn unstructured text data into a structured form. It opens up ways to analyze and extract insights from them.


## Get embeddings

Here we have a list of 50 top web search keywords about Hello, World! taken from a keyword tool. Let’s look at a few examples:

In [17]:
# Get a list of texts and add to a dataframe
df = pd.read_csv("https://github.com/cohere-ai/notebooks/raw/main/notebooks/data/hello-world-kw.csv", names=["search_term"])
df.head()

Unnamed: 0,search_term
0,how to print hello world in python
1,what is hello world
2,how do you write hello world in an alert box
3,how to print hello world in java
4,how to write hello world in eclipse


We use the Embed endpoint to get the embeddings for each of these keywords.

In [35]:
# A function that classifies a list of inputs given the examples
def embed_text(texts, input_type):
  """
  Turns a piece of text into embeddings
  Arguments:
    text(str): the text to be turned into embeddings
  Returns:
    embedding(list): the embeddings
  """
  # Embed text by calling the Embed endpoint
  response = co.embed(
                model="embed-english-v3.0",
                input_type=input_type,
                texts=texts)
  
  return response.embeddings

In [36]:
# Get embeddings of all search terms
df["search_term_embeds"] = embed_text(texts=df["search_term"].tolist(),
                                      input_type="search_document")
doc_embeds = np.array(df["search_term_embeds"].tolist())

### Semantic Search

We’ll look at a couple of example applications. The first example is semantic search. Given a new query, our "search engine" must return the most similar FAQs, where the FAQs are the 50 search terms we uploaded earlier.


In [43]:
# Add a new query
query = "what is the history of hello world"

# Get embeddings of the new query
query_embeds = embed_text(texts=[query],
                          input_type="search_query")[0]

We use cosine similarity to compare the similarity of the new query with each of the FAQs

In [46]:
# Calculate cosine similarity

from sklearn.metrics.pairwise import cosine_similarity

def get_similarity(target, candidates):
  """
  Computes the similarity between a target text and a list of other texts
  Arguments:
    target(list[float]): the target text
    candidates(list[list[float]]): a list of other texts, or candidates
  Returns:
    sim(list[tuple]): candidate IDs and the similarity scores
  """
  # Turn list into array
  candidates = np.array(candidates)
  target = np.expand_dims(np.array(target),axis=0)

  # Calculate cosine similarity
  sim = cosine_similarity(target,candidates)
  sim = np.squeeze(sim).tolist()

  # Sort by descending order in similarity
  sim = list(enumerate(sim))
  sim = sorted(sim, key=lambda x:x[1], reverse=True)

  # Return similarity scores
  return sim

Finally, we display the top 5 FAQs that match the new query

In [47]:
# Get the similarity between the new query and existing queries
similarity = get_similarity(query_embeds,doc_embeds)

# Display the top 5 FAQs
print("New query:")
print(new_query,'\n')

print("Similar queries:")
for idx,score in similarity[:5]:
  print(f"Similarity: {score:.2f};", df.iloc[idx]["search_term"])

New query:
what is the history of hello world 

Similar queries:
Similarity: 0.58; how did hello world originate
Similarity: 0.56; where did hello world come from
Similarity: 0.54; why hello world
Similarity: 0.53; why is hello world so famous
Similarity: 0.53; what is hello world


### Semantic Exploration

In the second example, we take the same idea as semantic search and take a broader look, which is exploring huge volumes of text and analyzing their semantic relationships.

We'll use the same 50 top web search terms about Hello, World! There are different techniques we can use to compress the embeddings down to just 2 dimensions while retaining as much information as possible. We'll use a technique called UMAP. And once we can get it down to 2 dimensions, we can plot these embeddings on a 2D chart.

In [49]:
# Reduce the embeddings' dimensions to 2 using UMAP
import umap
reducer = umap.UMAP(n_neighbors=49) 
umap_embeds = reducer.fit_transform(doc_embeds)

# Add the 2 dimensions to the dataframe
df['x'] = umap_embeds[:,0]
df['y'] = umap_embeds[:,1]

In [59]:
# Plot the 2-dimension embeddings on a chart
chart = alt.Chart(df).mark_circle(size=500).encode(
  x=
  alt.X('x',
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
  ),

  y=
  alt.Y('y',
      scale=alt.Scale(zero=False),
      axis=alt.Axis(labels=False, ticks=False, domain=False)
  ),
  
  tooltip=['search_term']
  )

text = chart.mark_text(align='left', dx=15, size=12, color='black'
          ).encode(text='search_term', color= alt.value('black'))

result = (chart + text).configure(background="#FDF7F0"
      ).properties(
      width=1000,
      height=700,
      title="2D Embeddings"
      )

result