<a href="https://colab.research.google.com/github/amrindersingh03/Unstructured-Machine-Learning-/blob/main/Summarization_using_Cohere.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarization with the help of Generation Models
This notebook shows how to use Cohere's Generation models to summarize text.

<img src="https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/summarization.png"
    style="width:100%; max-width:600px" alt="provided with the right prompt, a language model can generate multiple candidate summaries" />

This will be a few shot learning where we will use two examples and a task description  in a prompt.



In [None]:
# Let's first install Cohere's python SDK
!pip install cohere

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
  Downloading cohere-3.2.3.tar.gz (11 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting urllib3~=1.26
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: cohere
  Building wheel for cohere (setup.py) ... [?25l[?25hdone
  Created wheel for cohere: filename=cohere-3.2.3-cp38-cp38-linux_x86_64.whl size=13276 sha256=2c3eab1f10e7fa90585998437a4d1ce1d0b2fa926924ea254bde7208f6ce3609
  Stored in directory: /root/.cache/pip/wheels/d2/ed/57/fbe64110cedea8626c3281cb4eb2049d5df6be37e620d4efeb
Successfully built cohere
Installing collected packages: urllib3, cohere
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uni

In [None]:
# Set up Cohere Client using API key
import cohere
import time
import pandas as pd
# Paste your API key here. Remember to not share it publicly 
api_key = 'CCM8A7NqnXZatYRsJ383wgyUHSwYRqMO0fcNgsNm'
co = cohere.Client(api_key)


Our prompt is geared for paraphrasing to simplify an input sentence. It contains two examples. The sentence we want it to summarize is:

**Killer whales have a diverse diet, although individual populations often specialize in particular types of prey.**

In [None]:
# Since we are using few shot learning, we are providing two summary examples in our prompt
prompt = '''"The killer whale or orca (Orcinus orca) is a toothed whale belonging to the oceanic dolphin family, of which it is the largest member"
In summary: "The killer whale or orca is the largest type of dolphin"
---
"It is recognizable by its black-and-white patterned body" 
In summary:"Its body has a black and white pattern"
---
"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey" 
In summary:"'''
print(prompt)

"The killer whale or orca (Orcinus orca) is a toothed whale belonging to the oceanic dolphin family, of which it is the largest member"
In summary: "The killer whale or orca is the largest type of dolphin"
---
"It is recognizable by its black-and-white patterned body" 
In summary:"Its body has a black and white pattern"
---
"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey" 
In summary:"


We get several completions from the model via the API

In [None]:
# Generate predictions using Cohere's model. Make sure to use the latest available model
n_generations = 5

prediction = co.generate(
    model='xlarge',
    prompt=prompt,
    return_likelihoods = 'GENERATION',
    stop_sequences=['"'],
    max_tokens=50,
    temperature=0.7,
    num_generations=n_generations,
    k=0,
    p=0.75)


In [None]:
# Get list of generations
gens = []
likelihoods = []
for gen in prediction.generations:
    gens.append(gen.text)
    
    sum_likelihood = 0
    for t in gen.token_likelihoods:
        sum_likelihood += t.likelihood
    # Get sum of likelihoods
    likelihoods.append(sum_likelihood)


In [None]:
pd.options.display.max_colwidth = 200
# Create a dataframe for the generated sentences and their likelihood scores
df = pd.DataFrame({'generation':gens, 'likelihood': likelihoods})
# Drop duplicates
df = df.drop_duplicates(subset=['generation'])
# Sort by highest sum likelihood
df = df.sort_values('likelihood', ascending=False, ignore_index=True)
print('Candidate summaries for the sentence: \n"Killer whales have a diverse diet, although individual populations often specialize in particular types of prey."')
df

In a lot of cases, better generations can be reached by creating multiple generations then ranking and filtering them. In this case we're ranking the generations by their average likelihoods. 

## Hyperparameters
It's worth spending some time learning the various hyperparameters of the generation endpoint. For example, [temperature](https://docs.cohere.ai/temperature-wiki) tunes the degree of randomness in the generations. Other parameters include [top-k and top-p](https://docs.cohere.ai/token-picking) as well as `frequency_penalty` and `presence_penalty` which can reduce the amount of repetition in the output of the model. See the [API reference of the generate endpoint](https://docs.cohere.ai/generate-reference) for more details on all the parameters.