# Build Multilingual Financial Search Applications with Cohere - Code Walkthrough
In the following use case example, we’ll showcase how Cohere’s Embed model can search and
query across financial news in different languages in one unique pipeline. Finally, we’ll see how
adding Rerank to our embeddings retrieval (or adding it to a legacy lexical search) can further
improve our results.

### Step 0: Enable Model Access Through Amazon Bedrock

Enable Model access through the [Amazon Console](https://console.aws.amazon.com/bedrock) following the instructions in the [Amazon Bedrock Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-access.html) console.

For this walkthrough you will need to request access to the Cohere Embed Multilingual model.

### Step 1: Install Packages and Import Modules

In [None]:
!pip install --upgrade cohere-aws hnswlib translate
# if you upgrade the package, you need to restart the kernel

In [None]:
import pandas as pd
import cohere_aws
import hnswlib
from translate import Translator
import warnings
import os
import re
import boto3

warnings.filterwarnings('ignore')

### Step 2: Import Documents 

Information about MultiFIN can be found in its Github repo https://github.com/RasmusKaer/MultiFin. The repo contains a link to [Dropbox](https://www.dropbox.com/sh/v0fvtn5gnf4ij6f/AABPSV4NzHIRa8GwUfQ7kA0aa?dl=0) where the data can be downloaded.

Download the file _MultiFinDataset2023.zip_, unzip the file and navigate into the folder _MultiFinDataset_EACL_ and then _0-all-languages-lowlevel_. It contains a _train.json_ file. Upload that file to your Jupyter SageMaker Notebook.

In [None]:
# Read the data
df = pd.read_json("./train.json", lines=True)

In [None]:
# Inspect dataset
df.head(5)

In [None]:
# check language distribution
df['lang'].value_counts()

### Step 3. Select List of Documents to Query

In [None]:
# Ensure there is no duplicated text in the headers
def remove_duplicates(text):
    return re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', text, flags=re.I)

df ['text'] = df['text'].apply(remove_duplicates)

# Select languages
languages = ['English', 'Spanish', 'Danish']
top_80_df = df.loc[df['lang'].isin(languages)]

# Pick the top 80 longest articles
top_80_df['text_length'] = top_80_df['text'].str.len()
top_80_df.sort_values(by=['text_length'], ascending=False, inplace=True)
top_80_df = top_80_df[:80]

# Language distribution
top_80_df['lang'].value_counts()

In [None]:
# Let's look at the first record
top_80_df['text'].iloc[0]

In [None]:
# Set up translator. You can also commemnt this block and use the Python list at the end of this notebook
# if you run into any issues with the translator API
translator= Translator(from_lang="autodetect", to_lang="en")
translations = []

# Translate text
for text in top_80_df['text']:
    translation = translator.translate(text)
    # If the header is already in english we will just paste the original text
    if translation == 'PLEASE SELECT TWO DISTINCT LANGUAGES':
        translations.append(text) 
    else:
        translations.append(translation)

In [None]:
# Add translated text to the dataframe
top_80_df['translated_text'] = translations

# Now we can see the translation of that same record
top_80_df['translated_text'].iloc[0]

### Step 4: Embed and Index Documents

In [None]:
# Establish Cohere client
co = cohere_aws.Client(mode=cohere_aws.Mode.BEDROCK)
model_id = "cohere.embed-multilingual-v3"

# Embed documents
docs = top_80_df['text'].to_list()
docs_lang = top_80_df['lang'].to_list()
translated_docs = top_80_df['translated_text'].to_list() #for reference when returning non-English results
doc_embs = co.embed(texts=docs, model_id=model_id, input_type='search_document').embeddings

# Create a search index
index = hnswlib.Index(space='ip', dim=1024)
index.init_index(max_elements=len(doc_embs), ef_construction=512, M=64)
index.add_items(doc_embs, list(range(len(doc_embs))))

### Step 5: Build a Retrieval System

In [None]:
# Retrieval of 4 closest docs to query
def retrieval(query):
    # Embed query and retrieve results
    query_emb = co.embed(texts=[query], model_id=model_id, input_type="search_query").embeddings
    doc_ids = index.knn_query(query_emb, k=3)[0][0] # we will retrieve 4 closest neighbors
    
    # Print and append results
    print(f"QUERY: {query.upper()} \n")
    retrieved_docs, translated_retrieved_docs = [], []
    
    for doc_id in doc_ids:
        # Append results
        retrieved_docs.append(docs[doc_id])
        translated_retrieved_docs.append(translated_docs[doc_id])
    
        # Print results
        print(f"ORIGINAL ({docs_lang[doc_id]}): {docs[doc_id]}")
        if docs_lang[doc_id] != "English":
            print(f"TRANSLATION: {translated_docs[doc_id]} \n----")
        else:
            print("----")
    print("END OF RESULTS \n\n")
    return retrieved_docs, translated_retrieved_docs

### Step 6: Query the Retrieval System

In [None]:
queries = [
    "Are businesses meeting sustainability goals?",
    "Can data science help meet sustainability goals?"
]

for query in queries:
    retrieval(query)

In [None]:
query = "Hvor kan jeg finde den seneste danske boligplan?" # "Where can I find the latest Danish property plan?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

### Step 7: Improve Results with Cohere Rerank

The following query is not returning the most relevant result at the top, here is where Rerank will help.

In [None]:
query = "Are companies ready for the next down market?"
retrieved_docs, translated_retrieved_docs = retrieval(query)

####  Subscribe to the model package in SageMaker


Rerank is available in SageMaker and will soon be available in Bedrock. We will use SageMaker now but the process with Bedrock will be similar as shown above for the Embed model and you won't need to create and manage an endpoint.


To subscribe to the model package:
1. Open the model package listing page [cohere-rerank-multilingual](https://aws.amazon.com/marketplace/pp/prodview-pf7d2umihcseq)
1. On the AWS Marketplace listing, click on the **Continue to subscribe** button.
1. On the **Subscribe to this software** page, review and click on **"Accept Offer"** if you and your organization agrees with EULA, pricing, and support terms. 
1. Once you click on **Continue to configuration button** and then choose a **region**, you will see a **Product Arn** displayed. This is the model package ARN that you need to specify while creating a deployable model using Boto3. Copy the ARN corresponding to your region and specify the same in the following cell.

In [None]:
# map model package arn
import boto3
cohere_package = "cohere-rerank-multilingual-v2--8b26a507962f3adb98ea9ac44cb70be1" # replace this with your info

model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:865070037744:model-package/{cohere_package}",
    "us-east-2": f"arn:aws:sagemaker:us-east-2:057799348421:model-package/{cohere_package}",
    "us-west-1": f"arn:aws:sagemaker:us-west-1:382657785993:model-package/{cohere_package}",
    "us-west-2": f"arn:aws:sagemaker:us-west-2:594846645681:model-package/{cohere_package}",
    "ca-central-1": f"arn:aws:sagemaker:ca-central-1:470592106596:model-package/{cohere_package}",
    "eu-central-1": f"arn:aws:sagemaker:eu-central-1:446921602837:model-package/{cohere_package}",
    "eu-west-1": f"arn:aws:sagemaker:eu-west-1:985815980388:model-package/{cohere_package}",
    "eu-west-2": f"arn:aws:sagemaker:eu-west-2:856760150666:model-package/{cohere_package}",
    "eu-west-3": f"arn:aws:sagemaker:eu-west-3:843114510376:model-package/{cohere_package}",
    "eu-north-1": f"arn:aws:sagemaker:eu-north-1:136758871317:model-package/{cohere_package}",
    "ap-southeast-1": f"arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/{cohere_package}",
    "ap-southeast-2": f"arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/{cohere_package}",
    "ap-northeast-2": f"arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/{cohere_package}",
    "ap-northeast-1": f"arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/{cohere_package}",
    "ap-south-1": f"arn:aws:sagemaker:ap-south-1:077584701553:model-package/{cohere_package}",
    "sa-east-1": f"arn:aws:sagemaker:sa-east-1:270155090741:model-package/{cohere_package}",
}

region = boto3.Session().region_name
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

#### Create an endpoint and perform real-time inference

If you want to understand how real-time inference with Amazon SageMaker works, see [Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html).

In [None]:
co = cohere_aws.Client(region_name=region)
co.create_endpoint(arn=model_package_arn, endpoint_name="cohere-rerank-multilingual", instance_type="ml.g4dn.xlarge", n_instances=1)

# If the endpoint is already created, you just need to connect to it
# co.connect_to_endpoint(endpoint_name="cohere-rerank-multilingual")

Once endpoint has been created, you would be able to perform real-time inference.

In [None]:
results = co.rerank(query=query, documents=retrieved_docs, top_n=1)

for hit in results:
    print(hit.document['text'])

### Translations Backup

You can leverage the list below which already has the results of the translation if you run into any issues or rate limits using the translate package

In [None]:
translations = ["CFOdirect: Results from PwC's Employee Engagement Landscape Survey, including how to increase employee engagement. Read also about the accounting consequences for income tax according to Brexit",
 "CFOdirect: Get guidance on, among other things, income taxes, IFRS, etc. in PwC's Strategy That Works on how to ensure an effective and value-creating approach to strategy and execution through five actions",
 "CFOdirect: For the 10th time, PwC measures companies' digital transformation in the Digital IQ Survey 2017 and new FASB guidance clarifies the accounting treatment of non-financial assets",
 'Most acquisitions and divestments don’t maximise value – even when some dealmakers think they do. But acquirers who prioritise value creation at the onset outperform peers by as much as 14%',
 'CFOdirect: Investors look at diversity on the board. What does it look like for your industry? Also read our benchmarking analysis of recently listed technology companies',
 'Polish Constitutional Tribunal stated that the bill introducing the removal of the pension and disability contributions is not consistent with Polish Constitution',
 'CFOdirect: Why and how the board should take a fresh look at risks in your company, analysis of 17 major banks during and after the financial crisis and much more',
 'New from CFOdirect: new standard for hedge accounting on the way, simplified tax accounting for internal transfers of assets, and much more',
 'Proposed amendment to the Polish CIT Law to end exemption of rental revenues and capital gains from disposal of real estate available to investment funds',
 'New from CFOdirect: New standard on revenue recognition, revision of accounting estimates/audit estimates, cybersecurity webcast and much more',
 '69% of business leaders have experienced a corporate crisis in the last five years yet 29% of companies have no staff dedicated to crisis preparedness',
 'PwC Esade Georgeson and Diligent create a new Corporate Governance Centre of reference in Spain to improve the governance of Esade companies',
 'Amendment to the Polish CIT Law restricting exemption of rental revenues and capital gains from disposal of real estate available to investment funds',
 'PwC and Salesforce: Connecting organizations to their customers using the latest innovations in cloud, social, mobile and data analytics solutions',
 'New from CFOdirect: New PP&E guide, FAQs on the new leasing standard, podcast on the challenges of implementing the leasing standard and much more',
 'Legislative proposal presented on interest-free loans, deferred payroll tax deadline, early payment of tax credit and ceiling on deposits in the tax account',
 'Spain could reduce its emissions by between 7% and 17% by 2033 if it opts for a model based on innovation and technological development',
 'Experts and managers consider the role of institutions in driving innovation and digitization in the economy to be irrelevant',
 'PwC Tax & Legal Services, the second most innovative law firm in Europe in the New Business of Law category, according to the FT Innovative Lawyers awards',
 'Experts and managers believe that the reform of the Corporation Tax will reduce the competitiveness of companies and economic growth',
 'Conclusions of the webcast: Multinationals facing the COVID-19 crisis: fiscal perspective of the restructuring of business models',
 'The Spanish hotel sector will remain stable for the 2018 Summer season following the historical records broken in previous seasons',
 'Spanish companies are improving their degree of digitisation and are betting on Artificial Intelligence and robotisation for the Internet of Things',
 'PwC Poland, Airbus Poland, Cleanproject and SatRevolution to start a unique aerospace startup acceleration program  – Future Space Accelerator',
 "CFOdirect: IFRS news - New Year's resolutions for your company, be prepared for what questions you will face at the shareholders' meeting and much more",
 'Integrate ESG criteria and purpose into the main challenge strategy of the Boards of Spanish companies in the post-COVID world',
 'CFOdirect: Seven Paths to Growth for Medium-Sized Enterprises, and Podcast on Disclosure Requirements in the "Revenue from Contracts with Customers" Standard',
 'Greater attention to non-financial information and  technology, keys to the new internal control regulations for companies',
 'Digitalization and new consumer expectations put a $46 trillion market in logistics at stake',
 'CFOdirect: 20 years of PwC CEO survey, VIDEO: FASB has updated the definition of a company - get the changes explained and much more',
 'Changes in the VAT Directive – as of 1 January 2020 new rules on settlement of international commodity transactions will come into force',
 'The experts and directors ask the new government not to touch the Corporate Tax and a reduction of the IRPF in the next legislature',
 'Bill on the legal status of employers and employees in the case of salary compensation of companies in connection with COVID-19 has now been adopted',
 'Technology and succession strategy the three challenges faced by the Boards of Directors of large Spanish companies',
 'Blockchain technologies could boost the global economy US$1.76 trillion by 2030 through raising levels of tracking, tracing and trust.',
 'Blockchain technologies could boost the global economy US$1.76 trillion by 2030 through raising levels of tracking, tracing and trust',
 'Hotel sector prospects improve for the second quarter of the year despite political and financial uncertainty',
 'Survey Polled Over 500 Global Executives on How Financial Services and Technology Firms are Navigating the Current Fintech Landscape',
 'Large companies must submit all their tax documentation by electronic means from 1 July',
 "Experts question the Treasury's calculation of the effective rate of Corporate Income Tax paid by Spanish companies",
 'Key technical and digital training to improve the labour insertion of girls and young people in the Spanish labour market of the future',
 'The press will experience a 21% drop in revenue over the next five years despite the rise of digital advertising',
 'PwC Named a Leader in Global Cybersecurity Consulting and Global Digital Experience Agencies reports by Independent Research Firm',
 'Loss of sensitive data and damage to physical assets main impacts of cyberattacks on Spanish companies',
 'As work sites slowly start to reopen, CFOs are concerned about the global economy and a potential new COVID-19 wave - PwC survey',
 'Proposal from the Minister for Industry, Business and Financial Affairs: Abolition of the possibility of establishing IVSs and simultaneous reduction of the capital requirement in APSs',
 'Large companies change the direction of innovation: software and services gain weight to the detriment of the physical product',
 'AML Transaction Monitoring Overview: Banking segments differentiators - Retail, Corporate, Correspondent and Investment banking.',
 'Quality of business reporting on the Sustainable Development Goals improves, but has a long way to go to meet and drive targets.',
 'From cigars to 3D printing: Starting a new entrepreneurial venture in the Netherlands: Global Family Business Survey 2016: PwC',
 'Water companies - now associations can probably be established soon in order to improve the efficiency and quality of the sector',
 'European banks face the challenge of increasing their profitability in a context of low rates and new regulatory requirements',
 '85% of Spanish CEOs believe that technology is the disruptive factor that will transform their companies in the coming years',
 'More than half of companies acknowledge that they do not adequately manage their risks and claim the role of internal auditors',
 'The preference for cash and the fragmentation of the market are obstacles to creating a means of payment industry in the EU',
 'Tax Flash: Amendment of the circular regarding the filing procedures of Country by Country Reports and relevant notifications',
 'Sherlock in Health:How artificial intelligence may improve quality and efficiency whilst reducing healthcare costs in Europe?',
 'Finnish national developments with regards to the final tax losses doctrine concerning cross-border EU/EEA group contribution',
 'Banking executives see the arrival of new entrants outside the sector as the main threat to the traditional business',
 'Publications on new leasing standard, IFRS 16, and special challenges for shipping companies and communication companies',
 'Innovation, digitisation and international competition are the three major challenges facing Spanish family businesses',
 'Governments should do more to unlock the potential of technology to facilitate tax compliance, says PwC and World Bank report',
 'Companies shifting more R&D spending away from physical products to software and services: 2016 Global Innovation 1000 Study',
 'Over a half of the “fathers and sons” surveyed by PwC together with NAFI are optimistic about the future of Russian business',
 'R&D Tax Relief – significant growth of the deduction and costs of staff on civil law contracts as qualified costs since 2018',
 'Only 10 years to achieve Sustainable Development Goals but businesses remain on starting blocks for integration and progress',
 'Webcast summary: Data protection regulation and guarantee of digital rights in health crisis situations',
 'The stock market value of the 100 largest listed companies falls 15% between January and March but withstands the onslaught of COVID-19',
 "Business and people with high share or capital income must finance the government's proposal on the right to early retirement",
 'Decarbonization plus investment in networks and renewables and new services the strategic bets of the large electricity companies',
 'Using AI to better manage the environment could reduce greenhouse gas emissions, boost global GDP by up to 38m jobs by 2030',
 'Increased regulatory demands coupled with business change are forcing European banks to reinvent themselves in order to be profitable',
 'New from CFOdirect: Shareholder questions for management, the SEC cybersecurity guide, US tax reform and more',
 'Globalisation, digitisation, demographic change: Managing the megatrends in Russia: Global Family Business Survey 2016: PwC',
 'PwC leads for the seventh consecutive year in transaction advice in Spain in 2018 according to the main rankings',
 '49% of CFOs believe their companies would be back to normal in 3 months if the pandemic ended today',
 'The real estate sector faces 2013 with moderate optimism and expects it to mark the beginning of the market revival',
 'Foundations and Associations as obligated subjects for the prevention of money laundering and terrorist financing',
 '34 Digital Innovation Hubs that have qualified for the Smart Factories programme will receive support from PwC and Oxentia',
 'Monitoring, Reporting and Validating (MRV) maritime vessels’ CO2 emissions performance: A new era in the Shipping Industry'] 