## 1. Document Embeddings

In this section, we aim to explore various methods for generating word embeddings in the context of financial documents. These embeddings capture the semantic meaning and relationships between words. They will allow us to represent the documents numerically, enabling the machine to comprehend and utilize them for various NLP tasks, including clustering, text classification, and sentiment analysis.

### Installing & Importing required libraries

In [1]:
%%capture
!pip install wordninja
!pip install contractions
!pip install num2words 
!pip install textblob
!pip install gensim
!pip install joblib

In [2]:
import pandas as pd
import numpy as np
import re
import wordninja
from num2words import num2words
import contractions
from textblob import TextBlob
import gensim
import nltk
import joblib
import ast
from nltk import FreqDist
from nltk.stem.snowball import SnowballStemmer
from gensim.models import KeyedVectors
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import string
nltk.download("punkt")
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces
import warnings 
warnings.simplefilter("ignore")



[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Loading documents

In [3]:
# Load data
data_path = "/kaggle/input/3-document-cleaning/cleaned-docs.csv"
raw_data = pd.read_csv(data_path)

# Convert lists represented as strings to lists within the dataframe
raw_data['HSentences'] = raw_data['HSentences'].apply(ast.literal_eval)
raw_data['MSentences'] = raw_data['MSentences'].apply(ast.literal_eval)

In [4]:
raw_data.head(5)

Unnamed: 0,Report Name,Bank Name,Report Date,Page ID,Page Text,MSentences,HSentences
0,fx_insight_e_16_janvier_2023,citi,16-1-2023,0,\n Citi Global Wealth Investments \n FX Snaps...,[citi global wealth invest fx snapshot major c...,[ Citi Global Wealth Investments FX Snapshot M...
1,fx_insight_e_16_janvier_2023,citi,16-1-2023,3,\n Important Disclosure \n “Citi analysts” ref...,[import disclosur citi analyst refer invest pr...,[ Important Disclosure Citi analysts refers to...
2,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,0,\n Fixed Income \n MUSINGS \n FIXED INCOME Go...,[fix incom muse fix incom goldman sach asset m...,[ Fixed Income MUSINGS FIXED INCOME Goldman Sa...
3,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,1,\n Fixed Income \n MUSINGS \n Goldman Sachs As...,[fix incom muse goldman sach asset manag fix i...,[ Fixed Income MUSINGS Goldman Sachs Asset Man...
4,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,2,\n MUSINGS \n FIXED INCOME Goldman Sachs Asse...,[muse fix incom goldman sach asset manag centr...,[ MUSINGS FIXED INCOME Goldman Sachs Asset Man...


In [5]:
raw_data.describe()

Unnamed: 0,Page ID
count,395.0
mean,29.837975
std,27.220832
min,0.0
25%,6.0
50%,22.0
75%,48.0
max,107.0


Here, we define a function to get the document based on different aggregation levels: words, sentences, or the whole text. Each row in the dataframe represents a single page. The columns MSentences and HSentences are the list of sentences for each page. One is a list of fully preprocessed sentences for the purpose of getting the embeddings, and the other is less preprocessed that will be used later to output readable answers to questions/queries in the semantic search, and information retrieval section.

In [6]:
from itertools import chain

def get_document(document, by="words", col="MSentences", paginated=False):
    # Get the sentences of all pages
    pages_sentences = document[col].tolist()   
    
    if by == "words":
        pages_words = []
        for page_sentences in pages_sentences:
            page_words = []
            for sentence in page_sentences:
                sent_tokens = word_tokenize(sentence)
                page_words.append(sent_tokens)
            
            pages_words.append(list(chain.from_iterable(page_words)))
        results = pages_words
        
    elif by == "sentences":
        results = pages_sentences
        
    elif by == "document":
        results = [ " ".join(page_sentences).strip() for page_sentences in pages_sentences ]
        if not paginated: 
            return " ".join(results).strip()
            
    if paginated:
        return results
    return list(chain.from_iterable(results))

In [7]:
report_name = "fx_insight_e_15_mai_2023"
document = raw_data[raw_data['Report Name'] == report_name].sort_values(by='Page ID')
print("The document", report_name, "has", len(document), "pages.")

The document fx_insight_e_15_mai_2023 has 7 pages.


In [8]:
# Get the first 10 words of a page of the document
page_id = 0
print(get_document(document, by="words", col="MSentences", paginated=True)[page_id][:10])

# Get the first 3 sentences of a page of the document
print(get_document(document, by="sentences", col="MSentences", paginated=True)[page_id][:3])

# Get a small chunk from the whole document
print(get_document(document, by="document", col="MSentences", paginated=False)[:100])

['invest', 'product', 'bank', 'deposit', 'govern', 'insur', 'bank', 'guarante', 'lose', 'valu']
['invest product bank deposit', 'govern insur', 'bank guarante']
invest product bank deposit govern insur bank guarante lose valu major currenc perform sourc bloombe


In [9]:
print(get_document(document, by="document", col="HSentences", paginated=True)[4])

Safe-haven Currencies CHF Citi views & strategy Bias/ Forecasts/ Key levels Citi FX outlook Room for more short-term hawkishness - domestic factors enable an incrementally hawkish SNB, which should keep CHF supported. SNBs conditional inflation forecast would suggest the need for additional hikes, and the improvement in leading indicators creates room for a hawkish surprise. Previously USDCHF: 0 3mths: 0.90 USDCHF: 6 12mths: 0.87 USDCHF: LT: 0.85 Currently (as of Apr): USDCHF: 0 3mths: 0.87 USDCHF: 6 12mths: 0.86 USDCHF: LT: 0.85 6-12mths: Bullish CHF vs USD, moderately bearish vs EUR JPY Citi views & strategy Bias/ Forecasts/ Key levels Citi FX outlook BoJ YCC shift remains in play. Citi analysts base case is for shortening of the target duration from 10y to 5y or 3y in June meeting (after final outcome of spring wage negotiations). For US, as more hikes are priced, total cycle cuts are likely to deepen. Even in the scenario of a hawkish Fed, we think it may be hard for US duration to

### 1.1. Document Embeddings using GloVe

GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Those embeddings are capable of capturing semantic relationships between words.

#### Approach:

1. Data Preparation: The first step in the process is to ensure that the documents have been parsed and cleaned, which has already been accomplished in the previous notebook. We will be using the output of that notebook as our dataset for this current task.

2. Tokenization: For generating vector embeddings at the sentence level, we need to tokenize the sentences within each page of the document. Tokenization is the process of breaking down a sentence into individual words or tokens. This step is essential as it allows us to convert raw text into a format that can be used for further analysis. This step has also been accomplished in the previous notebook.

3. Word Vector Retrieval: Once we have tokenized the sentences, the next step is to retrieve the vector embeddings for each word in the sentence using the pre-trained GloVe embeddings. These embeddings are 300-dimensional representations, capturing semantic information about each word.

4. Sentence Vector Calculation: With the word embeddings at hand, we calculate an average vector for each sentence. This is achieved by taking the average of the individual word vectors in the sentence. The result is a representative vector that captures the overall meaning of the sentence.

5. Page-level Sentence Embeddings: All sentences within a page of the document go through the same process of tokenization, word vector retrieval, and sentence vector calculation. Consequently, each page will have a list of sentences and their corresponding vector representations.

Now, we have the option to further condense the information to represent each page by calculating an average vector of all the sentence vectors within that page. This will provide us with a representative vector for each page. If we wish to represent the entire document as a single vector, we can take an additional average across all the page-level vectors. This final step results in a vector that represents the entire financial document.

#### Loading GloVe Embeddings

In [10]:
%%capture 
!wget -N http://nlp.stanford.edu/data/glove.6B.zip
!unzip -o glove.6B.zip -d glove.6B
!rm glove.6B/glove.6B.50d.txt
!rm glove.6B/glove.6B.100d.txt
!rm glove.6B/glove.6B.200d.txt
!rm glove.6B.zip

In [11]:
%%time
glove_vectors = "/kaggle/working/glove.6B/glove.6B.300d.txt"
glove_model = KeyedVectors.load_word2vec_format(glove_vectors, binary=False, encoding='utf-8', no_header=True)

CPU times: user 59 s, sys: 2.45 s, total: 1min 1s
Wall time: 60 s


#### Generating Embeddings

Based on a single document from the dataset, we'll try to generate its embeddings using different aggregation levels. In other words, we'll have words embeddings, sentences embeddings, and a single-vector document embedding. To represent a sentence in the document as one vector, we compute the average vector of the embeddings of its words. The same way is applied to get a document embedding. One can also use the sum of vectors instead of the average, but here we'll just stick to the average vector.

In [12]:
def get_word_embedding(word, model):
    """Retrieve Word Embeddings of a Word using GloVe or Word2Vec-based Models"""
    w2v = False
    try:
        if hasattr(model, 'wv'):
            w2v = True
            vec = model.wv[word]
        else: 
            vec = model[word]
    except KeyError:
        if w2v: 
            vec = model.wv['x']
        else: 
            vec = model['x']  
    return vec

def get_sentence_embedding(sentence, embedding_model):
    if hasattr(embedding_model, 'wv'):
        w2v = True
    # Tokenize the sentence using NLTK tokenizer
    words = word_tokenize(sentence)
    # Get word vectors for each word in the sentence
    word_vectors = [get_word_embedding(word, embedding_model) for word in words]
    
    if not word_vectors:
        # normally this wouldn't happen unless the sentence is empty
        if w2v:
            return np.zeros_like(embedding_model.wv['x']).reshape(1, -1)
        else:
            return np.zeros_like(embedding_model['x']).reshape(1, -1)
    
    # Calculate the mean of the words vectors (Not the best way but gives an idea about a sentence) 
    mean_embedding = np.mean(word_vectors, axis=0)
    return mean_embedding.reshape(1, -1)

In [13]:
def get_document_embeddings(document, by="words", paginated=False, model=None):
    if by == "words":
        doc = get_document(document, by="words", col="MSentences", paginated=paginated)
        # if paginated, we have a list of lists of words, else it's a single list of words
        words = []
        if paginated:
            for page in doc:
                page_words = []
                for word in page:
                    word_embedding = get_word_embedding(word, model)
                    word_item = [word, word_embedding]  
                    page_words.append(word_item)
                words.append(page_words)
        else:
            for word in doc:
                word_embedding = get_word_embedding(word, model)
                word_item = [word, word_embedding]
                words.append(word_item)
        return words
    
    elif by == "sentences":
        doc = get_document(document, by="sentences", col="MSentences", paginated=paginated)
        sentences = []
        if paginated:
            for page in doc:
                page_sentences = []
                for sentence in page:
                    sentence_embedding = get_sentence_embedding(sentence, model)
                    sentence_item = [sentence, sentence_embedding[0]]
                    page_sentences.append(sentence_item)
                sentences.append(page_sentences)
        else:
            for sentence in doc:
                sentence_embedding = get_sentence_embedding(sentence, model)
                sentence_item = [sentence, sentence_embedding[0]]
                sentences.append(sentence_item)
        return sentences

In [14]:
# Get the word embedding of the first word of the document
get_document_embeddings(document, by="words", paginated=False, model=glove_model)[1]

['product',
 array([ 1.3399e-01,  9.5343e-01, -2.1405e-02, -4.4056e-01, -3.1848e-01,
        -1.1838e-02, -1.3645e-01,  3.1068e-01,  2.1663e-01, -2.0424e+00,
        -1.8248e-01, -1.6210e-01, -5.6770e-01, -3.7521e-01, -5.0013e-02,
        -2.0031e-02, -2.4038e-01, -4.9640e-02,  2.8889e-01, -3.9435e-01,
        -1.5044e-01,  1.9880e-01,  1.6445e-01,  4.1103e-01, -4.6098e-01,
         4.0521e-01, -4.5604e-01, -9.8286e-02,  1.3198e-01,  1.3040e-01,
        -5.5641e-01,  2.2624e-01,  1.1782e-01,  6.0687e-02, -1.0273e+00,
         6.4199e-01, -1.3349e-01,  1.4734e-02,  2.5083e-01, -3.1211e-01,
        -3.8832e-01,  1.5705e-01, -6.9372e-02,  2.1884e-01, -2.9250e-01,
        -3.9328e-01, -1.8500e-01, -2.9737e-01, -5.8112e-02,  1.6959e-01,
        -7.1924e-03, -1.6866e-01,  1.3513e-01,  3.0760e-01, -3.7039e-01,
         2.4984e-01,  4.0689e-01,  2.2337e-02, -3.8146e-01, -1.1498e-01,
        -4.6484e-03, -1.0378e-04, -6.8782e-04, -1.5847e-01, -4.9675e-02,
         1.8417e-01, -3.1724e-01,  9.64

In [15]:
# Get the sentence embedding of the first sentence of the document
get_document_embeddings(document, by="sentences", paginated=True, model=glove_model)[0][0]

['invest product bank deposit',
 array([-2.83167511e-01,  1.57509252e-01, -2.64883757e-01, -1.45697504e-01,
         2.34204501e-01, -1.61922008e-01,  3.48022491e-01,  5.04252538e-02,
        -1.33927003e-01, -1.62987494e+00, -5.80024943e-02, -2.47445002e-01,
         7.63149932e-02, -2.68419981e-01, -1.80604979e-02, -2.79577732e-01,
        -1.91222489e-01, -4.86485004e-01,  5.60092553e-02, -3.28714997e-01,
         3.55305001e-02, -3.21049988e-02,  1.08314753e-01,  5.36522493e-02,
        -1.52801007e-01,  2.30902452e-02, -1.06143743e-01,  1.97545484e-01,
        -3.54631752e-01,  7.92725012e-03, -3.32102507e-01,  2.04788893e-01,
         2.64919996e-01,  1.48712739e-01, -4.90858495e-01,  5.82342446e-02,
        -1.09559998e-01,  2.55208492e-01,  8.51374939e-02, -2.58603007e-01,
        -4.25881267e-01,  1.45485010e-02, -8.74930024e-02,  4.56924975e-01,
         4.49222550e-02, -2.52350003e-01, -1.68006912e-01,  1.20587498e-01,
        -3.03364754e-01,  3.35247487e-01, -3.59291062e-0

**Most Similar/Close Words** 

Let's try to find the words that are very close to the word "bank" within our document.

In [16]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def most_similar_words(word_list, target_word, top_n=5):
    target_embedding = None
    similar_words = []
    results = []
    # Find the embedding for the target word
    for word, embedding in word_list:
        if word == target_word:
            target_embedding = embedding.reshape(1, -1) 
            break

    if target_embedding is None:
        return similar_words 

    # Calculate cosine similarity and find most similar words
    for word, embedding in word_list:
        if word != target_word:
            similarity = cosine_similarity(target_embedding, embedding.reshape(1, -1))[0][0]
            if word in similar_words:
                continue
            similar_words.append(word)
            results.append((word, similarity))

    results.sort(key=lambda x: x[1], reverse=True)
    return results[:top_n]

target_word = "bank"

glove_words_embeddings = get_document_embeddings(document, by="words", paginated=False, model=glove_model)
similar_words = most_similar_words(glove_words_embeddings, target_word)

for word, similarity in similar_words:
    print(f"{word}: {similarity}")

central: 0.5375902056694031
credit: 0.531377911567688
citibank: 0.4939170479774475
deposit: 0.45129913091659546
citigroup: 0.4335481524467468


**Word Analogy**

"NZD is to New Zealand as AUD is to ?"




In [17]:
positive_words = ['nzd', 'zealand']
word_to_find_analogy_for = 'aud'
analogy_results = glove_model.most_similar(positive=positive_words, negative=[word_to_find_analogy_for], topn=1)
print(f"Analogy: {positive_words[0]} is to {positive_words[1]} as {word_to_find_analogy_for} is to {analogy_results[0][0]}")

Analogy: nzd is to zealand as aud is to australia


### 1.2. Document Embeddings using FinText


#### Loading FinText Embeddings

In [18]:
%%capture
!wget -N https://www.rahimikia.com/FinText/FinText_Word2Vec_CBOW.zip
!unzip -o FinText_Word2Vec_CBOW.zip -d FinText_Word2Vec_CBOW
!rm FinText_Word2Vec_CBOW.zip

In [19]:
%%time
fintext_path = "/kaggle/working/FinText_Word2Vec_CBOW/FinText_Word2Vec_CBOW/Word_Embedding_2000_2015"
# Load the FinText model
fintext_model = KeyedVectors.load(fintext_path)

CPU times: user 56 s, sys: 2.18 s, total: 58.1 s
Wall time: 58.2 s


#### Generating Embeddings

In [20]:
# Get the word embedding of the first word of the document
#get_document_embeddings(document, by="words", paginated=False, model=fintext_model)[0]

In [21]:
# Get the sentence embedding of the first sentence of the document
get_document_embeddings(document, by="sentences", paginated=True, model=fintext_model)[0][0]

['invest product bank deposit',
 array([-4.95728683e+00,  9.98760819e-01, -9.86634254e-01, -1.91504419e+00,
         6.06757402e-01,  1.21559811e+00,  1.37422240e+00,  6.41845703e-01,
        -6.37138784e-02, -1.85384464e+00, -2.50361800e+00, -2.84312868e+00,
        -1.61761713e+00, -1.11309290e+00, -2.98852301e+00,  1.77939549e-01,
         1.57604289e+00, -1.95075810e-01, -2.74044871e+00, -1.22344077e-01,
        -2.16380882e+00, -9.99018729e-01, -2.43867016e+00,  1.31971848e+00,
         4.26488370e-01, -1.43408871e+00,  1.04030108e+00,  6.26651704e-01,
         4.08551991e-01,  8.64704132e-01,  1.06607628e+00, -7.07986951e-01,
        -1.09643841e+00, -5.28450310e-01, -2.21319988e-01, -1.66300082e+00,
         1.49757540e+00, -1.05606520e+00, -1.92846024e+00, -2.14403018e-01,
        -1.37729704e+00,  5.09446144e-01,  8.67957711e-01, -1.59890324e-01,
         2.82230759e+00, -4.69033539e-01, -2.87354803e+00,  4.60414946e-01,
         3.01001638e-01, -1.41977608e-01, -1.60587883e+0

In [22]:
target_word = "bank"

fintext_words_embeddings = get_document_embeddings(document, by="words", paginated=False, model=fintext_model)
similar_words = most_similar_words(fintext_words_embeddings, target_word)

for word, similarity in similar_words:
    print(f"{word}: {similarity}")

citibank: 0.5876072645187378
boc: 0.41906818747520447
snb: 0.39384377002716064
citigroup: 0.39015644788742065
fed: 0.37750378251075745


### 1.3. Document Embeddings using FinBERT


In [23]:
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert")
finbert_model = BertModel.from_pretrained("ProsusAI/finbert")

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at ProsusAI/finbert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [24]:
def get_sentence_finbert_embeddings(sentence, model):
    with torch.no_grad():
        sentence_tokens = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=256)
        sentence_embeddings = model(**sentence_tokens).last_hidden_state.mean(dim=1)
        return sentence_embeddings

def get_document_finbert_embeddings(document, paginated=False, model=None):
    doc = get_document(document, by="sentences", col="HSentences", paginated=paginated)
    sentences = []
    if paginated:
        for page in doc:
            page_sentences = []
            for sentence in page:
                sentence_embedding = get_sentence_finbert_embeddings(sentence, model)
                sentence_item = [sentence, np.array(sentence_embedding)]
                page_sentences.append(sentence_item)
            sentences.append(page_sentences)
    else:
        for sentence in doc:
            sentence_embedding = get_sentence_finbert_embeddings(sentence, model)
            sentence_item = [sentence, np.array(sentence_embedding)]
            sentences.append(sentence_item)
    return sentences

In [25]:
#finbert_embeddings = get_document_finbert_embeddings(document, paginated=False, model=finbert_model)

In [26]:
# Get the first sentence embedding of the document:
#finbert_embeddings[0]

## 2. Document Semantic Search

In [27]:
import sklearn
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_answers(question_embedding, sentence_embeddings, sentences, k=1):
    sentence_embeddings = np.vstack(sentence_embeddings)
    similarity_scores = cosine_similarity(sentence_embeddings, question_embedding.reshape(1, -1))
    top_k_indices = np.argsort(similarity_scores, axis=0)[-k:][::-1]
    top_k_answers = []
    
    for index in top_k_indices:
        index = index[0] 
        top_k_answers.append((index, similarity_scores[index][0], sentences[index]))

    return top_k_answers

def get_answer_page_num(document, answer):
    page_num = document.loc[document['HSentences'].apply(lambda lst: answer in str(lst)), 'Page ID']
    return page_num.values[0]

In [28]:
# Same pipeline from notebook of doc cleaning

def replace_symbols(text):
    symbols = {'$': 'dollar','€': 'euro','£': 'pound','¥': 'yen','₹': 'rupee','%': 'percent'}
    for symbol, word in symbols.items():
        text = re.sub(r'(\d)' + re.escape(symbol), r'\1 ' + word, text)
        text = re.sub(re.escape(symbol) + r'(\d)', word + r' \1', text)
        text = re.sub(re.escape(symbol), word, text)
    return text

def remove_urls(text):
    url_pattern_1 = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
    url_pattern_2 = r'https?://\S+'
    url_pattern_3 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
    url_pattern = f"({url_pattern_1})|({url_pattern_2})|({url_pattern_3})"
    text = re.sub(url_pattern, '', text, flags=re.MULTILINE)
    return text
    
stemmer = SnowballStemmer(language='english')

def clean_sent(text):
    text = text.lower()
    text = remove_urls(text)
    text = contractions.fix(text)
    text = re.sub(r'[^\x00-\x7F$€£¥₹]+', '', text)
    text = re.sub(r'\b\d+\b|\b\d+\.\d+\b', lambda match: num2words(float(match.group(0))) if '.' in match.group(0) else num2words(int(match.group(0))), text)
    text = replace_symbols(text)
    text = strip_punctuation(text)
    text = remove_stopwords(text)
    text = re.sub(r'[^a-z0-9\s]', '', text)
    words = word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in words]
    text = " ".join(stemmed_words).strip()
    text = strip_multiple_whitespaces(text)
    return text

In [29]:
report_name = 'fx_insight_e'
qa_document = raw_data[raw_data['Report Name'] == report_name].sort_values(by='Page ID')

original_sentences = get_document(qa_document, by="sentences", col="HSentences")

In [30]:
questions = ["Why did Citi analysts push the RBA rate hike expectation from June to August?",
             "What are the expected medium-term gains for AUD vs USD?",
             "What will happen to UK?",
             "How might China's low inflation impact the policy support from PBoC?"]

In [31]:
def print_answers(questions, answers):
    for question_id, question in enumerate(questions):
        print(f"Question {question_id + 1}:", question)
        print(f"Top {top_k} answers:")
        for index, similarity, sentence in answers[question_id]:
            print("=>", sentence, '(similarity_score', similarity,')' '(from Document',report_name, 'on Page', get_answer_page_num(qa_document,sentence), ')\n')
            #print(cleaned_sentences[index])
        print('*'*50)

### 2.1 Using GloVe Embeddings

In [32]:
top_k = 1 
answers = []

doc_embeddings = [emb for _,emb in get_document_embeddings(qa_document, by="sentences", model=glove_model)]
for question in questions:
    question_embedding = get_sentence_embedding(clean_sent(question), embedding_model=glove_model)
    top_answers = retrieve_answers(question_embedding, doc_embeddings, original_sentences, k=top_k)
    answers.append(top_answers)
    
print_answers(questions, answers)

Question 1: Why did Citi analysts push the RBA rate hike expectation from June to August?
Top 1 answers:
=> Citi analysts push RBA rate hike expectation from June to August to give it more time to assess its reaction function towards wages, inflation and the labor market. (similarity_score 0.8865552 )(from Document fx_insight_e on Page 5 )

**************************************************
Question 2: What are the expected medium-term gains for AUD vs USD?
Top 1 answers:
=> This should underpin strong support for AUDUSD at 0.6550 0.6600 with a likely acceleration of Chinas growth in 2H a medium-term tailwind for AUD, putting it on track for further medium-term gains vs USD (> 0.72-0.74 range in H223), NZD, CAD and GBP (economies with much higher rates and higher recession risks than Australia). (similarity_score 0.7392146 )(from Document fx_insight_e on Page 5 )

**************************************************
Question 3: What will happen to UK?
Top 1 answers:
=> The medium to long

### 2.2. Using FinText Embeddings

In [33]:
top_k = 1 
answers = []

doc_embeddings = [emb for _,emb in get_document_embeddings(qa_document, by="sentences", model=fintext_model)]
for question in questions:
    question_embedding = get_sentence_embedding(clean_sent(question), embedding_model=fintext_model)
    top_answers = retrieve_answers(question_embedding, doc_embeddings, original_sentences, k=top_k)
    answers.append(top_answers)
    
print_answers(questions, answers)

Question 1: Why did Citi analysts push the RBA rate hike expectation from June to August?
Top 1 answers:
=> Citi analysts push RBA rate hike expectation from June to August to give it more time to assess its reaction function towards wages, inflation and the labor market. (similarity_score 0.87560993 )(from Document fx_insight_e on Page 5 )

**************************************************
Question 2: What are the expected medium-term gains for AUD vs USD?
Top 1 answers:
=> This should underpin strong support for AUDUSD at 0.6550 0.6600 with a likely acceleration of Chinas growth in 2H a medium-term tailwind for AUD, putting it on track for further medium-term gains vs USD (> 0.72-0.74 range in H223), NZD, CAD and GBP (economies with much higher rates and higher recession risks than Australia). (similarity_score 0.7110077 )(from Document fx_insight_e on Page 5 )

**************************************************
Question 3: What will happen to UK?
Top 1 answers:
=> Persons who come 

### 2.3 Using FinBERT Embeddings

In [34]:
top_k = 1 
answers = []

doc_embeddings = [emb for _,emb in get_document_finbert_embeddings(qa_document, model=finbert_model)]
for question in questions:
    question_embedding = np.array(get_sentence_finbert_embeddings(question, model=finbert_model))
    top_answers = retrieve_answers(question_embedding, doc_embeddings, original_sentences, k=top_k)
    answers.append(top_answers)
    
print_answers(questions, answers)

Question 1: Why did Citi analysts push the RBA rate hike expectation from June to August?
Top 1 answers:
=> Citi analysts push RBA rate hike expectation from June to August to give it more time to assess its reaction function towards wages, inflation and the labor market. (similarity_score 0.86756265 )(from Document fx_insight_e on Page 5 )

**************************************************
Question 2: What are the expected medium-term gains for AUD vs USD?
Top 1 answers:
=> Previously USDCHF: 0 3mths: 0.87 USDCHF: 6 12mths: 0.86 USDCHF: LT: 0.85 Currently (as of May): USDCHF: 0 3mths: 0.88 USDCHF: 6 12mths: 0.86 USDCHF: LT: 0.85 6-12mths: Bullish CHF vs USD, moderately bearish vs EUR JPY Citi views & strategy Bias/ Forecasts/ Key levels Citi FX outlook Japanese fundamentals make a strong case for Yen to find support on 3 grounds (1) Japans terms of trade having decisively bottomed around mid- August 2022; (2) rising probability of a shift in the BoJs stance on YCC at its June meeting

#### Save Embeddings
In the next notebook, we'll do document clustering at a sentence-level. We'll export the sentences embeddings so that we can use them.

In [35]:
import os

# Path to the main directory
main_dir = "sent_embeddings"

# Create the main directory if it doesn't exist
if not os.path.exists(main_dir):
    os.makedirs(main_dir)

# Delete and recreate the main directory
else:
    for root, dirs, files in os.walk(main_dir, topdown=False):
        for dir_name in dirs:
            dir_path = os.path.join(root, dir_name)
            os.rmdir(dir_path)
    os.makedirs(main_dir)    

In [36]:
# sent_embeddings > report_X.json

def export_embeddings(report_name):
    document = raw_data[raw_data['Report Name'] == report_name].sort_values(by='Page ID')
    hsentences = get_document(document, by="sentences", col="HSentences", paginated=True)
    msentences = get_document(document, by="sentences", col="MSentences", paginated=True)
    
    report_data = {}
    report_data['Report Name'] = report_name
    report_data['Bank Name'] = document['Bank Name'].unique()[0]
    report_data['Report Date'] = document['Report Date'].unique()[0]
    report_data['Pages'] = []
    
    # Embeddings 
    glove_embeddings = get_document_embeddings(document, by="sentences", model=glove_model, paginated=True)
    fintext_embeddings = get_document_embeddings(document, by="sentences", model=fintext_model, paginated=True)
    finbert_embeddings = get_document_finbert_embeddings(document, paginated=True, model=finbert_model)
    
    for page_num in range(len(document)):
        page_data = {}
        page_data['Page ID'] = page_num
        page_data['MSentences'] = msentences[page_num]
        page_data['HSentences'] = hsentences[page_num]
        page_data['FinText'] = [emb.tolist() for _, emb in fintext_embeddings[page_num]]
        page_data['FinBert'] = [emb.tolist() for _, emb in finbert_embeddings[page_num]]
        page_data['GloVe'] = [emb.tolist() for _, emb in glove_embeddings[page_num]]
        report_data['Pages'].append(page_data)
    
    return report_data

In [37]:
# List of documents
reports = list(raw_data['Report Name'].unique())

In [38]:
import json
for report_name in reports:
    report_data = export_embeddings(report_name)
    output_filename = report_data['Report Name'] + ".json"
    out_path = os.path.join(main_dir, output_filename)
    with open(out_path, 'w') as output_file:
        json.dump(report_data, output_file, indent=4)
    
    print(f"Report data exported to {out_path}")

Report data exported to sent_embeddings/fx_insight_e_16_janvier_2023.json
Report data exported to sent_embeddings/exemple_analyse_macro_economique_goldman_sachs.json
Report data exported to sent_embeddings/fx_insight_e_20_fevrier_2023.json
Report data exported to sent_embeddings/fx_insight_e_15_mai_2023.json
Report data exported to sent_embeddings/bnp_parisbas_global_view_2023.json
Report data exported to sent_embeddings/goldman_sachs_janvier_2023.json
Report data exported to sent_embeddings/jpmorgan_private_banking_global_view_2023.json
Report data exported to sent_embeddings/fx_insight_e_03_01_2023.json
Report data exported to sent_embeddings/recession_goldman_sachs.json
Report data exported to sent_embeddings/fx_insight_e_24_avril_2023.json
Report data exported to sent_embeddings/citi_gold_fx_16_janvier_2023_USD_EUR.json
Report data exported to sent_embeddings/goldman_sachs_global_view_2023.json
Report data exported to sent_embeddings/goldman_sachs_global_outlook.json
Report data ex

In [39]:
import shutil

output_zip_filename = main_dir

shutil.make_archive(output_zip_filename, 'zip', main_dir)
print(f"{output_zip_filename} zip created.")

sent_embeddings zip created.
