# Document Post-Processing
In this notebook, our primary objective is to execute essential text cleaning operations on the parsed documents. By doing so, we aim to ensure that the documents are thoroughly prepared and optimized for various tasks, such as clustering and analysis. Text cleaning involves a series of critical steps, including removing any irrelevant characters, useless whitespaces, etc.. Additionally, we'll address common data quality issues like misspellings and inconsistencies. By the end of this process, we'll have transformed the raw text into a clean readable format for humans, and a cleaned version for embeddings and clustering.

## Importing required libraries

In [1]:
%%capture
!pip install wordninja
!pip install contractions
!pip install num2words
!pip install textblob
!pip install gensim

In [2]:
import pandas as pd
import numpy as np
import re
import wordninja
from num2words import num2words
import contractions
import spacy
from textblob import TextBlob
import gensim
from gensim.parsing.preprocessing import remove_stopwords, strip_punctuation, strip_multiple_whitespaces

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


## Loading documents
We'll use the documents parsed using the PyMuPDF library. This method has demonstrated superior parsing results compared to other approaches.

In [3]:
# Loading documents
raw_data = pd.read_csv("/kaggle/input/parsed-docs/parsed-docs-pymupdf.csv")

In [4]:
raw_data.head()

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,Parsing Method
0,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,0,\n Citi Global Wealth Investments \n FX Snaps...,pymupdf
1,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,1,"\n Source: Bloomberg L.P. \n (K = Thousand, M ...",pymupdf
2,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,2,\n Citi FX interest rate Forecast % \n Source:...,pymupdf
3,0,fx_insight_e_16_janvier_2023,Unknown,Unknown,3,\n Important Disclosure \n “Citi analysts” ref...,pymupdf
4,1,exemple_analyse_macro_economique_goldman_sachs,Unknown,Unknown,0,\n Fixed Income \n MUSINGS \n FIXED INCOME Go...,pymupdf


Let's observe some pages to identify what can we do to clean the text


In [5]:
print(raw_data['Page Text'].iloc[0])


 Citi Global Wealth Investments  
 FX Snapshot 
 Major Currencies Performance 
 Source: Bloomberg L.P., as of Jan 13, 2023 (cut off time is NY Time 5:00pm); Citi (forecasts as of Jan 12, 2023) 
 CCY Close Weekly Change 
 1 month high 
 1 month low 
 1 month change 
 3 month high 
 3 month low 
 3 month change 
 52 week high 
 52 week low 
 Year-To- Date Change 
 USD  102.20 -1.6% 105.04 102.20 -1.7% 113.31 102.20 -9.0% 114.78 94.63 -1.3% 
 EUR/USD  1.0830 1.7% 1.0853 1.0522 1.9% 1.0853 0.9722 10.8% 1.1495 0.9536 1.2% 
 USD/JPY  127.87 -3.2% 137.78 127.87 -5.7% 150.15 127.87 -13.1% 151.95 113.47 -2.5% 
 GBP/USD  1.2227 1.1% 1.2426 1.1908 -1.1% 1.2426 1.1160 8.0% 1.3690 1.0350 1.2% 
 USD/CAD  1.3396 -0.4% 1.3699 1.3367 -1.1% 1.3885 1.3275 -2.6% 1.3977 1.2403 -1.2% 
 AUD/USD  0.6968 1.3% 0.6969 0.6670 1.6% 0.6969 0.6199 10.6% 0.7661 0.6170 2.3% 
 NZD/USD  0.6375 0.4% 0.6464 0.6234 -1.4% 0.6464 0.5562 13.1% 0.7034 0.5512 0.4% 
 USD/CHF  0.9269 -0.1% 0.9362 0.9213 -0.2% 1.0133 0.9213 -7.3%

In [6]:
print(raw_data['Page Text'].iloc[6])


 MUSINGS 
 FIXED INCOME  Goldman Sachs Asset Management  3 
 CENTRAL BANK SNAPSHOT 
 Source: Goldman Sachs Asset Management. As of July 29, 2022. The economic and market forecasts presented herein are for informational purposes as of the date of this presentation. There can be no assurance that the forecasts will be achieved. Please see additional disclosures at the end of this presentation. 
 Interest Rate Policy Balance Sheet Policy Outlook 
 Our outlook relative to  
 market-implied  
 pricing 
 Fed  Federal funds rate: 2.25%-2.50%  
 Last changed:  
 July 2022 (+75bps) Prior changes: 
 June 2022 (+75bps) 
 May 2022 (+50bps) 
 March 2022 (+25bps) 
 Started reducing the monthly pace of its net asset purchases in November 2021 and ended net additional purchases of Treasuries and agency MBS in early March. Balance sheet runoff begins in June; an eventual monthly cap will be set at $95bn—split $60bn-$35bn between US Treasury and mortgage- backed securities (MBS)—and the caps will initi

In [7]:
print(raw_data['Page Text'].iloc[20])


 NZD Citi views & strategy Bias/ Forecasts/ Key levels 
 Citi FX outlook 
 NZD outperformance earlier due to its higher carry looks unsustainable as –(1) current highly restrictive NZ financial conditions sharply raise the prospect of a hard landing for the NZ economy; (2) the recent larger 50bp hike sharply reduces odds of further RBNZ tightening; and (3) a strengthening Chinese recovery better supports other currencies closely linked to China (EUR, AUD, SGD, CNH). Against USD however, the combination of a stronger Chinese recovery in H2’23 and Fed pivoting towards rate cuts may yet lift NZDUSD in H2’23. 
 Previously  • NZDUSD: 0 – 3mths: 0.62 
 • NZDUSD: 6 – 12mths: 0.63 
 • NZDUSD: Longer term: 0.67 
 Currently (as of Apr):  • NZDUSD: 0 – 3mths: 0.63 
 • NZDUSD: 6 – 12mths: 0.63 
 • NZDUSD: Longer term: 0.67 
 • 6-12mths: Modestly bullish NZD vs USD, bearish vs EUR and AUD 
 CAD Citi views & strategy Bias/ Forecasts/ Key levels 
 Citi FX outlook 
 USDCAD may find it difficult to br

## Cleaning documents
Here are some cleaning steps to consider:
- Remove URLs: Any URLs or web links present in the text are removed.
- Replace Contractions: Contractions like "can't" are expanded to their full forms ("cannot").
- Remove Non-ASCII Characters: Non-ASCII characters, such as special characters or non-English characters, are removed from the text.
- Replace Numbers: Numbers are replaced with their textual representations (e.g., "1" becomes "one").
- Replace Symbols: Specific symbols may be replaced with their textual equivalents.
- Remove Unwanted Punctuation: Unwanted punctuation marks are removed from the text.
- Convert Text to Lowercase: The entire text is converted to lowercase to ensure consistency.
- Remove Stopwords: Common stopwords like "the," "a," "and," etc., are removed from the text.
- Remove Non-Alphanumeric Characters: Only letters and digits are kept, and all other non-alphanumeric characters are removed.
- Strip Multiple Whitespaces: Any extra consecutive whitespaces are reduced to a single whitespace.
- Remove short pages

In [8]:
def remove_urls(text):
    url_pattern_1 = "^https?:\\/\\/(?:www\\.)?[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
    url_pattern_2 = r'https?://\S+'
    url_pattern_3 = "^[-a-zA-Z0-9@:%._\\+~#=]{1,256}\\.[a-zA-Z0-9()]{1,6}\\b(?:[-a-zA-Z0-9()@:%_\\+.~#?&\\/=]*)$"
    url_pattern = f"({url_pattern_1})|({url_pattern_2})|({url_pattern_3})"
    text = re.sub(url_pattern, '', text, flags=re.MULTILINE)
    return text

def replace_contractions(text):
    return contractions.fix(text)

def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7F$€£¥₹]+', '', text)

def replace_numbers(text):
    return re.sub(r'\b\d+\b|\b\d+\.\d+\b', lambda match: num2words(float(match.group(0))) if '.' in match.group(0) else num2words(int(match.group(0))), text)

def replace_symbols(text):
    symbols = {
        '$': 'dollar',
        '€': 'euro',
        '£': 'pound',
        '¥': 'yen',
        '₹': 'rupee',
        '%': 'percent'
    }
    
    for symbol, word in symbols.items():
        text = re.sub(r'(\d)' + re.escape(symbol), r'\1 ' + word, text)
        text = re.sub(re.escape(symbol) + r'(\d)', word + r' \1', text)
        text = re.sub(re.escape(symbol), word, text)
    return text

def correct_sent(text):
    return TextBlob(text).correct()

def remove_unwanted_punctuation(text):
    special_chars = [
            '!', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/','“','”'
            ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '–','•'
            ]
    pattern = "[" + re.escape("".join(special_chars)) + "]"
    return re.sub(pattern, '', text)

def remove_unwanted_space(text):
    text = text.replace('\n', ' ').replace('\t', ' ')
    return re.sub(r'\s+', ' ', text).strip()

In [9]:
def clean(text, include_all=False):
    text = remove_urls(text)
    text = replace_contractions(text)
    text = remove_non_ascii(text)
    if include_all:
        text = replace_numbers(text)
        text = replace_symbols(text)
        #text = remove_unwanted_punctuation(text)
        text = strip_punctuation(text)
        text = text.lower()
        text = remove_stopwords(text)
        text = re.sub(r'[^a-z0-9\s]', '', text)
    text = strip_multiple_whitespaces(text)
    return text


def process(text, include_all=False):
    blob = TextBlob(text)
    original_sentences = blob.raw_sentences
    new_sentences = []
    for sentence in original_sentences: 
        new_sentences.append(clean(sentence, include_all))
    return new_sentences

In [10]:
raw_data['HSentences'] = raw_data['Page Text'].apply(lambda x: process(x, include_all=False))
raw_data['MSentences'] = raw_data['Page Text'].apply(lambda x: process(x, include_all=True))

In [11]:
# Identify pages with less than X sentences, and drop them
threshold = 2
raw_data = raw_data[raw_data['HSentences'].apply(lambda x: len(x) > threshold)]

Let's observe some pages after the cleaning.

In [12]:
page_sentences = raw_data['HSentences'].iloc[6]
print('* Full Page Text')
print(' '.join(page_sentences))
print(50*'*')
print('* Page Text by Sentence:')
for i,sent in enumerate(page_sentences):
    print(f'Sentence {i}:',sent)

* Full Page Text
 MUSINGS FIXED INCOME Goldman Sachs Asset Management 5 Disclosures Views and opinions are current as of date of publication and may be subject to change, they should not be construed as investment advice. Views and opinions expressed are for informational purposes only and do not constitute a recommendation by Goldman Sachs Asset Management to buy, sell, or hold any security. Individual portfolio management teams for Goldman Sachs Asset Management may have views and opinions and/or make investment decisions that, in certain instances, may not always be consistent with the views and opinions expressed herein. This material is provided at your request for informational purposes only. It is not an offer or solicitation to buy or sell any securities. The website links provided are for your convenience only and are not an endorsement or recommendation by Goldman Sachs Asset Management of any of these websites or the products or services offered. Goldman Sachs Asset Manageme

The report name was saved according to the filename of the document. All the names include the bank/org related to the document, and some of them also include the date of the report. We'll try to extract both bank names and dates whenever possible.

### Generating documents metadata

In [13]:
report_names = raw_data['Report Name'].unique().tolist()

In [14]:
report_names

['fx_insight_e_16_janvier_2023',
 'exemple_analyse_macro_economique_goldman_sachs',
 'fx_insight_e_20_fevrier_2023',
 'fx_insight_e_15_mai_2023',
 'bnp_parisbas_global_view_2023',
 'goldman_sachs_janvier_2023',
 'jpmorgan_private_banking_global_view_2023',
 'fx_insight_e_03_01_2023',
 'recession_goldman_sachs',
 'fx_insight_e_24_avril_2023',
 'citi_gold_fx_16_janvier_2023_USD_EUR',
 'goldman_sachs_global_view_2023',
 'goldman_sachs_global_outlook',
 'fx_insight_e_30_janvier_2023',
 'fx_insight_e_9_9_2023',
 'jpmorgan_asset_management_Q1_2023',
 'fx_insight_e_22_mai_2023',
 'kkr_global_view_2023',
 'fx_insight_e_13_fevrier_2023',
 'fx_insight_e']

In [15]:
months = {
    'janvier': 1,
    'fevrier': 2,
    'mars': 3,
    'avril': 4,
    'mai': 5,
    'juin': 6,
    'juillet': 7,
    'aout': 8,
    'septembre': 9,
    'octobre': 10,
    'novembre': 11,
    'decembre': 12
}

banks = ['goldman_sachs','jpmorgan','citi','bnp_parisba','kkr']

reports2struct = {}
for x in report_names:
    reports2struct[x] = {}
    # Getting dates
    date_pattern = r'\d{1,2}_\w*_\d{4}|Q[1-4]_\d{4}|_(?!view)[a-z]*_\d{4}|\d{4}'
    d = re.findall(date_pattern, x) 
    date = d[0] if len(d) > 0 else ""
    if date != "":
        date = date.split("_")
        date = list(filter(lambda x: x != '', date))     
        if len(date) > 2:
            date[1] = str(months[date[1]]) if date[1].isalpha() else date[1]
        else:
            date[0] = str(months[date[0]]) if date[0] in months.keys() else date[0]
    reports2struct[x]['date'] = '-'.join(date)
    # Getting bank names
    bank_name = None
    for bank in banks:
        if bank in x:
            bank_name = bank
    if bank_name is None and "fx" in x:
        bank_name = "citi"
    reports2struct[x]['bank'] = bank_name

reports2struct

{'fx_insight_e_16_janvier_2023': {'date': '16-1-2023', 'bank': 'citi'},
 'exemple_analyse_macro_economique_goldman_sachs': {'date': '',
  'bank': 'goldman_sachs'},
 'fx_insight_e_20_fevrier_2023': {'date': '20-2-2023', 'bank': 'citi'},
 'fx_insight_e_15_mai_2023': {'date': '15-5-2023', 'bank': 'citi'},
 'bnp_parisbas_global_view_2023': {'date': '2023', 'bank': 'bnp_parisba'},
 'goldman_sachs_janvier_2023': {'date': '1-2023', 'bank': 'goldman_sachs'},
 'jpmorgan_private_banking_global_view_2023': {'date': '2023',
  'bank': 'jpmorgan'},
 'fx_insight_e_03_01_2023': {'date': '03-01-2023', 'bank': 'citi'},
 'recession_goldman_sachs': {'date': '', 'bank': 'goldman_sachs'},
 'fx_insight_e_24_avril_2023': {'date': '24-4-2023', 'bank': 'citi'},
 'citi_gold_fx_16_janvier_2023_USD_EUR': {'date': '16-1-2023', 'bank': 'citi'},
 'goldman_sachs_global_view_2023': {'date': '2023', 'bank': 'goldman_sachs'},
 'goldman_sachs_global_outlook': {'date': '', 'bank': 'goldman_sachs'},
 'fx_insight_e_30_janvie

In [16]:
for index, row in raw_data.iterrows():
    current_report_name = row['Report Name']
    # Extract the bank name and date from the reports2struct dictionary
    bank_name = reports2struct[current_report_name]['bank']
    report_date = reports2struct[current_report_name]['date']

    # Use .loc to update the DataFrame with the new values
    raw_data.loc[index, 'Bank Name'] = bank_name
    raw_data.loc[index, 'Report Date'] = report_date

In [17]:
raw_data.drop(columns=['Parsing Method'], inplace=True)

In [18]:
raw_data.head(20)

Unnamed: 0,Report ID,Report Name,Bank Name,Report Date,Page ID,Page Text,HSentences,MSentences
0,0,fx_insight_e_16_janvier_2023,citi,16-1-2023,0,\n Citi Global Wealth Investments \n FX Snaps...,[ Citi Global Wealth Investments FX Snapshot M...,[citi global wealth investments fx snapshot ma...
3,0,fx_insight_e_16_janvier_2023,citi,16-1-2023,3,\n Important Disclosure \n “Citi analysts” ref...,[ Important Disclosure Citi analysts refers to...,[important disclosure citi analysts refers inv...
4,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,0,\n Fixed Income \n MUSINGS \n FIXED INCOME Go...,[ Fixed Income MUSINGS FIXED INCOME Goldman Sa...,[fixed income musings fixed income goldman sac...
5,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,1,\n Fixed Income \n MUSINGS \n Goldman Sachs As...,[ Fixed Income MUSINGS Goldman Sachs Asset Man...,[fixed income musings goldman sachs asset mana...
6,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,2,\n MUSINGS \n FIXED INCOME Goldman Sachs Asse...,[ MUSINGS FIXED INCOME Goldman Sachs Asset Man...,[musings fixed income goldman sachs asset mana...
7,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,3,\n MUSINGS \n FIXED INCOME Goldman Sachs Asse...,[ MUSINGS FIXED INCOME Goldman Sachs Asset Man...,[musings fixed income goldman sachs asset mana...
8,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,4,\n MUSINGS \n FIXED INCOME Goldman Sachs Asse...,[ MUSINGS FIXED INCOME Goldman Sachs Asset Man...,[musings fixed income goldman sachs asset mana...
9,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,5,\n MUSINGS \n FIXED INCOME Goldman Sachs Asse...,[ MUSINGS FIXED INCOME Goldman Sachs Asset Man...,[musings fixed income goldman sachs asset mana...
10,1,exemple_analyse_macro_economique_goldman_sachs,goldman_sachs,,6,\n MUSINGS \n FIXED INCOME Goldman Sachs Asse...,[ MUSINGS FIXED INCOME Goldman Sachs Asset Man...,[musings fixed income goldman sachs asset mana...
11,2,fx_insight_e_20_fevrier_2023,citi,20-2-2023,0,\n Citi Global Wealth Investments \n FX Snaps...,[ Citi Global Wealth Investments FX Snapshot M...,[citi global wealth investments fx snapshot ma...


In [19]:
raw_data.to_csv("cleaned-docs.csv", index=False)

The post-processing of parsed documents here makes the documents ready for further task-specific pre-processing steps and analysis. Certainly, there is still more to do regarding it.