### Data Preporcessing and Cleaning

Sample uncleaned dataset
```text
BOARD OF DIRECTORS
Shri Ghanshyam Das Agarwal
-
Non-executive Chairman
Shri Jugal Kishore Agarwal
-
Non-executive Director
Shri Nirmal Kumar Agarwal
-
Non-executive Director
Shri Mohan Lal Agarwal

MANAGEMENT DISCUSSION & ANALYSIS
ADHUNIK METALIKS - AN OVERVIEW
Your Company operates in a specialised segment of steel industry,
producing, special alloy steel, ferro alloys, iron billets and rolled
products at it manufacturing facility at Odisha. Though integrated
with iron ore and manganese ore mines and a 1.6 MMTPA pellet
making facility set up under its wholly owned subsidiary, Orissa
Manganese & Minerals Limited, the fortune of your industry are
dependent upon the growth and fall of iron & steel segment of
the economy. During the year under review, the iron & steel
industry has been plagued with several challenges relating to
negative growth, issues with the mining sector and uncontrolled
imports from countries with surplus capacities. Though a preferred
supplier to many major industrial houses, your Company's
performance has been marred due to the sharp decline in the
performance of important customers of the Company.
```

In [1]:
# check if management discussion and analysis section is present in the following reports

import os
import re

bankrupt_companies = os.listdir('Dataset/Final Dataset/Bankrupt')
healthy_companies = os.listdir('Dataset/Final Dataset/Healthy')


In [2]:
acceptable_bankrupt = []
acceptable_healthy = []
for company in bankrupt_companies:
    with open('Dataset/Final Dataset/Bankrupt/' + company, 'r') as f:
        text = f.read()
        if re.search('management discussion and analysis', text, re.IGNORECASE):
            acceptable_bankrupt.append(company)

for company in healthy_companies:
    with open('Dataset/Final Dataset/Healthy/' + company, 'r') as f:
        text = f.read()
        if re.search('management discussion and analysis', text, re.IGNORECASE):
            acceptable_healthy.append(company)

In [3]:
print(f'Acceptable bankrupt companies: {len(acceptable_bankrupt)} out of {len(bankrupt_companies)} companies.')
print(f'Acceptable healthy companies: {len(acceptable_healthy)} out of {len(healthy_companies)} companies.')

Acceptable bankrupt companies: 131 out of 201 companies.
Acceptable healthy companies: 130 out of 298 companies.


In [4]:
import spacy
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

# Download NLTK stopwords
nltk.download('punkt')
nltk.download('stopwords')

# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))


[nltk_data] Downloading package punkt to /home/vijay/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/vijay/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
def preprocess_mda(mda_text):
    # Step 1: Convert text to lowercase
    mda_text = mda_text.lower()
    
    # Step 2: Sentence tokenization with NLTK
    sentences = sent_tokenize(mda_text)
    
    # Step 3: Process each sentence
    processed_sentences = []
    for sentence in sentences:
        # Tokenize each sentence into words
        words = word_tokenize(sentence)
        
        # Remove punctuation (except for full stops) and stopwords
        filtered_words = [word for word in words if (word.isalnum() or word == '.') and word not in stop_words]
        
        # Join words back into a sentence
        processed_sentence = ' '.join(filtered_words)
        processed_sentences.append(processed_sentence)
    
    # Join processed sentences back into a single text
    cleaned_text = ' '.join(processed_sentences)
    
    # Step 4: Lemmatization with spaCy
    doc = nlp(cleaned_text)
    lemmatized_tokens = [token.lemma_ if token.lemma_ != '-PRON-' else token.text for token in doc]
    
    # Step 5: Named Entity Recognition (NER) with spaCy
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]
    
    # Join lemmatized tokens back to form a preprocessed text string
    lemmatized_text = ' '.join(lemmatized_tokens)
    
    return lemmatized_text, named_entities

In [6]:
# extract the md&a section from the text
def extract_mda_section(text):
    # Define the start and end patterns for the MD&A section
    start_pattern = r"(?:MANAGEMENT DISCUSSION & ANALYSIS|management discussion and analysis|MD&A|MDA|Management Discussion and Analysis|management discussion)"
    end_pattern = r"(?:DIRECTORS’ REPORT|BOARD OF DIRECTORS|CORPORATE GOVERNANCE|CEO CERTIFICATION)"

    # Find the start and end indices
    start_match = re.search(start_pattern, text, re.IGNORECASE)
    end_match = re.search(end_pattern, text[start_match.end():], re.IGNORECASE) if start_match else None
    
    # If both start and end are found, extract the section
    if start_match and end_match:
        mda_section = text[start_match.start():start_match.end() + end_match.start()]
        return mda_section.strip()
    elif start_match:
        # If only the start is found, extract from start to the end of the document
        mda_section = text[start_match.start():].strip()
        return mda_section

In [7]:
# now start extracting the text from the acceptable companies
with open('Dataset/Final Dataset/Bankrupt/' + acceptable_bankrupt[0], 'r') as f:
    text = f.read()
    text_data, ne = preprocess_mda(text)
    mda_section = extract_mda_section(text_data)

management discussion analysis forward look statement gdp growth statement management discussion analysis financial condition result operation company describe company objective expectation prediction may forward look within mean applicable security law regulation . forward look statement base certain assumption expectation future event . india gdp grow five year high per cent 16 powered rebound farm output improvement electricity generation mining production fourth quarter fiscal . economic growth estimate per cent growth number last fiscal reinforce india position world large economy come back strong per cent growth last quarter fiscal . fourth quarter growth come time china report per cent march quarter slow growth seven year . farm sector grow per cent year ago compare per cent contraction december quarter . mining grow per cent march quarter per cent previous quarter . electricity water gas production growth surge per cent per cent december quarter . go forward well rainfall seven