# Table of Contents
* [How many volumns and chapters are there?](#chapter1)
    * [Extracting the volumns](#section_1_1)
    * [Extracting the chapters](#section1_2)
* [Basic EDA](#chapter2)
    * [Occurrence of sentences](#section2_1)
    * [Occurrence of words](#section_2_2)
        * [Sub Section 2.1.1](#sub_section_2_1_1)
        * [Sub Section 2.1.2](#sub_section_2_1_2)
* [Chapter 3](#chapter3)
    * [Section 3.1](#section_3_1)
        * [Sub Section 3.1.1](#sub_section_3_1_1)
        * [Sub Section 3.1.2](#sub_section_3_1_2)
    * [Section 3.2](#section_3_2)
        * [Sub Section 3.2.1](#sub_section_3_2_1)

# How many volumns are there ?<a class="anchor" id="chapter1"></a>

In [64]:
# Necessary libraries
import re
from pathlib import Path 
import json
from processtext import clean, remove_sw, lemmatize, clean_text
PATH = str(Path.cwd().parent)

# Reading the dataset
with open(PATH + "/data/raw/Complete Works of Swami Vivekananda -  All Volumes - Swami Vivekananda.txt", "r") as file:
    book = file.read()

## Extracting the volumns inside the book<a class="anchor" id="section_1_1"></a>

In [52]:
pattern = re.compile("Volume [0-9]+") # (Volumn [a-z])
volumns = re.findall(pattern, book)
volumns

['Volume 1',
 'Volume 2',
 'Volume 3',
 'Volume 4',
 'Volume 5',
 'Volume 6',
 'Volume 7',
 'Volume 8',
 'Volume 9',
 'Volume 9',
 'Volume 9']

## Extracting the chapters<a class="anchor" id="section_1_2"></a>

In [53]:
def search_patterns(text:str,
                    pattern)->list:
    """This function searches the whole text file for the given pattern

    Args:
        text (str): Input text
        pattern (str): Pattern in regex format

    Returns:
        list: list of search results
    """
    forward_pointer = 0
    output_results_list = []
    while (match := re.search(pattern, text[forward_pointer:])) is not None:
        forward_pointer += match.end()
        output_results_list.append(match.group().replace('\n', '').rstrip())
        # print(f"Forward Pointer: {forward_pointer} \n _____________________ text:",match.group().replace('\n', '').rstrip())
    return list(sorted(set(output_results_list)))

list_of_chapters = search_patterns(book,
                                   pattern= r'\b\d\.\d[ \.\d\w\s][ \.\d\w\s][ \.\d\w\s][ \.\d\w\s][A-Za-z0-9_]*.?(\n|.*)') # https://regexr.com/
list_of_chapters.remove('0.0032 millimeter.')
list_of_chapters

['1.1 ADDRESSES AT THE PARLIAM',
 '1.1 ADDRESSES AT THE PARLIAMENT',
 '1.1. 1. (Sanskrit in ITRANS format)',
 '1.1.1 RESPONSE TO WELCOME',
 '1.1.2 WHY WE DISAGREE',
 '1.1.3 PAPER ON HINDUISM',
 '1.1.4 RELIGION NOT THE CRYING NEED OF INDIA',
 '1.1.5 BUDDHISM, THE FULFILMENT OF HINDUISM',
 '1.1.5 PAPER ON HINDUISM',
 '1.1.6 ADDRESS AT THE FINAL SESSION',
 '1.11.',
 '1.2 KARMA-YOGA',
 '1.2.1 CHAPTER 1: KARMA IN ITS EFFECT ON CHARACTER',
 '1.2.2 CHAPTER 2: EACH IS GREAT IN HIS OWN PLACE',
 '1.2.4 CHAPTER 4: WHAT IS DUTY?',
 '1.2.5 CHAPTER 3: THE SECRET OF WORK',
 '1.2.5 CHAPTER 5: WE HELP OURSELVES, NOT THE WORLD',
 '1.2.6 CHAPTER 6: NON-ATTACHMENT IS COMPLETE SELF-',
 '1.2.7 CHAPTER 7: FREED',
 '1.2.7 CHAPTER 7: FREEDOM',
 '1.2.8 CHAPTER 8: THE IDEAL OF K',
 '1.2.8 CHAPTER 8: THE IDEAL OF KARMA-YOGA',
 '1.2.9',
 '1.3 RAJA-YOGA',
 '1.3.0 PREFACE',
 '1.3.1 CHAPTER 1: INTRODUCTORY',
 '1.3.2 CHAPTER 2: THE FIRST STEPS',
 '1.3.3 CHAPTER 3: PRANA',
 '1.3.4 CHAPTER 4: THE PSYCHIC PRANA',
 '1.3.5

# Basic EDA<a class="anchor" id="chapter2"></a>

## Occurrence of sentences <a class="anchor" id="section_2_1"></a>

In [54]:
strating_sentence = "OUR MASTER AND HIS MESSAGE"
pattern = re.compile(strating_sentence)
starting_pointer = re.search(pattern, book).end()
print("Number of sentences: ", len(book[starting_pointer:].split(". ")))

Number of sentences:  79340


## Occurrence of words <a class="anchor" id="section_2_2"></a>

In [72]:
cleaned_text = clean(book.replace("’",""),extra_spaces= True, lowercase= True, punct= True) # Basic preprocessing
cleaned_text = remove_sw(cleaned_text) # Removing stopwords
cleaned_text = cleaned_text.replace("—", '')
cleaned_text = clean_text(cleaned_text)

def remove_numerals_and_one_char_words(input_text):
    # Remove numerals
    text_without_numerals = re.sub(r'\b\d+\b', '', input_text)

    # Remove one-character words
    text_without_one_char_words = ' '.join(word for word in text_without_numerals.split() if len(word) > 1)

    # Remove extra spaces
    cleaned_text = ' '.join(text_without_one_char_words.split())

    return cleaned_text

cleaned_text = remove_numerals_and_one_char_words(cleaned_text)
cleaned_text = lemmatize(cleaned_text) # lemmatization