# Index Creation
**Keys in index_config:**
1. **filename** - name of a pdf file or a json file. If its pdf file, the file is obtained from reports_folder. If its json file, the file is obtained from scraped_data_folder
2. **index_name** - Name of index
3. **l1_chunk_size** - Specifies size of chunk for L1 index creation.
4. **l2_chunk_size** - Specifies size of chunk for L2 index creation.
5. **index_folder** (*optional*) - Folder within which index is stored.
6. **index_path** (*optional*) - If not specified, company name is extracted from filename and index is stored within that folder.
7. **embedding_model** (*optional*) - all_mpnet_base_v2 or text_embedding_004 from config.py
8. **cleaning** (*optional*) - if True, scraped result is cleaned where websites that are old and irrelevant to investment are ignored in the process of index creation. Cleaning can sometimes be incorrect. Hence its better to disable it for smaller websites. This is applicable only for websites and not for other input files (like pdfs etc.)

In [None]:
from src.index.index import config_to_index
from src.config import company_to_index_name

index_configs = [
    {'filename': 'adanipower_cleaned.json', 'l1_chunk_size': 1500, 'l2_chunk_size':300, 'cleaning': False}
]

for i in range(len(index_configs)):
    index_config = index_configs[i]
    company_name = index_config['filename'].split('.')[0].split('_')[0]
    index_config['index_name'] = company_to_index_name[company_name]
    print(f'{index_config=}')
    config_to_index(index_config)
    print('=============================================================================')

  from .autonotebook import tqdm as notebook_tqdm


index_config={'filename': 'adanipower_cleaned.json', 'l1_chunk_size': 1500, 'l2_chunk_size': 300, 'cleaning': False, 'index_name': 'Adani Power'}
Length of data: 10125
Chunking to form L1 paras


100%|██████████| 5/5 [00:01<00:00,  2.81it/s]


Chunking to form L2 paras


100%|██████████| 5/5 [00:03<00:00,  1.64it/s]


Getting embeddings for L1 paras


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


len(texts)=8
len(batches)=1
100%|██████████| 1/1 [00:01<00:00,  1.47s/it]
Getting embeddings for L2 paras
len(texts)=37
len(batches)=1
100%|██████████| 1/1 [00:02<00:00,  2.09s/it]
Building L1 and L2 indexes
Creating index at: data\index\adanipower\adanipower_cleaned_1500_300_EMtex.pkl
Index created in 38.71962785720825 seconds


# Adding cleaned text to raw json
Can be used to clean the files. Gemini-1.5-flash is used for cleaning

In [None]:
import os
from src.config import scraped_data_folder
from src.scrape_cleaning import add_cleaned_text
scraped_file = 'adanipower.json'
scraped_path = os.path.join(scraped_data_folder, scraped_file)
cleaned_res = add_cleaned_text(scraped_path)