# NSMQ - Kwame AI Project

## Title: JSON to Text File Conversion Script
### Author: Ernest Samuel
### Team: Data Preprocessing Team
### Date: June 24, 2023
##### Updated: August 24, 2023

---

## Data Processing Functions

This script includes several functions for processing structured JSON data and converting it into a plain text file. Below are the key functions used:

### `convert_dict_to_strings()`
This function is responsible for converting a dictionary of strings into a list of strings, making data handling more straightforward.

### `convert_list_to_strings()`
The purpose of this function is to transform a list of lists into a list of strings, enhancing data compatibility.

### `extract_and_save_data()`
The central function, `extract_and_save_data()`, serves as the core of the script. It performs the following steps:

1. Reads the input JSON file, extracting the data contained within.
2. Determines the folder path and file name of the input JSON file.
3. Initializes a list to store the formatted sections.

Next, the function iterates through the JSON file's pages and sections. For each section, it creates a well-structured section by incorporating the section title, paragraphs, lists, tables, and figures. These formatted sections are then appended to the list of formatted sections.

Finally, the function generates the output file name and saves the formatted sections into the specified text file.

In [1]:
# -------- Import libraries --------------- #
import json
import os


In [6]:
# Data processing funtions

def convert_dict_to_strings(dict_of_strings):
  """
  Converts a dictionary of strings to a list of strings.

  Args:
    dict_of_strings: A dictionary of strings.

  Returns:
    A list of strings.
  """
  new_list = []
  if isinstance(dict_of_strings, dict):
    raise TypeError("dict_of_strings must not be a dictionary")

  for dic in dict_of_strings: 
    for key, value in dic.items():
      new_list.append(f"{key}: {value}")

  return new_list


def convert_list_to_strings(list_of_lists):
  """
  Converts a list of lists to a list of strings.

  Args:
    list_of_lists: A list of lists.

  Returns:
    A list of strings.
  """
  new_list = []
  for list_item in list_of_lists:
    new_list.append("\n".join(map(str, list_item)))
  return new_list
  

In [7]:
# Main .TXT File Conversion Function

def extract_and_save_data(input_filename):
    """
    This function extracts structured data from a JSON file, formats it, and saves it as a .txt file.
    
    Parameters:
    input_filename (str): The name of the input JSON file to be processed.
    """
    
    # Read the JSON file
    with open(input_filename, 'r') as json_file:
        data = json.load(json_file)

    # Get the directory and base name of the input file
    input_dir, base_name = os.path.split(input_filename)
    base_name_without_extension, _ = os.path.splitext(base_name)

    # Create the output folder if it doesn't exist
    output_folder = os.path.join(input_dir, "output_txt_files")
    os.makedirs(output_folder, exist_ok=True)

    # Create a list to store formatted sections
    formatted_sections = []

    # Iterate through the pages and sections
    for page_data in data:
        for page_title in page_data.keys():
            sections = page_data[page_title]
            for section in sections:
                formatted_section = f"__section__\n**{section['title']}**\n"
                formatted_section += "\n\n_paragraph_ \n".join(section['Section']) + "\n"
                formatted_section += "\n**Lists**\n"
                formatted_section += "\n".join(convert_list_to_strings(section['lists'])) + "\n"
                formatted_section += "\n**Table**\n"
                formatted_section += "\n".join(convert_list_to_strings(section['tables'])) + "\n"
                formatted_section += "\n**Figures**\n"
                formatted_section += "\n".join(convert_dict_to_strings(section['figures'])) + "\n\n"
                formatted_sections.append(formatted_section)

    # Create the output file name
    output_file_name = os.path.join(output_folder, f"{base_name_without_extension}.txt")

    # Save the formatted sections to a text file
    with open(output_file_name, 'w') as output_file:
        output_file.write("\n\n".join(formatted_sections))

    print(f"{base_name_without_extension} Data extracted and saved as .txt successfully in {output_folder}.")


#### Run Script.

NOTE: Make sure the JSON file you are runing generated with `extract_textbook_from_url_notebook_.ipynb`.

In [9]:
# Call the function with the input filename
input_filename = 'Biology 2e.json'  # Replace with the actual input file name
extract_and_save_data(input_filename)


Biology 2e Data extracted and saved as .txt successfully in output_txt_files.
