# NSMQ - Kwame AI Project

## Title: JSON to Text File Conversion Script
### Author: Ernest Samuel
### Team: Data Preprocessing Team
### Date: June 24, 2023
##### Updated: August 24, 2023

---

## Data Processing Functions

This script includes several functions for processing structured JSON data and converting it into a plain text file. Below are the key functions used:

### `convert_dict_to_strings()`
This function is responsible for converting a dictionary of strings into a list of strings, making data handling more straightforward.

### `convert_list_to_strings()`
The purpose of this function is to transform a list of lists into a list of strings, enhancing data compatibility.

### `convert_to_txt_or_csv()`
The central function, `convert_to_txt_or_csv()`, serves as the core of the script. It performs the following steps:

1. Reads the input JSON file, extracting the data contained within.
2. Determines the folder path and file name of the input JSON file.
3. Initializes a list to store the formatted sections.

Next, the function iterates through the JSON file's pages and sections. For each section, it creates a well-structured section by incorporating the section title, paragraphs, lists, tables, and figures. These formatted sections are then appended to the list of formatted sections.

Finally, the function generates the output file name and saves the formatted sections into the specified format, either as .csv or .txt text file. The default is .txt file.

---

This documentation outlines the purpose and functionality of each function, ensuring clarity and readability for future users and collaborators.

In [70]:
# -------- Import libraries --------------- #
import json
import os
import csv



In [71]:
# Data processing funtions

def convert_dict_to_strings(dict_of_strings):
  """
  Converts a dictionary of strings to a list of strings.

  Args:
    dict_of_strings: A dictionary of strings.

  Returns:
    A list of strings.
  """
  new_list = []
  if isinstance(dict_of_strings, dict):
    raise TypeError("dict_of_strings must not be a dictionary")

  for dic in dict_of_strings: 
    for key, value in dic.items():
      new_list.append(f"{key}: {value}")

  return new_list


def convert_list_to_strings(list_of_lists):
  """
  Converts a list of lists to a list of strings.

  Args:
    list_of_lists: A list of lists.

  Returns:
    A list of strings.
  """
  new_list = []
  for list_item in list_of_lists:
    new_list.append("\n".join(map(str, list_item)))
  return new_list
  

In [76]:
def convert_to_txt_or_csv(input_filename, output_format="txt"):
    """
    Extracts structured data from a JSON file, applies formatting, and saves it as a .txt or .csv file.

    Args:
        input_filename (str): The name of the input JSON file for processing.
        output_format (str): The desired output file format (txt or csv).
    """

    # Read the input JSON file
    with open(input_filename, 'r') as json_file:
        data = json.load(json_file)

    # Extract input file information
    input_dir, base_name = os.path.split(input_filename)
    base_name_without_extension, _ = os.path.splitext(base_name)

    # Prepare the output folder
    output_folder = os.path.join(input_dir, "formatted_files")
    os.makedirs(output_folder, exist_ok=True)

    # Container for formatted sections
    formatted_sections = []

    # Container for CSV data
    csv_data = []

    # Iterate through pages and sections
    for page_data in data:
        for page_title in page_data.keys():
            sections = page_data[page_title]
            for section in sections:
                # Construct formatted section
                formatted_section = f"__section__\n**{section['title']}**\n"
                #formatted_section += "\n\n_paragraph_ \n"
                formatted_section += "\n\n_paragraph_ \n".join(section['Section']) + "\n"
                formatted_section += "\n**Lists**\n"
                formatted_section += "\n".join(convert_list_to_strings(section['lists'])) + "\n"
                formatted_section += "\n**Table**\n"
                formatted_section += "\n".join(convert_list_to_strings(section['tables'])) + "\n"
                formatted_section += "\n**Figures**\n"
                formatted_section += "\n".join(convert_dict_to_strings(section['figures'])) + "\n\n"
                formatted_sections.append(formatted_section)

                # Prepare numbered CSV data
                paragraph_numbered = [f"_paragraph_{i+1}\n {p}" for i, p in enumerate(section['Section'])]
                lists_numbered = [f"_list_{i+1}\n {l}" for i, l in enumerate(convert_list_to_strings(section['lists']))]
                tables_numbered = [f"_table_{i+1}\n {t}" for i, t in enumerate(convert_list_to_strings(section['tables']))]
                figures_numbered = [f"_figure_{i+1}\n {f}" for i, f in enumerate(convert_dict_to_strings(section['figures']))]

                csv_row = {
                    "Section Title": section['title'],
                    "Paragraphs": "\n".join(paragraph_numbered),
                    "Lists": "\n".join(lists_numbered),
                    "Table": "\n".join(tables_numbered),
                    "Figures": "\n".join(figures_numbered)
                }
                csv_data.append(csv_row)

    # Determine the output file name and format
    if output_format == "txt":
        output_file_name = os.path.join(output_folder, f"{base_name_without_extension}.txt")
        formatted_output = "\n\n".join(formatted_sections)
    elif output_format == "csv":
        output_file_name = os.path.join(output_folder, f"{base_name_without_extension}.csv")
        
        # Use csv.DictWriter to write CSV data
        with open(output_file_name, 'w', newline='') as csv_file:
            fieldnames = ["Section Title", "Paragraphs", "Lists", "Table", "Figures"]
            writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
            writer.writeheader()
            for row in csv_data:
                writer.writerow(row)

        formatted_output = None  # No formatted output for CSV
    else:
        print("Invalid output format. Supported formats are 'txt' and 'csv'.")
        return

    # Save the formatted data to the output file
    if formatted_output:
        with open(output_file_name, 'w') as output_file:
            output_file.write(formatted_output)

    print(f"{base_name_without_extension} Data extracted and saved as .{output_format} successfully in {output_folder}.")


#### Run Script.

NOTE: Make sure the JSON file you are runing generated with `extract_textbook_from_url_notebook_.ipynb`.

In [78]:
# Call the function with the input filename
input_filename = 'Biology 2e.json'  # Replace with the actual input file name
output_format = "csv"               # Choose output format

# ----- Remove comment from the code below to extract to specified format - - - -- -  #
#convert_to_txt_or_csv(input_filename, output_format)


Biology 2e Data extracted and saved as .csv successfully in formatted_files.
