# text_to_json

This notebook will take the raw text files and convert them into structured JSON

## JSON structure

We have the following structure:

- Book
    - Chapter
        - Title (first line)
        - Paragraphs (all other lines)

The JSON structure should be similar:

```json
{
    "title": "My Book",
    "chapters": [{
        "name": "Chapter 1",
        "paragraphs": [
            {
                "text": "this is a paragraph",
                "parsed_text": ["this", "is", "a", "paragraph"],
                "parsed_text": [
                    {
                        "word": "this",
                        "definition": "an article that refers to something"
                    },
                    {
                        "word": "is",
                        "definition": "2nd person of to be"
                    },
                    {
                        "word": "a",
                        "definition": "an article that refers to one of something"
                    },
                    {
                        "word": "paragraph",
                        "definition": "a group of printed words that forms a collection of ideas in writing"
                    }
                ]
            }
        ]
    }],
    "characters": [{
        "character": "shi",
        "occurences": 1,
        "rank_in_book": 1,
        "rank_in_chinese": 4,
    }],
    "words": [{
        "characters": "nihao",
        "pinyin": "nihao",
        "occurences": 1,
        "meanings" ["hello"]
    }]
}
```

In [2]:
import glob

def get_chapters_for_book(book_path):
    chapters = glob.glob(f"../data/books/{book_path}/text/*")
    chapter_count = len(chapters)+1
    chapters = [f"../data/books/{book_path}/text/{i}_*" for i in range(1, chapter_count)]
    chapters = [glob.glob(pattern) for i, pattern in enumerate(chapters)]
    chapters = [c[0] for c in chapters]
    return chapters

In [3]:
def get_paragraph_json_from_file(file_path):
    with open(file_path, "r+") as f:
        # paragraph_text = f.read()
        title = f.readline().strip()
        paragraphs = [x.strip() for x in f.readlines()]
    return title, paragraphs

In [4]:
import json

def write_json_file_from_chapter_texts(book_path, book_name):
    chapter_file_paths = get_chapters_for_book(book_path)

    chapter_data = []

    for file_path in chapter_file_paths:
        title, paragraphs = get_paragraph_json_from_file(file_path)
        chapter_data.append({'title': title, 'paragraphs': paragraphs})

    book_data = {'title': book_name}
    book_data["chapters"] = chapter_data

    with open(f"../data/books/{book_path}/book_data.json", "w+") as f:
        json.dump(book_data, f, ensure_ascii=False)

In [11]:
write_json_file_from_chapter_texts("dark_forest", "dark forest")

In [5]:
write_json_file_from_chapter_texts("deaths_end", "deaths end")

# Parsing

Now that we have the basic text in a json format, we can work on parsing the individual paragraphs using spacy. This will be continued in a `parsing.ipynb`.