<a href="https://www.kaggle.com/code/binfeng2021/question-answering-with-chinese-history-book?scriptVersionId=209598627" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Main Goal

* Process a pdf document of a Chinese hitory book, and use it as reference to answer fact-based questions related to Chinese history.
* Embed the texts from the book and find the most similar text chunk as context to answer provided questions. 
* Build a questions answering bot that can process the information and find the most likely answer based on contexts.

# Data 
[A History of CHINA by Morris Rossabi](https://arxiujosepserradell.cat/wp-content/uploads/2021/12/A-History-of-China-by-Morris-Rossabi-z-lib.org_.pdf)

## Install and import needed libraries

In [1]:
!pip install pymupdf

Collecting pymupdf
  Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.14-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.24.14


In [2]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.3.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.3.1-py3-none-any.whl (268 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.3.1


In [3]:
import pymupdf
import re
import json
import pandas as pd
import math

from sentence_transformers import SentenceTransformer, util

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

from collections import Counter

from fuzzywuzzy import fuzz

import nltk
nltk.download('punkt')  # Download the sentence tokenizer

from nltk.tokenize import sent_tokenize
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')



[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
def get_most_popular_item(l):
    counter = Counter(l)
    most_common = counter.most_common(1)
    value, _ = most_common[0]
    return value

In [5]:
def extract_pdf_text_with_metadata(page_num, doc):
    data = {'text':[], 'bbox':[], 'page':[], 'font':[], 'size':[]}
    current_part = ""
    min_pos = math.inf
    page = doc.load_page(page_num)
    
    # extract text blocks with positional info
    blocks = page.get_text("dict")["blocks"]
    
    for block in blocks:
        if "lines" not in block:
            continue
        
        bbox = block['bbox']
        block_texts = ""
        font_list = []
        size_list = []
        
        for line in block["lines"]:
            line_texts = ""
            for span in line["spans"]:
                line_texts += span["text"].strip()
                font_list.append(span["font"])
                size_list.append(span["size"])
                
            if block_texts == "":
                block_texts = line_texts
            else:
                block_texts = block_texts + " " + line_texts

        # store text blocks with metadata
        # first, choose the most popular font and size of the block to store
        
        data['font'].append(get_most_popular_item(font_list))
        data['size'].append(get_most_popular_item(size_list))
        data['text'].append(block_texts)
        data['bbox'].append(bbox)
        data['page'].append(page_num)
    data = pd.DataFrame(data)
    return data

In [6]:
def dict_to_df(book_dict):
    rows = []
    for part, part_info in book_dict.items():
        part_name = part_info["name"]
        for chapter, chapter_info in part_info["chapters"].items():
            chapter_name = chapter_info["name"]
            for section, pages in chapter_info["sections"].items():
                rows.append([part_name, chapter_name, section, pages])
    
    # Create a DataFrame from the list of rows
    df = pd.DataFrame(rows, columns=["Part", "Chapter", "Section", "Pages"])
    return df

In [7]:
def find_most_similar(text, df, model, n = 5):
    # Encode the user query
    query_embedding = model.encode(text)
    
    # Calculate cosine similarities
    similarities = cosine_similarity([query_embedding], df['embeddings'].tolist())
    
    # Find the index of the most similar text
    # most_similar_idx = similarities.argmax()
    top_n_indices = np.argsort(similarities[0])[-n:][::-1]
    
    # Return the relevant info (text, page number, part, and chapter)
    relevant_info = df_process_pdf.iloc[top_n_indices]
    relevant_info.reset_index(drop = True, inplace = True)
    relevant_info['relevant_rank'] = relevant_info.index + 1
    return relevant_info

In [8]:
def get_context(context_info, context_window = 10):
    context_list = []
    relevant_rank_list = []
    for _, row in context_info.iterrows():
        sen_num = row['sen_num']
        part = row['Part']
        chapter = row['Chapter']
        section = row['Section']
        relevant_rank = row['relevant_rank']
        sentences_list = df_process_pdf[(df_process_pdf['Part']==part)
        &(df_process_pdf['Chapter']==chapter)
        &(df_process_pdf['Section']==section)
        &(df_process_pdf['sen_num']>=sen_num - context_window)
        &(df_process_pdf['sen_num']<=sen_num + context_window)].sentences.tolist()
        context = ' '.join(sentences_list)
        context_list.append(context)
        relevant_rank_list.append(relevant_rank)
    return context_list, relevant_rank_list

In [9]:
def generate_similarity_score(row):
    similarity = util.cos_sim(row['gt_embeddings'], row['predict_embeddings'])
    return similarity.item()

## Process the PDF documents

In [10]:
doc = pymupdf.open('/kaggle/input/a-history-of-china-morris-rossabi/A-History-of-China-by-Morris-Rossabi-z-lib.org_.pdf')

df_full_doc = []
for page_n in range(doc.page_count):
    processed_df = extract_pdf_text_with_metadata(page_n, doc)
    processed_df['page_num'] = page_n + 1
    df_full_doc.append(processed_df)

df_full_doc = pd.concat(df_full_doc)

df_full_doc['y0'] = df_full_doc['bbox'].apply(lambda x: x[1])
df_full_doc = df_full_doc.groupby('page_num', as_index=False).apply(lambda x: x.sort_values('y0').reset_index(drop = True).reset_index())
df_full_doc['block_num'] = df_full_doc['index'] + 1

df_full_doc = df_full_doc[(df_full_doc['font'].isin(['PlantinStd', 'PlantinStd-Italic-SC700']))&(df_full_doc['size'].round(2)>=10)&(df_full_doc['page_num']>=31)&(df_full_doc['page_num']<441)]

df_full_doc.loc[(df_full_doc['font']=='PlantinStd-Italic-SC700')&(df_full_doc['size'].round(2)==19.47), 'text_function'] = 'section title'

df_full_doc.loc[(df_full_doc['font']=='PlantinStd')&(df_full_doc['text_function'].isnull())&(~df_full_doc['text'].str.startswith('Figure'))&(~df_full_doc['text'].str.startswith('Map')), 'text_function'] = 'main body'

df_in = df_full_doc[df_full_doc['text_function'].isin(['main body', 'section title'])][['page_num', 'block_num', 'text', 'text_function']].drop_duplicates().sort_values(['page_num', 'block_num']).reset_index(drop = True)

In [11]:
section_list = df_in[df_in['text_function']=='section title'].index.tolist() + [math.inf]

In [12]:
dict_ref = {'section':[], 'text':[]}
for (start, end) in zip(section_list, section_list[1:]):
    df = df_in[(df_in.index >start)&(df_in.index <end)]
    section = df_in[df_in.index==start].text.values[0]
    dict_ref['section'].append(section)
    text_all = " ".join(df['text'].unique().tolist())
    dict_ref['text'].append(text_all)

dict_ref = pd.DataFrame(dict_ref)

# Remove further reading and notes sections
dict_ref = dict_ref[~dict_ref['section'].isin(['FURTHERREADING', 'NOTES'])]

In [13]:
with open('/kaggle/input/a-history-of-china-morris-rossabi/table of contents.txt') as file:
    content = file.read()

In [14]:
# Convert part of the dictionary to DataFrame
df_structure = dict_to_df(json.loads(content))

# note that the page number we will be using is the page of the pdf document, not the book. There are a difference of 28 between book page and pdf page
df_structure['book_page'] = df_structure['Pages'].str.extract('(\d+)')
df_structure['page_start'] = df_structure['book_page'].astype(float) + 28 # offset the page differences, so we will talk about pdf pages going forward
df_structure['page_end'] = df_structure['page_start'].shift(-1)

df_structure.loc[df_structure['page_end'].isnull(), 'page_end'] = df_structure.loc[df_structure['page_end'].isnull(), 'page_start']

In [15]:
extracted_sections = dict_ref['section'].unique().tolist()
original_sections = df_structure['Section'].unique().tolist()

section_match = {'section':[], 'Section':[], 'ratio':[]}
for l1 in extracted_sections:
    l1_ = l1.replace(' ', '')
    l1_ = l1_.replace(r'[^\w\s]', '')
    for l2 in original_sections:
        l2_ = l2.upper()
        l2_ = l2_.replace(' ', '')
        l2_ = l2_.replace(r'[^\w\s]', '')
        ratio = fuzz.ratio(l1_, l2_)
        section_match['section'].append(l1)
        section_match['Section'].append(l2)
        section_match['ratio'].append(ratio)

section_match = pd.DataFrame(section_match).sort_values('ratio', ascending=False).groupby(['section'], as_index=False).head(1).drop('ratio', axis = 1)

In [16]:
df_process_pdf = df_structure.merge(section_match).merge(dict_ref).drop('section', axis = 1)

In [17]:
df_process_pdf['sentences'] = df_process_pdf['text'].apply(lambda x: sent_tokenize(x))

In [18]:
df_process_pdf = df_process_pdf.explode('sentences').reset_index(drop = True)

In [19]:
# build text embeddings
embeddings_model = SentenceTransformer("all-mpnet-base-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [20]:
df_process_pdf['embeddings'] = df_process_pdf['sentences'].apply(lambda x: embeddings_model.encode(x, show_progress_bar = False))

In [21]:
df_process_pdf = df_process_pdf.groupby(['Part', 'Chapter', 'Section'], as_index=False).apply(lambda x: x.reset_index(drop = True).reset_index())

df_process_pdf.rename(columns = {'index':'sen_num'}, inplace = True)

In [26]:
df_process_pdf.to_excel('processed_book.xlsx', index=False)

In [25]:
df_process_pdf

Unnamed: 0,Unnamed: 1,sen_num,Part,Chapter,Section,Pages,book_page,page_start,page_end,text,sentences,embeddings
0,0,0,China Among 'Barbarians',"Chaos and Religious and Political Responses, 2...",Buddhism Enters China,Pages 110,110,138.0,144.0,Buddhism is linked with the Indian prince Sidd...,Buddhism is linked with the Indian prince Sidd...,"[0.038585927, 0.020595705, -0.035208672, 0.007..."
0,1,1,China Among 'Barbarians',"Chaos and Religious and Political Responses, 2...",Buddhism Enters China,Pages 110,110,138.0,144.0,Buddhism is linked with the Indian prince Sidd...,Because descriptions of his life evolved centu...,"[0.021002242, 0.016055271, -0.04977115, 0.0365..."
0,2,2,China Among 'Barbarians',"Chaos and Religious and Political Responses, 2...",Buddhism Enters China,Pages 110,110,138.0,144.0,Buddhism is linked with the Indian prince Sidd...,"Such writings are not reliable accounts, as th...","[0.086978056, 0.028460192, -0.01349521, 0.0572..."
0,3,3,China Among 'Barbarians',"Chaos and Religious and Political Responses, 2...",Buddhism Enters China,Pages 110,110,138.0,144.0,Buddhism is linked with the Indian prince Sidd...,Actual events are so interlaced with myths and...,"[0.06389172, 0.051194068, -0.015496978, 0.0812..."
0,4,4,China Among 'Barbarians',"Chaos and Religious and Political Responses, 2...",Buddhism Enters China,Pages 110,110,138.0,144.0,Buddhism is linked with the Indian prince Sidd...,Yet consideration of the mythical accounts co...,"[0.030101553, 0.05770132, -0.032306902, 0.0712..."
...,...,...,...,...,...,...,...,...,...,...,...,...
134,56,56,China in Global History,"The Republican Period, 1911–1949",Warlords in Power,Pages 337,337,365.0,368.0,The northwest province of Xinjiang illustrates...,Some in the old gentry class profited from the...,"[-0.02893394, 0.05812279, 0.03695344, 0.080495..."
134,57,57,China in Global History,"The Republican Period, 1911–1949",Warlords in Power,Pages 337,337,365.0,368.0,The northwest province of Xinjiang illustrates...,"Virtually without restrictions, these landlord...","[-0.045782655, 0.038755514, -0.023722423, -0.0..."
134,58,58,China in Global History,"The Republican Period, 1911–1949",Warlords in Power,Pages 337,337,365.0,368.0,The northwest province of Xinjiang illustrates...,"Alliance with local officials, who either deri...","[-0.07431569, 0.055751823, 0.019721208, 0.0838..."
134,59,59,China in Global History,"The Republican Period, 1911–1949",Warlords in Power,Pages 337,337,365.0,368.0,The northwest province of Xinjiang illustrates...,Chinese who believed themselves to be exploite...,"[-0.020467656, 0.022244317, 0.048469592, -0.01..."


## Build QA bot to answer questions

In [None]:
qa_pipeline = pipeline("question-answering", model = "distilbert-base-uncased-distilled-squad")

In [None]:
df_questions = pd.read_excel("/kaggle/input/a-history-of-china-morris-rossabi/QA_dataset.xlsx", sheet_name = 'Sheet1')

In [None]:
df_results = []
c = 0
for row, df_row in df_questions.iterrows():
    question = df_row.Question
    context_info = find_most_similar(question, df_process_pdf, embeddings_model)
    context_list, relevant_rank_list = get_context(context_info)
    # Get the answer
    for context, relevant_rank in zip(context_list, relevant_rank_list):
        result = qa_pipeline(question=question, context=context)
        df_row['generated_answer'] = result['answer']
        df_row['score'] = result['score']
        df_row['relevant_rank'] = relevant_rank
        df_results.append(pd.DataFrame(df_row).T)

In [None]:
df_results = pd.concat(df_results)

## Evaluate the Results Generated From the Model

There are two approaches that I created to generate the answers:

1. Use the texts with highest rank as the context to feed to the QA model, and generate the answer;
2. Use the top 5 highest rank texts as the contexts, and generate the answer for each context. Suming the scores up for each generarted answer and provide the answer with highest total scoring as the final answer.

In [None]:
# first method, we will only use the answer generated from the highest rank context
df_results1 = df_results[df_results['relevant_rank']==1]

df_results1.sort_values('score', ascending=False).head(10)

In [None]:
# second method, we will calculate the total score for each generated answers from the top 5 contexts, then choose the answer with highest score
df_results2 = df_results.groupby(['Question', 'Answer', 'generated_answer'], as_index=False)['score'].sum()
df_results2 = df_results2.sort_values('score', ascending=False).groupby(['Question', 'Answer'], as_index=False).head(1)
df_results2.sort_values('score', ascending=False).head(10)

### Use semantic similarity to evaluate the generated answers against the provided answers

In [None]:
# using sentence transformer to embed the generated answers and the ground truth, so we can calculate the semantic similarity
df_results1['gt_embeddings'] = df_results1['Answer'].apply(lambda x: embeddings_model.encode(x, show_progress_bar = False))
df_results1['predict_embeddings'] = df_results1['generated_answer'].apply(lambda x: embeddings_model.encode(x, show_progress_bar = False))
df_results1['similarity'] = df_results1.apply(generate_similarity_score, axis = 1)

# calculate similarity for results 2
df_results2['gt_embeddings'] = df_results2['Answer'].apply(lambda x: embeddings_model.encode(x, show_progress_bar = False))
df_results2['predict_embeddings'] = df_results2['generated_answer'].apply(lambda x: embeddings_model.encode(x, show_progress_bar = False))
df_results2['similarity'] = df_results2.apply(generate_similarity_score, axis = 1)

In [None]:
df_results1[df_results1['similarity']>=0.8].shape

In [None]:
df_results2[df_results2['similarity']>=0.8].shape

# Future Work

* Improve the accuracy of the context extraction. Currently I comparied the similarity of provided question and each senteces of the book to find the most similar sentence. Then I expand to a block of texts based on window size. However, based on how the question is phrased, sometimes it could lead to the incorrect contexts.

* The question answering model is pre-trained model, the results generated might not be the most optimal yet. However, the proper dataset is lacking currently to finetune the model. 