- notebook/requirements.txt: EDA를 위한 종속성 관리
- autoEDA: llama2로 특수 문자 여부를 판단한 내용

In [7]:
import pandas as pd
import pyarrow as pa
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from langchain_ollama.llms import OllamaLLM
from langchain.prompts import PromptTemplate
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from transformers import AutoTokenizer

In [2]:
file_path = '../data/train_dataset/train/dataset.arrow'
with pa.memory_map(file_path, 'r') as source:
    table = pa.ipc.open_stream(source).read_all()
df: pd.DataFrame = table.to_pandas()
df.head()

Unnamed: 0,title,context,question,id,answers,document_id,__index_level_0__
0,미국 상원,미국 상의원 또는 미국 상원(United States Senate)은 양원제인 미국...,대통령을 포함한 미국의 행정부 견제권을 갖는 국가 기관은?,mrc-1-000067,"{'answer_start': [235], 'text': ['하원']}",18293,42
1,인사조직관리,'근대적 경영학' 또는 '고전적 경영학'에서 현대적 경영학으로 전환되는 시기는 19...,현대적 인사조직관리의 시발점이 된 책은?,mrc-0-004397,"{'answer_start': [212], 'text': ['《경영의 실제》']}",51638,2873
2,강희제,강희제는 강화된 황권으로 거의 황제 중심의 독단적으로 나라를 이끌어 갔기에 자칫 전...,강희제가 1717년에 쓴 글은 누구를 위해 쓰여졌는가?,mrc-1-000362,"{'answer_start': [510], 'text': ['백성']}",5028,230
3,금동삼존불감,"불상을 모시기 위해 나무나 돌, 쇠 등을 깎아 일반적인 건축물보다 작은 규모로 만든...",11~12세기에 제작된 본존불은 보통 어떤 나라의 특징이 전파되었나요?,mrc-0-001510,"{'answer_start': [625], 'text': ['중국']}",34146,992
4,계사명 사리구,동아대학교박물관에서 소장하고 있는 계사명 사리구는 총 4개의 용기로 구성된 조선후기...,명문이 적힌 유물을 구성하는 그릇의 총 개수는?,mrc-0-000823,"{'answer_start': [30], 'text': ['4개']}",47334,548


In [3]:
text_column = 'context'  # 예: 'context', 'document' 등
if text_column not in df.columns:
    print(f"Warning: '{text_column}' column not found. Please specify the correct column name.")

\* 로컬 Ollama 서버로 llama2 활용하는 방법 (macOS 기준)
1. Ollama 설치
    ```
    brew install ollama
    ```
2. Ollama 서버 실행
    ```
    ollama run llama2
    ```
3. Ollama 서버 중단
    ```
    pkill ollama
    ```

In [4]:
llm = OllamaLLM(model="llama2")

In [8]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")

OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Llama-2-7b.
401 Client Error. (Request ID: Root=1-6717bb5a-1b35fc743b409e0f20fda6ae;a5466b19-cb6b-49e1-9d45-b7e68097c3d5)

Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-2-7b/resolve/main/config.json.
Access to model meta-llama/Llama-2-7b is restricted. You must have access to it and be authenticated to access it. Please log in.

In [5]:
text_analysis_template = """
You are a Data Analyst. You are tasked with doing EDA for modeling Open-Domain Question Answering. You are tasked with analyzing the following text based on potential anomalies and preprocessing requirements. Use a Tree of Thoughts approach to evaluate multiple reasoning paths. Answer the following questions strictly with 'Yes' or 'No' where applicable, and provide concise reasoning or recommendations if an issue is detected.

Context: {context},
Question: {question},
Answer: {answer}

### Step 1: Anomaly Detection
1. Does this text contain any structural, semantic, or formatting anomalies? (Yes/No)
    - If 'Yes', explain the anomaly briefly.

### Step 2: Special Character Evaluation
2. Are there any unnecessary special characters or symbols in this text that do not contribute to the meaning? (Yes/No)
    - If 'Yes', specify which characters should be removed or replaced.

### Step 3: Preprocessing Requirements
3. Does this text require any preprocessing to improve its structure or readability? (Yes/No)
    - If 'Yes', specify the type of preprocessing required (e.g., punctuation removal, spacing correction, formatting adjustments).

### Step 4: Sufficiency of Context
4. Is the provided context sufficient and specific enough to answer the given question correctly? (Yes/No)
    - If 'No', explain briefly why the context is inadequate.

### Step 5: Logical Consistency
5. Can a human logically infer the answer to the given question from the provided context? (Yes/No)
    - If 'No', explain the logical defection briefly.

Answer each question concisely based on your findings. Only answer with 'Yes' when you found ANY OUTLYING. Answer 'No' ONLY IF ALL CONDITIONS MATCH.
"""

text_analysis_prompt = PromptTemplate(
    input_variables=["context", "question", "answer"],
    template=text_analysis_template
)

text_analysis_chain = (
    {
        "context": RunnablePassthrough(),
        "question": RunnablePassthrough(),
        "answer": RunnablePassthrough(),
    }
    | text_analysis_prompt
    | llm  # Executes the analysis through Llama2 or your chosen LLM
    | StrOutputParser()  # Parses the result into a string format
)

In [None]:
MAX_TOKEN_LIMIT = 2048
TEMPLATE_OVERHEAD = 150

In [None]:
def trim_input(context, question, answer, template_length=TEMPLATE_OVERHEAD):
    """
    Ensures that the combined input stays within the token limit.
    If the context is too long, it will be truncated to fit.
    """
    # Tokenize the context, question, and answer using LlamaTokenizer
    context_tokens = tokenizer.encode(context, add_special_tokens=False)
    question_tokens = tokenizer.encode(question, add_special_tokens=False)
    answer_tokens = tokenizer.encode(answer, add_special_tokens=False)

    # Calculate how much space is left for the context after accounting for the question, answer, and template
    available_tokens_for_context = MAX_TOKEN_LIMIT - (len(question_tokens) + len(answer_tokens) + template_length)

    # If context exceeds the available space, truncate it
    if len(context_tokens) > available_tokens_for_context:
        # Truncate the context tokens
        context_tokens = context_tokens[:available_tokens_for_context]

    # Decode the truncated context back into text
    trimmed_context = tokenizer.decode(context_tokens, clean_up_tokenization_spaces=True)
    
    return trimmed_context


In [None]:
def process_qa_sample(context, question, answer):
    # Trim the context if necessary
    trimmed_context = trim_input(context, question, answer)

    # Invoke the LLM analysis with the trimmed context
    result = text_analysis_chain.invoke({
        "context": trimmed_context,
        "question": question,
        "answer": answer
    })

    return result

In [None]:
context = df.iloc[696]["context"]
question = df.iloc[696]["question"]
answer = df.iloc[696]["answer"]
result_txt = process_qa_sample(context, question, answer)

for i, c in enumerate(result_txt):
    if i % 50 == 0:
        print(f"{c}\n")
    else:
        print(f"{c}", end="")

In [6]:
def analyze_special_chars(text):
    # Detect special characters that are not common in ODQA data
    special_chars = re.findall(r'[^\w\s\.\,\!\?\"\'\:\;\-\(\)\[\]\{\}]', text)
    return list(set(special_chars))

In [7]:
def detect_simple_anomalies(text, question, answer):
    """Detects simple anomalies in text, question, or answer."""
    special_chars = analyze_special_chars(text)
    is_text_missing = text is None or text.strip() == ""
    is_question_missing = question is None or question.strip() == ""
    is_answer_missing = answer is None or answer.strip() == ""

    simple_anomalies = []

    if special_chars:
        simple_anomalies.append(f"Special characters: {special_chars}")
    
    if is_text_missing:
        simple_anomalies.append("Missing or empty context.")
    if is_question_missing:
        simple_anomalies.append("Missing or empty question.")
    if is_answer_missing:
        simple_anomalies.append("Missing or empty answer.")

    return simple_anomalies


In [8]:
# Set the sample size to either 100 or the size of the DataFrame, whichever is smaller
sample_size = min(100, len(df))
sample_indices = np.random.choice(len(df), sample_size, replace=False)

# Open the file for writing the results
with open('autoEDA.txt', 'w') as f:
    for idx in sample_indices:
        # Extract the text from the context column
        text = df.iloc[idx]['context']
        question = df.iloc[idx]['question']
        answer = df.iloc[idx]['answers']['text'][0]  # Assuming first answer in list

        # Detect simple anomalies
        simple_anomalies = detect_simple_anomalies(text, question, answer)

        # Run LLM analysis
        result = text_analysis_chain.invoke(
            {
                "context": text,
                "question": question,
                "answer": answer,
            }
        )

        # If any anomalies detected, write them to the file
        if simple_anomalies or ('Yes' in result):
            f.write(f"Sample {idx}:\n")
            f.write(f"Context: {text[:100]}...\n" if len(text) > 100 else f"Context: {text}\n")
            f.write(f"Question: {question}\n")
            f.write(f"Answer: {answer}\n")
            
            if simple_anomalies:
                f.write("Simple Anomalies:\n")
                f.write("\n".join(simple_anomalies) + "\n")
            
            if 'Yes' in result:
                f.write("LLM Analysis:\n")
                f.write(f"{result}\n")
            
            f.write("-" * 50 + "\n")
