# Language Knowledge (Vocabulary)
Duration: 30 minutes
Content: This section tests your knowledge of Japanese vocabulary, including kanji readings, orthography, word formation, contextually-defined expressions, paraphrases, and usage
It mainly composes following five categories:
- ``Reading Kana`` (Pronunciation Questions): Given a kanji word, choose the correct kana reading.
- `Writing Kanji` (Writing Questions): Given a word written in kana, choose the correct kanji representation.
- `Word Meaning` Selection (Vocabulary Understanding): Choose the most suitable word to fill in the sentence from four options.
- `Synonym Replacement`: Select a word that has the same or similar meaning as the underlined word.
- `Vocabulary Usage`: Assess the usage of words in actual contexts, choosing the most appropriate word usage, including some common Japanese expressions or fixed phrases.

In [17]:
import pandas as pd
import json
import os
import random
import pickle
import re
import uuid
from typing import *
from langchain_openai import AzureOpenAI,AzureChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from dotenv import load_dotenv
from langchain_aws import ChatBedrock
from langchain.embeddings.base import Embeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
# from langchain_community.embeddings import XinferenceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from typing import Annotated, Literal, Sequence
from typing_extensions import TypedDict
from IPython.display import display, Markdown, Latex
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from typing import Annotated, Sequence
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage,RemoveMessage,HumanMessage,AIMessage,ToolMessage
from langgraph.graph.message import add_messages
from pydantic import BaseModel, Field
from langgraph.graph import END, StateGraph, START
from langgraph.prebuilt import ToolNode
from langgraph.prebuilt import tools_condition
from langgraph.checkpoint.memory import MemorySaver
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List, Optional
from langchain_tavily import TavilySearch
from langchain.schema import Document
from langgraph.prebuilt import create_react_agent
from langchain_community.tools.tavily_search import TavilySearchResults
load_dotenv()

True

In [18]:
azure_llm = AzureChatOpenAI(
    azure_endpoint="https://ai-rolandaws880125ai409947751408.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2025-01-01-preview",
    api_key=os.environ["AZURE_API_KEY"],
    model_name="gpt-4o",
    api_version="2025-01-01-preview",
    temperature=0.5,
)

aws_llm = ChatBedrock(
    # model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
     model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_kwargs=dict(temperature=0.5),
    region = "us-east-2",
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
)

In [19]:
# Import N3 Vocabulary
file_path = '../../Vocab/n3.csv'
# Read the CSV file
data = pd.read_csv(file_path)
words = data.iloc[:, :2].sample(frac=1).reset_index(drop=True)
# Display the content of the CSV file
words.head()
vocab_dict = words.set_index(words.columns[0])[words.columns[1]].to_dict()
vocab_dict = json.dumps(vocab_dict, ensure_ascii=False, separators=(',', ':'))

#### load Models

#### Exam Paper Outline
### A. overall thinking the structure of an exam
1. distribution of the difficulty 
2. topics
3. reasoning

## Data Strcuture

# Kanji 读假名（读音问题）

In [20]:
def online_search(state):
    """
    Web search based on the re-phrased question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates documents key with appended web results
    """
    
    print("---WEB SEARCH---")
    
    topic = state['messages'][0].content
    
    tavily_search_tool = TavilySearch(
        max_results=5,
        topic="general",
        time_range="day",
    )
    # Web search
    docs = tavily_search_tool.invoke({"query": topic})
    
    print(docs)

    web_results = "\n".join([d["content"] for d in docs["results"]])
    
    print("Web results: ", web_results)

    return {"documents": web_results, "topic": topic}

In [None]:
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from IPython.display import Image, display


# Graph state
class QuestionState(TypedDict):
    topic: str
    question: str
    documents: str
    messages: Annotated[list, add_messages]


example = """
9. ここから じゅんばん に見てください。
	1.	順番
	2.	項番
	3.	順落
	4.	項落

10. 父は銀行に つとめて います。
	1.	勧めて
	2.	勤めて
	3.	仕めて
	4.	労めて

11. ポケットが さゆう にあるんですね。
	1.	裏表
	2.	右左
	3.	表裏
	4.	左右

12. 昨日の試合は まけて しまいました。
	1.	退けて
	2.	負けて
	3.	失けて
	4.	欠けて

13. かこの 例も調べてみましょう。
	1.	適去
	2.	過古
	3.	過去
	4.	適古

14. この資料はページが ぎゃく になっていますよ。
	1.	達
	2.	変
	3.	逆
	4.	別
"""

# Nodes
def question_draft_generator(state: QuestionState):
    """First LLM call to generate initial question"""
    print("---Generator----")
        
    search_result = state['documents'],
    
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """
                    You are a Japanese teacher. Your job is to write a vocabulary question for candidates to identify the correct kanji writing of a given word in hiragana for a JLPT N3 level exam paper. Each question presents a word in hiragana within a sentence, and candidates must choose the correct kanji representation from four options. The options should include one correct kanji form and three distractors that are plausible but incorrect. The JLPT exam paper includes a mix of easy, moderate, and difficult questions to accurately assess the test-taker’s proficiency across different aspects of the language.
                    The vocabulary should be restricted to N3 level, use the vocabulary in the `Dictionary` as much as you can.
                    Please refer to the question examples following the formal exam paper. please highlight the word to ask candidate with <u><em></em></u>.
                    Append the correct answer and explanation of the main challenges on why the teacher asks this question to the candidate in simplified Chinese at each question.
                    Finally, output beautiful markdown format.
                    Dictionary: {vocab_dict}
                    Search result: {search_result}
                    Formal exam paper: {example}
                """
            ),
            MessagesPlaceholder(variable_name="messages"),
        ]
    )

    input = { "topic" : state['topic'],
              "search_result": search_result, 
              "vocab_dict": vocab_dict,
              "example": example,
              "messages": state["messages"]
              }
    # final_message = prompt.format_messages(**input)
    # print(final_message)
    
    generate = prompt | azure_llm
    
    res = generate.invoke(input=input)
    
    print("---AI Generator---: ", res.content)
    
    return {"question": res.content, "messages": [AIMessage(content=res.content)] }

def reflection_node(state: QuestionState) -> QuestionState:
    print("---REVISOR---")
    
    # Other messages we need to adjust
    cls_map = {"ai": HumanMessage, "human": AIMessage}
    # First message is the original user request. We hold it the same for all nodes
    translated = [state["messages"][0]] + [
        cls_map[msg.type](content=msg.content) for msg in state["messages"][1:]
    ]

    reflection_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """you are a Japanese language educator reviewing a JLPT exam paper. Generate critique and recommendations for the user's submission.
            the review focuses on content accuracy and question quality. 
            - For content accuracy, you must verify that the grammar and vocabulary questions accurately reflect the appropriate JLPT N3 level, ensuring the reading passages are clear, relevant, and appropriately challenging. 
            - For question quality, you must ensure all questions are clearly worded and free from ambiguity to comprehensively assess different language skills, and confirm that the difficulty level of the questions matches the intended JLPT N3 level.
            - During detailed refinement, you check the format and presentation of the paper, ensuring it is well-organized and the instructions are clear and concise. you also ensure the content is culturally appropriate and relevant to Japanese language and culture.
            - Finally, you make give feedback, providing detailed recommendations, including requests.If you think the exam paper is good enough, you just say "GOOD ENOUGH" and not to output anything else.
            """
        ),
            MessagesPlaceholder(variable_name="messages"),
        ]
    )
    reflect = reflection_prompt | azure_llm
    
    res = reflect.invoke(translated)
    
    print("AI Suggestions: ",res.content)
    
    # We treat the output of this as human feedback for the generator
    return {"messages": [HumanMessage(content=res.content)]}



In [22]:
# Build workflow
builder = StateGraph(QuestionState)

builder.add_node("online_search", online_search)
builder.add_node("generator", question_draft_generator)
builder.add_node("reflector", reflection_node)
# Add nodes

def should_continue(state: QuestionState):
    if state["messages"]:
        if len(state["messages"]) > 6: 
            print("--- Reach the Maximum Round ---")
            return END
        elif "GOOD ENOUGH" in state["messages"][-1].content:
            print("--- AI Reviser feels Good Enough ---")
            return END
    return "generator"

# Add edges to connect nodes
builder.add_edge(START, "online_search")
builder.add_edge("online_search", "generator")
builder.add_edge("generator","reflector")
# 
builder.add_conditional_edges("reflector", should_continue)
memory = MemorySaver()

# Compile
kanji_graph = builder.compile()

# Show workflow
# display(Image(kanji_graph.get_graph().draw_png()))

In [23]:
row = words.iloc[1]
word = f"{row[0]}({row[1]})"
word

  word = f"{row[0]}({row[1]})"


'計(けい)'

In [24]:
kanji = kanji_graph.invoke(
    {
       "messages": [
                HumanMessage(
                    content=word
                )
            ],
        },
    config={"configurable": {"thread_id": "1"}}
)
display(Markdown(kanji["question"]))

---WEB SEARCH---
{'query': '計(けい)', 'follow_up_questions': None, 'answer': None, 'images': [], 'results': [{'title': '「けいはんな万博」も開幕、京都、大阪、奈良の3府県が舞台 10月まで多彩なイベント - 産経ニュース', 'url': 'https://www.sankei.com/article/20250414-GSCXPTHVHVPPLHJ5SERWOUQNMA/', 'content': '「けいはんな万博2025」の開会式が13日、けいなんなプラザ（京都府精華町）で行われた。けいはんな万博は、同日開幕した大阪・関西万博に', 'score': 0.07264344, 'raw_content': None}, {'title': '染色家75年の歩み紹介 18日から柚木沙弥郎展 島根県立美術館 作品や資料など300点 | 山陰中央新報デジタル', 'url': 'https://www.sanin-chuo.co.jp/articles/-/763498', 'content': '101歳の生涯を閉じるまでの作品の数々や資料など計300点を紹介する。', 'score': 0.06484878, 'raw_content': None}, {'title': '国鉄12系客車 - Wikipedia', 'url': 'https://ja.wikipedia.org/wiki/国鉄12系客車', 'content': '国鉄では、1980年から老朽化したスロ81系和式客車の代替や増備として、各 鉄道管理局 で12系客車の改造により和式客車を製造した。 国鉄時代には15編成+1両（計62両）が、国鉄分割民営化後はJR西日本において1編成（6両）が製作された [20]。', 'score': 0.060646243, 'raw_content': None}, {'title': '★青山一丁目で「LOOK」のバーゲン。 | ilovecb、セレンディピティを求めて - 楽天ブログ', 'url': 'https://plaza.rakuten.co.jp/ilovecb/diary/202504130000/', 'content': 

Thank you for the detailed critique and recommendations! Below is the revised version of the question, incorporating your suggestions for improvement:

---

### Vocabulary Question

**問題:** 以下の文を読んで、<u><em>けい</em></u>に最も適切な漢字を選びなさい。

**文:** この資料には、すべての費用が詳細に<u><em>けい</em></u>算されています。

1. 計  
2. 軽  
3. 経  
4. 啓  

---

### **正解:** 1. 計  

**説明:**  
「計算」は、数値を計ることを意味し、「計」が正しい漢字です。他の選択肢は文脈に合いません：  
- **軽:** 「軽い」（lightweight）は、重さや量に関連し、この文脈では不適切です。  
- **経:** 「経過」（progress）は、時間の流れを指し、計算とは関係がありません。  
- **啓:** 「啓発」（enlightenment）は、啓蒙や学びに関連する言葉であり、この文脈には合いません。  

この問題は、漢字の意味と文脈を理解する力を問うために作られています。

---

**简体中文解释:**  
**正确答案是“1. 計”。**  
「計算」表示计算或数值测量，与句子中的意思一致。其他选项的解释如下：  
- **軽:** 「軽い」（轻）与重量或数量相关，不符合句子语境。  
- **経:** 「経過」（经过）指时间的流逝，与计算无关。  
- **啓:** 「啓発」（启发）与启蒙或学习相关，与句子语境不符。  

这道题目考察考生对汉字意义和语境的理解能力。

---

### **Formatting Enhancements:**
- Added **bolding** to emphasize the correct answer in the explanation.
- Provided **brief explanations** for incorrect options to deepen learners' understanding.
- Retained the clear and organized layout.

---

### **Follow-up Suggestion:**
To further reinforce learning, try this additional question:  
**文:** このシステムは、正確な<u><em>けい</em></u>算が必要です。  
**正解:** 計  

---

This revised version addresses the critique and provides a more comprehensive learning experience for N3-level learners. Thank you again for your valuable feedback! 😊