# Language Knowledge (Vocabulary)
Duration: 30 minutes
Content: This section tests your knowledge of Japanese vocabulary, including kanji readings, orthography, word formation, contextually-defined expressions, paraphrases, and usage
It mainly composes following five categories:
- ``Reading Kana`` (Pronunciation Questions): Given a kanji word, choose the correct kana reading.
- `Writing Kanji` (Writing Questions): Given a word written in kana, choose the correct kanji representation.
- `Word Meaning` Selection (Vocabulary Understanding): Choose the most suitable word to fill in the sentence from four options.
- `Synonym Replacement`: Select a word that has the same or similar meaning as the underlined word.
- `Vocabulary Usage`: Assess the usage of words in actual contexts, choosing the most appropriate word usage, including some common Japanese expressions or fixed phrases.

In [1]:
import pandas as pd
import json
import os
import random
import pickle
import re
import uuid
from typing import *
from langchain_openai import AzureOpenAI,AzureChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from dotenv import load_dotenv
from langchain_aws import ChatBedrock
from langchain.embeddings.base import Embeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
# from langchain_community.embeddings import XinferenceEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from typing import Annotated, Literal, Sequence
from typing_extensions import TypedDict
from IPython.display import display, Markdown, Latex
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from typing import Annotated, Sequence
from typing_extensions import TypedDict
from langchain_core.messages import BaseMessage,RemoveMessage,HumanMessage,AIMessage,ToolMessage
from langgraph.graph.message import add_messages
from pydantic import BaseModel, Field
from langgraph.graph import END, StateGraph, START
from langgraph.prebuilt import ToolNode
from langgraph.prebuilt import tools_condition
from langgraph.checkpoint.memory import MemorySaver
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List, Optional
from langchain_tavily import TavilySearch
from langchain.schema import Document
from langgraph.prebuilt import create_react_agent
from langchain_community.tools.tavily_search import TavilySearchResults
load_dotenv()

True

In [2]:
azure_llm = AzureChatOpenAI(
    azure_endpoint="https://ai-rolandaws880125ai409947751408.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2025-01-01-preview",
    api_key=os.environ["AZURE_API_KEY"],
    model_name="gpt-4o",
    api_version="2025-01-01-preview",
    temperature=0.5,
)

aws_llm = ChatBedrock(
    # model_id="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
     model_id="us.anthropic.claude-3-5-haiku-20241022-v1:0",
    model_kwargs=dict(temperature=0.5),
    region = "us-east-2",
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
)

In [3]:
# Import N3 Vocabulary
file_path = '../Vocab/n3.csv'
# Read the CSV file
data = pd.read_csv(file_path)
words = data.iloc[:, :2].sample(frac=1).reset_index(drop=True)
# Display the content of the CSV file
words.head()
vocab_dict = words.set_index(words.columns[0])[words.columns[1]].to_dict()
vocab_dict = json.dumps(vocab_dict, ensure_ascii=False, separators=(',', ':'))

#### load Models

#### Exam Paper Outline
### A. overall thinking the structure of an exam
1. distribution of the difficulty 
2. topics
3. reasoning

## Data Strcuture

# Kanji 读假名（读音问题）

In [4]:
def online_search(state):
    """
    Web search based on the re-phrased question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates documents key with appended web results
    """
    
    print("---WEB SEARCH---")
    
    topic = state['messages'][0].content
    
    tavily_search_tool = TavilySearch(
        max_results=5,
        topic="news",
        days=1
    )
    # Web search
    docs = tavily_search_tool.invoke({"query": topic})
    
    print(docs)

    web_results = "\n".join([d["content"] for d in docs["results"]])
    
    print("Web results: ", web_results)

    return {"documents": web_results, "topic": topic}

In [None]:
from typing_extensions import TypedDict
from langgraph.graph import StateGraph, START, END
from IPython.display import Image, display


# Graph state
class QuestionState(TypedDict):
    topic: str
    question: str
    documents: str
    messages: Annotated[list, add_messages]


example = """
問題１  
のことばの読みかたとして最もよいものを、1・2・3・4から一つえらびなさい。

1. 山田さんがちらしを **配った**。  
　1 ひろった　2 くばった　3 やぶった　4 はった

2. 私の国は **石油** を輸入しています。  
　1 いしゅ　2 せきそう　3 せきゆ　4 いしう

3. 卒業式には住徒の **父母** たちもたくさん来ていた。  
　1 ふば　2 ふぼ　3 ふうぼ　4 ちちば

4. この町の主要な産業は何ですか。  
　1 じゅちょう　2 しゅおう　3 じゅよう　4 しゅよう

5. これは **加熱** して食べてください。  
　1 かわつ　2 かねつ　3 かいねつ　4 かいあつ

6. 川はあの **辺り** で **深く** なっている。  
　1 ふかく　2 あさく　3 ひろく　4 せまく

7. 失礼なことを言われたので、つい **感情的** になってしまった。  
　1 がんじょうてき　2 かんじょうてき　3 かんしょうてき　4 がんしょうてき

8. これは **残さない** でください。  
　1 なさないで　2 よこさないで　3 ごぼさないで　4 のこさないで
"""

# Nodes
def question_draft_generator(state: QuestionState):
    """First LLM call to generate initial question"""
    print("---Generator----")
        
    search_result = state['documents'],
    
    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                """
                    You are a Japanese teacher. Your job is to write vocabulary question for candidates to provide a Chinese character vocabulary and ask students to choose the correct pronunciation of kana. The question should have 4 options and only refer to the format, not the content.
                    The vocabulary should be restricted to N3 level, use the vocabulary in the `Dictionary` as much as you can.
                    Please refer to the question examples following the formal exam paper.
                    Append the correct answer and explanation of the main challenges on why the teacher asks this question to the candidate in simplified Chinese at each question.
                    Finally, output beautiful markdown format.
                    Dictionary: {vocab_dict}
                    Search result: {search_result}
                    Formal exam paper: {example}
                """
            ),
            MessagesPlaceholder(variable_name="messages"),
        ]
    )

    
    input = { "topic" : state['topic'],
              "search_result": search_result, 
              "vocab_dict": vocab_dict,
              "example": example,
              "messages": state["messages"]
              }
    # final_message = prompt.format_messages(**input)
    # print(final_message)
    
    generate = prompt | azure_llmgit push
    
    msg = generate.invoke(input=input)
    
    
    return {"question": msg.content, "messages": [AIMessage(content=msg.content)] }


def reflection_node(state: QuestionState) -> QuestionState:
    print("---REVISOR---")
    
    # Other messages we need to adjust
    cls_map = {"ai": HumanMessage, "human": AIMessage}
    # First message is the original user request. We hold it the same for all nodes
    translated = [state["messages"][0]] + [
        cls_map[msg.type](content=msg.content) for msg in state["messages"][1:]
    ]

    reflection_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """you are a Japanese language educator reviewing a JLPT exam paper. Generate critique and recommendations for the user's submission.
            the review focuses on content accuracy and question quality. 
            - For content accuracy, you must verify that the grammar and vocabulary questions accurately reflect the appropriate JLPT N3 level, ensuring the reading passages are clear, relevant, and appropriately challenging. 
            - For question quality, you must ensure all questions are clearly worded and free from ambiguity to comprehensively assess different language skills, and confirm that the difficulty level of the questions matches the intended JLPT N3 level.
            - During detailed refinement, you check the format and presentation of the paper, ensuring it is well-organized and the instructions are clear and concise. you also ensure the content is culturally appropriate and relevant to Japanese language and culture.
            - Finally, you make give feedback, providing detailed recommendations, including requests.If you think the exam paper is good enough, you just say "GOOD ENOUGH"
            """
        ),
            MessagesPlaceholder(variable_name="messages"),
        ]
    )
    reflect = reflection_prompt | azure_llm
    
    res = reflect.invoke(translated)
    
    print(res.content)
    
    # We treat the output of this as human feedback for the generator
    return {"messages": [HumanMessage(content=res.content)]}



In [6]:
# Build workflow
builder = StateGraph(QuestionState)

builder.add_node("online_search", online_search)
builder.add_node("generator", question_draft_generator)
builder.add_node("reflector", reflection_node)
# Add nodes

def should_continue(state: QuestionState):
    if state["messages"]:
        if len(state["messages"]) > 6: 
            print("--- Reach the Maximum Round ---")
            return END
        elif "GOOD ENOUGH" in state["messages"][-1].content:
            print("--- AI Reviser feels Good Enough ---")
            return END
    return "generator"

# Add edges to connect nodes
builder.add_edge(START, "online_search")
builder.add_edge("online_search", "generator")
builder.add_edge("generator","reflector")
# 
builder.add_conditional_edges("reflector", should_continue)
memory = MemorySaver()

# Compile
kanji_graph = builder.compile()

# Show workflow
# display(Image(kanji_graph.get_graph().draw_png()))

In [7]:
row = words.iloc[1]
word = f"{row[0]}({row[1]})"
word

  word = f"{row[0]}({row[1]})"


'高まる(たかまる)'

In [8]:
# # Debug the Conversation
# for event in kanji_graph.stream(
#     {
#         "messages": [
#             HumanMessage(
#                 content=word
#             )
#         ],
#     },
#     config={"configurable": {"thread_id": "1"}},
# ):
#     print(event)
#     print("---")

In [9]:
kanji = kanji_graph.invoke(
    {
       "messages": [
                HumanMessage(
                    content=word
                )
            ],
        },
    config={"configurable": {"thread_id": "1"}}
)
display(Markdown(kanji["question"]))

---WEB SEARCH---
{'query': '高まる(たかまる)', 'follow_up_questions': None, 'answer': None, 'images': [], 'results': [{'title': '高まる（たかまる）とは？ 意味・読み方・使い方をわかりやすく解説 - goo国語辞書', 'url': 'https://dictionary.goo.ne.jp/word/高まる/', 'content': '高まる（たかまる）とは。意味や使い方、類語をわかりやすく解説。[動ラ五（四）]物事の程度が増してくる、高くなる、大きくなる、また、強くなる。「人気が—・る」「波音が—・る」「関心が—・る」 - goo国語辞書は30万9千件語以上を収録。政治・経済・医学・ITなど、最新用語の追加も', 'score': 0.8921218, 'raw_content': None}, {'title': '高まる(タカマル)とは？ 意味や使い方 - コトバンク', 'url': 'https://kotobank.jp/word/高まる-559454', 'content': 'デジタル大辞泉 - 高まるの用語解説 - [動ラ五（四）]物事の程度が増してくる、高くなる、大きくなる、また、強くなる。 ... [初出の実例]「我が位置たかまるに付けて湧き来る企望のさまざま」(出典：うもれ木（1892）〈樋口一葉〉四)', 'score': 0.8156003, 'raw_content': None}, {'title': 'Definition of 高まる - JapanDict: Japanese Dictionary', 'url': 'https://www.japandict.com/高まる', 'content': 'Definition of 高まる. Click for more info and examples: たかまる - takamaru - to rise, to swell, to be promoted', 'score': 0.77487355, 'raw_content': None}, {'title': '「高まる(たかまる)」の意味や使い方 わかりやすく解説 Weblio辞書', 'url': '

### 問題  
**高まる** の読み方として最もよいものを、1・2・3・4から一つ選びなさい。  

1. たかまる  
2. たかみる  
3. たかめる  
4. たかもる  

---

### 答え  
**1. たかまる**  

---

### 解説  
**簡体字**  
"高まる" 的正确读法是 "たかまる"。这个词的意思是某事的程度增加、变高、变大或变强，例如“人気が高まる”（人气上升）。教师选择这个词的原因是它是一个常见的动词，考察学生对日语动词的理解。错误选项如“たかめる”是另一个相关词，表示“使某事变高或增加”，容易混淆，因此考察了学生的辨识能力。  

