In [19]:
import pandas as pd
import os

In [1]:
from ccai9012 import llm_utils

## Initilize and test LLM

In [3]:
# Initialize LLM
api_key = llm_utils.get_deepseek_api_key()
llm = llm_utils.initialize_llm()

Enter your DEEPSEEK_API_KEY:  ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [5]:
# Test connection
test = ["Is 9.9 or 9.11 bigger?"]
llm_utils.ask_llm(test)


üìå Prompt:
['Is 9.9 or 9.11 bigger?']

Let's compare the two numbers: **9.9** and **9.11**.

---

**Step 1: Compare the whole number part**  
Both numbers have the same whole number part: **9**.

---

**Step 2: Compare the decimal part**  
- **9.9** means \( 9 + \frac{9}{10} \) = 9.900...  
- **9.11** means \( 9 + \frac{11}{100} \) = 9.11  

---

**Step 3: Align decimal places for clarity**  
Write both with the same number of decimal places:  
- 9.90  
- 9.11  

Now compare digit by digit after the decimal:  
- Tenths place: 9 (first number) vs 1 (second number)  
- Since 9 > 1, **9.9 is larger**.

---

\[
\boxed{9.9}
\]



## Sparse and embedding the document

In [5]:
retriever = llm_utils.build_pdf_retriever(
    pdf_path="data/1-s2.0-S1353829223001867-main.pdf",
    embedding_model_name="BAAI/bge-base-en-v1.5",
    chunk_size=800,
    chunk_overlap=200,
    top_k=8,
    exclude_last_n_pages=2 # exlude the last few Reference pages
)

  embedding_model = HuggingFaceEmbeddings(model_name=embedding_model_name)


In [7]:
# Find related chunks from the paper
results = retriever.get_relevant_documents("public space")
for r in results:
    print(r.page_content)
    print("page:", r.metadata.get("page"))
    print("source:", r.metadata.get("source"))
    print("----")

  results = retriever.get_relevant_documents("public space")


et al., 2011 ; Tonkin and Whitaker, 2019 ; Quinn and Russo, 2022 ). 
Against this background, there is an urgent need to investigate ways that 
limited public space resources in cities can be (re-)designed to facilitate 
more physical and playful activities to make cities more liveable and 
sustainable ( Edwards et al., 2015 ; Kaczynski et al., 2014 ; Slater et al., 
2016 ). 
Currently, studies that examine the relationship between public open 
space and physical activity mostly focus on the availability of or 
accessibility to public open space ( Koohsari et al., 2015 ; Lackey and 
Kaczynski, 2009 ). There are also studies that examine whether the size, 
density, or the presence of some features in public open space are
page: 0
source: data/1-s2.0-S1353829223001867-main.pdf
----
the world. It is especially related to the growing trend of Tactical Ur -
banism, which aims to create temporary changes using low-cost and 
moveable features in under-utilized public spaces that leads to long

## Summarize the paper

In [9]:
summary = llm_utils.run_qa_chain(
    query="Please summarize the paper in a brief paragraph.",
    retriever=retriever,
    llm=llm,
    return_sources=True,
    save_path="output/summary.txt"
)


--- Final Answer ---
Based on the provided context, I cannot provide a complete summary of the paper.

The excerpts contain specific data from a results section (including statistical values and odds ratios) and parts of the discussion and limitations sections. The paper appears to be a study analyzing physical activity and play behavior (using trajectory analysis) in a public open space, specifically a waterfront area.

However, the context does not include the paper's abstract, introduction, or a clear statement of its main objectives and conclusions, which are necessary for a proper summary. The authors acknowledge limitations such as the study being based on a single day's video data and not capturing less active forms of play.

-------------------- Document 1 --------------------
*** 
  1.956 0.141 (0.100 ‚Äì 0.201) 
*** 
Interaction with heart-shaped 
seating 
  0.116 0.891 (0.827 ‚Äì 0.96)** 0.020 1.02 (0.959 ‚Äì 1.085)   4.720 0.009 (0.004 ‚Äì 0.019) 
*** 
  0.680 0.507 (0.447

## Ask specific question

In [11]:
answer = llm_utils.run_qa_chain(
    query="What model is applied in the paper?",
    retriever=retriever,
    llm=llm,
    return_sources=True,
    save_path="output/question.txt"
)


--- Final Answer ---
Based on the provided context, the paper applies a **multinomial logistic regression** model. 

The purpose of this model is to determine whether different movement trajectories are associated with the usage of different features in space.

Specifically:
*   **Dependent variable:** The type of trajectory (e.g., passive, playful, etc.).
*   **Independent variables:** The total duration a person spends within 0.5 meters of different types of spatial features.
*   **Reference group:** People with passive trajectories are used as the baseline for comparison.

-------------------- Document 1 --------------------
*** 
  1.956 0.141 (0.100 ‚Äì 0.201) 
*** 
Interaction with heart-shaped 
seating 
  0.116 0.891 (0.827 ‚Äì 0.96)** 0.020 1.02 (0.959 ‚Äì 1.085)   4.720 0.009 (0.004 ‚Äì 0.019) 
*** 
  0.680 0.507 (0.447 ‚Äì 0.575) 
*** 
Interaction with bench   0.610 0.543 (0.281 ‚Äì 1.051) 0.312 1.367 (0.742 ‚Äì 2.517)   9.987 0.0 (0.0 ‚Äì 0.001)***   2.589 0.075 (0.033 ‚Äì 0

## Extract information form multiple papers for comparison

In [13]:
from langchain.prompts import PromptTemplate

structured_prompt = PromptTemplate.from_template(
"""
Given the following document text, extract key information for a concise literature review. 
Output a markdown table with columns:

| Problem | Research Gap | Methodology | Key Results |
|-------|--------------|-------------|-------------|

Requirements:
- Each paper must be represented in a single row.
- If a category contains multiple points, separate them using "; ".
- Do NOT create multiple rows for the same paper.

Context:
{context}

Question:
{question}
"""
)

In [21]:
folder_path = "data"
save_csv_path = "output/multiple_comparison.csv"
query_text = "Extract key information from this paper."
results = []

for fname in sorted(os.listdir(folder_path)):
    if not fname.lower().endswith(".pdf"):
        continue
    pdf_path = os.path.join(folder_path, fname)
    print(f"Processing {pdf_path} ...")

    # Build retriever for this pdf
    retriever = llm_utils.build_pdf_retriever(pdf_path)

    # Run QA chain for structured extraction
    extracted_text = llm_utils.run_qa_chain(
        query=query_text,
        retriever=retriever,
        llm=llm,
        prompt_template=structured_prompt,
        return_sources=False
    )

    results.append({
        "pdf_path": fname,
        "extracted_text": extracted_text,
    })

df = pd.DataFrame(results)
os.makedirs(os.path.dirname(save_csv_path), exist_ok=True)
df.to_csv(save_csv_path, index=False, encoding="utf-8-sig")
print(f"\n Saved results to {save_csv_path}")

Processing data/1-s2.0-S1353829223001867-main.pdf ...


Ignoring wrong pointing object 624 0 (offset 0)
Ignoring wrong pointing object 625 0 (offset 0)
Ignoring wrong pointing object 628 0 (offset 0)
Ignoring wrong pointing object 629 0 (offset 0)
Ignoring wrong pointing object 663 0 (offset 0)
Ignoring wrong pointing object 664 0 (offset 0)



--- Final Answer ---
Based on the provided text, here is the key information extracted for a concise literature review.

| Problem | Research Gap | Methodology | Key Results |
|-------|--------------|-------------|-------------|
| Difficulty in measuring and distinguishing playful activities from other behaviors in public open spaces; need to understand how movement patterns relate to space usage and physical activity. | Few studies measure play directly; existing research often overlooks the relationship between specific movement trajectories, interaction with site features, and resulting physical activity levels in unprogrammed public spaces. | Analysis of anonymized video data from a harbourfront space; generation of pedestrian trajectories tracked for ‚â•25s; classification into types (Playful, Strolling, Sporty, Passive, Mixed) using indicators (detour ratio, bounding box area/ratio, trajectory length/duration); multinomial logistic regression to link trajectory types to feature 

## Combine the result extracted from the papers

In [23]:
merged_df_list = []

for i, md in enumerate(df["extracted_text"]):
    try:
        parsed = llm_utils.parse_markdown_table(md)
        parsed["source_doc"] = df.loc[i, "pdf_path"] if "pdf_path" in df.columns else f"doc_{i+1}"
        merged_df_list.append(parsed)
    except Exception as e:
        print(f"Failed to parse doc {i+1}: {e}")

final_df = pd.concat(merged_df_list, ignore_index=True)
final_df.to_csv("output/merged_lit_review.csv", index=False)

In [25]:
final_df

Unnamed: 0,Problem,Research Gap,Methodology,Key Results,source_doc
0,Difficulty in measuring and distinguishing pla...,Few studies measure play directly; existing re...,Analysis of anonymized video data from a harbo...,Playful trajectories are associated with inter...,1-s2.0-S1353829223001867-main.pdf
1,Developing a vision-based system for monitorin...,How to define appropriate averaging weights an...,An AI and Computer Vision methodology involvin...,The system successfully identified crowded zon...,DeepSOCIAL.pdf
2,Achieving optimal speed and accuracy for real-...,Improving upon previous object detection archi...,"Proposes YOLOv4, a CNN-based object detector w...",YOLOv4 is presented as achieving state-of-the-...,DeepSOCIAL.pdf
3,Enabling real-time object detection with a uni...,Overcoming the complexity and speed limitation...,"Proposes YOLO (You Only Look Once), a unified ...","YOLO enables real-time, unified object detection.",DeepSOCIAL.pdf
4,Improving the speed and performance of two-sta...,The computational bottleneck of generating reg...,"Introduces Faster R-CNN, which uses a Region P...",Enables real-time object detection with a regi...,DeepSOCIAL.pdf
5,Addressing class imbalance in dense object det...,The class imbalance problem encountered during...,Proposes RetinaNet and a novel Focal Loss func...,Focal loss enables one-stage detectors to achi...,DeepSOCIAL.pdf
6,Monitoring social distancing in public spaces ...,"The need for an efficient, real-time framework...",A deep learning framework combining fine-tuned...,"Models were trained on a filtered ""Person"" cla...",Monitoring COVID-19 social distancing.pdf
7,Quantifying the vitality of small public space...,"Existing methods (e.g., for large-scale spaces...",A computer vision-based framework to extract h...,A significant regression model was found (N=48...,Small public space vitality analysis and evalu...
