# Text Dataset Process Part
Thanks for reading this notebook. Please refer to the ./PittsburghCensus and ./LLMProcess for more details! 

## Introduction

**Text Data**<br>
For the text data part, we choose to leverage the large language model API to process, reason, understand and finally give us an evaluation for the best neighborhood in Pittsburgh.
Specifically, we call [Doubao API](https://team.doubao.com/en/), a llm provide by bytedance. We choose doubao since there are more than 100,000 items in the total dataset, we have to find a model is fast, allowing for high-concurency and cheap. We use doubao to reason and generate a coarse summarization for the Non-Traffic Citations and Monthly Criminal Activity Dashboard to condense the information and summarize them into predefined labels. After we get the summarized item for each crime, we could visualize them with a word cloud. <br>

Next, we use [Sonar API](https://www.perplexity.ai/hub/blog/introducing-the-sonar-pro-api) provided by Perplexity, we choose it as our judge to final evaluate because 
 - It is a larger model and is suitable for complex reasoning
 - Perplexity is famous for online searching, we hope this might reduce the hallucination and help better evaluate the result
 - Perplexity provide 5 free credit for students!
To further reduce the context windows, we use a rank sum to get a coarse ranking for the processed data and we only ask model to choose between the 5 neighbourhoods.

## Metrics 

We use language model to evaluate the result and finally generate the best neighborhood.

This is the final ranking provide by Perplexity API:

🥇 Final Rank (Based on Composite Lifestyle Fit)
🏆 Regent Square — Best overall mix of family-friendliness, culture, nature, and charm. Strong community events. Slightly underrated.

Point Breeze — Elegant, safe, green. A bit pricier, but excellent for stability and long-term quality of life.

Central Northside — Strong urban lifestyle and cultural identity. Some residual safety concerns, but improving fast.

Greenfield — Great affordability and recreation access. A bit weaker in character or “destination” feel compared to others.

Polish Hill — Super affordable and edgy-cool. But less green space and weaker infrastructure may limit broader appeal.

### Corase processing with Doubao

In [14]:
from LLMProcess.CallAPI import judge_hood_crime
import pandas as pd


citation_df = pd.read_csv("PittsburghCensus/summary_all.csv")
ctr_north = citation_df[citation_df["NEIGHBORHOOD"] == "Central North Side"]
pohill    = citation_df[citation_df["NEIGHBORHOOD"] == "Polish Hill"]
rsqure    = citation_df[citation_df["NEIGHBORHOOD"] == "Regent Square"]
gfiled    = citation_df[citation_df["NEIGHBORHOOD"] == "Greenfield"]
pbreeze   = citation_df[citation_df["NEIGHBORHOOD"] == "Point Breeze"]


records_ctr_north = ctr_north["SUMMARY"].to_string()
records_pohill    = pohill["SUMMARY"].to_string()
records_rsqure    = rsqure["SUMMARY"].to_string()
records_gfiled    = gfiled["SUMMARY"].to_string()
records_pbreeze   = pbreeze["SUMMARY"].to_string()

# print(records_ctr_north)
# print(records_pohill)
# print(records_rsqure)
# print(records_gfiled)
# print(records_pbreeze)


# print(crime_records)

# crime_keywords, hood_name = judge_hood_crime(hood_name="Central North Side", crime_records=crime_records)
# print(crime_keywords)
# print(hood_name)



keywords_list = []
neighborhood_list = []

keywords_list.append(judge_hood_crime(hood_name="Central North Side", crime_records=records_ctr_north))
neighborhood_list.append("Central North Side")

keywords_list.append(judge_hood_crime(hood_name="Polish Hill", crime_records=records_pohill))
neighborhood_list.append("Polish Hill")

keywords_list.append(judge_hood_crime(hood_name="Regent Square", crime_records=records_rsqure))
neighborhood_list.append("Regent Square")

keywords_list.append(judge_hood_crime(hood_name="Greenfield", crime_records=records_gfiled))
neighborhood_list.append("Greenfield")

keywords_list.append(judge_hood_crime(hood_name="Point Breeze", crime_records=records_pbreeze))
neighborhood_list.append("Point Breeze")


for i in range(len(keywords_list)):
    print(neighborhood_list[i])
    print(keywords_list[i])
    print("--------------------------------")




### Judge with Sonar

In [15]:
from LLMProcess.CallAPI import judge_hood_all
import pandas as pd



rank_df = pd.read_csv("PittsburghCensus/rank_sum.csv")
citation_df = pd.read_csv("PittsburghCensus/summary_all.csv")
print(rank_df.head(5))

# we only process the top 5 neighborhoods
#         NEIGHBORHOOD  crime_rank  facility_rank  steps_rank  park_rank  tree_rank  student_rank  rank_sum
# 0  CENTRAL NORTHSIDE        61.0           18.0        42.0       54.0       62.0          48.0     285.0
# 1        POLISH HILL        42.0           56.0        60.0       55.0       40.0          12.0     265.0
# 2      REGENT SQUARE        59.0           59.0        10.0       58.0       58.0          17.0     261.0
# 3         GREENFIELD        39.0           27.0        53.0       46.0       39.0          50.0     254.0
# 4       POINT BREEZE        55.0           39.0        11.0       59.0       47.0          26.0     237.0


score_list = []


for i in range(5):
    hood_name = rank_df.iloc[i]["NEIGHBORHOOD"]
    crime_score = rank_df.iloc[i]["crime_rank"]
    facility_score = rank_df.iloc[i]["facility_rank"]
    steps_score = rank_df.iloc[i]["steps_rank"]
    park_score = rank_df.iloc[i]["park_rank"]
    tree_score = rank_df.iloc[i]["tree_rank"]
    student_score = rank_df.iloc[i]["student_rank"]
    score_sum = rank_df.iloc[i]["rank_sum"]

    score_record = f"""
    All the scores are higher the better.
    In neighborhood {hood_name},
    the crime score is {crime_score},
    the facility score is {facility_score},
    the steps score is {steps_score},
    the park score is {park_score},
    the tree score is {tree_score},
    the student score is {student_score},
    the total score sum up to {score_sum}
    """
    score_list.append(score_record)

ctr_north = citation_df[citation_df["NEIGHBORHOOD"] == "Central North Side"]["SUMMARY"].to_string()
pohill    = citation_df[citation_df["NEIGHBORHOOD"] == "Polish Hill"]["SUMMARY"].to_string()
rsqure    = citation_df[citation_df["NEIGHBORHOOD"] == "Regent Square"]["SUMMARY"].to_string()
gfiled    = citation_df[citation_df["NEIGHBORHOOD"] == "Greenfield"]["SUMMARY"].to_string()
pbreeze   = citation_df[citation_df["NEIGHBORHOOD"] == "Point Breeze"]["SUMMARY"].to_string()

records = [ctr_north, pohill, rsqure, gfiled, pbreeze]

final_result = []

for i in range(5):
    final_result.append(score_list[i] + "The detailed crime records are: " + records[i])
    print(final_result[i])

result = judge_hood_all(final_result)
print(result)










## Conclusion


Our text process part explores the use of large language models (LLMs) to assist in evaluating Pittsburgh neighborhoods based on a mix of qualitative and quantitative crime-related text data. By leveraging the high-concurrency, cost-effective **Doubao API** for initial summarization and the more powerful **Sonar API** from Perplexity for final evaluation, we design a two-stage pipeline that balances scalability and reasoning depth.

However, several limitations remain that influence the robustness and reliability of the results, please refer to our slides and presentation for more details:

- **Prompt Dependency**: The outcome is highly sensitive to the system prompt design and model choice. Small changes in phrasing or task framing can significantly alter the generated summaries or final evaluations, if we use the deep research model, the model tend to reason from the internet knowledge.
  
- **Model Hallucination on Hard Tasks**: When the reasoning complexity exceeds the model’s effective context length or clarity, especially in comparative evaluations with dense crime data, we observed instances of hallucination or overconfident assertions not grounded in input facts.

- **Monotonous and Redundant Input**: Much of the raw data (e.g., repetitive criminal activity entries) lacks semantic diversity. This can cause the model to overfit on trivial patterns, skewing interpretation or leading to surface-level summaries. Also, we only use the crime data and rank of other data, which might not be enough to give a comprehensive evaluation.

- **Context Window and Truncation**: Despite ranking and filtering steps to reduce context length, LLMs may still miss relevant information or fail to maintain coherence across long-form comparisons. This limits their effectiveness in nuanced tradeoff analysis.


