<a href="https://colab.research.google.com/github/casualcomputer/datasets/blob/master/AI_agents_Henry.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A. Thread Format Strategy

1. json/html outputs

2. Concatenation with Delimiters: Joining comment texts into a single string using special characters or patterns to separate individual comments and indicate hierarchy.

3. Structured Input with Role/Hierarchy Indicators: Including explicit labels or markers within the text to denote the role of each turn (e.g., "Comment", "Reply") and its position in the conversation tree (e.g., "Level 0", "Level 1").

4. Summarization or Extraction before Feeding: Pre-processing the conversation to create summaries of threads or extract key information, reducing the input size for the LLM.

5. Feeding in Chunks: Breaking down very long conversations into smaller segments to fit within the LLM's context window, potentially processing each chunk separately.

6. Using Models Designed for Conversational Data: Utilizing LLMs specifically trained to handle conversational turns and maintain context across interactions.

## A1. Json outputs

In [1]:
import requests

item_id = 44047693
url = f"https://hn.algolia.com/api/v1/items/{item_id}"
response = requests.get(url)
data = response.json()

data

{'author': 'bookofjoe',
 'children': [{'author': 'bookofjoe',
   'children': [],
   'created_at': '2025-05-21T02:13:59.000Z',
   'created_at_i': 1747793639,
   'id': 44047694,
   'options': [],
   'parent_id': 44047693,
   'points': None,
   'story_id': 44047693,
   'text': '<a href="https:&#x2F;&#x2F;archive.ph&#x2F;1geSw" rel="nofollow">https:&#x2F;&#x2F;archive.ph&#x2F;1geSw</a>',
   'title': None,
   'type': 'comment',
   'url': None},
  {'author': 'neom',
   'children': [{'author': 'dirtyhippiefree',
     'children': [],
     'created_at': '2025-05-21T19:41:15.000Z',
     'created_at_i': 1747856475,
     'id': 44055435,
     'options': [],
     'parent_id': 44047849,
     'points': None,
     'story_id': 44047693,
     'text': 'The article also says that William Ziff Jr. sold the computer magazines when his son, Dirk, didn’t want to run them.<p>He’s worth $7 billion, now.',
     'title': None,
     'type': 'comment',
     'url': None}],
   'created_at': '2025-05-21T02:49:10.000Z',

## A2. Recursive Concatenation

In [2]:
import requests

item_id = 44047693
url = f"https://hn.algolia.com/api/v1/items/{item_id}"
response = requests.get(url)
data = response.json()
print("data['children']: " ,data['children'])

conversation_text = ""

# Function to recursively build the conversation string
def build_conversation_string(item, indent=0):
    global conversation_text # Use the global variable

    if 'text' in item and item['text'] is not None:
        # Add indentation and the comment text
        conversation_text += "  " * indent + item['text'] + "\n\n" # Add newlines

    # Check if the item has children (replies)
    if 'children' in item and item['children']:
        # If it has children, recursively call the function for each child
        for child in item['children']:
            build_conversation_string(child, indent + 1) # Increase indent

# Start building the conversation string with the main item's children
if 'children' in data and data['children']:
    for top_level_comment in data['children']:
        build_conversation_string(top_level_comment, indent=0)

# The 'conversation_text' variable now holds the concatenated conversation
print(conversation_text)

# Now you can feed 'conversation_text' to your LLM
# Example (conceptual):
# llm_input = {"prompt": "Summarize the following conversation:\n" + conversation_text}
# llm_output = your_llm_api_call(llm_input)

data['children']:  [{'author': 'bookofjoe', 'children': [], 'created_at': '2025-05-21T02:13:59.000Z', 'created_at_i': 1747793639, 'id': 44047694, 'options': [], 'parent_id': 44047693, 'points': None, 'story_id': 44047693, 'text': '<a href="https:&#x2F;&#x2F;archive.ph&#x2F;1geSw" rel="nofollow">https:&#x2F;&#x2F;archive.ph&#x2F;1geSw</a>', 'title': None, 'type': 'comment', 'url': None}, {'author': 'neom', 'children': [{'author': 'dirtyhippiefree', 'children': [], 'created_at': '2025-05-21T19:41:15.000Z', 'created_at_i': 1747856475, 'id': 44055435, 'options': [], 'parent_id': 44047849, 'points': None, 'story_id': 44047693, 'text': 'The article also says that William Ziff Jr. sold the computer magazines when his son, Dirk, didn’t want to run them.<p>He’s worth $7 billion, now.', 'title': None, 'type': 'comment', 'url': None}], 'created_at': '2025-05-21T02:49:10.000Z', 'created_at_i': 1747795750, 'id': 44047849, 'options': [], 'parent_id': 44047693, 'points': None, 'story_id': 44047693, '

In [3]:
item_id = 44047693
url = f"https://hn.algolia.com/api/v1/items/{item_id}"
response = requests.get(url)
data = response.json()

comments = data['children']
for comment in comments:
  print(comment['text'])
  print('\n')

<a href="https:&#x2F;&#x2F;archive.ph&#x2F;1geSw" rel="nofollow">https:&#x2F;&#x2F;archive.ph&#x2F;1geSw</a>


The owner is the grandson of the Ziff Davis fortune. <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Ziff_Davis" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Ziff_Davis</a> (Popular Electronics, PC Mag, ZDNET, etc.)


&gt; In the spring of 2027, the museum will open a permanent gallery devoted to the evolution and cultural impact of the American guitar.<p>This is fun, it looks like they have many important prototype and early production guitars.


As a guitar player (20+ years) and a audio engineer&#x2F;electrical engineer&#x2F;dsp programmer one thing that really keeps baffling me about the field of guitar playing is how much myths about what affects the sound in which way exists. With people trying to get the sounds of their stars by buying products that were made in the 60s and somehow assuming the wild quality fluctuations and effects of the 

# Dataset for AI agent

In [4]:
import requests
import pandas as pd

# Load the JSON from GitHub
url = "https://raw.githubusercontent.com/casualcomputer/datasets/master/conversations.json"
data = requests.get(url).json()

# Normalize into a DataFrame
conversation_df = pd.json_normalize(data["conversations"])
conversation_df

Unnamed: 0,item_id,conversation_text
0,39765630,I like the interactive setup. I think this is ...
1,39763458,I am fascinated they used 60 pounds and a larg...
2,39761650,"I’ve seen this at a Fortune-50 company, on the..."
3,39771596,ESG was basically instantly exploited to the p...
4,39764387,"Paywall-free: <a href=""https:&#x2F;&#x2F;archi..."
...,...,...
250,39768860,A friend of mine from a farming family in Euro...
251,39768434,"<a href=""https:&#x2F;&#x2F;www.flightradar24.c..."
252,39770249,qacom for radio company? was mokia taken?\n\n ...
253,39748761,And of course the Nick Metropolis mentioned in...


In [5]:
import pandas as pd

url = "https://raw.githubusercontent.com/casualcomputer/datasets/master/analysis_results.csv"

threads_summary = pd.read_csv(url).sort_values(by='rank', ascending=False)
threads_summary

Unnamed: 0,rank,title,link,item_id,page,scrape_time,content_saved
10,255.0,Biden is giving Intel $8.5B for big semiconduc...,https://text.npr.org/1239533039,39768240,9,2025-05-17 11:28:26.188898,
11,254.0,How does perception of climate protest influen...,https://www.nature.com/articles/s44168-023-000...,39769918,9,2025-05-17 11:28:26.187889,D:\alpha_seeker\content_2024-03-20\1178aeb149c...
12,253.0,US Weighs Sanctioning Huawei's Secretive Chine...,https://www.bloomberg.com/news/articles/2024-0...,39769909,9,2025-05-17 11:28:26.187889,
13,252.0,Researchers demonstrate breakthrough recyclabi...,https://phys.org/news/2024-03-breakthrough-rec...,39762164,9,2025-05-17 11:28:26.187889,D:\alpha_seeker\content_2024-03-20\19037b0b087...
14,251.0,Intel receives $8.5B from US for expanding hig...,https://arstechnica.com/tech-policy/2024/03/in...,39766518,9,2025-05-17 11:28:26.186382,D:\alpha_seeker\content_2024-03-20\e21d0bea450...
...,...,...,...,...,...,...,...
247,5.0,Michel Talagrand wins Abel Prize for work wran...,https://www.quantamagazine.org/michel-talagran...,39764954,1,2025-05-17 11:28:00.178128,D:\alpha_seeker\content_2024-03-20\a3b498a5018...
248,4.0,Google Scholar PDF Reader,https://scholar.googleblog.com/2024/03/superch...,39768438,1,2025-05-17 11:28:00.178128,D:\alpha_seeker\content_2024-03-20\98fd21dea64...
249,3.0,Rive Renderer for real-time vector graphics is...,https://rive.app/blog/rive-renderer-now-open-s...,39766893,1,2025-05-17 11:28:00.176560,D:\alpha_seeker\content_2024-03-20\938c76f3e1d...
250,2.0,Suspicious discontinuities (2020),https://danluu.com/discontinuities/,39768860,1,2025-05-17 11:28:00.176560,D:\alpha_seeker\content_2024-03-20\2779493d1bf...


In [9]:
# Perform a left merge of threads_summary and conversation_df on the 'item_id' column
conversation_df['item_id'] = pd.to_numeric(conversation_df['item_id'], errors='coerce').astype('Int64')

# Now perform the left merge
merged_df = threads_summary.merge(conversation_df, on='item_id', how='left').sort_values(by='rank')

# Display the first few rows of the merged DataFrame
merged_df.head()

Unnamed: 0,rank,title,link,item_id,page,scrape_time,content_saved,conversation_text
254,1.0,Flightradar24's new GPS jamming map,https://www.flightradar24.com/blog/gps-jamming...,39768434,1,2025-05-17 11:28:00.176560,D:\alpha_seeker\content_2024-03-20\32d2683cf43...,"<a href=""https:&#x2F;&#x2F;www.flightradar24.c..."
253,2.0,Suspicious discontinuities (2020),https://danluu.com/discontinuities/,39768860,1,2025-05-17 11:28:00.176560,D:\alpha_seeker\content_2024-03-20\2779493d1bf...,A friend of mine from a farming family in Euro...
252,3.0,Rive Renderer for real-time vector graphics is...,https://rive.app/blog/rive-renderer-now-open-s...,39766893,1,2025-05-17 11:28:00.176560,D:\alpha_seeker\content_2024-03-20\938c76f3e1d...,"Repo: <a href=""https:&#x2F;&#x2F;github.com&#x..."
251,4.0,Google Scholar PDF Reader,https://scholar.googleblog.com/2024/03/superch...,39768438,1,2025-05-17 11:28:00.178128,D:\alpha_seeker\content_2024-03-20\98fd21dea64...,"That&#x27;s nice and all, but google scholar r..."
250,5.0,Michel Talagrand wins Abel Prize for work wran...,https://www.quantamagazine.org/michel-talagran...,39764954,1,2025-05-17 11:28:00.178128,D:\alpha_seeker\content_2024-03-20\a3b498a5018...,"Following Shaw Price in 2019 <a href=""https:&#..."


# AI agents

The baseline agent's role is to verify the information's origin (e.g., a web search) and then provide a summary of the problem, its solution, and a link to that solution.

In [10]:
!pip install crewai --quiet

In [19]:
# Start your codes
for row_id in range(len(merged_df)):
  title = merged_df.loc[row_id, "title"]
  link = merged_df.loc[row_id, "link"]
  conv_text = merged_df.loc[row_id, "conversation_text"]
  print(f'title:{title}, \nlink:{link}, \nconv_text:{conv_text}',"\n\n")

title:Biden is giving Intel $8.5B for big semiconductor projects in 4 states, 
link:https://text.npr.org/1239533039, 
conv_text: 



title:How does perception of climate protest influence support for climate action?, 
link:https://www.nature.com/articles/s44168-023-00096-9, 
conv_text: 



title:US Weighs Sanctioning Huawei's Secretive Chinese Chip Network, 
link:https://www.bloomberg.com/news/articles/2024-03-20/us-weighs-sanctioning-huawei-s-secretive-chinese-chip-network, 
conv_text:<a href="https:&#x2F;&#x2F;archive.is&#x2F;wCyOO" rel="nofollow">https:&#x2F;&#x2F;archive.is&#x2F;wCyOO</a>

this will later be described as part of the &#x27;US-China&#x27; chip war, as if both are equal protagonists 



title:Researchers demonstrate breakthrough recyclability of carbon nanotube sheets, 
link:https://phys.org/news/2024-03-breakthrough-recyclability-carbon-nanotube-sheets.amp, 
conv_text:&gt; <i>&quot;It demonstrates that high-performance materials made from carbon nanotubes can be reus