### Challenge

Your goal is to create an **accurate representation of a user** based on their Google search history.

The data is in `./search_history.json`. This contains a list of searches made by a single person over time.

### What does "accurate" mean?

**Accurate** means understanding which searches are **signal** and which are **noise**. Not every search reflects who someone is. Your job is to separate the meaningful from the incidental and build a coherent picture of this person.

A strong solution might surface insights like:
- **Fashion preferences**: What styles, brands, or aesthetics do they gravitate toward?
- **Travel**: Where have they been? Where are they planning to go?
- **Daily life**: What occupies their time—at work and for leisure?
- **Life transitions**: Are they moving? Starting a new job? Planning a wedding?
- **Location**: Where do they live?

This is not an exhaustive list. The point is to go beyond surface-level keyword extraction and demonstrate that you *actually understand* this person.

### What could a "representation" look like?

There are many ways to represent a user. A few examples:
- A **personal knowledge graph** capturing entities, relationships, and context
- A **single embedding** that encodes the user's preferences in a vector space
- An **LLM fine-tuned** on the user's data
- An **agent** that uses RAG to answer questions about the user

These are just starting points—come up with your own if you have a better idea. The specific representation you choose matters less than **why** you chose it and how well it captures what's meaningful about this person.

### Dummy approach

The following is what we consider a **dummy** approach:
1. Embed all searches
2. Cluster them by topic
3. Label each cluster and call it a "user interest"

This is mechanical. It doesn't distinguish signal from noise, doesn't capture nuance, and doesn't produce insights that feel *true* about a real person.

### What makes an interesting approach?

We're not looking for a "correct" answer, there probably isn't one. We're looking for **evidence of thinking**:
- Why did you choose this method over alternatives?
- What assumptions are you making, and why are they reasonable?
- How do you handle ambiguity in the data?
- What did you try that didn't work?

**The reasoning behind your approach is as important as the solution itself.** Show your work. Explain your decisions. If you explored dead ends, include them.

Make sure to include the cell output in the final commit. We will **not** execute the notebook ourselves.

In [None]:
# add your code here
# Lets load the json 

import json 

# Load the search history data
with open('search_history.json', 'r') as f:
    search_history = json.load(f)

# Displaying basic information
for i, search in enumerate(search_history[:3]):
    print(f"\n{i+1}. {search}")



1. {'header': 'Search', 'title': 'Visited https://www.businessinsider.com/shivon-zilis-reported-mother-elon-musk-twins-2022-7?amp', 'titleUrl': 'https://www.google.com/url?q=https://www.businessinsider.com/shivon-zilis-reported-mother-elon-musk-twins-2022-7%3Famp&usg=AOvVaw1JpQbDqah1O4c5A5wg4Who', 'time': '2024-06-23T22:21:50.431Z', 'products': ['Search'], 'activityControls': ['Web & App Activity']}

2. {'header': 'Search', 'title': 'Visited Elon Musk and Shivon Zilis privately welcome third baby – NBC10 ...', 'titleUrl': 'https://www.google.com/url?q=https://www.nbcphiladelphia.com/entertainment/entertainment-news/elon-musk-and-shivon-zilis-privately-welcome-third-baby/3892694/&usg=AOvVaw0BqY5StEFFTmHdppOUNY4V', 'time': '2024-06-23T22:20:53.934Z', 'products': ['Search'], 'activityControls': ['Web & App Activity']}

3. {'header': 'Search', 'title': 'Searched for elon musk shivon zilis', 'titleUrl': 'https://www.google.com/search?q=elon+musk+shivon+zilis', 'time': '2024-06-23T22:20:47.

In [None]:
## want to do some data exploration 

import pandas as pd
df = pd.DataFrame(search_history)

# Convert time to datetime - using format='ISO8601' to handle mixed ISO8601 formats
df['time'] = pd.to_datetime(df['time'], format='ISO8601')

# Uderstanding the data hight level 
print(df.dtypes)
print(df.head())
print(df.describe(include='all'))




header                           object
title                            object
titleUrl                         object
time                datetime64[ns, UTC]
products                         object
activityControls                 object
locationInfos                    object
subtitles                        object
details                          object
dtype: object
   header                                              title  \
0  Search  Visited https://www.businessinsider.com/shivon...   
1  Search  Visited Elon Musk and Shivon Zilis privately w...   
2  Search                Searched for elon musk shivon zilis   
3  Search                                     1 notification   
4  Search               Searched for bank station fire alert   

                                            titleUrl  \
0  https://www.google.com/url?q=https://www.busin...   
1  https://www.google.com/url?q=https://www.nbcph...   
2  https://www.google.com/search?q=elon+musk+shiv...   
3                

In [None]:
# Goin sliggtly deeper into individula variables to understand their distribution
df.title.unique()
df.products.apply(lambda x : x[0]).unique()
# looks like there are 3 kinds of locations 
df[df['locationInfos'].notna()][['locationInfos']].locationInfos.apply(lambda x : x[0]["source"]).unique() 
df[df['details'].notna()]['details']
df[df['subtitles'].notna()]['subtitles']
df['time'] # looks like search from 8th june 2023 and 23 june 2023
df.sort_values('time').to_csv('/')

Unnamed: 0,header,title,titleUrl,time,products,activityControls,locationInfos,subtitles,details
55382,Search,Searched for gmail,https://www.google.com/search?q=gmail,2017-06-08 16:42:55.223000+00:00,[Search],[Web & App Activity],,,
55381,Search,Visited https://www.google.com/gmail/,https://www.google.com/gmail/,2017-06-08 16:42:57.355000+00:00,[Search],[Web & App Activity],,,
55380,Search,Searched for investment banking networking eve...,https://www.google.com/search?q=investment+ban...,2017-06-08 16:45:50.139000+00:00,[Search],[Web & App Activity],,,
55379,Search,Visited http://news.efinancialcareers.com/uk-e...,https://www.google.com/url?q=http://news.efina...,2017-06-08 16:45:58.449000+00:00,[Search],[Web & App Activity],,,
55378,Search,Searched for blackstone's women networking event,https://www.google.com/search?q=blackstone%27s...,2017-06-08 16:48:12.167000+00:00,[Search],[Web & App Activity],,,
...,...,...,...,...,...,...,...,...,...
4,Search,Searched for bank station fire alert,https://www.google.com/search?q=bank+station+f...,2024-06-23 16:52:09.311000+00:00,[Search],[Web & App Activity],"[{'name': 'At this general area', 'url': 'http...",,
3,Search,1 notification,,2024-06-23 17:08:38.542000+00:00,[Search],[Web & App Activity],,"[{'name': 'Including topics:'}, {'name': 'Reut...",
2,Search,Searched for elon musk shivon zilis,https://www.google.com/search?q=elon+musk+shiv...,2024-06-23 22:20:47.560000+00:00,[Search],[Web & App Activity],"[{'name': 'At this general area', 'url': 'http...",,
1,Search,Visited Elon Musk and Shivon Zilis privately w...,https://www.google.com/url?q=https://www.nbcph...,2024-06-23 22:20:53.934000+00:00,[Search],[Web & App Activity],,,


In [86]:
## Embed the search queries using Google Gemini embeddings

# Install the package first
!pip install -q google-generativeai

import google.generativeai as genai
import numpy as np
from tqdm import tqdm
import time

# Initialize the client (you'll need to set your API key)
genai.configure(api_key='AIzaSyC6D9FNA-86CgH0lPb2QXRFvVn0JT916GE')

# Extract clean search queries from the title column and categorize action type
def extract_search_query(title):
    """Extract the actual search query from the title field"""
    if pd.isna(title):
        return ""
    if title.startswith("Searched for "):
        return title.replace("Searched for ", "")
    elif title.startswith("Visited "):
        # For visited URLs, extract meaningful text if possible
        return title.replace("Visited ", "")
    else:
        return title

def categorize_action(title):
    """Categorize the action as 'search' or 'visited' or 'other'"""
    if pd.isna(title):
        return 'other'
    elif title.startswith("Searched for "):
        return 'search'
    elif title.startswith("Visited "):
        return 'visited'
    else:
        return 'other'

df['search_query'] = df['title'].apply(extract_search_query)
df['action_type'] = df['title'].apply(categorize_action)

# Filter out empty queries
df_to_embed = df[df['search_query'].str.strip() != ''].copy()

print(f"Total searches to embed: {len(df_to_embed)}")
print(f"\nAction type distribution:")
print(df['action_type'].value_counts())
print(f"\nSample queries:")
print(df_to_embed[['title', 'action_type', 'search_query']].head(10))


# Looks like the search column is the one which is most important for getting the representation of the person 
# since we would like to separate the noise from the signal , some sort of bayesian approach to update the persons representation is one approach
# since the search history is on a long time horizon, we must also assume a data point from years ago is less likely to be signal if it is not seen again and again over the years
# Bayesian updates could help in this as well.



# Embed in batches to handle rate limits and large dataset
batch_size = 50  # Reduced batch size to avoid rate limits
delay_between_batches = 2  # seconds to wait between batches
embeddings = []

queries_list = df_to_embed['search_query'].tolist()

print(f"\nEmbedding {len(queries_list)} queries in batches of {batch_size}...")
print(f"With {delay_between_batches}s delay between batches, this will take approximately {(len(queries_list)//batch_size * delay_between_batches)/60:.1f} minutes")

for i in tqdm(range(0, len(queries_list), batch_size)):
    batch = queries_list[i:i + batch_size]
    
    try:
        result = genai.embed_content(
            model="models/embedding-001",
            content=batch
        )
        
        # Extract embeddings from the result
        batch_embeddings = result['embedding']
        embeddings.extend(batch_embeddings)
        
        # Sleep to avoid rate limits
        if i + batch_size < len(queries_list):  # Don't sleep after last batch
            time.sleep(delay_between_batches)
        
    except Exception as e:
        print(f"\nError at batch {i//batch_size}: {e}")
        # Add None as placeholders for failed batches
        embeddings.extend([None] * len(batch))
        # Sleep longer after an error
        time.sleep(delay_between_batches * 2)

# Add embeddings to dataframe
df_to_embed['embedding'] = embeddings

print(f"\nEmbedding complete!")
print(f"Embedding dimension: {len(embeddings[0]) if embeddings and embeddings[0] else 'N/A'}")
print(f"Total embeddings: {len([e for e in embeddings if e is not None])}")


Total searches to embed: 55383

Action type distribution:
action_type
search     30542
visited    22496
other       2345
Name: count, dtype: int64

Sample queries:
                                               title action_type  \
0  Visited https://www.businessinsider.com/shivon...     visited   
1  Visited Elon Musk and Shivon Zilis privately w...     visited   
2                Searched for elon musk shivon zilis      search   
3                                     1 notification       other   
4               Searched for bank station fire alert      search   
5               Searched for bank station fire alert      search   
6                   Searched for mukesh ambani house      search   
7  Visited Teens could lose bank accounts and dri...     visited   
8  Visited Starmer: Sunak showing 'total lack of ...     visited   
9  Visited Sunak looked like a man who was runnin...     visited   

                                        search_query  
0  https://www.businessinsider.c

100%|██████████| 1108/1108 [55:35<00:00,  3.01s/it] 


Embedding complete!
Embedding dimension: 768
Total embeddings: 55383





In [90]:
df_to_embed.to_csv("/Users/abhinavkhare/Desktop/onfabric_embeddings/embeddings.csv")