# Google Trends Anchor-based Rescaling V1.0

* Topic: Chlamydia (Infection)  -> MID = /m/020gd
* Anchor: California (US-CA)
* Author: LIU Aoran / 2025

> **Note**  
> You should run this notebook on Google Colab to get a stable IP for Google Trends.

> ⚠️ **Warning**  
> The methodology for Step 3 is still incomplete and its usability cannot be guaranteed. However, portions of Step 2 can be utilized to scrape data for individual states separately.

In [2]:
!pip install pytrends

Collecting pytrends
  Downloading pytrends-4.9.2-py3-none-any.whl.metadata (13 kB)
Downloading pytrends-4.9.2-py3-none-any.whl (15 kB)
Installing collected packages: pytrends
Successfully installed pytrends-4.9.2


In [3]:
from pytrends.request import TrendReq
import pandas as pd
import time, random, os

In [9]:
from google.colab import drive
drive.mount('/content/drive')
save_dir = "/content/drive/MyDrive/google_trends_anchor"
os.makedirs(save_dir, exist_ok=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
keyword = "/m/020gd"       # Google Topic ID: Chlamydia (Infection)
label = "Chlamydia_infection"
timeframe = "2018-01-01 2025-09-01"
anchor_state = "US-CA"     # California as anchor
save_dir = "/content/drive/MyDrive/google_trends_anchor"
os.makedirs(save_dir, exist_ok=True)

In [11]:
states = [
    'US-AK','US-AL','US-AZ','US-AR','US-CA','US-CO','US-CT','US-DE','US-FL','US-GA','US-HI','US-ID','US-IL',
    'US-IN','US-IA','US-KS','US-KY','US-LA','US-ME','US-MD','US-MA','US-MI','US-MN','US-MS','US-MO','US-MT',
    'US-NE','US-NV','US-NH','US-NJ','US-NM','US-NY','US-NC','US-ND','US-OH','US-OK','US-OR','US-PA','US-RI',
    'US-SC','US-SD','US-TN','US-TX','US-UT','US-VT','US-VA','US-WA','US-WV','US-WI','US-WY'
]

In [12]:
pytrends = TrendReq(hl='en-US', tz=360)

## Step 1: Region-level long-term average

Using `interest_by_region()` with `geo="US"`, collect the long-term average Google Trends index for each U.S. state over 2018–2025.

These averages represent the relative search intensity of each state with respect to the entire country.

Denote the value of **Region-level long-term average** as $\bar{GT_r}$


In [13]:
print("▶️ Fetching region-level average (for scaling factors)...")
pytrends.build_payload([keyword], geo="US", timeframe=timeframe)
region_df = pytrends.interest_by_region(resolution='region', inc_low_vol=True).reset_index()
region_df = region_df.rename(columns={keyword: "avg_index", "geoName": "state"})
region_df.to_csv(f"{save_dir}/region_avg.csv", index=False)
print("✅ Region averages saved.")

▶️ Fetching region-level average (for scaling factors)...
✅ Region averages saved.


## Step 2: Time-series per state

 For each state, we download the complete monthly time series of the topic  *“Chlamydia (Infection)”* (`MID = /m/020gd`) using the `interest_over_time()` function.

 ( Under ideal conditions, it can be completed in approximately 35–45 minutes for one keyword. )


In [14]:
print("\n▶️ Fetching state-level time series...")
all_data = []
for s in states:
    try:
        pytrends.build_payload([keyword], geo=s, timeframe=timeframe)
        df = pytrends.interest_over_time().reset_index()
        if df.empty:
            print(f"⚠️ {s} returned empty data.")
            continue
        df = df.rename(columns={keyword: "gt_index"})
        df["state_code"] = s.split("-")[1]
        all_data.append(df)
        print(f"✅ {s} done. Waiting...")
        time.sleep(random.uniform(30, 55))  # avoid 429
    except Exception as e:
        print(f"❌ {s} failed: {e}")
        time.sleep(60)


▶️ Fetching state-level time series...
✅ US-AK done. Waiting...
✅ US-AL done. Waiting...
✅ US-AZ done. Waiting...
✅ US-AR done. Waiting...
✅ US-CA done. Waiting...
✅ US-CO done. Waiting...
✅ US-CT done. Waiting...
✅ US-DE done. Waiting...
✅ US-FL done. Waiting...
✅ US-GA done. Waiting...
✅ US-HI done. Waiting...
✅ US-ID done. Waiting...
✅ US-IL done. Waiting...
✅ US-IN done. Waiting...
✅ US-IA done. Waiting...
✅ US-KS done. Waiting...
✅ US-KY done. Waiting...
✅ US-LA done. Waiting...
✅ US-ME done. Waiting...
✅ US-MD done. Waiting...
✅ US-MA done. Waiting...
✅ US-MI done. Waiting...
✅ US-MN done. Waiting...
✅ US-MS done. Waiting...
✅ US-MO done. Waiting...
✅ US-MT done. Waiting...
✅ US-NE done. Waiting...
✅ US-NV done. Waiting...
✅ US-NH done. Waiting...
✅ US-NJ done. Waiting...
✅ US-NM done. Waiting...
✅ US-NY done. Waiting...
✅ US-NC done. Waiting...
✅ US-ND done. Waiting...
✅ US-OH done. Waiting...
✅ US-OK done. Waiting...
✅ US-OR done. Waiting...
✅ US-PA done. Waiting...
✅ US-RI do

In [15]:
gt_raw = pd.concat(all_data)
gt_raw.to_csv(f"{save_dir}/gt_raw.csv", index=False)
print("✅ All state series saved.")

✅ All state series saved.


## Step 3: Anchor-based Rescaling

Each state’s time series is rescaled by the ratio between the long-term average of California and that of the corresponding state:

$$
\tilde{GT}_{r,t} = GT_{r,t} \times
\frac{\bar{GT_{CA}}}{\bar{GT_r}}
$$

where  
   - $GT_{r,t}$ is the raw Google Trends index for state $r$ at time $t$,  
   - $\bar{GT_r}$ is the state’s long-term average from Step 1,  
   - $\bar{GT_{CA}}$ is the anchor state’s long-term average.












In [18]:
state_map = {
    "Alabama": "AL", "Alaska": "AK", "Arizona": "AZ", "Arkansas": "AR", "California": "CA",
    "Colorado": "CO", "Connecticut": "CT", "Delaware": "DE", "Florida": "FL", "Georgia": "GA",
    "Hawaii": "HI", "Idaho": "ID", "Illinois": "IL", "Indiana": "IN", "Iowa": "IA",
    "Kansas": "KS", "Kentucky": "KY", "Louisiana": "LA", "Maine": "ME", "Maryland": "MD",
    "Massachusetts": "MA", "Michigan": "MI", "Minnesota": "MN", "Mississippi": "MS",
    "Missouri": "MO", "Montana": "MT", "Nebraska": "NE", "Nevada": "NV", "New Hampshire": "NH",
    "New Jersey": "NJ", "New Mexico": "NM", "New York": "NY", "North Carolina": "NC",
    "North Dakota": "ND", "Ohio": "OH", "Oklahoma": "OK", "Oregon": "OR", "Pennsylvania": "PA",
    "Rhode Island": "RI", "South Carolina": "SC", "South Dakota": "SD", "Tennessee": "TN",
    "Texas": "TX", "Utah": "UT", "Vermont": "VT", "Virginia": "VA", "Washington": "WA",
    "West Virginia": "WV", "Wisconsin": "WI", "Wyoming": "WY", "District of Columbia": "DC"
}

region_df["state_code"] = region_df["state"].map(state_map)

In [20]:
anchor_mean = region_df.loc[region_df["state"]=="California","avg_index"].values[0]
region_df["scale_factor"] = anchor_mean / region_df["avg_index"]


In [21]:
merged = gt_raw.merge(region_df[["state_code","scale_factor"]], on="state_code", how="left")
merged["scaled_index"] = merged["gt_index"] * merged["scale_factor"]

In [22]:
final_path = f"{save_dir}/gt_anchor_scaled.csv"
merged.to_csv(final_path, index=False)
print(f"\n✅ Fixed anchor-based rescaled data saved to:\n{final_path}")


✅ Fixed anchor-based rescaled data saved to:
/content/drive/MyDrive/google_trends_anchor/gt_anchor_scaled.csv
