当前的算法是： 先用 BigQuery 把所有仓库按照最近两天的 star 总数排序，取出前 1000 条，此时仅有 repo_name 和 two_days_star_count 这两种信息。  然后对这 1000 条数据，分别用 API 查询到仓库创建时间和当前的 star 总数，按照其创建时间排序。

因此可以认为：这样靠前的仓库，两天内积累的 star 比较多，而且建库比较晚

需要安装的主要有 
1. GCP CLI工具  https://cloud.google.com/sdk/docs/install?hl=zh-cn
2. 谷歌 BigQuery 客户端  https://cloud.google.com/bigquery/docs/reference/libraries?hl=zh-cn
3. tqdm + pandas


请按照以上文档配置好 GCP 服务，主要是身份认证

亲测, GCP 账号即使没有验证付款方式，也能顺利运行

In [15]:
from google.cloud import bigquery
from datetime import datetime, timedelta

# Construct a BigQuery client object.
client = bigquery.Client()

# 获取今天的日期和前一天的日期
today = datetime.today()
yesterday = today - timedelta(days=1)

# 将日期格式化为 BigQuery 查询需要的格式 (YYYYMMDD)
today_str = today.strftime('%Y%m%d')
yesterday_str = yesterday.strftime('%Y%m%d')

# 构建查询字符串
query = f"""
WITH watch_data AS (
  -- 查询最近一天的 WatchEvent 事件
  SELECT 
    repo.name AS repo_name
  FROM 
    `githubarchive.day.{today_str}`
  WHERE 
    type = 'WatchEvent'
  
  UNION ALL
  
  -- 查询前一天的 WatchEvent 事件
  SELECT 
    repo.name AS repo_name
  FROM 
    `githubarchive.day.{yesterday_str}`
  WHERE 
    type = 'WatchEvent'
)

SELECT 
  repo_name,
  COUNT(*) AS star_count
FROM 
  watch_data
GROUP BY 
  repo_name
ORDER BY 
  star_count DESC
LIMIT 1000
"""

# 执行查询并等待结果
rows = client.query(query)  # 执行查询
results = rows.result().to_dataframe()  # 等待查询结果






In [16]:
import requests
import pandas as pd
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed


# GitHub GraphQL API URL
GRAPHQL_URL = "https://api.github.com/graphql"

# GitHub Token (替换为你的 Token)
TOKEN = "your_token"

# 设置请求头
HEADERS = {
    "Authorization": f"bearer {TOKEN}",
    "Content-Type": "application/json"
}

# 构建 GraphQL 查询模板
GRAPHQL_QUERY_TEMPLATE = """
query {{
  repository(owner: "{repo_owner}", name: "{repo_name}") {{
    createdAt
    stargazerCount
  }}
}}
"""

# 请求函数，获取仓库信息
def fetch_repo_details(repo_name):
    """
    使用 GitHub GraphQL API 获取仓库的创建日期和 star 总数。
    """
    if "/" not in repo_name:
        return None, None  # 无效的 repo_name
    
    repo_owner, repo_name_only = repo_name.split("/", 1)
    
    query = GRAPHQL_QUERY_TEMPLATE.format(repo_owner=repo_owner, repo_name=repo_name_only)
    
    # 发送请求
    try:
        response = requests.post(
            GRAPHQL_URL,
            json={"query": query},
            headers=HEADERS
        )
        # 检查请求是否成功
        if response.status_code == 200:
            data = response.json()
            # 确保 data 和 repository 存在且有效
            if "data" in data and "repository" in data["data"]:
                repo_data = data["data"]["repository"]
                if repo_data is not None:
                    return repo_data.get("createdAt"), repo_data.get("stargazerCount")
                else:
                    print(f"Repository data is None for {repo_name}")
            else:
                print(f"Missing 'data' or 'repository' for {repo_name}")
        else:
            print(f"Failed to fetch data for {repo_name}: {response.status_code}, {response.text}")
    except requests.exceptions.RequestException as e:
        # 捕获请求中的异常
        print(f"Request failed for {repo_name}: {str(e)}")
    
    # 出现问题时返回 None
    return None, None

# 处理并行化请求
def fetch_repo_details_parallel(df):
    results = []
    
    # 使用 ThreadPoolExecutor 进行并行化
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(fetch_repo_details, row["repo_name"]): index for index, row in df.iterrows()}
        
        # 显示进度条
        for future in tqdm(as_completed(futures), total=len(futures), desc="Fetching repo details"):
            index = futures[future]
            created_at, stargazer_count = future.result()
            df.at[index, "created_at"] = created_at
            df.at[index, "current_star_count"] = stargazer_count
    
    return df

# 假设你已经得到了如下的 DataFrame
df = results  # BigQuery 查询的结果 DataFrame

# 添加两列：创建日期和 star 总数
df["created_at"] = None
df["current_star_count"] = None

# 执行并行化的获取仓库信息
df = fetch_repo_details_parallel(df)


Fetching repo details:   0%|          | 0/1000 [00:00<?, ?it/s]

Fetching repo details:   5%|▌         | 50/1000 [00:01<00:28, 33.57it/s]

Repository data is None for PasinduSmarasingha/rust-h4ck-free
Repository data is None for Kuyaa06/counter-str1ke-2-h4ck
Repository data is None for xynqx/Roblox-Blox-Fruits-Script-2024
Repository data is None for RoyalKnightD/Spotify-Premium-for-free-2024
Repository data is None for DELDELHENRY2/r0b10x-synapse-x-free
Repository data is None for 1ofzp1/Discord-AllinOne-Tool
Repository data is None for carloscupu/IObit-Driver-Booster-Pro-2024-free-Serial-Key


Fetching repo details:   6%|▌         | 55/1000 [00:01<00:33, 28.51it/s]

Repository data is None for ArkManace/SketchUp-Pro-free-2024


Fetching repo details:   7%|▋         | 68/1000 [00:02<00:31, 29.49it/s]

Repository data is None for ArdaKundurayapan/Al-Photoshop-2024
Repository data is None for XxLordxXx/Adobe-Express-2024
Repository data is None for daveolio/counter-str1ke-2-h4ck
Repository data is None for anthonnyjohn/hack-apex-1egend


Fetching repo details:   8%|▊         | 76/1000 [00:02<00:31, 29.49it/s]

Repository data is None for shadowstrike2/FL-Studio
Repository data is None for CoderCosmics/Exit1ag-Free-2024
Repository data is None for mersa31/ESET-KeyGen-2024
Repository data is None for NoPressure000/Discord-AllinOne-Tool
Repository data is None for dsdsadfdfs2323rw/Dayz-Cheat-H4ck-A1mb0t
Repository data is None for SamsongT/h4ck-f0rtnite


Fetching repo details:   8%|▊         | 85/1000 [00:03<00:29, 30.54it/s]

Repository data is None for jubemar/League-0f-Legends-h4ck
Repository data is None for brumikdev/IObit-Driver-Booster-Pro-2024-free-Serial-Key
Repository data is None for modore/m0dmenu-gta5-free
Repository data is None for shaikhhaareess/Roblox-Blox-Fruits-Script-2024
Repository data is None for FusenTG/IDM-Activation-Script-2024
Repository data is None for GunsJoez/SilenceGen


Fetching repo details:  10%|▉         | 96/1000 [00:03<00:29, 30.80it/s]

Repository data is None for JohnBluess/PhotoDiva-Pro-free-2024
Repository data is None for anuj4207/SketchUp-Pro-free-2024
Repository data is None for Luhbob/OpenSea-Bidding-Bot-2024
Repository data is None for DuongSuper/Nexus-Roblox
Repository data is None for Unknownshadowterror/NitroDreams-2024
Repository data is None for hrjossmahmud/roblox-scr1pts-s0lara


Fetching repo details:  10%|█         | 105/1000 [00:03<00:26, 33.57it/s]

Repository data is None for DelDD/SoLBF
Repository data is None for hyper8-codefien/Wave-Executor
Repository data is None for popeyerollvn/cheat-escape-from-tarkov
Repository data is None for Maurice1001/SonyVegas-2024
Repository data is None for fafadk/hack-apex-1egend
Repository data is None for Tyiscola/r0b10x-synapse-x-free
Repository data is None for sabbir32981/minecraft-cheat2024


Fetching repo details:  11%|█         | 109/1000 [00:03<00:29, 29.71it/s]

Repository data is None for Roki0217/rust-h4ck-free
Repository data is None for minhtan2010/Valorant-H4ck
Repository data is None for ntziz2/Rainbow-S1x-Siege-Cheat
Repository data is None for JojiDevelopment/Al-Photoshop-2024
Repository data is None for yukihamaaa/Wemod-Premium-Unlocker-2024
Repository data is None for eddadawdwda/counter-str1ke-2-h4ck


Fetching repo details:  12%|█▏        | 117/1000 [00:04<00:27, 31.97it/s]

Repository data is None for kaidenp11/Xbox-Game-Pass-Activator-Free-2024


Fetching repo details:  12%|█▏        | 121/1000 [00:04<00:28, 30.82it/s]

Repository data is None for JokenPlayz/Spotify-Premium-for-free-2024
Repository data is None for Frankkubas123/Dayz-Cheat-H4ck-A1mb0t
Repository data is None for lokinn005/Adobe-Express-2024


Fetching repo details:  14%|█▍        | 139/1000 [00:04<00:31, 27.13it/s]

Repository data is None for thisiscindychou/roblox-solara-executors
Repository data is None for timigsh2mos/fortnite-hack-external


Fetching repo details:  38%|███▊      | 383/1000 [00:13<00:23, 26.68it/s]

Repository data is None for Dilodova/Roblox-Executor-Xeno-v1.0.9


Fetching repo details:  50%|█████     | 502/1000 [00:17<00:17, 28.80it/s]

Repository data is None for spear-blackseeker/Solara-Executor-Roblox


Fetching repo details:  55%|█████▍    | 545/1000 [00:19<00:14, 30.80it/s]

Repository data is None for beastamya8/Roblox-Synapse-X


Fetching repo details:  56%|█████▌    | 556/1000 [00:19<00:13, 32.72it/s]

Repository data is None for kotskirk852/verse-spoofer


Fetching repo details: 100%|██████████| 1000/1000 [00:34<00:00, 28.63it/s]


条目显示为 None 的，基本都是有 star 记录，但仓库被设为隐私或者删库的。这些仓库没删之前我点进去看到过，全都是不同账号创建的，但内容一模一样。没有代码，只有 Readme ，让人下载某个 exe 。 它们创建时间都非常相近，star 数量也非常接近。看仓库名也全都是奇奇怪怪的，不像给人看的。

我怀疑是某个组织在社工投毒

In [20]:
df

Unnamed: 0,repo_name,star_count,created_at,current_star_count
0,Tencent/HunyuanVideo,2094,2024-11-28 08:38:31+00:00,2332
1,LadybirdBrowser/ladybird,1705,2024-05-30 09:18:10+00:00,24964
2,huggingface/smol-course,1432,2024-11-25 19:22:43+00:00,1659
3,myhhub/stock,1030,2023-03-21 01:23:26+00:00,5273
4,lobehub/lobe-chat,937,2023-05-21 07:19:12+00:00,47133
...,...,...,...,...
995,CorentinTh/it-tools,28,2020-04-05 11:50:24+00:00,23318
996,PostHog/posthog,28,2020-01-23 22:46:58+00:00,22390
997,mifi/lossless-cut,28,2016-10-30 10:49:56+00:00,28316
998,interledger/open-payments-snippets,28,2023-09-21 12:29:10+00:00,35


In [21]:
# 确保 created_at 是 datetime 类型
df["created_at"] = pd.to_datetime(df["created_at"])

# 按照 created_at 升序排序
df_sorted = df.sort_values(by="created_at", ascending=False)


In [22]:
df_sorted.to_csv('result.csv')