# 1) Prepare data

To collect the submission urls per hack id, we have to check each contest separately.
For each contest, there exist an overview of the submitted hacks.
We check the list of submitted hacks to match the ids for successful hacks we have already collected.

## Load successful hack data

In [6]:
import json
import os
from collections import defaultdict

In [7]:
f = open("codehacks.json",encoding="utf-8")
data = json.load(f)
len(data)

288617

In [8]:
data[0]

{'id': 265867,
 'creationTimeSeconds': 1479638061,
 'hacker': {'contestId': 729,
  'members': [{'handle': 'TheWayISteppedOutTheCar'}],
  'participantType': 'CONTESTANT',
  'ghost': False,
  'room': 4,
  'startTimeSeconds': 1479632700},
 'defender': {'contestId': 729,
  'members': [{'handle': 'Dmozze'}],
  'participantType': 'CONTESTANT',
  'ghost': False,
  'room': 4,
  'startTimeSeconds': 1479632700},
 'verdict': 'HACK_SUCCESSFUL',
 'problem': {'contestId': 729,
  'index': 'A',
  'name': 'Interview with Oleg',
  'type': 'PROGRAMMING',
  'points': 500.0,
  'rating': 900,
  'tags': ['implementation', 'strings']},
 'judgeProtocol': {'protocol': 'Solution verdict:\nTIME_LIMIT_EXCEEDED\n\nChecker:\n\n\nInput:\n3\r\nogo\r\n\n\nOutput:\n\n\nAnswer:\n\n\nTime:\n1000\n\nMemory:\n0\n',
  'manual': 'false',
  'verdict': 'Successful hacking attempt'}}

## Group hack ids by contest

In [9]:
file = open("contests.json")
contests = json.load(file)
contests = [contest["id"] for contest in contests["result"] if contest["phase"] == "FINISHED"][::-1]
len(contests)

file = open("contest2.json")
contests2 = json.load(file)
contests2 = [contest for contest in contests2["result"] if contest["phase"] == "FINISHED"][::-1]
len(contests2)

ids = set([contest["id"] for contest in contests2 if contest["id"] not in contests])
len(ids)

126

In [13]:
contest_hackIDs = defaultdict(set)
for hack in data:
    contestNR = hack["problem"]["contestId"]
    if not contestNR in ids:
        continue
    contest_hackIDs[contestNR].add(str(hack["id"]))

In [12]:
len(contest_hackIDs)

88

# 2) Crawl contest hack page

In the second step, we carry out the crawling.

In [15]:
import requests
import time
from bs4 import BeautifulSoup
import tqdm

In [16]:
for contestID, successfulHacks in tqdm.tqdm(contest_hackIDs.items()):
    url = f"https://codeforces.com/contest/{contestID}/hacks?showAll=true"
    page = requests.get(url)
    soup = BeautifulSoup(page.text, "html.parser")
    for row in soup.find_all("tr", {"challengeid":True}):
        hackID = row["challengeid"]
        if not hackID in contest_hackIDs[contestID]:
            continue
        
        # Extract details from the page
        submissionLinks = row.find_all("span",{"class":"small"})[0]
        submissionLink = submissionLinks.find_all("a",href=True)[0]
        submissionRef = submissionLink["href"]
        # Store the results in a file
        with open('submissionUrls.txt', 'a') as file:
            content = f'{contestID} {hackID} {submissionRef} \n'
            file.write(content)
    
    time.sleep(5)

100%|███████████████████████████████████████████| 87/87 [07:24<00:00,  5.11s/it]
