# Touché 2022: Collection Version 0.0.2 for Task 2 

We discussed the following improvements of version 0.0.2 over 0.0.1:

- Remove duplicated Ids (they were in version 0.0.1 due to some pooling problem)
- Remove passages that are too short (less than 10 terms)
  - concerns 5083 of the 1222231 passages
- Remove passages that are too long (more than 1024 terms)
  - concerns 584 of the 1222231 passages
- Remove near-duplicate passages
  - concerns 348512 of the 1222231 passages
- Combine everything into a single file

In [1]:
def all_lines(file_name):
    import json
    from tqdm import tqdm
    
    with open('/mnt/ceph/storage/data-in-progress/data-research/arguana/touche-shared-tasks/data/2022-task2/data-cleaning/' + file_name) as f:
        for i in tqdm(f):
            try:
                yield json.loads(i)
            except:
                pass


In [2]:
too_short = []
too_long = []
docs = {}

for year in ['2020', '2021']:
    for i in all_lines(year + '-task2-passages-of-top100-docs.jsonl'):
        docs[i['id']] = i
        
        length = len(i['fullyCanonicalizedContent'].split())
        
        if length > 1024:
            too_long += [i['id']]
        if length < 10:
            too_short += [i['id']]
            
too_short = set(too_short)
too_long = set(too_long)

595428it [01:58, 5005.52it/s]
691551it [02:17, 5013.84it/s]


In [3]:
len(docs)

1222231

In [4]:
len(too_short)

5083

In [5]:
len(too_long)

584

In [6]:
duplicates = []

for year in ['2020', '2021']:
    for i in all_lines('s3-scores-' + year + '-task2-passages-of-top100-docs.jsonl'):
        firstId = i['idPair']['left']
        secondId = i['idPair']['right']
        
        if firstId == secondId:
            continue
            
        if firstId > secondId:
            raise ValueError('')
        
        # this near-duplicate threshold is from previous studies
        if i['s3Score'] > 0.82:
            duplicates += [secondId]

duplicates = set(duplicates)

12197648it [07:09, 28428.27it/s]
11373094it [07:53, 24024.03it/s]


In [7]:
print(len(duplicates))

348512


# Some example duplicates

In [16]:
#clueweb12-0915wb-42-00127___124 , clueweb12-0915wb-93-17218___124 --> 1.0

print(docs['clueweb12-0915wb-42-00127___124']['content'])
print('\n\n')
print(docs['clueweb12-0915wb-93-17218___124']['content'])

Where the market has done an efficient job in flushing out Chinese RTOs and other equities with unreliable accounting, the notion of hiring a team of lawyers to prop of a company with questionable financials, reconciled by non existent auditing firm is a dangerous blueprint for other Chinese companies to inflate their stock while management could possibly be selling stock into a bidding market. Chinese nationals in management positions, insulated as they are from any enforcement of US securities law can easily orchestrate the whole process. Therefore the SEC is a gatekeeper on a set of market integrity concerns which stretch far beyond the current Harbin drama. Proof that Tianfu Yang has NO INTENTION of Concluding the Proposed $24 Buyout of Harbin Electric It's really quite simple. If Tianfu Yang wanted to buy Harbin Electric he would have taken a different path. He knew Simo's cash couldn't be reconciled. He knew his gross margins from his antiquated factories couldn't possibly be dou

In [18]:
#clueweb12-0803wb-25-35631___2 , clueweb12-0808wb-93-02306___2 --> 0.9166666666666666

print(docs['clueweb12-0803wb-25-35631___2']['content'])
print('\n\n')
print(docs['clueweb12-0808wb-93-02306___2']['content'])

Local Business Search TrueLocal.com.au Business: Type Name In Printer Friendly Text size + - Your News Save the environment and money with our tips for a green Christmas Environment 20 Dec 11 @ 02:21pm by Nicole Has your food waste expanded over Christmas? Would you like to know how you can minimise your carbon footprint over the holiday season? The Nature Conservation Council of NSW offers practical advice and tips to help reduce the environmental impact of your festive season. Food and drink Plan all Christmas meals in advance and use a shopping list to ensure that you only purchase what is necessary. This will not only help the environment, but will also save money, time and improve nutrition. Buy local and organic food produce which is fresh, supports local farmers, and reduces food miles and transport emissions.



Local Business Search TrueLocal.com.au Business: Type Name In Printer Friendly Text size + - Your News Save the environment and money with our tips for a green Christma

In [19]:
#clueweb12-0803wb-67-20451___7 , clueweb12-0805wb-36-03834___7 --> 0.847926267281106

print(docs['clueweb12-0803wb-67-20451___7']['content'])
print('\n\n')
print(docs['clueweb12-0805wb-36-03834___7']['content'])

Rather than mailing a card, send friends and family a virtual card instead. The festive season does not have to be a time of over consumption and waste. By planning ahead you can minimise your carbon footprint and encourage others around you to do the same. Photo: Marju Randmer Write the News! Know something we don't? Help set the local agenda by writing your own news stories. Write news About the author Writer: Nicole Articles Written: 49 Joined: 9 August 2011 Related Articles Save the environment and money with our tips for a green Christmas Has your food waste expanded over Christmas? Would you like to know how you can minimise your carbon... Add your Comment Your name: * Email address: * Postcode: Comment: *(max 1200 characters) Verify: (type word into box on right)



Rather than mailing a card, send friends and family a virtual card instead. The festive season does not have to be a time of over consumption and waste. By planning ahead you can minimise your carbon footprint and en

# Verify too short or too long

In [25]:
docs[[i for i in too_short][0]]['content']

'All Rights Reserved'

In [27]:
docs[[i for i in too_short][10]]['content']

'Author Name: Site contents ©2005 - 2010 Linux Install. Net'

In [26]:
docs[[i for i in too_long][0]]['content']

'March 2009 February 2009 January 2009 December 2008 November 2008 October 2008 September 2008 August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007 April 2007 March 2007 February 2007 January 2007 December 2006 November 2006 October 2006 September 2006 August 2006 July 2006 June 2006 May 2006 April 2006 March 2006 November 2005 October 2005 Categories Select Category “Accidents” “Atlantic Yards” “Gridlock” Sam Schwartz “Sustainable Streets” 2009 Transportation Bill 9th Avenue Renaissance 9th Street Road Diet Aaron Naparstek AARP AASHTO Ad Nauseam Adolfo Carrion Adrian Benepe Adriano Espaillat Air Quality Al Gore Alan Gerson Albany Reform Alex Marshall Allerton Amanda Burden Amsterdam Amtrak Andrew Cuomo Andrew Hevesi Andrew Lanza Andy Wiley-Schwartz Angela Glover Blackwell Anthony Weiner April Fool’s Day APTA Astoria Astrid Glynn Athens Athletes and Cele

In [28]:
docs[[i for i in too_long][10]]['content']



# Create Version 0.0.2 from 0.0.1 using the above information

In [10]:
ids_already_covered = set()

with open('/mnt/ceph/storage/data-in-progress/data-research/arguana/touche-shared-tasks/data/2022-task2/touche-task2-passages-version-002.jsonl', 'w') as out_file:
    import json
    for year in ['2020', '2021']:
        for i in all_lines('../../' + year + '-task2-passages-of-top100-docs.jsonl'):
            doc_id = i['id']
            if doc_id in ids_already_covered or doc_id in duplicates or doc_id in too_short or doc_id in too_long:
                continue
            
            ids_already_covered.add(doc_id)
            out_file.write(json.dumps(i) + '\n')
        

print(len(ids_already_covered))

595432it [01:51, 5344.91it/s] 
691551it [02:05, 5518.96it/s] 

868655





In [29]:
!gzip -c /mnt/ceph/storage/data-in-progress/data-research/arguana/touche-shared-tasks/data/2022-task2/touche-task2-passages-version-002.jsonl > /mnt/ceph/storage/data-in-progress/data-research/arguana/touche-shared-tasks/data/2022-task2/touche-task2-passages-version-002.jsonl.gz