# TREC CrisisFACTs Track 2022 Tutorial

This notebook illustrates how to download the TREC 2022 CrisisFACTs event streams along with the information needs for each one.

## Downloading the Track Data

Below, we walk you through the steps for downloading the CrisisFACTS data using the `ir_datasets` package.

After downloading, we demonstrate converting this data into a Pandas DataFrame for quick inspect of the content associated with a given event-day pair.

<hr> 


**Part 1: Installing Needed Packages**

Before we can get the data, we need to install some packages to handle the download process. In particular, we are going to install one main package:

*   ir_datasets (https://github.com/allenai/ir_datasets): A python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. We can use this to download the raw event streams and information needs for each.


In [1]:
!pip install --upgrade git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)


Collecting git+https://github.com/allenai/ir_datasets.git@crisisfacts
  Cloning https://github.com/allenai/ir_datasets.git (to revision crisisfacts) to /tmp/pip-req-build-hx00m2es
  Running command git clone --filter=blob:none --quiet https://github.com/allenai/ir_datasets.git /tmp/pip-req-build-hx00m2es
  Running command git checkout -b crisisfacts --track origin/crisisfacts
  Switched to a new branch 'crisisfacts'
  Branch 'crisisfacts' set up to track remote branch 'crisisfacts' from 'origin'.
  Resolved https://github.com/allenai/ir_datasets.git to commit e2359e24c9546e2a62284cd1aec6138295bb5ec5
  Preparing metadata (setup.py) ... [?25ldone
Collecting trec-car-tools>=2.5.4
  Downloading trec_car_tools-2.6-py3-none-any.whl (8.4 kB)
Collecting warc3-wet>=0.2.3
  Downloading warc3_wet-0.2.3-py3-none-any.whl (13 kB)
Collecting warc3-wet-clueweb09>=0.2.5
  Downloading warc3-wet-clueweb09-0.2.5.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting zlib-state>=0.

<hr> 



**Part 2: Initalizing Your Credentials**

When you want to download part of the CrisisFACTs dataset we require that you provide a set of contact details. The reason for this is two-fold: 1) the terms of service from some of the platforms (like Twitter) from which we have sourced data require us to do so, and 2) it allows us to collect statistics on how many people are making use of the data we provide.

**GDPR Statement**: By downloading the CrisisFACTs datasets, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to [me via email](http://www.dcs.gla.ac.uk/~richardm/Home/Contact.html). We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses. 

Rather than entering these details every time you request the dataset, its more effcient to set this once up-front, so fill in your details below:

In [2]:
credentials = {
    "institution": "<University/Agency Name>", # University, Company or Public Agency Name
    "contactname": "<Your Name>", # Your Name
    "email": "<Your Email>", # A contact email address
    "institutiontype": "<Research | Industry | Public Sector>" # Either 'Research', 'Industry', or 'Public Sector'
}

# Write this to a file so it can be read when needed
import json
import os

home_dir = os.path.expanduser('~')

!mkdir -p ~/.ir_datasets/auth/
with open(home_dir + '/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)

<hr> 

**Part 3: Understanding the structure of the CrisisFACTs Dataset**

The CrisisFACTs dataset is divided into events, representing real-world crises. Each event is given an identifier, e.g. 'CrisisFACTS-001' is the Lilac Wildfire from 2017. We sometimes refer to the event number or 'eventNo', this is the last three digits of the event identifier, e.g. '001'. There are 8 events for CrisisFACTs 2022:

In [3]:
# Event numbers as a list
eventNoList = [
        "001", # Lilac Wildfire 2017
        "002", # Cranston Wildfire 2018
        "003", # Holy Wildfire 2018
        "004", # Hurricane Florence 2018
        "005", # 2018 Maryland Flood
        "006", # Saddleridge Wildfire 2019
        "007", # Hurricane Laura 2020
        "008", # Hurricane Sally 2020
        "009", # Beirut Explosion, 2020
        "010", # Houston Explosion, 2020
        "011", # Rutherford TN Floods, 2020
        "012", # TN Derecho, 2020
        "013", # Edenville Dam Fail, 2020
        "014", # Hurricane Dorian, 2019
        "015", # Kincade Wildfire, 2019
        "016", # Easter Tornado Outbreak, 2020
        "017", # Tornado Outbreak, 2020 Apr
        "018", # Tornado Outbreak, 2020 March
]

Each event has a duration, i.e. it lasts for a number of days. In the CrisisFACTs track, you need to produce a timeline summary for each day for a set of events. You can get the list of days for an event as shown below (example is for event "001", i.e. the Lilac Wildfire 2017):

In [4]:
import requests

# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

    # We will download a file containing the day list for an event
    url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

    # Download the list and parse as JSON
    dayList = requests.get(url).json()

    # Print each day
    # Note each day object contains the following fields
    #   {
    #      "eventID" : "CrisisFACTS-001",
    #      "requestID" : "CrisisFACTS-001-r3",
    #      "dateString" : "2017-12-07",
    #      "startUnixTimestamp" : 1512604800,
    #      "endUnixTimestamp" : 1512691199
    #   }

    return dayList

for day in getDaysForEventNo(eventNoList[0]):
    print(day["dateString"])

2017-12-07
2017-12-08
2017-12-09
2017-12-10
2017-12-11
2017-12-12
2017-12-13
2017-12-14
2017-12-15


Now that we know what the request strings for each event and day are, we can download for the associated stream for each via ir_datasets:

**Part 1. Installing Packages**

Search engines are a core part of the online information space, and much work has gone into making this technology accessible and easy to develop. A major package in this space that is designed to facilitate experimentation with search and information-retrieval methods is `Terrier` and its related Python bindings. To use this library, we install the following packages:

*   pyTerrier (https://pyterrier.readthedocs.io/en/latest/): pyTerrier is a python wrapper around the Terrier IR Platform (a search engine in-a-box). We will use this to produce a searchable index for each day during a crisis event, so we can retrieve (hopefully) relevant content for different information needs. 

In [5]:
!pip install --upgrade python-terrier # install pyTerrier

Collecting python-terrier
  Downloading python-terrier-0.9.2.tar.gz (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.4/104.4 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pyjnius>=1.4.2
  Downloading pyjnius-1.5.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting matchpy
  Downloading matchpy-0.5.5-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
Collecting deprecated
  Downloading Deprecated-1.2.13-py2.py3-none-any.whl (9.6 kB)
Collecting chest
  Downloading chest-0.2.3.tar.gz (9.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting npt

  Building wheel for chest (setup.py) ... [?25ldone
[?25h  Created wheel for chest: filename=chest-0.2.3-py3-none-any.whl size=7612 sha256=424903937d1c31f07e9acde9a027e1dd61c0848745eeae4aa4090e0e3aed20a8
  Stored in directory: /home/ubuntu/.cache/pip/wheels/ce/05/8e/2ecd3728ed5d938ee57161c8acf2ee74f55f894964bb3d31ef
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=cb1a96b90124066acb1bd62b0f0b570c46f82c82d4d3ce7e4e3c8b7bfa497ff8
  Stored in directory: /home/ubuntu/.cache/pip/wheels/04/5f/3e/46cc37c5d698415694d83f607f833f83f0149e49b3af9d0f38
  Building wheel for cwl-eval (setup.py) ... [?25ldone
[?25h  Created wheel for cwl-eval: filename=cwl_eval-1.0.12-py3-none-any.whl size=38068 sha256=903972b81eeb0b0cb955a25455766b15745acd49e577dacff639ed22a603f233
  Stored in directory: /home/ubuntu/.cache/pip/wheels/0e/52/dc/9d448ff093ccae9b248a41b265d31e2c930eea2d0e84aa29e0
Successfully built python-terrier 

# Simple query-based baseline


In [6]:
eventsMeta = {}

for eventNo in eventNoList: # for each event
    print("Event "+eventNo)
    
    dailyInfo = getDaysForEventNo(eventNo) # get the list of days
    eventsMeta[eventNo]= dailyInfo
    for day in dailyInfo: # for each day
        print("  crisisfacts/"+eventNo+"/"+day["dateString"], "-->", day["requestID"]) # construct the request string

    print()

Event 001
  crisisfacts/001/2017-12-07 --> CrisisFACTS-001-r3
  crisisfacts/001/2017-12-08 --> CrisisFACTS-001-r4
  crisisfacts/001/2017-12-09 --> CrisisFACTS-001-r5
  crisisfacts/001/2017-12-10 --> CrisisFACTS-001-r6
  crisisfacts/001/2017-12-11 --> CrisisFACTS-001-r7
  crisisfacts/001/2017-12-12 --> CrisisFACTS-001-r8
  crisisfacts/001/2017-12-13 --> CrisisFACTS-001-r9
  crisisfacts/001/2017-12-14 --> CrisisFACTS-001-r10
  crisisfacts/001/2017-12-15 --> CrisisFACTS-001-r11

Event 002
  crisisfacts/002/2018-07-25 --> CrisisFACTS-002-r1
  crisisfacts/002/2018-07-26 --> CrisisFACTS-002-r2
  crisisfacts/002/2018-07-27 --> CrisisFACTS-002-r3
  crisisfacts/002/2018-07-28 --> CrisisFACTS-002-r4
  crisisfacts/002/2018-07-29 --> CrisisFACTS-002-r5
  crisisfacts/002/2018-07-30 --> CrisisFACTS-002-r6

Event 003
  crisisfacts/003/2018-08-06 --> CrisisFACTS-003-r5
  crisisfacts/003/2018-08-07 --> CrisisFACTS-003-r6
  crisisfacts/003/2018-08-08 --> CrisisFACTS-003-r7
  crisisfacts/003/2018-08-09 -

In [7]:
import pandas as pd
import numpy as np
import pyterrier as pt

import ir_datasets

In [8]:
def get_retriever(datasetID):
    # Initalize pyTerrier if not started
    if not pt.started():
        pt.init()

    # Ask pyTerrier to download the dataset, the 'irds:' header tells pyTerrier to use ir_datasets as the data source
    pyTerrierDataset = pt.get_dataset(f'irds:{datasetID}')
    # To create the index, we use an 'indexer', this interates over the documents in the collection and adds them to the index
    # The paramters of this call are:
    #  Index Storage Path: "None" (some index types write to disk, this would be the directory to write to)
    #  Index Type: type=pt.index.IndexingType(3) (Type 3 is a Memory Index)
    #  Meta Index Fields: meta=['docno', 'text'] (The index also can store raw fields so they can be attached to the search results, this specifies what fields to store)
    #  Meta Index Lengths: meta_lengths=[40, 200] (pyTerrier allocates a fixed amount of storage space per field, how many characters should this be?)
    indexer = pt.IterDictIndexer(
        "None", 
        type=pt.index.IndexingType(3), 
        meta=['docno', 'text'], 
        meta_lengths=[40, 200],
    )

    # Trigger the indexing process
    index = indexer.index(pyTerrierDataset.get_corpus_iter())
    retriever = pt.BatchRetrieve(
        index, 
        wmodel="DFReeKLIM", 
        metadata=["docno", "text"],
    )
    return retriever

def  get_quries_hits(retriever, queries):
    
    event_queries={}
    event_scores={}
    
    for index, query in queries.iterrows():
        results = pd.DataFrame(retriever.search(query['indicative_terms']))
        
        for ind, result in results.iterrows():
            if result['docno'] in event_queries:
                event_queries[result['docno']].append(query['query_id'])
                event_scores[result['docno']].append(result['score'])
            else:
                event_queries[result['docno']]=[query['query_id']]
                event_scores[result['docno']]=[result['score']]
    
    return event_queries,event_scores

In [21]:
# Build the document cache for each CrisisFACTS event-day
#. request. All data goes to the `ir_datasets/crisisfacts` directory,
#. which on Unix-type machines is `~/.ir_datasets/crisisfacts/`
for eventId,dailyInfo in eventsMeta.items():

    for thisDay in dailyInfo:
        
        requestID = thisDay["requestID"]
        ir_dataset_id = "crisisfacts/%s/%s" % (eventId, thisDay["dateString"])        
        
        dataset = ir_datasets.load(ir_dataset_id)
        print(dataset.dataset_id())
        
        itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())
        print("\tDataset Size:", itemsAsDataFrame.shape[0])
        
        queries = pd.DataFrame(dataset.queries_iter())
        print("\tQuery Count:", queries.shape[0])
        
        # Both of these should be true. If you get an error here, 
        #. it suggests a bug popped up during data download. 
        #. Try re-running this cell.
        assert itemsAsDataFrame.shape[0] > 0
        assert queries.shape[0] > 0

crisisfacts/001/2017-12-07
	Dataset Size: 7288
	Query Count: 52
crisisfacts/001/2017-12-08
	Dataset Size: 19231
	Query Count: 52
crisisfacts/001/2017-12-09
	Dataset Size: 5839
	Query Count: 52
crisisfacts/001/2017-12-10
	Dataset Size: 4407
	Query Count: 52
crisisfacts/001/2017-12-11
	Dataset Size: 3394
	Query Count: 52
crisisfacts/001/2017-12-12
	Dataset Size: 2805
	Query Count: 52
crisisfacts/001/2017-12-13
	Dataset Size: 2658
	Query Count: 52
crisisfacts/001/2017-12-14
	Dataset Size: 2728
	Query Count: 52
crisisfacts/001/2017-12-15
	Dataset Size: 2665
	Query Count: 52
crisisfacts/002/2018-07-25
	Dataset Size: 5056
	Query Count: 52
crisisfacts/002/2018-07-26
	Dataset Size: 7866
	Query Count: 52
crisisfacts/002/2018-07-27
	Dataset Size: 7433
	Query Count: 52
crisisfacts/002/2018-07-28
	Dataset Size: 5238
	Query Count: 52
crisisfacts/002/2018-07-29
	Dataset Size: 4691
	Query Count: 52
crisisfacts/002/2018-07-30
	Dataset Size: 251
	Query Count: 52
crisisfacts/003/2018-08-06
	Dataset Size

In [22]:
rows = []

for eventId,dailyInfo in eventsMeta.items():

    for thisDay in dailyInfo:
        
        requestID = thisDay["requestID"]
        ir_dataset_id = "crisisfacts/%s/%s" % (eventId, thisDay["dateString"])        
        
        # Should use the cached data we downloaded above
        dataset = ir_datasets.load(ir_dataset_id)        
        queries = pd.DataFrame(dataset.queries_iter())

        retriever = get_retriever(ir_dataset_id)
        event_queries, event_scores = get_quries_hits(retriever,queries)
        
        itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())

        for key,value in event_scores.items():
            fact = itemsAsDataFrame.query(f"doc_id== '{key}'").iloc[0]
            
            rows.append({
                'requestID' : requestID, 
                'factText' : fact["text"], 
                'unixTimestamp' : fact["unix_timestamp"],
                "importance_count" : len(value),
                "importance_sum" : np.sum(value),
                "sources" : [key],
                "streamID" : key,
                "informationNeeds" : event_queries[key]
            })

output = pd.DataFrame(rows)

crisisfacts/001/2017-12-07 documents: 7288it [00:01, 6492.80it/s]




crisisfacts/001/2017-12-08 documents: 19231it [00:04, 4506.85it/s]




crisisfacts/001/2017-12-09 documents: 5839it [00:01, 5270.65it/s] 




crisisfacts/001/2017-12-10 documents: 4407it [00:00, 7515.32it/s] 




crisisfacts/001/2017-12-11 documents: 3394it [00:00, 8335.85it/s] 




crisisfacts/001/2017-12-12 documents: 2805it [00:00, 7850.02it/s] 




crisisfacts/001/2017-12-13 documents: 2658it [00:00, 9182.69it/s] 




crisisfacts/001/2017-12-14 documents: 2728it [00:00, 11050.82it/s]




crisisfacts/001/2017-12-15 documents: 2665it [00:00, 6732.35it/s] 




crisisfacts/002/2018-07-25 documents: 5056it [00:00, 5296.69it/s]




crisisfacts/002/2018-07-26 documents: 7866it [00:01, 6586.86it/s] 




crisisfacts/002/2018-07-27 documents: 7433it [00:00, 7481.62it/s] 




crisisfacts/002/2018-07-28 documents: 5238it [00:00, 8450.59it/s] 




crisisfacts/002/2018-07-29 documents: 4691it [00:00, 9237.29it/s] 




crisisfacts/002/2018-07-30 documents: 251it [00:00, 9681.89it/s]




crisisfacts/003/2018-08-06 documents: 4402it [00:00, 10274.53it/s]




crisisfacts/003/2018-08-07 documents: 5727it [00:00, 7191.29it/s] 




crisisfacts/003/2018-08-08 documents: 5925it [00:00, 6962.10it/s] 




crisisfacts/003/2018-08-09 documents: 5984it [00:00, 7885.32it/s] 




crisisfacts/003/2018-08-10 documents: 6233it [00:00, 7832.42it/s] 




crisisfacts/003/2018-08-12 documents: 4023it [00:00, 8759.50it/s] 




crisisfacts/003/2018-08-13 documents: 195it [00:00, 10939.17it/s]




crisisfacts/004/2018-09-01 documents: 7959it [00:00, 19636.65it/s]




crisisfacts/004/2018-09-04 documents: 503it [00:00, 16659.18it/s]




crisisfacts/004/2018-09-05 documents: 1559it [00:00, 18428.32it/s]




crisisfacts/004/2018-09-07 documents: 3329it [00:00, 22686.28it/s]




crisisfacts/004/2018-09-08 documents: 6056it [00:00, 22378.00it/s]




crisisfacts/004/2018-09-09 documents: 4000it [00:00, 22142.24it/s]




crisisfacts/004/2018-09-10 documents: 38738it [00:02, 17758.02it/s]




crisisfacts/004/2018-09-11 documents: 55465it [00:03, 15395.03it/s]




crisisfacts/004/2018-09-12 documents: 78467it [00:05, 13595.77it/s]




crisisfacts/004/2018-09-13 documents: 64442it [00:04, 13582.78it/s]




crisisfacts/004/2018-09-14 documents: 58610it [00:05, 10591.73it/s]




crisisfacts/004/2018-09-15 documents: 1000it [00:00, 17896.54it/s]




crisisfacts/004/2018-09-16 documents: 13528it [00:00, 18583.51it/s]




crisisfacts/004/2018-09-17 documents: 18202it [00:00, 22180.95it/s]




crisisfacts/004/2018-09-18 documents: 710it [00:00, 10783.87it/s]




crisisfacts/005/2018-05-27 documents: 7380it [00:01, 6715.20it/s] 




crisisfacts/005/2018-05-28 documents: 16407it [00:03, 4379.10it/s]




crisisfacts/005/2018-05-29 documents: 10226it [00:02, 4038.79it/s]




crisisfacts/005/2018-05-30 documents: 7757it [00:01, 4544.43it/s] 




crisisfacts/006/2019-10-10 documents: 6993it [00:01, 4432.38it/s] 




crisisfacts/006/2019-10-11 documents: 2000it [00:00, 7512.86it/s] 




crisisfacts/006/2019-10-12 documents: 9364it [00:02, 4394.58it/s] 




crisisfacts/006/2019-10-13 documents: 6998it [00:01, 4494.13it/s]




crisisfacts/007/2020-08-27 documents: 46021it [00:17, 2656.28it/s]




crisisfacts/007/2020-08-28 documents: 2000it [00:00, 8578.91it/s]




crisisfacts/008/2020-09-11 documents: 2215it [00:00, 7522.71it/s]




crisisfacts/008/2020-09-12 documents: 7000it [00:00, 7655.74it/s]




crisisfacts/008/2020-09-13 documents: 7678it [00:00, 10107.73it/s]




crisisfacts/008/2020-09-14 documents: 12000it [00:01, 9197.38it/s]




crisisfacts/008/2020-09-15 documents: 22399it [00:02, 9604.78it/s] 




crisisfacts/008/2020-09-16 documents: 34106it [00:09, 3730.49it/s]




crisisfacts/008/2020-09-17 documents: 17087it [00:03, 5079.85it/s]




crisisfacts/008/2020-09-18 documents: 5437it [00:00, 14565.77it/s]




crisisfacts/009/2020-08-03 documents: 259it [00:00, 20612.59it/s]




crisisfacts/009/2020-08-04 documents: 114528it [00:29, 3836.91it/s]




crisisfacts/009/2020-08-05 documents: 27000it [00:04, 5458.82it/s]




crisisfacts/009/2020-08-06 documents: 77036it [00:40, 1907.66it/s]




crisisfacts/009/2020-08-07 documents: 52431it [00:20, 2545.66it/s]




crisisfacts/009/2020-08-08 documents: 34917it [00:10, 3298.58it/s]




crisisfacts/009/2020-08-09 documents: 28724it [00:08, 3526.58it/s]




crisisfacts/010/2020-01-23 documents: 11207it [00:02, 5513.73it/s]




crisisfacts/010/2020-01-24 documents: 23981it [00:05, 4628.46it/s]




crisisfacts/010/2020-01-25 documents: 12430it [00:02, 4461.71it/s]




crisisfacts/010/2020-01-26 documents: 10698it [00:01, 7851.49it/s]




crisisfacts/010/2020-01-27 documents: 10574it [00:01, 6107.11it/s]




crisisfacts/010/2020-01-28 documents: 2931it [00:00, 5750.18it/s] 




crisisfacts/010/2020-01-29 documents: 709it [00:00, 19451.10it/s]




crisisfacts/011/2020-02-04 documents: 3455it [00:00, 9095.49it/s] 




crisisfacts/011/2020-02-05 documents: 5075it [00:01, 3688.62it/s]




crisisfacts/011/2020-02-06 documents: 7009it [00:00, 7929.32it/s] 




crisisfacts/011/2020-02-07 documents: 4497it [00:00, 7459.51it/s] 




crisisfacts/011/2020-02-08 documents: 832it [00:00, 11096.54it/s]




crisisfacts/012/2020-05-02 documents: 13337it [00:02, 6629.84it/s]




crisisfacts/012/2020-05-03 documents: 20831it [00:03, 6419.87it/s]




crisisfacts/012/2020-05-04 documents: 24301it [00:03, 6181.37it/s]




crisisfacts/012/2020-05-05 documents: 9000it [00:01, 6595.86it/s]




crisisfacts/012/2020-05-06 documents: 1197it [00:00, 11294.16it/s]




crisisfacts/012/2020-05-07 documents: 29it [00:00, 2428.13it/s]




crisisfacts/012/2020-05-08 documents: 19it [00:00, 4842.72it/s]




crisisfacts/013/2020-05-23 documents: 8783it [00:00, 13300.16it/s]




crisisfacts/013/2020-05-24 documents: 5363it [00:00, 11143.13it/s]




crisisfacts/013/2020-05-25 documents: 5638it [00:00, 9382.33it/s] 




crisisfacts/013/2020-05-26 documents: 5145it [00:00, 5568.57it/s] 




crisisfacts/013/2020-05-27 documents: 2928it [00:00, 9490.21it/s] 




crisisfacts/013/2020-05-28 documents: 315it [00:00, 7978.68it/s]




crisisfacts/013/2020-05-29 documents: 13it [00:00, 3147.97it/s]




crisisfacts/014/2019-08-29 documents: 8000it [00:00, 8350.51it/s] 




crisisfacts/014/2019-08-30 documents: 34000it [00:04, 7925.62it/s]




crisisfacts/014/2019-08-31 documents: 31000it [00:05, 6199.33it/s]




crisisfacts/014/2019-09-01 documents: 10000it [00:01, 8236.11it/s]




crisisfacts/014/2019-09-02 documents: 29000it [00:02, 9753.85it/s] 




crisisfacts/014/2019-09-03 documents: 20000it [00:02, 9158.16it/s]




crisisfacts/014/2019-09-04 documents: 12000it [00:01, 9604.40it/s] 




crisisfacts/015/2019-10-25 documents: 12000it [00:01, 7599.60it/s]




crisisfacts/015/2019-10-26 documents: 1000it [00:00, 16369.04it/s]




crisisfacts/015/2019-10-27 documents: 26178it [00:02, 9106.05it/s] 




crisisfacts/015/2019-10-28 documents: 1000it [00:00, 22968.64it/s]




crisisfacts/015/2019-10-29 documents: 4000it [00:00, 14692.07it/s]




crisisfacts/015/2019-10-30 documents: 7000it [00:01, 6904.36it/s] 




crisisfacts/015/2019-10-31 documents: 3000it [00:00, 11483.38it/s]




crisisfacts/016/2020-04-12 documents: 54642it [00:09, 5649.88it/s]




crisisfacts/016/2020-04-13 documents: 14000it [00:01, 9397.94it/s]




crisisfacts/016/2020-04-14 documents: 22000it [00:03, 5721.30it/s]




crisisfacts/016/2020-04-15 documents: 4832it [00:00, 5030.57it/s]




crisisfacts/016/2020-04-16 documents: 53it [00:00, 9214.05it/s]




crisisfacts/017/2020-04-21 documents: 20900it [00:03, 5777.18it/s]




crisisfacts/017/2020-04-22 documents: 13000it [00:02, 6418.12it/s]




crisisfacts/017/2020-04-23 documents: 29506it [00:05, 5614.48it/s]




crisisfacts/017/2020-04-24 documents: 6000it [00:00, 9062.09it/s] 




crisisfacts/017/2020-04-25 documents: 4000it [00:00, 14182.78it/s]




crisisfacts/017/2020-04-26 documents: 2108it [00:00, 5554.61it/s]




crisisfacts/018/2020-03-02 documents: 3000it [00:00, 10666.54it/s]




crisisfacts/018/2020-03-03 documents: 123416it [00:20, 5916.56it/s]




crisisfacts/018/2020-03-04 documents: 52566it [00:05, 8888.73it/s] 




crisisfacts/018/2020-03-05 documents: 14918it [00:01, 13729.67it/s]




crisisfacts/018/2020-03-06 documents: 336it [00:00, 26576.26it/s]




crisisfacts/018/2020-03-07 documents: 77it [00:00, 20217.94it/s]






In [23]:
output["importance_count"] = output["importance_count"] / output["importance_count"].max()
output["importance_sum"] = output["importance_sum"] / output["importance_sum"].max()

In [24]:
with open("./submission_v1.json", "w") as out_file:
    for idx,row in output\
        .drop(columns=["importance_sum"])\
        .rename(columns={"importance_count": "importance"})\
        .iterrows():
        
        out_file.write("%s\n" % json.dumps(dict(row)))

with open("./submission_v2.json", "w") as out_file:
    for idx,row in output\
        .drop(columns=["importance_count"])\
        .rename(columns={"importance_sum": "importance"})\
        .iterrows():
        
        out_file.write("%s\n" % json.dumps(dict(row)))

In [25]:
with open("./submission_v1.json", "r") as in_file:
    submission_rows = [json.loads(line) for line in in_file]
submission_df = pd.DataFrame(submission_rows)

In [26]:
top_k = 64

for reqId, group in submission_df.groupby("requestID"):
    print(reqId, group.shape[0])
    
    summary = " ".join(group.sort_values(by="importance", ascending=False).head(top_k)["factText"])
    
    print(summary)
    
    break

CrisisFACTS-001-r10 654
