# TREC CrisisFACTs Track 2022 Tutorial

This notebook illustrates how to download the TREC 2022 CrisisFACTs event streams along with the information needs for each one.

## Downloading the Track Data

Below, we walk you through the steps for downloading the CrisisFACTS data using the `ir_datasets` package.

After downloading, we demonstrate converting this data into a Pandas DataFrame for quick inspect of the content associated with a given event-day pair.

<hr> 


**Part 1: Installing Needed Packages**

Before we can get the data, we need to install some packages to handle the download process. In particular, we are going to install one main package:

*   ir_datasets (https://github.com/allenai/ir_datasets): A python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. We can use this to download the raw event streams and information needs for each.


In [None]:
!pip install --upgrade git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)


<hr> 



**Part 2: Initalizing Your Credentials**

When you want to download part of the CrisisFACTs dataset we require that you provide a set of contact details. The reason for this is two-fold: 1) the terms of service from some of the platforms (like Twitter) from which we have sourced data require us to do so, and 2) it allows us to collect statistics on how many people are making use of the data we provide.

**GDPR Statement**: By downloading the CrisisFACTs datasets, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to [me via email](http://www.dcs.gla.ac.uk/~richardm/Home/Contact.html). We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses. 

Rather than entering these details every time you request the dataset, its more effcient to set this once up-front, so fill in your details below:

In [None]:
credentials = {
    "institution": "<University/Agency Name>", # University, Company or Public Agency Name
    "contactname": "<Your Name>", # Your Name
    "email": "<Your Email>", # A contact email address
    "institutiontype": "<Research | Industry | Public Sector>" # Either 'Research', 'Industry', or 'Public Sector'
}

# Write this to a file so it can be read when needed
import json
import os

home_dir = os.path.expanduser('~')

!mkdir -p ~/.ir_datasets/auth/
with open(home_dir + '/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)

<hr> 

**Part 3: Understanding the structure of the CrisisFACTs Dataset**

The CrisisFACTs dataset is divided into events, representing real-world crises. Each event is given an identifier, e.g. 'CrisisFACTS-001' is the Lilac Wildfire from 2017. We sometimes refer to the event number or 'eventNo', this is the last three digits of the event identifier, e.g. '001'. There are 8 events for CrisisFACTs 2022:

In [None]:
# Event numbers as a list
eventNoList = [
        "001", # Lilac Wildfire 2017
        "002", # Cranston Wildfire 2018
        "003", # Holy Wildfire 2018
        "004", # Hurricane Florence 2018
        "005", # 2018 Maryland Flood
        "006", # Saddleridge Wildfire 2019
        "007", # Hurricane Laura 2020
        "008", # Hurricane Sally 2020
        "009", # Beirut Explosion, 2020
        "010", # Houston Explosion, 2020
        "011", # Rutherford TN Floods, 2020
        "012", # TN Derecho, 2020
        "013", # Edenville Dam Fail, 2020
        "014", # Hurricane Dorian, 2019
        "015", # Kincade Wildfire, 2019
        "016", # Easter Tornado Outbreak, 2020
        "017", # Tornado Outbreak, 2020 Apr
        "018", # Tornado Outbreak, 2020 March
]

Each event has a duration, i.e. it lasts for a number of days. In the CrisisFACTs track, you need to produce a timeline summary for each day for a set of events. You can get the list of days for an event as shown below (example is for event "001", i.e. the Lilac Wildfire 2017):

In [None]:
import requests

# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

    # We will download a file containing the day list for an event
    url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

    # Download the list and parse as JSON
    dayList = requests.get(url).json()

    # Print each day
    # Note each day object contains the following fields
    #   {
    #      "eventID" : "CrisisFACTS-001",
    #      "requestID" : "CrisisFACTS-001-r3",
    #      "dateString" : "2017-12-07",
    #      "startUnixTimestamp" : 1512604800,
    #      "endUnixTimestamp" : 1512691199
    #   }

    return dayList

for day in getDaysForEventNo(eventNoList[0]):
    print(day["dateString"])

Now that we know what the request strings for each event and day are, we can download for the associated stream for each via ir_datasets:

**Part 1. Installing Packages**

Search engines are a core part of the online information space, and much work has gone into making this technology accessible and easy to develop. A major package in this space that is designed to facilitate experimentation with search and information-retrieval methods is `Terrier` and its related Python bindings. To use this library, we install the following packages:

*   pyTerrier (https://pyterrier.readthedocs.io/en/latest/): pyTerrier is a python wrapper around the Terrier IR Platform (a search engine in-a-box). We will use this to produce a searchable index for each day during a crisis event, so we can retrieve (hopefully) relevant content for different information needs. 

In [None]:
!pip install --upgrade python-terrier # install pyTerrier

# Simple query-based baseline


In [None]:
eventsMeta = {}

for eventNo in eventNoList: # for each event
    print("Event "+eventNo)
    
    dailyInfo = getDaysForEventNo(eventNo) # get the list of days
    eventsMeta[eventNo]= dailyInfo
    for day in dailyInfo: # for each day
        print("  crisisfacts/"+eventNo+"/"+day["dateString"], "-->", day["requestID"]) # construct the request string

    print()

In [None]:
import pandas as pd
import numpy as np
import pyterrier as pt

import ir_datasets

In [None]:
def get_retriever(datasetID):
    # Initalize pyTerrier if not started
    if not pt.started():
        pt.init()

    # Ask pyTerrier to download the dataset, the 'irds:' header tells pyTerrier to use ir_datasets as the data source
    pyTerrierDataset = pt.get_dataset(f'irds:{datasetID}')
    # To create the index, we use an 'indexer', this interates over the documents in the collection and adds them to the index
    # The paramters of this call are:
    #  Index Storage Path: "None" (some index types write to disk, this would be the directory to write to)
    #  Index Type: type=pt.index.IndexingType(3) (Type 3 is a Memory Index)
    #  Meta Index Fields: meta=['docno', 'text'] (The index also can store raw fields so they can be attached to the search results, this specifies what fields to store)
    #  Meta Index Lengths: meta_lengths=[40, 200] (pyTerrier allocates a fixed amount of storage space per field, how many characters should this be?)
    indexer = pt.IterDictIndexer(
        "None", 
        type=pt.index.IndexingType(3), 
        meta=['docno', 'text'], 
        meta_lengths=[40, 200],
    )

    # Trigger the indexing process
    index = indexer.index(pyTerrierDataset.get_corpus_iter())
    retriever = pt.BatchRetrieve(
        index, 
        wmodel="DFReeKLIM", 
        metadata=["docno", "text"],
    )
    return retriever

def  get_quries_hits(retriever, queries):
    
    event_queries={}
    event_scores={}
    
    for index, query in queries.iterrows():
        results = pd.DataFrame(retriever.search(query['indicative_terms']))
        
        for ind, result in results.iterrows():
            if result['docno'] in event_queries:
                event_queries[result['docno']].append(query['query_id'])
                event_scores[result['docno']].append(result['score'])
            else:
                event_queries[result['docno']]=[query['query_id']]
                event_scores[result['docno']]=[result['score']]
    
    return event_queries,event_scores

In [None]:
# Build the document cache for each CrisisFACTS event-day
#. request. All data goes to the `ir_datasets/crisisfacts` directory,
#. which on Unix-type machines is `~/.ir_datasets/crisisfacts/`
for eventId,dailyInfo in eventsMeta.items():

    for thisDay in dailyInfo:
        
        requestID = thisDay["requestID"]
        ir_dataset_id = "crisisfacts/%s/%s" % (eventId, thisDay["dateString"])        
        
        dataset = ir_datasets.load(ir_dataset_id)
        print(dataset.dataset_id())
        
        itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())
        print("\tDataset Size:", itemsAsDataFrame.shape[0])
        
        # This should be true. If you get an error here, 
        #. it suggests a bug popped up during data download. 
        #. Try re-running this cell.
        assert itemsAsDataFrame.shape[0] > 0
        
        queries = pd.DataFrame(dataset.queries_iter())
        print("\tQuery Count:", queries.shape[0])
        
        # This should also be true. If you get an error here, 
        #. it suggests a bug popped up during data download. 
        #. Try re-running this cell.
        assert queries.shape[0] > 0
        
        # If the above code fails on JSON decode error associated with queries,
        #. that means you've ended up with an empty query file. We have an 
        #. intermittent bug in the networking infrastructure that occasionally
        #. causes this. You can look for 0-byte files in ~/.ir_datasets/crisisfacts
        #. and delete them to force the library to retry downloading
        #. E.g., `rm ~/.ir_datasets/crisisfacts/2020-08-05/009/queries.json`

In [None]:
rows = []

for eventId,dailyInfo in eventsMeta.items():

    for thisDay in dailyInfo:
        
        requestID = thisDay["requestID"]
        ir_dataset_id = "crisisfacts/%s/%s" % (eventId, thisDay["dateString"])        
        
        # Should use the cached data we downloaded above
        dataset = ir_datasets.load(ir_dataset_id)        
        queries = pd.DataFrame(dataset.queries_iter())

        retriever = get_retriever(ir_dataset_id)
        event_queries, event_scores = get_quries_hits(retriever,queries)
        
        itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())

        for key,value in event_scores.items():
            fact = itemsAsDataFrame.query(f"doc_id== '{key}'").iloc[0]
            
            rows.append({
                'requestID' : requestID, 
                'factText' : fact["text"], 
                'unixTimestamp' : fact["unix_timestamp"],
                "importance_count" : len(value),
                "importance_sum" : np.sum(value),
                "sources" : [key],
                "streamID" : key,
                "informationNeeds" : event_queries[key]
            })

output = pd.DataFrame(rows)

In [None]:
output["importance_count"] = output["importance_count"] / output["importance_count"].max()
output["importance_sum"] = output["importance_sum"] / output["importance_sum"].max()

In [None]:
with open("./submission_v1.json", "w") as out_file:
    for idx,row in output\
        .drop(columns=["importance_sum"])\
        .rename(columns={"importance_count": "importance"})\
        .iterrows():
        
        out_file.write("%s\n" % json.dumps(dict(row)))

with open("./submission_v2.json", "w") as out_file:
    for idx,row in output\
        .drop(columns=["importance_count"])\
        .rename(columns={"importance_sum": "importance"})\
        .iterrows():
        
        out_file.write("%s\n" % json.dumps(dict(row)))

In [None]:
with open("./submission_v1.json", "r") as in_file:
    submission_rows = [json.loads(line) for line in in_file]
submission_df = pd.DataFrame(submission_rows)

In [None]:
top_k = 64

for reqId, group in submission_df.groupby("requestID"):
    print(reqId, group.shape[0])
    
    summary = " ".join(group.sort_values(by="importance", ascending=False).head(top_k)["factText"])
    
    print(summary)
    
    break