# TREC CrisisFACTs Track 2023 Downloader

This notebook illustrates how to download the TREC CrisisFACTs event streams along with the information needs for each one.

## Downloading the Track Data

Below, we walk you through the steps for downloading the CrisisFACTS data using the `ir_datasets` package.

After downloading, we demonstrate converting this data into a Pandas DataFrame for quick inspect of the content associated with a given event-day pair.

<hr> 


**Part 1: Installing Needed Packages**

Before we can get the data, we need to install some packages to handle the download process. In particular, we are going to install one main package:

*   ir_datasets (https://github.com/allenai/ir_datasets): A python package that provides a common interface to many IR ad-hoc ranking benchmarks, training datasets, etc. We can use this to download the raw event streams and information needs for each.


In [1]:
!pip install --upgrade git+https://github.com/allenai/ir_datasets.git@crisisfacts # install ir_datasets (crisisfacts branch)


<hr> 



**Part 2: Initalizing Your Credentials**

When you want to download part of the CrisisFACTs dataset we require that you provide a set of contact details. The reason for this is two-fold: 1) the terms of service from some of the platforms (like Twitter) from which we have sourced data require us to do so, and 2) it allows us to collect statistics on how many people are making use of the data we provide.

**GDPR Statement**: By downloading the CrisisFACTs datasets, you agree to the University of Glasgow processing your personal data, as defined by the EU General Data Protection Regulation (GDPR) - your name and email in this case. Queries about data processing and access/deletion requests should be sent to [me via email](http://www.dcs.gla.ac.uk/~richardm/Home/Contact.html). We will store your data for as long as the track is on-going and up-to 2 years beyond that. I may contact you using the details provided to notify you about changes in the datasets or track, to provide information or ask you questions about your participation or otherwise contact you about topics relevant to emergency management. We may collate statistics from the provided information that will be published, but we will not release individual names or email addresses. 

Rather than entering these details every time you request the dataset, its more effcient to set this once up-front, so fill in your details below:

In [2]:
credentials = {
    "institution": "<University/Agency Name>", # University, Company or Public Agency Name
    "contactname": "<Your Name>", # Your Name
    "email": "<Your Email>", # A contact email address
    "institutiontype": "<Research | Industry | Public Sector>" # Either 'Research', 'Industry', or 'Public Sector'
}

# Write this to a file so it can be read when needed
import json
import os

home_dir = os.path.expanduser('~')

!mkdir -p ~/.ir_datasets/auth/
with open(home_dir + '/.ir_datasets/auth/crisisfacts.json', 'w') as f:
    json.dump(credentials, f)

<hr> 

**Part 3: Understanding the structure of the CrisisFACTs Dataset**

The CrisisFACTs dataset is divided into events, representing real-world crises. Each event is given an identifier, e.g. 'CrisisFACTS-001' is the Lilac Wildfire from 2017. We sometimes refer to the event number or 'eventNo', this is the last three digits of the event identifier, e.g. '001'. There are 8 events for CrisisFACTs 2022:

In [2]:
# Event numbers as a list
eventNoList = [
    "001", # Lilac Wildfire 2017
    "002", # Cranston Wildfire 2018
    "003", # Holy Wildfire 2018
    "004", # Hurricane Florence 2018
    "005", # 2018 Maryland Flood
    "006", # Saddleridge Wildfire 2019
    "007", # Hurricane Laura 2020
    "008", # Hurricane Sally 2020
    "009", # Beirut Explosion, 2020
    "010", # Houston Explosion, 2020
    "011", # Rutherford TN Floods, 2020
    "012", # TN Derecho, 2020
    "013", # Edenville Dam Fail, 2020
    "014", # Hurricane Dorian, 2019
    "015", # Kincade Wildfire, 2019
    "016", # Easter Tornado Outbreak, 2020
    "017", # Tornado Outbreak, 2020 Apr
    "018", # Tornado Outbreak, 2020 March
]

Each event has a duration, i.e. it lasts for a number of days. In the CrisisFACTs track, you need to produce a timeline summary for each day for a set of events. You can get the list of days for an event as shown below (example is for event "001", i.e. the Lilac Wildfire 2017):

In [3]:
import requests

# Gets the list of days for a specified event number, e.g. '001'
def getDaysForEventNo(eventNo):

    # We will download a file containing the day list for an event
    url = "http://trecis.org/CrisisFACTs/CrisisFACTS-"+eventNo+".requests.json"

    # Download the list and parse as JSON
    dayList = requests.get(url).json()

    # Print each day
    # Note each day object contains the following fields
    #   {
    #      "eventID" : "CrisisFACTS-001",
    #      "requestID" : "CrisisFACTS-001-r3",
    #      "dateString" : "2017-12-07",
    #      "startUnixTimestamp" : 1512604800,
    #      "endUnixTimestamp" : 1512691199
    #   }

    return dayList

for day in getDaysForEventNo(eventNoList[0]):
    print(day["dateString"])

2017-12-07
2017-12-08
2017-12-09
2017-12-10
2017-12-11
2017-12-12
2017-12-13
2017-12-14
2017-12-15


For each day, we collected related content to the event from the following sources:


*   **Twitter**: We are re-using tweets collected as part of the TREC Incident Streams track (http://trecis.org). These tweets were crawled by keyword, and as such most are likely to be relevant to the event, but are not nessessaraly good candidates for inclusion into a summary of what is happening.
*   **Reddit**: Discussions regarding what happens during events also occurs on the forum platform Reddit. We collected relevant Reddit threads to each event, where we include both the original submission and subsequent comments within those threads.
*   **News**: Traditional news agencies are often a good source of information during an emergency and so we have also included a small number of news articles collected during each event as well.
*   **Facebook**: We collected Facebook/Meta posts from public pages that are relevant to each event using CrowdTangle. We cannot share the content of these posts, however, we have included the post and page ids of this content within the stream for those who have access to the CrowdTangle API and can retrieve this data separately. 

Because these sources have different formatting and characteristics, we reformatted this data into a list of standardized 'stream items', where a stream item contains:


*   **event**: The identifier of the event, e.g. 'CrisisFACTS-001'
*   **streamID**: A unique identifier for the stream item. This will generally be of the form 'CrisisFACTS-\<eventNo\>-\<source\>-\<postID\>-\<sentenceID\>', e.g. CrisisFACTS-001-Twitter-15712-0.
*   **unixTimestamp**: This is the time that the content was originally posted, expressed as a unix timestamp in seconds (UTC timezone).
*   **text**: The text of the stream item. The maximum length of a stream item is 200 characters. 
*   **sourceType**: A string denoting the source, i.e. either Twitter, Reddit, News or Facebook.
*   **source**: This is the original post content formated as JSON (ir_datasets ignores this field).

Since, some types of content are longer than others (compare a news article vs. a tweet for instance), for long-form content we perform sentence segmentation, so one input post might form multiple stream items. In these cases, the 'sentenceID' component of the streamID denotes the number of the sentence in the source content.


The dataset is structured by day and event. To access the stream items for a particular \<event,day\> pair we generate a request string specifying the day and event we want, of the form:

*   '**crisisfacts/\<eventNo\>/\<day\>**'

For instance, we could generate request strings for all CrisisFACTs \<event,day\> pairs as follows:

In [4]:
eventsMeta = {}

for eventNo in eventNoList: # for each event
    dailyInfo = getDaysForEventNo(eventNo) # get the list of days
    eventsMeta[eventNo]= dailyInfo
    
    print("Event "+eventNo)
    for day in dailyInfo: # for each day
        print("  crisisfacts/"+eventNo+"/"+day["dateString"], "-->", day["requestID"]) # construct the request string

    print()

Event 001
  crisisfacts/001/2017-12-07 --> CrisisFACTS-001-r3
  crisisfacts/001/2017-12-08 --> CrisisFACTS-001-r4
  crisisfacts/001/2017-12-09 --> CrisisFACTS-001-r5
  crisisfacts/001/2017-12-10 --> CrisisFACTS-001-r6
  crisisfacts/001/2017-12-11 --> CrisisFACTS-001-r7
  crisisfacts/001/2017-12-12 --> CrisisFACTS-001-r8
  crisisfacts/001/2017-12-13 --> CrisisFACTS-001-r9
  crisisfacts/001/2017-12-14 --> CrisisFACTS-001-r10
  crisisfacts/001/2017-12-15 --> CrisisFACTS-001-r11

Event 002
  crisisfacts/002/2018-07-25 --> CrisisFACTS-002-r1
  crisisfacts/002/2018-07-26 --> CrisisFACTS-002-r2
  crisisfacts/002/2018-07-27 --> CrisisFACTS-002-r3
  crisisfacts/002/2018-07-28 --> CrisisFACTS-002-r4
  crisisfacts/002/2018-07-29 --> CrisisFACTS-002-r5
  crisisfacts/002/2018-07-30 --> CrisisFACTS-002-r6

Event 003
  crisisfacts/003/2018-08-06 --> CrisisFACTS-003-r5
  crisisfacts/003/2018-08-07 --> CrisisFACTS-003-r6
  crisisfacts/003/2018-08-08 --> CrisisFACTS-003-r7
  crisisfacts/003/2018-08-09 -

Now that we know what the request strings for each event and day are, we can download for the associated stream for each via ir_datasets:

In [6]:
import ir_datasets

# download the first day for event 001 (this is a lazy call, it won't download until we first request a document from the stream)
dataset = ir_datasets.load('crisisfacts/001/2017-12-07')

for item in dataset.docs_iter()[:10]: # create an iterator over the stream containing the first 10 items
    print(item)



In [10]:
# download the second day for event 009, first 2023 event
dataset = ir_datasets.load('crisisfacts/009/2020-08-04')

for item in dataset.docs_iter()[:10]: # create an iterator over the stream containing the first 10 items
    print(item)

[INFO] [starting] building docstore
[INFO] [starting] requesting access key
[INFO] [finished] requesting access key [5.82s]
docs_iter: 114528doc [03:07, 609.74doc/s]

CrisisFactsStreamDoc(doc_id='CrisisFACTS-009-News-0-0', event='CrisisFACTS-009', text="Massive explosions rock Lebanon’s capital of Beirut; Trump says it was an ‘attack’: At least 70 people were killed Tuesday and more than 3,000 wounded in multiple explosions that rocked Downtown Beirut, Lebanon's health minister said.", source='{"id": "alarabiya.net-22", "date": "2020-08-04", "source": "alarabiya.net", "title": "Massive explosions rock Lebanon\\u2019s capital of Beirut; Trump says it was an \\u2018attack\\u2019", "content": "At least 70 people were killed Tuesday and more than 3,000 wounded in multiple explosions that rocked Downtown Beirut, Lebanon\'s health minister said. US President Donald Trump said he had reason to believe that the blasts were an attack.\\n\\nBuildings several kilometers away suffered material damage, the explosions were heard over 20 kilometers away from Beirut and residents in Cyprus said they felt the blasts.\\n\\nAdvertisement\\n\\nFor all the latest headli


[INFO] [finished] docs_iter: [03:07] [114528doc] [609.74doc/s]
[INFO] [finished] building docstore [03:08]


As we can see the first stream items are tweets, and not all of them are relevant, particularly at the begining of the event. If we wanted to find content of other types we can try filtering by the source_type field.

In [12]:
import pandas as pd

# Convert the stream of items to a Pandas Dataframe
itemsAsDataFrame = pd.DataFrame(dataset.docs_iter())

# Create a filter expression
is_reddit =  itemsAsDataFrame['source_type']=="Reddit"

# Apply our filter
itemsAsDataFrame[is_reddit]

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
5710,CrisisFACTS-009-Reddit-0-0,CrisisFACTS-009,"This just happened near us,in my country... a ...","{""subreddit_display"": ""r/teenagers"", ""subreddi...",Reddit,1596544575
5711,CrisisFACTS-009-Reddit-0-1,CrisisFACTS-009,And lebanon is a small country so most of the ...,"{""subreddit_display"": ""r/teenagers"", ""subreddi...",Reddit,1596544575
5712,CrisisFACTS-009-Reddit-0-2,CrisisFACTS-009,In the capital at least... its really depressi...,"{""subreddit_display"": ""r/teenagers"", ""subreddi...",Reddit,1596544575
5713,CrisisFACTS-009-Reddit-0-3,CrisisFACTS-009,i just wanna get out of here,"{""subreddit_display"": ""r/teenagers"", ""subreddi...",Reddit,1596544575
6102,CrisisFACTS-009-Reddit-1-0,CrisisFACTS-009,Idk where to post this an explosion in Lebanon...,"{""subreddit_display"": ""Well... That sucks..."",...",Reddit,1596544787
...,...,...,...,...,...,...
109930,CrisisFACTS-009-Reddit-157-1,CrisisFACTS-009,I feel so helpless not knowing exactly when an...,"{""body"": ""I'd been seeing a massive explosion ...",Reddit,1596583810
109931,CrisisFACTS-009-Reddit-157-2,CrisisFACTS-009,"Lots of prayers for Lebanon and its people, si...","{""body"": ""I'd been seeing a massive explosion ...",Reddit,1596583810
113389,CrisisFACTS-009-Reddit-158-0,CrisisFACTS-009,Keep making these.,"{""body"": ""Keep making these. There's more foot...",Reddit,1596585094
113390,CrisisFACTS-009-Reddit-158-1,CrisisFACTS-009,"There's more footage coming in constantly, you...","{""body"": ""Keep making these. There's more foot...",Reddit,1596585094


In [13]:
# Create a filter expression
is_twitter =  itemsAsDataFrame['source_type']=="Twitter"

# Apply our filter
itemsAsDataFrame[is_twitter]

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
73,CrisisFACTS-009-Twitter-115-0,CrisisFACTS-009,Opinion | We Lebanese Thought We Could Survive...,"{""created_at"":""Tue Aug 04 00:15:50 +0000 2020""...",Twitter,1596500150
74,CrisisFACTS-009-Twitter-115-1,CrisisFACTS-009,We Were Wrong. -,"{""created_at"":""Tue Aug 04 00:15:50 +0000 2020""...",Twitter,1596500150
75,CrisisFACTS-009-Twitter-115-2,CrisisFACTS-009,The New York Times https://t.co/VvergehA9A,"{""created_at"":""Tue Aug 04 00:15:50 +0000 2020""...",Twitter,1596500150
131,CrisisFACTS-009-Twitter-116-0,CrisisFACTS-009,"The synagogue of Bhamdoun, Lebanon, now and then.","{""created_at"":""Tue Aug 04 01:13:56 +0000 2020""...",Twitter,1596503636
132,CrisisFACTS-009-Twitter-116-1,CrisisFACTS-009,https://t.co/0Af5wyWLUD,"{""created_at"":""Tue Aug 04 01:13:56 +0000 2020""...",Twitter,1596503636
...,...,...,...,...,...,...
114518,CrisisFACTS-009-Twitter-16967-0,CrisisFACTS-009,@offthe_res @Gettingtrump Is this what went of...,"{""created_at"":""Tue Aug 04 23:59:53 +0000 2020""...",Twitter,1596585593
114519,CrisisFACTS-009-Twitter-16968-0,CrisisFACTS-009,May we also include Lebanon in our prayers ?,"{""created_at"":""Tue Aug 04 23:59:53 +0000 2020""...",Twitter,1596585593
114525,CrisisFACTS-009-Twitter-16969-0,CrisisFACTS-009,@CBSEveningNews @CBSNews @SayChrisLive I can?t...,"{""created_at"":""Tue Aug 04 23:59:57 +0000 2020""...",Twitter,1596585597
114526,CrisisFACTS-009-Twitter-16970-0,CrisisFACTS-009,So far there are 70 people reported dead in Be...,"{""created_at"":""Tue Aug 04 23:59:58 +0000 2020""...",Twitter,1596585598


In [14]:
# Create a filter expression
is_fb =  itemsAsDataFrame['source_type']=="Facebook"

# Apply our filter
itemsAsDataFrame[is_fb]

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
66,CrisisFACTS-009-Facebook-0-0,CrisisFACTS-009,استقالة عنقوديّة – سمير عطا الله – الشرق الأوسط,"{""Page Name"": ""www.beirutobserver.com"", ""User ...",Facebook,1596499322
67,CrisisFACTS-009-Facebook-1-0,CrisisFACTS-009,استقالة حتي بداية تفكك الحكومة وسقوطها,"{""Page Name"": ""Voice of Lebanon 100.5 FM"", ""Us...",Facebook,1596499503
68,CrisisFACTS-009-Facebook-2-0,CrisisFACTS-009,Pompeo says Iranian-Chinese agreement will ‘de...,"{""Page Name"": ""SYria Real Infos And News - SYR...",Facebook,1596499713
69,CrisisFACTS-009-Facebook-3-0,CrisisFACTS-009,#DreamDestination #Travel #Tourism #Lebanon #S...,"{""Page Name"": ""Million Dollar Homepage-Lebanon...",Facebook,1596499980
70,CrisisFACTS-009-Facebook-3-1,CrisisFACTS-009,68,"{""Page Name"": ""Million Dollar Homepage-Lebanon...",Facebook,1596499980
...,...,...,...,...,...,...
114520,CrisisFACTS-009-Facebook-28394-0,CrisisFACTS-009,Le bilan provisoire des #explosions survenues ...,"{""Page Name"": ""CGTN Fran\u00e7ais"", ""User Name...",Facebook,1596585596
114521,CrisisFACTS-009-Facebook-28394-1,CrisisFACTS-009,Deux énormes déflagrations ont secoué la capit...,"{""Page Name"": ""CGTN Fran\u00e7ais"", ""User Name...",Facebook,1596585596
114522,CrisisFACTS-009-Facebook-28394-2,CrisisFACTS-009,Selon les informations actuellement disponible...,"{""Page Name"": ""CGTN Fran\u00e7ais"", ""User Name...",Facebook,1596585596
114523,CrisisFACTS-009-Facebook-28394-3,CrisisFACTS-009,(GIF:CCTV),"{""Page Name"": ""CGTN Fran\u00e7ais"", ""User Name...",Facebook,1596585596


In [15]:
# Create a filter expression
is_news =  itemsAsDataFrame['source_type']=="News"

# Apply our filter
itemsAsDataFrame[is_news]

Unnamed: 0,doc_id,event,text,source,source_type,unix_timestamp
0,CrisisFACTS-009-News-0-0,CrisisFACTS-009,Massive explosions rock Lebanon’s capital of B...,"{""id"": ""alarabiya.net-22"", ""date"": ""2020-08-04...",News,1596499200
1,CrisisFACTS-009-News-0-1,CrisisFACTS-009,US President Donald Trump said he had reason t...,"{""id"": ""alarabiya.net-22"", ""date"": ""2020-08-04...",News,1596499200
2,CrisisFACTS-009-News-0-2,CrisisFACTS-009,Buildings several kilometers away suffered mat...,"{""id"": ""alarabiya.net-22"", ""date"": ""2020-08-04...",News,1596499200
3,CrisisFACTS-009-News-0-3,CrisisFACTS-009,Advertisement\n\n,"{""id"": ""alarabiya.net-22"", ""date"": ""2020-08-04...",News,1596499200
4,CrisisFACTS-009-News-0-4,CrisisFACTS-009,For all the latest headlines follow our Google...,"{""id"": ""alarabiya.net-22"", ""date"": ""2020-08-04...",News,1596499200
...,...,...,...,...,...,...
90930,CrisisFACTS-009-News-26-13,CrisisFACTS-009,“I also had discussions with my peacekeeping s...,"{""id"": ""fbcnews.com.fj-112"", ""date"": ""2020-08-...",News,1596576405
90931,CrisisFACTS-009-News-26-14,CrisisFACTS-009,International media report the blast has kille...,"{""id"": ""fbcnews.com.fj-112"", ""date"": ""2020-08-...",News,1596576405
112953,CrisisFACTS-009-News-27-0,CrisisFACTS-009,Beirut explosion: Frantic search for survivors...,"{""id"": ""bbc.com-87"", ""date"": ""2020-08-04"", ""so...",News,1596584905
112954,CrisisFACTS-009-News-27-1,CrisisFACTS-009,Three Beirut hospitals were closed with two ot...,"{""id"": ""bbc.com-87"", ""date"": ""2020-08-04"", ""so...",News,1596584905


You now have the data necessary for producing lists of facts for a given event-day pair.

<hr>

**Part 4. Queries for Disaster Manager/Responder Information Needs**

Clearly not all of the information in the input stream for each day will be useful for an emergency responder, or even be relevant. Hence it makes sense that we filter these streams down based on what the emergency responder cares about. Our task is focused on producing timeline summaries containing similar information to what might be entered into an after action report, similar to an ICS 209 form: 
* https://training.fema.gov/emiweb/is/icsresource/assets/ics%20forms/ics%20form%20209,%20incident%20status%20summary%20(v3).pdf

But how can we express this information need in a way that a computer can understand? 

To make it easier for participant systems to integrate content relevant to the event, we have manually constructed a set of queries that encapsulate this information need. These queries are in effect questions that an emergency responder might ask when writing their after action report.

These queries are included as part of each day of the CrisisFACTs dataset, and can access them as follows:

In [16]:
import pandas as pd

pd.DataFrame(dataset.queries_iter())

[INFO] [starting] requesting access key
[INFO] [finished] requesting access key [5.83s]


Unnamed: 0,query_id,text,indicative_terms,trecis_category_mapping,event_id,event_title,event_dataset,event_description,event_trecis_id,event_type,event_url
0,CrisisFACTS-General-q001,Have airports closed,airport closed,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
1,CrisisFACTS-General-q002,Have railways closed,rail closed,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
2,CrisisFACTS-General-q003,Have water supplied been contaminated,water supply,Report-EmergingThreats,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
3,CrisisFACTS-General-q004,How many firefighters are active,firefighters on-duty,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
4,CrisisFACTS-General-q005,How many people are affected,evacuated,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
5,CrisisFACTS-General-q006,How many people are in shelters,shelters,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
6,CrisisFACTS-General-q007,How many people are missing,missing,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
7,CrisisFACTS-General-q008,How many people are trapped,trapped,Request-SearchAndRescue,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
8,CrisisFACTS-General-q009,How many people have been injured,injury injured,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...
9,CrisisFACTS-General-q010,How many people have been killed,killed dead,Report-Factoid,CrisisFACTS-009,Beirut Explosion,beirutExplosion2020,"On 4 August 2020, a large amount of ammonium n...",TRECIS-CTIT-H-066,Accident,https://en.wikipedia.org/wiki/2020_Beirut_expl...


<hr> 



## Searching for Relevant Content in the Newly Downloaded Data

At this point you know how to get the data streams that you are to summarize, and you know how what ideally should be included within your summary. This is the minimum that you need to tackle the CrisisFACTs task. However, one of the reasons that we integrated the CrisisFACTs datasets into ir_datasets is that it provides you with a plug-and-play means to perform text search of the content for a day via pyTerrier. This is useful as an initial step to find content that is relevant to the emergency responder information needs.

Before we get into creating our search engine, its worth providing a very broad overview of how a (text) search engine works. At its core, a search engine produces a data structure called an index from your document set. This index makes it really fast to identify documents containing a particular query term. To create our index, we need to provide the input documents, as well as specify what fields in the document contain text that we want to be searchable. 


**Part 1. Installing Packages**

Search engines are a core part of the online information space, and much work has gone into making this technology accessible and easy to develop. A major package in this space that is designed to facilitate experimentation with search and information-retrieval methods is `Terrier` and its related Python bindings. To use this library, we install the following packages:

*   pyTerrier (https://pyterrier.readthedocs.io/en/latest/): pyTerrier is a python wrapper around the Terrier IR Platform (a search engine in-a-box). We will use this to produce a searchable index for each day during a crisis event, so we can retrieve (hopefully) relevant content for different information needs. 

In [12]:
!pip install python-terrier # install pyTerrier

**Part 2. Creating an Index**

We can take one of the request strings for an <event,day> pair and ask pyTerrier to create an index for us:   

In [17]:
import pyterrier as pt

# Initalize pyTerrier if not started
if not pt.started():
    pt.init()

# Ask pyTerrier to download the dataset, the 'irds:' header tells pyTerrier to use ir_datasets as the data source
pyTerrierDataset = pt.get_dataset('irds:crisisfacts/009/2020-08-04')

# To create the index, we use an 'indexer', this interates over the documents in the collection and adds them to the index
# The paramters of this call are:
#  Index Storage Path: "None" (some index types write to disk, this would be the directory to write to)
#  Index Type: type=pt.index.IndexingType(3) (Type 3 is a Memory Index)
#  Meta Index Fields: meta=['docno', 'text'] (The index also can store raw fields so they can be attached to the search results, this specifies what fields to store)
#  Meta Index Lengths: meta_lengths=[40, 200] (pyTerrier allocates a fixed amount of storage space per field, how many characters should this be?)
indexer = pt.IterDictIndexer("None", type=pt.index.IndexingType(3), meta=['docno', 'text'], meta_lengths=[40, 200])

# Trigger the indexing process
index = indexer.index(pyTerrierDataset.get_corpus_iter())

PyTerrier 0.9.2 has loaded Terrier 5.7 (built by craigm on 2022-11-10 18:30) and terrier-helper 0.0.7

No etc/terrier.properties, using terrier.default.properties for bootstrap configuration.
crisisfacts/009/2020-08-04 documents: 114528it [00:09, 12386.55it/s]


**Part 3. Handling Queries via the Retriever Object**

Now that we have an index, we can issue queries to it like you would do to a web search engine. Since this is our index, we have control over how we want scoring of the items to happen. Each item is scored using what is known as a weighting model. This is a function that produces a score based on the number of query terms the document contains, in combination with statistics of the documents in the dataset. Different weighting models are optimised for different types of documents. For instance, the classical BM25 model was designed for web pages.

In pyTerrier, we create a retriever object that will execute our queries. We pass the index to the retriever along with the weighting model we want to be used. We can also specify any raw fields we stored in the index for an item that we want to be attached to the search result, such as the original text:

In [18]:
retriever = pt.BatchRetrieve(index, wmodel="DFReeKLIM", metadata=["docno", "text"])

Now that we have a retriever, we can use it to issue queries:

In [19]:
pd.DataFrame(retriever.search("injuries"))



Unnamed: 0,qid,docid,docno,text,rank,score,query
0,1,30463,CrisisFACTS-009-Facebook-11442-0,"Blasts rock Beirut, widespread damage, injurie...",0,8.170213,injuries
1,1,10263,CrisisFACTS-009-Facebook-3882-1,Huge explosions rock Beirut with widespread da...,1,8.165600,injuries
2,1,13601,CrisisFACTS-009-Facebook-5190-0,Huge explosions rock Beirut with widespread da...,2,8.120484,injuries
3,1,82051,CrisisFACTS-009-Twitter-12222-0,Blast Injuries- types of injuries - medical ma...,3,8.120484,injuries
4,1,10439,CrisisFACTS-009-Facebook-3960-0,Huge explosions rock Beirut with widespread da...,4,8.099267,injuries
...,...,...,...,...,...,...,...
421,1,52090,CrisisFACTS-009-Twitter-4633-4,Multiple injuries.,421,4.238542,injuries
422,1,54019,CrisisFACTS-009-Twitter-5137-1,Lots of casualties/injuries,422,4.238542,injuries
423,1,70504,CrisisFACTS-009-Twitter-9477-2,But virtually everyone has some injury or anot...,423,4.238542,injuries
424,1,78172,CrisisFACTS-009-Twitter-11472-1,a lot of martyrs and injuries.,424,4.238542,injuries
