# loading datasets

The purpose of this notebook is to load an initial set of data so that we can prep them for loading into opensearch.
We'll take a look at data for 2024 in the process.

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
data_root = "../../data/longeval"
! tree {data_root} -L 3

[01;34m../../data/longeval[0m
├── [01;32mLongEval 2024 Test Collection Readme.pdf[0m
├── [01;32mLongEval 2024 Train Collection Readme.pdf[0m
├── [01;34mopensearch-v1[0m
│   ├── [01;32mbatch_metrics_enabled.conf[0m
│   ├── [01;32mlogging_enabled.conf[0m
│   ├── [01;34mnodes[0m
│   │   └── [01;34m0[0m
│   ├── [01;32mperformance_analyzer_enabled.conf[0m
│   ├── [01;32mrca_enabled.conf[0m
│   └── [01;32mthread_contention_monitoring_enabled.conf[0m
├── [01;34mparquet[0m
│   ├── [01;34mtest[0m
│   │   ├── [01;34m2023_06[0m
│   │   └── [01;34m2023_08[0m
│   └── [01;34mtrain[0m
│       └── [01;34m2023_01[0m
└── [01;34mraw[0m
    ├── [01;34mLongEval Test Collection[0m
    │   ├── [01;34m2023_06[0m
    │   └── [01;34m2023_08[0m
    ├── [01;34mLongEval Train Collection[0m
    │   └── [01;34m2023_01[0m
    ├── [01;32mid_urls_2023_01.txt[0m
    ├── [01;32mid_urls_2023_06.txt[0m
    └── [01;32mid_urls_2023_08.txt[0m

15 directories, 10 files


In [4]:
! cat {data_root}/raw/id_urls_2023_01.txt | head -n 10

doc012310000001	https://nationalinterest.org/blog/reboot/forget-nukes-tunnels-are-north-koreas-secret-weapon-164295
doc012310000002	https://ppmforums.com/viewtopic.php?t=46501&view=previous
doc012310000003	https://www.gettyimages.nl/detail/nieuwsfoto%27s/lawyer-james-b-donovan-widely-known-for-negotiating-the-nieuwsfotos/540536060
doc012310000004	https://nationalinterest.org/feature/what-rex-tillersons-nomination-means-russia-policy-18726
doc012310000005	https://en.pokechange.net/@Issan/collection
doc012310000006	http://www.llu.edu/pages/faculty/directory/portfolio_activity.php?catid=5&eid=1a30481&uid=junternaehrer
doc012310000007	https://wikimili.com/en/Kurt_Meyer
doc012310000008	http://www.historycommons.org/context.jsp?item=moorheadcensor72
doc012310000009	https://farsight.org/demo/Time_Cross_Project/2018/Time_Cross_January_2018_Events.html
doc012310000010	https://jnccn.org/abstract/journals/jnccn/15/8/article-p1028.xml?result=2&rskey=26Qf0J
cat: write error: Broken pipe


In [16]:
from longeval.spark import get_spark

data_root = "../../data/longeval"

spark = get_spark()
spark

24/12/19 23:19:29 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [17]:
! cat {data_root}/raw/"LongEval Test Collection"/2023_06/English/Documents/Trec/collector_kodicare_4.txt | head -n 10

<DOC>
<DOCNO>doc062300400001</DOCNO>
<DOCID>doc062300400001</DOCID>
<TEXT>
Request for Quotation LLD
Renault Traffic
- Request for a long-term rental quotation
Renault Traffic //
Your quote request LLD / LOA
Renault Renault TRAFFIC
cat: write error: Broken pipe


In [18]:
# find all the unique tags in the dataset
! cat {data_root}/raw/"LongEval Test Collection"/2023_06/English/Documents/Trec/collector_kodicare_3.txt | \
    grep -oP "<[^>\/]+>" | sort | uniq -c | sort -nr | head -n 8

  17887 <TEXT>
  17887 <DOCNO>
  17887 <DOCID>
  17887 <DOC>
     82 <br>
     32 <p>
     27 <strong>
     26 <reference>


```bash
! cat {data_root}/raw/"LongEval Train Collection"/2023_01/English/Documents/Trec/* | \
    grep -oaE "<[^>\/]+>" | sort | uniq -c | sort -nr | head -n 8

2033736 <TEXT>
2033736 <DOCNO>
2033736 <DOCID>
2033736 <DOC>
```

In [21]:
from longeval.collection import RawCollection

# according to a simple grep, there are 1759690 documents in the collection
train_eng = RawCollection(
    spark, f"{data_root}/raw/LongEval Train Collection/2023_01/English"
)
docs = train_eng.documents.cache()
%count docs.count()

24/12/19 23:21:07 WARN CacheManager: Asked to cache already cached data.        
UsageError: Line magic function `%count` not found.


In [12]:
docs.show(5, vertical=True, truncate=80)

-RECORD 0------------------------------------------------------------------------------------
 contents | Volt\nSystem\n–\nThe Tiffen Company SHOPPING CART title\ndetails THE\nHEART\n... 
 docid    | doc012300700001                                                                  
-RECORD 1------------------------------------------------------------------------------------
 contents | Eurovision\nSong\nContest Story — 12 years ago\n— 12 years\nago Story — 12 ye... 
 docid    | doc012300700002                                                                  
-RECORD 2------------------------------------------------------------------------------------
 contents | Assembly rules out the possibility of pharmacists prescribing certain medicin... 
 docid    | doc012300700003                                                                  
-RECORD 3------------------------------------------------------------------------------------
 contents | Nanne Grönvall\n| BeatZone\nAlbums\nEvents\nNann

In [13]:
train_eng.queries.show(5, vertical=True)

-RECORD 0---------------------
 qid   | q012318              
 query | case over the border 
-RECORD 1---------------------
 qid   | q012396              
 query | water atlantic       
-RECORD 2---------------------
 qid   | q0123180             
 query | blanquette de vea... 
-RECORD 3---------------------
 qid   | q0123240             
 query | gift woman           
-RECORD 4---------------------
 qid   | q0123387             
 query | Government           
only showing top 5 rows



In [14]:
train_eng.qrels.show(5, vertical=True)

-RECORD 0----------------
 qid   | q012318         
 rank  | 0               
 docid | doc012303114898 
 rel   | 0               
-RECORD 1----------------
 qid   | q012318         
 rank  | 0               
 docid | doc012307806130 
 rel   | 1               
-RECORD 2----------------
 qid   | q012318         
 rank  | 0               
 docid | doc012311314092 
 rel   | 0               
-RECORD 3----------------
 qid   | q012318         
 rank  | 0               
 docid | doc012301310209 
 rel   | 0               
-RECORD 4----------------
 qid   | q012318         
 rank  | 0               
 docid | doc012311608989 
 rel   | 0               
only showing top 5 rows



In [20]:
from longeval.collection import ParquetCollection

train_eng = ParquetCollection(spark, f"{data_root}/parquet/train/2023_01/English")
%time train_eng.documents.count()



CPU times: user 3.08 ms, sys: 530 μs, total: 3.61 ms
Wall time: 5.27 s


                                                                                

2033736