# How we do it in Barbera1 - RCV1

We use Terrier 4.2. with RCV1 dataset

The script we used to convert JSON to TERRIER files did not include TITLE, so we indexed just the BODY.
We will use the TITLE (unprocessed) as query.

Index properties:
- we have only BODY
- we have a positional index: block.indexing=True


We apply BM25Classic and BM25Passage.
- for BM25Passage we divide the file in 10 equal parts
- for BM25P we apply the weights from our statistical analysis of the text on (TFIDF and IDF) - see notebook High IDF Terms in Signal collection.ipynb
- we evaluate: P@1, NDCG@k, Recall@k

## 1. Creating queries

In [12]:
%load_ext autoreload
%autoreload 2


from elasticsearch import Elasticsearch, RequestError
import json
import pandas as pd
import numpy as np

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [4]:
dataset = pd.read_json("../data/rcv1_news.json", lines=True)
dataset.shape[0]

dataset

Unnamed: 0,categories,content,date,headline,id,lang,title
0,"[E11, ECAT, M11, M12, MCAT]",Emerging evidence that Mexico's economy was ba...,1996-08-20,Recovery excitement brings Mexican markets to ...,2286,en,MEXICO: Recovery excitement brings Mexican mar...
1,"[C24, CCAT]",Chrysler Corp. Tuesday announced $380 million ...,1996-08-20,Chrysler plans new investments in Latin America.,2287,en,USA: Chrysler plans new investments in Latin A...
2,"[C15, C151, CCAT, E41, ECAT, GCAT, GJOB]",CompuServe Corp. Tuesday reported a surprising...,1996-08-20,"CompuServe reports loss, cutting work force.",2288,en,"USA: CompuServe reports loss, cutting work force."
3,"[C15, C151, CCAT]",CompuServe Corp. Tuesday reported a surprising...,1996-08-20,"CompuServe reports loss, cutting work force.",2289,en,"USA: CompuServe reports loss, cutting work force."
4,"[C11, C22, CCAT]",If dining at Planet Hollywood made you feel li...,1996-08-20,Planet Hollywood launches credit card.,2290,en,USA: Planet Hollywood launches credit card.
5,"[M14, MCAT]",Hog prices fell Tuesday after government slaug...,1996-08-20,"Hog prices tumble as supplies increase, cocoa ...",2291,en,"USA: Hog prices tumble as supplies increase, c..."
6,"[M11, M12, M13, M132, M14, MCAT]",Blue-chip stocks rallied Tuesday after the Fed...,1996-08-20,Blue chips end up as Fed keeps interest rates ...,2292,en,USA: Blue chips end up as Fed keeps interest r...
7,"[C22, CCAT]",Sprint Corp. Tuesday announced plans to offer ...,1996-08-20,Sprint to offer consumer Internet access service.,2293,en,USA: Sprint to offer consumer Internet access ...
8,"[E14, ECAT]",Shoppers are loading up this year on perennial...,1996-08-20,Back-to-school spending is up.,2294,en,USA: Back-to-school spending is up.
9,"[C12, CCAT, GCAT, GCRIM]",Kansas and Arizona filed lawsuits against some...,1996-08-20,"Kansas, Arizona add to suits against tobacco f...",2295,en,"USA: Kansas, Arizona add to suits against toba..."


In [5]:
newTitle = dataset.headline.str.replace('[^a-zA-Z ]', '')

In [10]:
newTitle[0:100]

0     Recovery excitement brings Mexican markets to ...
1       Chrysler plans new investments in Latin America
2            CompuServe reports loss cutting work force
3            CompuServe reports loss cutting work force
4                 Planet Hollywood launches credit card
5     Hog prices tumble as supplies increase cocoa g...
6     Blue chips end up as Fed keeps interest rates ...
7      Sprint to offer consumer Internet access service
8                           Backtoschool spending is up
9     Kansas Arizona add to suits against tobacco firms
10      Chains may raise prices after minimum wage hike
11    Blue chips end up as Fed keeps interest rates ...
12       US Federal Reserve holds interest rates steady
13         Lloyds CEO questioned in recovery suit in US
14                 Most active stocks in Nasdaq trading
15                                NYSE closing averages
16       Ohio Blue Cross approves  mln ColumbiaHCA deal
17    War hero Colin Powell hits road with Dole 

In [7]:
newTitle.to_csv('RCV1-title-queries.txt', sep=':')

## 2. Running the queries in Terrier

#### BM25Classic

My command - not using terrier passage:

    bin/trec_terrier.sh -r -Dtrec.model=BM25 
                           -Dtrec.topics=/home/muntean/RCV1/data/RCV1-title-queries.txt 
                           -Dtrec.results=/home/muntean/RCV1/results 
                           -Dtrec.results.file=BM25-RCV1-all-title-queries.res
                           
12:17:31.764 [main] INFO  o.t.a.batchquerying.TRECQuerying - Time to process query: 0.054
12:17:31.799 [main] INFO  o.t.a.batchquerying.TRECQuerying - Settings of Terrier written to /home/muntean/RCV1/results/BM25-RCV1-all-title-queries.res.settings
12:17:31.799 [main] INFO  o.t.a.batchquerying.TRECQuerying - Finished topics, executed 804406 queries in 66634.29 seconds, results written to /home/muntean/RCV1/results/BM25-RCV1-all-title-queries.res
Time elapsed: 66637.903 seconds.

__We then run count-query-type-Terrier.py on the BM25-RCV1-all-title-queries.res file__

```
('A', 225227)
('C', 152174)
('B', 221300)
('D', 105383)
```


## 3. Running the queries on Elastic!
#### JUST for BM25 so as to find query types
An article has the following fields:  

```{'_index': 'articles', 
    '_type': 'article', 
    '_id': '26', 
    '_score': 1.0, 
    '_source': {'title': "Survivor Upset Use of 'Eye of the Tiger' With Kim Davis", 
                'media-type': 'Blog', 
                'content': 'On Tuesday afternoon, Kim Davis, the Kentucky clerk whose imprisonment for not granting same-sex marriage licenses became a cause célèbre for the right, was released amid fanfare and a glowing introduction by Republican presidential hopeful Mike Huckabee.As Davis, her husband and attorney ...', 
                'source': 'Latest News on One News Page [United States] - Top Headlines and News Videos', 
                'published': '2015-09-10T03:00:15Z', 
                'id': '7516303b-0db5-477d-9e5d-243a73865e39'}}```

In [8]:
es = Elasticsearch(['http://localhost/'], 
                    #http_auth=('elastic', 'bm25p'),
                    port=9200,
                    timeout=30
                    )

In [9]:
res = es.search(index="rcv1", 
                doc_type='article', 
                size=10, 
                body={"query": {
                        "match": {
                                "content": "Alex Wagner"
                        }}})

# [x['_source'] for x in res['hits']['hits']]
for r in res['hits']['hits'][:3]:
    print(r)
    print()

{'_index': 'rcv1', '_type': 'article', '_id': '23152', '_score': 14.878456, '_source': {'id': '26153', 'date': '1996-09-01', 'lang': 'en', 'title': 'ECUADOR: SOCCER-ECUADOR TAKE ANOTHER STEP TOWARDS FIRST WORLD CUP.', 'headline': 'SOCCER-ECUADOR TAKE ANOTHER STEP TOWARDS FIRST WORLD CUP.', 'content': "QUITO, Sept 1 (Ecuador) - Ecuador took another step towards their first World Cup when they beat Venezuela 1-0 on Sunday to record their third win in four qualifying games.\nCaptain Alex Aguinaga scored the winner in a game featuring the only two South American nations who have never played in the World Cup tournament.\nDespite three more precious points, Ecuador were jeered off the field by a 45,000 crowd who had expected a higher score against South American's weakest team who have taken only one point from their four games so far.\nAguinaga snapped up a rebound following a shot by Brazilian-born Gilson de Souza in the fourth minute but Ecuador were unable to add to their tally.\nEcuado

__We then run count-query-type-Elastic.py on the BM25-RCV1-all-title-queries.res file__

```
('A', 238580)
('C', 152506)
('B', 228059)
('D', 103161)
```

# 4. Generate samples (10.000) for queries on intersection between Terrier and Elastic - Global

See notebook: http://146.48.82.32:9999/notebooks/RCV1/notebooks/Create%20RCV1%20final%20query%20set%20from%20Terrier%20and%20Elastic%20on%20BM25.ipynb