# Run pipeline (or its steps) on dev data and obtain METEOR-0.25 score 

In [None]:
!echo $OPENAI_API_KEY

In [None]:
%pip install spacy
!python -m spacy download en_core_web_lg
!python -m nltk.downloader punkt
!python -m nltk.downloader wordnet

## ⚙️ 0. Prep environment 

In [None]:
%pip install -r requirements.txt
!python -m spacy download en_core_web_lg
!python -m nltk.downloader punkt
!python -m nltk.downloader wordnet
!sh script/copy_data.sh

## 🔎 1. Get precomputed Google API results and Scrape text from their URLs

In [None]:
!bash script/scraper.sh dev 0 500

## 🥇 2. Rank Search results with BM25

In [None]:
%run src/reranking/bm25_sentences.py

## 💬  3. Generate QA pair for the top sentences

In [None]:
%run src/reranking/question_generation_top_sentences.py

## 🥈 4. Rerank the QA pairs

In [None]:
%run src/reranking/rerank_questions.py

## 🤥 5. Predict veracity

In [None]:
%run src/prediction/veracity_prediction.py

## 📊 6. Evaluate the pipeline

In [2]:
%run src/prediction/evaluate_veracity.py --label_file data/dev.json --prediction_file data_store/dev_veracity_prediction_bck.json

Question-only score (HU-meteor):             0.24041210604919014
Question-answer score (HU-meteor):           0.18547341231661782
Veracity F1 scores:
 * Supported:                                0.4372093023255814
 * Refuted:                                  0.7138157894736842
 * Not Enough Evidence:                      0.0
 * Conflicting Evidence/Cherrypicking:       0.13333333333333333
 * macro:                                    0.32108960628314975
 * acc:                                      0.546
--------------------
AVeriTeC scores:
 * Veracity scores (meteor @ 0.1):           0.452
 * Veracity scores (meteor @ 0.2):           0.186
 * Veracity scores (meteor @ 0.25):          0.092
 * Veracity scores (meteor @ 0.3):           0.05
 * Veracity scores (meteor @ 0.4):           0.012
 * Veracity scores (meteor @ 0.5):           0.002
--------------------
AVeriTeC scores by type @ 0.25:
 * Veracity scores (Event/Property Claim):   0.05979024836242316
 * Veracity scores (Position St

In [3]:
%run src/prediction/evaluate_veracity.py --label_file data/dev10.json --prediction_file data_store/dev10_veracity_prediction.json

Question-only score (HU-meteor):             0.2647792777614756


Question-answer score (HU-meteor):           0.17693171723529702
Veracity F1 scores:
 * Supported:                                0.5714285714285714
 * Refuted:                                  0.8
 * Not Enough Evidence:                      0.0
 * Conflicting Evidence/Cherrypicking:       0.0
 * macro:                                    0.34285714285714286
 * acc:                                      0.6
--------------------
AVeriTeC scores:


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


 * Veracity scores (meteor @ 0.1):           0.6
 * Veracity scores (meteor @ 0.2):           0.1
 * Veracity scores (meteor @ 0.25):          0.0
 * Veracity scores (meteor @ 0.3):           0.0
 * Veracity scores (meteor @ 0.4):           0.0
 * Veracity scores (meteor @ 0.5):           0.0
--------------------
AVeriTeC scores by type @ 0.25:
 * Veracity scores (Position Statement):     0.13407472728460382
 * Veracity scores (Quote Verification):     0.0
 * Veracity scores (Event/Property Claim):   0.0
 * Veracity scores (Causal Claim):           0.0
 * Veracity scores (Numerical Claim):        0.0


In [1]:
%run src/prediction/evaluate_veracity.py --label_file data/dev_20.json --prediction_file data_store/dev_20_claude.json

Question-only score (HU-meteor):             0.4456959937506465
Question-answer score (HU-meteor):           0.305419432129659
Veracity F1 scores:
 * Supported:                                0.7272727272727273
 * Refuted:                                  0.9090909090909091
 * Not Enough Evidence:                      0.0
 * Conflicting Evidence/Cherrypicking:       0.0
 * macro:                                    0.40909090909090906
 * acc:                                      0.7
--------------------
AVeriTeC scores:
 * Veracity scores (meteor @ 0.1):           0.7
 * Veracity scores (meteor @ 0.2):           0.55
 * Veracity scores (meteor @ 0.25):          0.4
 * Veracity scores (meteor @ 0.3):           0.3
 * Veracity scores (meteor @ 0.4):           0.15
 * Veracity scores (meteor @ 0.5):           0.05
--------------------
AVeriTeC scores by type @ 0.25:
 * Veracity scores (Event/Property Claim):   0.27871912377066566
 * Veracity scores (Position Statement):     0.2745546713391