## Explore Data

In [2]:
import os
import pandas as pd

In [3]:
def read_files(path):
    file_contents = dict()

    for filename in os.listdir(path):
        if filename.endswith('.txt'):
            with open(os.path.join(path, filename), 'r') as f:
                content = f.read()
                file_contents[filename] = content

    print("... Reading files in path : ", path)
    print("Number of files read: ", len(file_contents))

    return file_contents

In [4]:
effectiveness_contents = read_files('data/effectiveness/train') # 4191
label_contents = read_files('data/label/train') # 15594

... Reading files in path :  data/effectiveness/train
Number of files read:  4191
... Reading files in path :  data/label/train
Number of files read:  15594


In [9]:
train_raw = pd.read_csv('data/effectiveness/train.csv')
train_raw.sample(3)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type,discourse_effectiveness
27238,53ed669461b1,AC2463899B0F,It can and has been argued that the Electoral ...,Evidence,Adequate
1211,ae2a5f7ad531,16B47CE3BEBB,Some peolpe don't like it mabye it's because t...,Counterclaim,Adequate
16170,5f15a0df3f4b,1840AEC2DC71,So you see It's not like it is perfect in ever...,Rebuttal,Adequate


In [10]:
test_raw = pd.read_csv('data/effectiveness/test.csv')
test_raw.sample(3)

Unnamed: 0,discourse_id,essay_id,discourse_text,discourse_type
7,c668ff840720,D72CB1C11673,Seeking others opinion can be very helpful and...,Claim
4,93578d946723,D72CB1C11673,can be very helpful and beneficial.,Claim
3,75ce6d68b67b,D72CB1C11673,a great chance to learn something new,Claim


##  Retrieval Augmented Generation
> * Learn to retrieve a sequence from an existing corpus of human-written prototypes (e.g., dialogue responses)
> * Learn to edit the retrieved sequence by adding, removing, and modifying tokens in the prototype – this will still result in a more “human-like” generation



`ColBERT` (dense retriever) 
1. use NN to encode all documents into representative vectors
2. encodes query into a vector and using vector similarity search

`PLAID` engine
* use aset of filtering steps to improve latency times for ColBERT-based indexes

    `PLAIDDocumentStore` document store class
    * `collection_path` is the path to the documents collection, in the form of a TSV file with columns being "id,content,title" where the title is optional.
    * `checkpoint_path` is the path for the encoder model, needed to encode queries into vectors at run time. Could be a local path to a model or a model hosted on HuggingFace hub. In order to use our trained model based on NaturalQuestions, provide the path Intel/ColBERT-NQ; see Model Hub for more details.
    * `index_path` location of the indexed documents. The index contains the optimized and compressed vector representation of all the documents. Index can be created by the user given a collection and a checkpoint, or can be specified via a path.

`Fusion-in-Decoder` (`FiD`)
* transformer-based generative model (based on T5 architecture)

#### Create Index

In [11]:
# extract effective arguments to create document store for retrieval task
train = (train_raw[train_raw['discourse_effectiveness'] == 'Effective']
         .reset_index()
         .reset_index()[['level_0', 'discourse_text']]
         .replace(r'\n',' ', regex=True) 
        )
train['discourse_text'].str.strip()
train.head(3)

Unnamed: 0,level_0,discourse_text
0,0,Limiting the usage of cars has personal and pr...
1,1,With so many things in this world that few peo...
2,2,It is no secret that morning traffic jams and ...


In [12]:
#test = test.reset_index()[['index', 'discourse_text']].replace(r'\n',' ', regex=True) 

test = (train_raw[train_raw['discourse_effectiveness'] != 'Effective']
         .reset_index()
         .reset_index()[['level_0', 'discourse_text']]
         .replace(r'\n',' ', regex=True) 
        )
test['discourse_text'].str.strip()
test.head(3)

Unnamed: 0,level_0,discourse_text
0,0,"Hi, i'm Isaac, i'm going to be writing about h..."
1,1,"On my perspective, I think that the face is a ..."
2,2,I think that the face is a natural landform be...


In [71]:
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert.data import Queries, Collection
from colbert import Indexer, Searcher

In [14]:
dataroot = 'data'
dataset = 'effective'
datasplit = 'train'

queries = os.path.join(dataroot, dataset, datasplit, 'questions.search.tsv')
collection = os.path.join(dataroot, dataset, datasplit, 'collection.tsv')

# collection refers to the documents collection
# in the form of a TSV file with columns being "id,content,title" where the title is optional.
with open(collection, 'w') as write_tsv:
   write_tsv.write(train.to_csv(sep='\t', index=False, header=False))

with open(queries, 'w') as write_tsv:
   write_tsv.write(test.to_csv(sep='\t', index=False, header=False))

tsv_read = pd.read_csv(queries, sep='\t')
tsv_read.head(3)

Unnamed: 0,0,"Hi, i'm Isaac, i'm going to be writing about how this face on Mars is a natural landform or if there is life on Mars that made it. The story is about how NASA took a picture of Mars and a face was seen on the planet. NASA doesn't know if the landform was created by life on Mars, or if it is just a natural landform."
0,1,"On my perspective, I think that the face is a ..."
1,2,I think that the face is a natural landform be...
2,3,"If life was on Mars, we would know by now. The..."


In [73]:
queries = Queries(path=queries)
collection = Collection(path=collection)

f'Loaded {len(queries)} queries and {len(collection):,} passages'

[Mar 01, 18:56:44] #> Loading the queries from data/effective/train/questions.search.tsv ...
[Mar 01, 18:56:44] #> Got 10 queries. All QIDs are unique.

[Mar 01, 18:56:44] #> Loading collection...
0M 


'Loaded 10 queries and 9,326 passages'

In [16]:
nbits = 2
index_name = f'{dataset}.{datasplit}.{nbits}bits'

with Run().context(RunConfig(nranks=5, experiment='notebook')):

    config = ColBERTConfig(
        nbits=nbits,
    )
    indexer = Indexer(checkpoint='downloads/colbertv2.0', config=config)
    print('start indexing")
    indexer.index(name=index_name, collection=collection, overwrite=True)

SyntaxError: EOL while scanning string literal (3613644112.py, line 10)

In [15]:
indexer.get_index() 

NameError: name 'indexer' is not defined

In [17]:
with Run().context(RunConfig(experiment='notebook')):
    searcher = Searcher(index=index_name)

NameError: name 'Run' is not defined