- Models
    - distilbert
        - https://huggingface.co/distilbert-base-uncased-distilled-squad
        - word count restricted to 512
        - appropriate for page summaries
    - look into
        - longform
            - https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9
    - model open directory
        - https://huggingface.co/

# How to use?

For running the final system:
- `Setup`
- `WikiAI Functions`
- `WikiAi` / `WikiAI Setup`
  - Change variable CORPUS_TO_USE to switch between curated and auto Wikipedia article lists.
- `WikiAI` / `Model served with Gradio`

For changing the curated list and generating new corpus, go to section `Data input`.

# Setup

In [2]:
# for distilbert - answer questions
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# for choosing the correct article to answer question
from sentence_transformers import SentenceTransformer, util


# for getting wikipedia articles
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')

# data
import pandas as pd


# utils
import tqdm
import gradio as gr

# standard
from typing import Iterable

In [3]:
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=1)

# WikiAI Functions

In [4]:
def get_related_articles_top_k(query: str, corpus: pd.DataFrame, embedder, model, top_k: int = 10):
    corpus_embeddings_loaded = model
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    
    hits = util.semantic_search(query_embedding, corpus_embeddings_loaded, top_k=top_k)
    hits_idx = list(map(lambda x: x['corpus_id'], hits[0]))
    
    return corpus.iloc[hits_idx]


def distilbert_ask(question, text, tokenizer, model):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    return tokenizer.decode(predict_answer_tokens)


def ask_question(question: str,
                 related_articles_model,
                 related_articles_embedder,
                 related_articles_corpus: pd.DataFrame,
                 question_models: dict,
                 question_tokenizer,
                 related_articles_top_k: int = 5,
                 ):
    # get all relevant articles
    related = get_related_articles_top_k(query=question,
                                         corpus=related_articles_corpus,
                                         embedder=related_articles_embedder,
                                         model=related_articles_model,
                                         top_k=related_articles_top_k
                                        )
    results = pd.DataFrame()
    # for each relevant article
    for idx, page in related.iterrows():        
        results.at[idx, 'article title'] = page['title']
        results.at[idx, 'article summary'] = page['summary']
        
        # for each question asking model
        for model_name, model_data in question_models.items():
            model = model_data['model']
            max_words = model_data['max_words']
            
            # run model on questin with text
            # TODO make sure num. tokens is not exceeded for given model
            
            text = page['summary']
            assert text.strip() != ''  # context must be provided
            answer = ''
            try:
                answer = distilbert_ask(question=question,
                                        text=text,
                                        tokenizer=question_tokenizer,
                                        model=model)
            except:
                print('error during distilbert_ask')
            
            results.at[idx, model_name] = answer
        
    
    return related, results
    

# Data input

## Curated collection

In [10]:
# compile corpus
curated_pages = '''
Artificial intelligence
Natural language processing
Deep learning
Supervised learning
Semi-supervised learning
Unsupervised learning
Statistical classification
Regression analysis
Federated learning
k-anonymity
Data anonymization
k-means clustering
DBSCAN
Dimensionality reduction
Silhouette (clustering)
Davies–Bouldin index
Multidimensional scaling
Cluster analysis
Principal component analysis
Isolation forest
Unsupervised learning
Hierarchical clustering
Local outlier factor
Kaiser–Meyer–Olkin test
Bartlett's test

Affinity propagation
Automatic clustering algorithms
BFR algorithm
BIRCH
Canopy clustering algorithm
Chinese whispers (clustering method)
Cluster-weighted modeling
Cobweb (clustering)
Complete-linkage clustering
Constrained clustering
CURE algorithm
Data stream clustering
DBSCAN
Expectation–maximization algorithm
FLAME clustering
Fuzzy clustering
Hierarchical clustering
Hoshen–Kopelman algorithm
Information bottleneck method
Jenks natural breaks optimization
K q-flats
K-means clustering
K-means++
K-medians clustering
K-medoids
K-SVD
Linde–Buzo–Gray algorithm
Low-energy adaptive clustering hierarchy
Mean shift
Nearest-neighbor chain algorithm
Neighbor joining
OPTICS algorithm
Pitman–Yor process
Quantum clustering
Self-organizing map
SimRank
Single-linkage clustering
Spectral clustering
SUBCLU
UPGMA
Ward's method
WPGMA



Support-vector machine
Boosting (machine learning)
Random forest
Linear regression
Logistic regression
Naive Bayes classifier
Artificial neural network
Perceptron
k-nearest neighbors algorithm
Semi-supervised learning
Ensemble learning
Bootstrap aggregating



'''

curated_pages = curated_pages.strip().splitlines()  # string to list of strings
curated_pages = [p for p in curated_pages if p.strip() != '']  # remove blank lines

## Collect all pages under Machine learning

In [None]:
cats_open = ['Category:Machine learning']
cats_close = []
all_pages = []
while len(cats_open) > 0:
    c = cats_open.pop()
    if c in cats_close:
        continue
    cats_close.append(c)
    cat = wiki_wiki.page(c)
    members = list(cat.categorymembers.keys())
    subcats = filter(lambda m: 'Category:' in m, members)
    pages = filter(lambda m: 'Category:' not in m, members)
    all_pages.extend(pages)
    cats_open.extend(subcats)
    
# remove duplicates
auto_pages = []
for p in all_pages:
    if p not in auto_pages:
        auto_pages.append(p)


## Prepare dataset

In [11]:
CORPUS_TO_USE = 'curated'

In [13]:
auto_pages=''

In [14]:
corpus_types = {'curated': curated_pages,
                'auto': auto_pages}

In [15]:
wikipedia_pages = corpus_types[CORPUS_TO_USE]

In [16]:
df = pd.DataFrame({'title_input': wikipedia_pages})
df['title'] = ''
df['summary'] = ''
df['text'] = ''

In [17]:
for idx, line in tqdm.tqdm(df.iterrows(), total=df.shape[0]):
    page_py = wiki_wiki.page(line['title_input'])
    df.at[idx, 'title'] = page_py.title
    df.at[idx, 'text'] = page_py.text
    df.at[idx, 'summary'] = page_py.summary
    
df

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [00:21<00:00,  3.61it/s]


Unnamed: 0,title_input,title,summary,text
0,Artificial intelligence,Artificial intelligence,Artificial intelligence (AI) is intelligence d...,Artificial intelligence (AI) is intelligence d...
1,Natural language processing,Natural language processing,Natural language processing (NLP) is a subfiel...,Natural language processing (NLP) is a subfiel...
2,Deep learning,Deep learning,Deep learning (also known as deep structured ...,Deep learning (also known as deep structured ...
3,Supervised learning,Supervised learning,Supervised learning (SL) is the machine learni...,Supervised learning (SL) is the machine learni...
4,Semi-supervised learning,Semi-supervised learning,Semi-supervised learning is an approach to mac...,Semi-supervised learning is an approach to mac...
...,...,...,...,...
74,Perceptron,Perceptron,"In machine learning, the perceptron is an algo...","In machine learning, the perceptron is an algo..."
75,k-nearest neighbors algorithm,k-nearest neighbors algorithm,"In statistics, the k-nearest neighbors algorit...","In statistics, the k-nearest neighbors algorit..."
76,Semi-supervised learning,Semi-supervised learning,Semi-supervised learning is an approach to mac...,Semi-supervised learning is an approach to mac...
77,Ensemble learning,Ensemble learning,"In statistics and machine learning, ensemble m...","In statistics and machine learning, ensemble m..."


In [18]:
df.to_csv(f'corpus_wikipedia_{CORPUS_TO_USE}.csv')

# Match article to input

## Train sentence transformer on corpus and store model

In [20]:
embedder = SentenceTransformer('all-MiniLM-L6-v2', device=device)
corpus = df["title"]+df["text"]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
corpus_embeddings.shape

torch.save(corpus_embeddings, f'corpus_embeddings_{CORPUS_TO_USE}.pt')


## Load model

In [None]:
corpus_embeddingsLoaded = torch.load(f'corpus_embeddings_{CORPUS_TO_USE}.pt')

In [None]:
query = 'what is the metric used in k means'
query_embedding = embedder.encode(query, convert_to_tensor=True)

top_k = 10

hits = util.semantic_search(query_embedding, corpus_embeddingsLoaded, top_k=top_k)
hits_idx = list(map(lambda x: x['corpus_id'], hits[0]))

for hit in hits:
    hit_id = hit ['corpus_id']
    article_data = df.iloc[hit_id]
    title = article_data ['title']
    print ("-", title, hit ['score'], hit_id)

In [None]:
query_test = 'what is the metric used in k means'
get_related_articles_top_k(query=query_test,
                           corpus=df,
                           model=corpus_embeddingsLoaded,
                           embedder=embedder,
                           top_k=7
                          )

# Ask Questions on specific text

## Distilbert - max 512 words

In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad")

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

## Distilbert - Demo

In [None]:
page_title = input('wikipedia page title:')
page_py = wiki_wiki.page(page_title)
if not page_py.exists():
    print('page does not exist')
else:
    print('\npage title:', page_py.title)
    print('\nsummary:\n')
    print(page_py.summary)


In [None]:
question = input('question:\n')
distilbert_ask(question, page_py.summary, tokenizer=tokenizer, model=model)

## Longformer

In [None]:
input_ids

In [None]:
tokenizer.encode(text)

In [None]:
encoding = tokenizer(question, page_py.text, return_tensors="pt") #
input_ids = encoding["input_ids"]

In [None]:

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors="pt", max_length=512) #
input_ids = encoding["input_ids"]

In [None]:
from transformers import LongformerTokenizer, LongformerForQuestionAnswering
import torch

tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors="pt", max_length=512) #
input_ids = encoding["input_ids"]

# default is local attention everywhere
# the forward method will automatically set global attention on question tokens
attention_mask = encoding["attention_mask"]

outputs = model(input_ids, attention_mask=attention_mask)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

answer_tokens = all_tokens[torch.argmax(start_logits) : torch.argmax(end_logits) + 1]
answer = tokenizer.decode(
    tokenizer.convert_tokens_to_ids(answer_tokens)
)  # remove space prepending space token

In [None]:
tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

def longformer_ask(question, text, tokenizer, model):
    encoding = tokenizer(question, text, return_tensors="pt") #
    input_ids = encoding["input_ids"]


    with torch.no_grad():
        outputs = model(input_ids)

    start_logits = outputs.start_logits
    end_logits = outputs.end_logits
    all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0].tolist())

    answer_tokens = all_tokens[torch.argmax(start_logits) : torch.argmax(end_logits) + 1]
    answer = tokenizer.decode(
        tokenizer.convert_tokens_to_ids(answer_tokens)
    )  # remove space prepending space token
    return answer

longformer_ask('which metric?', page_py.text, tokenizer, model)

In [None]:
answer

# WikiAI

## WikiAI Setup

In [21]:
CORPUS_TO_USE = 'curated'

# related articles
corpus_df = pd.read_csv(f'corpus_wikipedia_{CORPUS_TO_USE}.csv', index_col=0)
corpus_embeddingsLoaded = torch.load(f'corpus_embeddings_{CORPUS_TO_USE}.pt')
embedder = SentenceTransformer('all-MiniLM-L6-v2', device=device)

# question asking
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad")
question_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

question_models = {
    'distillbert': {'model': question_model, 'tokenizer': tokenizer, 'max_words': 512},
    
}

## Single run test

In [22]:
related, results = \
    ask_question('what is k-means metric?',
                 related_articles_model=corpus_embeddingsLoaded,
                 related_articles_embedder=embedder,
                 related_articles_corpus=corpus_df,
                 question_models=question_models,
                 question_tokenizer=tokenizer,
                 related_articles_top_k= 10,
                )

In [23]:
mask = results['distillbert'] == ''
mask |= results['distillbert'] == '[SEP]'
mask |= results['distillbert'] == '[CLS]'
mask = ~mask

In [24]:
results[mask]

Unnamed: 0,article title,article summary,distillbert
48,K-medians clustering,"In statistics, k-medians clustering is a clust...",squared 2 - norm distance metric
50,K-SVD,"In applied mathematics, K-SVD is a dictionary ...",clustering method
47,K-means++,"In data mining, k-means++ is an algorithm for ...",an algorithm for choosing the initial values (...
26,Automatic clustering algorithms,Automatic clustering algorithms are algorithms...,clustering
75,k-nearest neighbors algorithm,"In statistics, the k-nearest neighbors algorit...",k - nearest neighbors algorithm
36,Data stream clustering,"In computer science, data stream clustering is...",data stream clustering
45,K q-flats,"In data mining and machine learning, \n \n ...",k { \ displaystyle k } - means algorithm


## Model served with Gradio

In [30]:
def question_answer(question):
    related, results = \
        ask_question(question,
                     related_articles_model=corpus_embeddingsLoaded,
                     related_articles_embedder=embedder,
                     related_articles_corpus=corpus_df,
                     question_models=question_models,
                     question_tokenizer=tokenizer,
                     related_articles_top_k= 10,
                    )
    
    mask = results['distillbert'] == ''
    mask |= results['distillbert'] == '[SEP]'
    mask |= results['distillbert'] == '[CLS]'
    mask = ~mask
    
    results = results[mask]
    return results[['article title', 'distillbert']]

gr_interface = gr.Interface(fn=question_answer, inputs=[ "text"], outputs=["dataframe"])

In [32]:
question_answer('what is love?')

error during distilbert_ask


Unnamed: 0,article title,distillbert
1,Natural language processing,natural language processing ( nlp ) is a subfi...
40,Fuzzy clustering,each data point can belong to more than one cl...
73,Artificial neural network,an ann is based on a collection of connected u...
39,FLAME clustering,fuzzy clustering by local approximation of mem...
17,Cluster analysis,exploratory data analysis
3,Supervised learning,supervised learning
8,Federated learning,"enables multiple actors to build a common, rob..."


In [33]:
gr_interface.launch(server_port=7860)

Running on local URL:  http://127.0.0.1:7860/

To create a public link, set `share=True` in `launch()`.


(<gradio.routes.App at 0x7f1720db8a00>, 'http://127.0.0.1:7860/', None)

error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask
error during distilbert_ask


In [None]:
gr.close_all()
gr_interface.close()