- Models
    - distilbert
        - https://huggingface.co/distilbert-base-uncased-distilled-squad
        - word count restricted to 512
        - appropriate for page summaries
    - look into
        - longform
            - https://medium.com/dair-ai/longformer-what-bert-should-have-been-78f4cd595be9
    - model open directory
        - https://huggingface.co/

# Setup

In [262]:
# for distilbert - answer questions
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# for choosing the correct article to answer question
from sentence_transformers import SentenceTransformer, util


# for getting wikipedia articles
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')

# data
import pandas as pd


# utils
import tqdm

In [199]:
device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
device

device(type='cuda', index=1)

# Data input

## Curated collection

In [350]:
# compile corpus
curated_pages = '''
Artificial intelligence
Natural language processing
Deep learning
Supervised learning
Semi-supervised learning
Unsupervised learning
Statistical classification
Regression analysis
Federated learning
k-anonymity
Data anonymization
k-means clustering
DBSCAN
Dimensionality reduction
Silhouette (clustering)
Davies–Bouldin index
Multidimensional scaling
Cluster analysis
Principal component analysis
Isolation forest
Unsupervised learning
Hierarchical clustering
Local outlier factor
Kaiser–Meyer–Olkin test
Bartlett's test

Affinity propagation
Automatic clustering algorithms
BFR algorithm
BIRCH
Canopy clustering algorithm
Chinese whispers (clustering method)
Cluster-weighted modeling
Cobweb (clustering)
Complete-linkage clustering
Constrained clustering
CURE algorithm
Data stream clustering
DBSCAN
Expectation–maximization algorithm
FLAME clustering
Fuzzy clustering
Hierarchical clustering
Hoshen–Kopelman algorithm
Information bottleneck method
Jenks natural breaks optimization
K q-flats
K-means clustering
K-means++
K-medians clustering
K-medoids
K-SVD
Linde–Buzo–Gray algorithm
Low-energy adaptive clustering hierarchy
Mean shift
Nearest-neighbor chain algorithm
Neighbor joining
OPTICS algorithm
Pitman–Yor process
Quantum clustering
Self-organizing map
SimRank
Single-linkage clustering
Spectral clustering
SUBCLU
UPGMA
Ward's method
WPGMA



Support-vector machine
Boosting (machine learning)
Random forest
Linear regression
Logistic regression
Naive Bayes classifier
Artificial neural network
Perceptron
k-nearest neighbors algorithm
Semi-supervised learning
Ensemble learning
Bootstrap aggregating



'''

curated_pages = curated_pages.strip().splitlines()  # string to list of strings
curated_pages = [p for p in curated_pages if p.strip() != '']  # remove blank lines

## Collect all pages under Machine learning

In [228]:
cats_open = ['Category:Machine learning']
cats_close = []
all_pages = []
while len(cats_open) > 0:
    c = cats_open.pop()
    if c in cats_close:
        continue
    cats_close.append(c)
    cat = wiki_wiki.page(c)
    members = list(cat.categorymembers.keys())
    subcats = filter(lambda m: 'Category:' in m, members)
    pages = filter(lambda m: 'Category:' not in m, members)
    all_pages.extend(pages)
    cats_open.extend(subcats)
    
# remove duplicates
auto_pages = []
for p in all_pages:
    if p not in auto_pages:
        auto_pages.append(p)


## Prepare dataset

In [304]:
CORPUS_TO_USE = 'auto'

In [351]:
corpus_types = {'curated': curated_pages,
                'auto': auto_pages}

In [306]:
wikipedia_pages = corpus_types[CORPUS_TO_USE]

In [356]:
df = pd.DataFrame({'title_input': wikipedia_pages})
df['title'] = ''
df['summary'] = ''
df['text'] = ''

In [413]:
for idx, line in tqdm.tqdm(df.iterrows(), total=df.shape[0]):
    page_py = wiki_wiki.page(line['title_input'])
    df.at[idx, 'title'] = page_py.title
    df.at[idx, 'text'] = page_py.text
    df.at[idx, 'summary'] = page_py.summary
    
df

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1526/1526 [05:10<00:00,  4.91it/s]


Unnamed: 0,title_input,title,summary,text
0,Bayesian learning mechanisms,Bayesian learning mechanisms,Bayesian learning mechanisms are probabilistic...,Bayesian learning mechanisms are probabilistic...
1,Machine learning,Machine learning,Machine learning (ML) is a field of inquiry de...,Machine learning (ML) is a field of inquiry de...
2,List of datasets for machine-learning research,List of datasets for machine-learning research,These datasets are applied for machine learnin...,These datasets are applied for machine learnin...
3,Outline of machine learning,Outline of machine learning,The following outline is provided as an overvi...,The following outline is provided as an overvi...
4,80 Million Tiny Images,80 Million Tiny Images,80 Million Tiny Images is a dataset intended f...,80 Million Tiny Images is a dataset intended f...
...,...,...,...,...
1521,Fan Hui,Fan Hui,Fan Hui (Chinese: 樊麾; pinyin: Fán Huī; born 27...,Fan Hui (Chinese: 樊麾; pinyin: Fán Huī; born 27...
1522,Future of Go Summit,Future of Go Summit,The Future of Go Summit (Chinese: 中国乌镇围棋峰会) wa...,The Future of Go Summit (Chinese: 中国乌镇围棋峰会) wa...
1523,Aja Huang,Aja Huang,Aja Huang (Chinese: 黃士傑; pinyin: Huáng Shìjié;...,Aja Huang (Chinese: 黃士傑; pinyin: Huáng Shìjié;...
1524,Master (software),Master (software),Master is a version of DeepMind's Go software ...,Master is a version of DeepMind's Go software ...


In [416]:
df.to_csv(f'corpus_wikipedia_{CORPUS_TO_USE}.csv')

# Match article to input

## Train sentence transformer on corpus and store model

In [310]:
embedder = SentenceTransformer('all-MiniLM-L6-v2', device=device)
corpus = df["title"]+df["text"]
corpus_embeddings = embedder.encode(corpus, convert_to_tensor=True)
corpus_embeddings.shape

torch.save(corpus_embeddings, f'corpus_embeddings_{CORPUS_TO_USE}.pt')


## Load model

In [315]:
corpus_embeddingsLoaded = torch.load(f'corpus_embeddings_{CORPUS_TO_USE}.pt')

In [325]:
query = 'what is the metric used in k means'
query_embedding = embedder.encode(query, convert_to_tensor=True)

top_k = 10

hits = util.semantic_search(query_embedding, corpus_embeddingsLoaded, top_k=top_k)
hits_idx = list(map(lambda x: x['corpus_id'], hits[0]))

for hit in hits:
    hit_id = hit ['corpus_id']
    article_data = df.iloc[hit_id]
    title = article_data ['title']
    print ("-", title, hit ['score'], hit_id)

- Determining the number of clusters in a data set 0.5419691205024719 1203
- K-medians clustering 0.5400686264038086 1247
- K-medoids 0.5380665063858032 1248
- K-SVD 0.49442362785339355 1249
- K-means clustering 0.4790186285972595 1245
- K-means++ 0.4489384889602661 1246
- K-nearest neighbors algorithm 0.43898695707321167 727
- Automatic clustering algorithms 0.43277668952941895 1211
- Data stream clustering 0.4221004247665405 1236
- K q-flats 0.4161999821662903 1244


In [352]:
def get_related_articles_top_k(query: str, corpus: pd.DataFrame, embedder, model, top_k: int = 10):
    corpus_embeddings_loaded = model
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    
    hits = util.semantic_search(query_embedding, corpus_embeddings_loaded, top_k=top_k)
    hits_idx = list(map(lambda x: x['corpus_id'], hits[0]))
    
    return corpus.iloc[hits_idx]

In [355]:
query_test = 'what is the metric used in k means'
get_related_articles_top_k(query=query_test,
                           corpus=df,
                           model=corpus_embeddingsLoaded,
                           embedder=embedder,
                           top_k=7
                          )

Unnamed: 0,title_input,title,text
1203,Determining the number of clusters in a data set,Determining the number of clusters in a data set,Determining the number of clusters in a data s...
1247,K-medians clustering,K-medians clustering,"In statistics, k-medians clustering is a clust..."
1248,K-medoids,K-medoids,The k-medoids problem is a clustering problem ...
1249,K-SVD,K-SVD,"In applied mathematics, K-SVD is a dictionary ..."
1245,K-means clustering,K-means clustering,k-means clustering is a method of vector quant...
1246,K-means++,K-means++,"In data mining, k-means++ is an algorithm for ..."
727,K-nearest neighbors algorithm,K-nearest neighbors algorithm,"In statistics, the k-nearest neighbors algorit..."


# Ask Questions on specific text

## Distilbert - max 512 words

In [320]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad")

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

In [359]:
def distilbert_ask(question, text, tokenizer, model):
    inputs = tokenizer(question, text, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)

    answer_start_index = outputs.start_logits.argmax()
    answer_end_index = outputs.end_logits.argmax()

    predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
    return tokenizer.decode(predict_answer_tokens)


## Distilbert - Demo

In [148]:
page_title = input('wikipedia page title:')
page_py = wiki_wiki.page(page_title)
if not page_py.exists():
    print('page does not exist')
else:
    print('\npage title:', page_py.title)
    print('\nsummary:\n')
    print(page_py.summary)


wikipedia page title: kmeans



page title: K-means clustering

summary:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
The problem is computationally difficult (NP-hard); however, efficient heuristic algorithms converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via

In [362]:
page_py.summary

'In the mathematical theory of stochastic processes, variable-order Markov (VOM) models are an important class of models that extend the well known Markov chain models. In contrast to the Markov chain models, where each random variable in a sequence with a Markov property depends on a fixed number of random variables, in VOM models this number of conditioning random variables may vary based on the specific observed realization.\nThis realization sequence is often called the context; therefore the VOM models are also called context trees. VOM models are nicely rendered by colorized probabilistic suffix trees (PST). The flexibility in the number of conditioning random variables turns out to be of real advantage for many applications, such as statistical analysis, classification and prediction.'

In [363]:
question = input('question:\n')
distilbert_ask(question, page_py.summary, tokenizer=tokenizer, model=model)

question:
 what is variable-order


'vom ) models'

# Asking any question

In [None]:
question = input('question:\n')

In [319]:
question = 'what is the metric of kmeans?'

In [358]:
from typing import Iterable

In [420]:
# related articles
corpus_df = pd.read_csv(f'corpus_wikipedia_{CORPUS_TO_USE}.csv')
corpus_embeddingsLoaded = torch.load(f'corpus_embeddings_{CORPUS_TO_USE}.pt')
embedder = SentenceTransformer('all-MiniLM-L6-v2', device=device)

# question asking
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-distilled-squad")
question_model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad")

question_models = {
    'distillbert': {'model': question_model, 'max_words': 512},
}

In [436]:
def ask_question(question: str,
                 related_articles_model,
                 related_articles_embedder,
                 related_articles_corpus: pd.DataFrame,
                 question_models: dict,
                 question_tokenizer,
                 related_articles_top_k: int = 5,
                 ):
    # get all relevant articles
    related = get_related_articles_top_k(query=question,
                                         corpus=related_articles_corpus,
                                         embedder=related_articles_embedder,
                                         model=related_articles_model,
                                         top_k=related_articles_top_k
                                        )
    results = pd.DataFrame()
    # for each relevant article
    for idx, page in related.iterrows():        
        results.at[idx, 'article title'] = page['title']
        results.at[idx, 'article summary'] = page['summary']
        
        # for each question asking model
        for model_name, model_data in question_models.items():
            model = model_data['model']
            max_words = model_data['max_words']
            
            # run model on questin with text
            # TODO validate text has less than max_words (crop)
            
            text = page['summary']
            assert len(text.split()) < max_words
            assert text.strip() != ''  # context must be provided
            answer = distilbert_ask(question=question,
                                    text=text,
                                    tokenizer=question_tokenizer,
                                    model=model)
            
            results.at[idx, model_name] = answer
        
    
    return related, results
    

In [442]:
related, results = \
ask_question('what is k-means metric?',
             related_articles_model=corpus_embeddingsLoaded,
             related_articles_embedder=embedder,
             related_articles_corpus=df,
             question_models=question_models,
             question_tokenizer=tokenizer,
             related_articles_top_k= 10,
            )

In [447]:
distilbert_ask(question='what is k-means metric?',
               text=results.loc[1245]['article summary'],
               tokenizer=tokenizer,
               model=model)

''

In [443]:
results

Unnamed: 0,article title,article summary,distillbert
1248,K-medoids,The k-medoids problem is a clustering problem ...,[CLS]
1247,K-medians clustering,"In statistics, k-medians clustering is a clust...",squared 2 - norm distance metric
1203,Determining the number of clusters in a data set,Determining the number of clusters in a data s...,a quantity
1249,K-SVD,"In applied mathematics, K-SVD is a dictionary ...",clustering method
1245,K-means clustering,k-means clustering is a method of vector quant...,
1246,K-means++,"In data mining, k-means++ is an algorithm for ...",an algorithm for choosing the initial values (...
1211,Automatic clustering algorithms,Automatic clustering algorithms are algorithms...,clustering
727,K-nearest neighbors algorithm,"In statistics, the k-nearest neighbors algorit...",k - nearest neighbors algorithm
1236,Data stream clustering,"In computer science, data stream clustering is...",data stream clustering
1244,K q-flats,"In data mining and machine learning, \n \n ...",k { \ displaystyle k } - means algorithm
