# Wikipedia articles clustering with Bert and Projector
In this example, we cluster Wikipedia articles and visualize them in [Projector](http://projector.tensorflow.org/).

1. Download raw text from Wikipedia
2. Convert the articles to a tf-idf matrix.
3. Reduce the dimensionality of the resulting tf-idf matrix using SVD
4. Upload the vectors to Projector

Also this notebook shows how to use pre-trained BERT and TensorFlow Text.

In [1]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

TensorFlow 2.x selected.


In [0]:
!pip install wikipedia -q
!pip install bert-tensorflow -q
!pip install tensorflow_text -q

In [0]:
import pickle
import wikipedia

import tensorflow as tf
import tensorflow_text as tftext
import tensorflow_hub as hub
from bert.tokenization import FullTokenizer

import tqdm.auto

tqdm = tqdm.auto.tqdm

## Data
As data for this experiment, we will use Wikipedia articles under the ["Vital articles"](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles) section.

In [0]:
main = wikipedia.page('Wikipedia:Vital articles')

In [5]:
contents = {}
failed = []

for i, article in tqdm(enumerate(main.links)):
  if article in contents: continue
  try:
    text = wikipedia.page(article)
    contents[article] = text.content
  except:
    failed.append(article)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))



  lis = BeautifulSoup(html).find_all('li')





In [0]:
pickle.dump(contents, open('WikiVitalArticles.pkl', 'wb'))

In [0]:
contents = pickle.load(open('WikiVitalArticles.pkl', 'rb'))

In [0]:
keys = list(contents.keys())

In [5]:
train_input = [contents[key].lower() for key in keys[:10]]
train_input

 'abiogenesis, or informally the origin of life, is the natural process by which life has arisen from non-living matter, such as simple organic compounds. while the details of this process are still unknown, the prevailing scientific hypothesis is that the transition from non-living to living entities was not a single event, but an evolutionary process of increasing complexity that involved molecular self-replication, self-assembly, autocatalysis, and the emergence of cell membranes. although the occurrence of abiogenesis is uncontroversial among scientists, its possible mechanisms are poorly understood. there are several principles and hypotheses for how abiogenesis could have occurred.\nresearchers study abiogenesis through a combination of molecular biology, paleontology, astrobiology, oceanography, biophysics, geochemistry and biochemistry, and aim to determine how pre-life chemical reactions gave rise to life. the study of abiogenesis can be geophysical, chemical, or biological, w

## Processing

### Tokenization

In [0]:
BERT_URL = 'https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/1'
bert_layer = hub.KerasLayer(BERT_URL, trainable=False)

In [7]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()

print(f'BERT vocab is stored at     : {vocab_file}')
print(f'BERT model is case sensitive: {do_lower_case}')

BERT vocab is stored at     : b'/tmp/tfhub_modules/25a382cb8907296cce1ba02833c676110971fad6/assets/vocab.txt'
BERT model is case sensitive: False


#### Create Bert vocab table

Load Bert vocab file and clean tokens

In [0]:
def load_vocab(vocab_file):
  """Loads a vocabulary file into a list."""
  vocab = []
  with tf.io.gfile.GFile(vocab_file, "r") as reader:
    while True:
      token = reader.readline()
      if not token: break
      token = token.strip()
      vocab.append(token)
  return vocab

In [0]:
def create_vocab_table(vocab, num_oov=1):
  """Create a lookup table for a vocabulary"""
  vocab_values = tf.range(tf.size(vocab, out_type=tf.int64), dtype=tf.int64)
  init = tf.lookup.KeyValueTensorInitializer(keys=vocab, values=vocab_values, key_dtype=tf.string, value_dtype=tf.int64)
  vocab_table = tf.lookup.StaticVocabularyTable(init, num_oov, lookup_key_dtype=tf.string)
  return vocab_table

In [0]:
vocab = load_vocab(vocab_file)
vocab_lookup_table = create_vocab_table(vocab)

In [0]:
tokenizer = tftext.BertTokenizer(vocab_lookup_table, token_out_type=tf.int64, lower_case=do_lower_case)

In [12]:
tokenizer.tokenize(["the brown fox jumped over the lazy dog"])

<tf.RaggedTensor [[[1103], [3058], [17594], [4874], [1166], [1103], [16688], [3676]]]>

In [0]:
tokens = tokenizer.tokenize(train_input)

# BERT module excepts a 2D tensor (not 3D)
tokens = tokens.to_tensor()[:, :, 0]
tokens = tf.cast(tokens, dtype=tf.int32)

# Set masks and segment ids
input_mask = tf.ones(tokens.shape, dtype=tf.int32)
segment_ids = tf.zeros(tokens.shape, dtype=tf.int32)

# Embed the inputs.
pooled_output, sequence_output = bert_layer([tokens, input_mask, segment_ids])

In [0]:
dataset = tf.data.Dataset.from_tensor_slices(train_input)

In [0]:
dataset = dataset.map(bert_preprocess)

In [0]:
subtokens.to_tensor().shape

In [0]:
subtokens.to_tensor()[0]

In [0]:
subtokens.to_tensor()[5]

In [0]:
vocab_lookup_table.size()