# Wikipedia articles clustering with Projector
In this example, we cluster Wikipedia articles and visualize them in [Projector](http://projector.tensorflow.org/).

1. Download raw text from Wikipedia
2. Convert the articles to a tf-idf matrix.
3. Reduce the dimensionality of the resulting tf-idf matrix using SVD
4. Upload the vectors to Projector

In [4]:
!pip install wikipedia -q
!pip install tf-nightly -q

  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 449.5MB 34kB/s 
[K     |████████████████████████████████| 460kB 48.9MB/s 
[K     |████████████████████████████████| 3.9MB 36.1MB/s 
[K     |████████████████████████████████| 2.9MB 30.5MB/s 
[K     |████████████████████████████████| 81kB 9.8MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: tensorflow 1.15.0 has requirement gast==0.2.2, but you'll have gast 0.3.2 which is incompatible.[0m
[31mERROR: google-colab 1.0.0 has requirement google-auth~=1.4.0, but you'll have google-auth 1.10.0 which is incompatible.[0m
[31mERROR: tb-nightly 2.2.0a20200106 has requirement grpcio>=1.24.3, but you'll have grpcio 1.15.0 which is incompatible.[0m


In [0]:
import pickle
import wikipedia
import numpy as np
import tqdm.auto

tqdm = tqdm.auto.tqdm

import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from sklearn.decomposition import TruncatedSVD

In [3]:
print('TensorFlow version: ' + tf.__version__)

TensorFlow version: 2.2.0-dev20200112


## Data

### Downloading
As data for this experiment, we will use Wikipedia articles under the ["Vital articles"](https://en.wikipedia.org/wiki/Wikipedia:Vital_articles) section.

In [13]:
main = wikipedia.page('Wikipedia:Vital articles')
contents = {}
failed = []

for article in tqdm(main.links):
  if article in contents: continue
  try:
    text = wikipedia.page(article)
    contents[article] = text.content
  except:
    failed.append(article)

HBox(children=(IntProgress(value=0, max=1003), HTML(value='')))



  lis = BeautifulSoup(html).find_all('li')





In [0]:
pickle.dump(contents, open('WikiVitalArticles.pkl', 'wb'))

In [0]:
contents = pickle.load(open('WikiVitalArticles.pkl', 'rb'))

### Sampling
To avoid OOM, select a subset of the downloaded articles.

In [0]:
MAX_ARTICLES = 100 #@param max number of tokens

In [0]:
keys = list(contents.keys())

In [0]:
indices = list([np.random.randint(0, len(keys)) for i in range(MAX_ARTICLES)])

In [0]:
train_input = np.array([[contents[keys[index]].lower()] for index in indices])
train_input[:10]

## Processing

In [0]:
MAX_TOKENS = 2000 #@param max number of tokens

### Tokenization

We need to make sure to limit the number of tokens to avoid OOM, alternatively replace the default normalizer with a custom one that will remove useless token.

In [0]:
dataset = tf.data.Dataset.from_tensor_slices(train_input).batch(8)

In [0]:
vectorize_layer = TextVectorization(
    standardize = 'lower_and_strip_punctuation',
    split       = 'whitespace',
    max_tokens  = MAX_TOKENS,
    output_mode ='tf-idf', 
    pad_to_max_tokens=False)

In [0]:
vectorize_layer.adapt(train_input)

In [0]:
tfids = vectorize_layer(train_input).numpy()

In [0]:
tfids.shape

### Dimensionality reduction with PCA

In [0]:
svd = TruncatedSVD(n_components=100, random_state=0)
tfids_reduced = svd.fit_transform(tfids)

In [0]:
tfids_reduced.shape

## Output
Save the metadata (article names) and vectors, then upload them to Projector for further analysis.

In [0]:
np.savetxt('vectors.tsv', tfids_reduced, delimiter='\t')

In [0]:
open('metadata.tsv', 'w').write('\n'.join(keys[:100]))