## Sparse Vector Tutorial

Transistors are the building blocks of computers, and they understand the world with 1's and 0's. Consequentially, in order to build systems that interpret the world, conversion into 1's and 0's needs to be applied. 

In this walkthrough, we'll create a [Bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model). This is a method of "feature extraction" from text data. Features, in this case are these numerical (1's and 0's) representations of words. 

In [None]:
import nltk
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

import ssl
from pprint import pp

# because we don't care about security. sweat emoji.
ssl._create_default_https_context = ssl._create_unverified_context

In [None]:
# grab the entire Japan wikipedia article
html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Japan').read()
article = bs.BeautifulSoup(html, 'lxml')

# extract all the paragraphs
paragraphs = article.find_all('p')

pp(paragraphs)

In [None]:
article_text = ''

# get just the content of the article
for para in paragraphs:
    article_text += para.text

print(article_text)

## Tokenize

In order to find similar strings, we need to get to the "core" of each word. This is what tokenization does. We'll create an incredbily basic tokenizer function which:

1. Splits the article by sentence
2. Normalizes it into all lowercase 
3. Retrieves only the stems
4. Remove all punctuation
5. Remove all stopwords

In [None]:
import Stemmer
import re
import string

# simply breaking a string up by whitespace into an array of strings.
def tokenize(text):
    return text.split()

# converting every string into lowercarse
def lowercase_filter(tokens):
    return [token.lower() for token in tokens]

# applying the stemmer library to get every word to its' root
def stem_filter(tokens):
    STEMMER = Stemmer.Stemmer('english')
    return STEMMER.stemWords(tokens)

# remove all punctuation
def punctuation_filter(tokens):
    PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
    return [PUNCTUATION.sub('', token) for token in tokens]

# remove all stop words
# These are the top 25 most common words in English according to wikipedia: https://en.wikipedia.org/wiki/Most_common_words_in_English
def stopword_filter(tokens):
    STOPWORDS = set(['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
                     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
                     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'wikipedia'])
    return [token for token in tokens if token not in STOPWORDS]

# Now an "analyze" function which puts all of the functions above into one single function:
def analyze(text):
    tokens = tokenize(text)
    tokens = lowercase_filter(tokens)
    tokens = punctuation_filter(tokens)
    tokens = stopword_filter(tokens)
    tokens = stem_filter(tokens)

    return [token for token in tokens if token]

In [None]:
corpus = analyze(article_text)
print(corpus)

In [None]:
# Get the word frequency of each word in the corpus

wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

print(wordfreq)

In [29]:
# What are the 200 most frequently used words?
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

print(most_freq)

['japan', 'is', 'japanes', 'has', 'world', 'are', 'nation', 'was', 'countri', 'which', 'it', 'war', 'an', 'period', 'includ', 'been', 'state', 'use', 'among', 'popul', 'one', 'island', 'largest', 'sinc', 'centuri', 'after', 'unit', 'most', 'develop', 'popular', 'or', 'power', 'militari', 'dure', 'govern', 'china', 'cultur', 'other', 'such', 'day', 'tradit', 'constitut', 'south', 'million', 'first', 'emperor', 'forc', 'industri', 'highest', 'intern', 'main', 'high', 'known', 'languag', 'system', 'major', 'tokyo', 'chines', 'earli', 'influenc', 'were', 'over', 'total', 'asia', 'well', 'region', 'begin', 'rank', 'western', 'all', 'law', 'follow', 'area', 'between', 'gdp', 'establish', 'water', 'number', 'year', 'larg', 'climat', 'secur', 'rate', 'age', 'educ', 'about', 'into', 'more', 'base', 'polit', 'adopt', 'ii', 'began', 'econom', 'becom', 'signific', 'around', 'introduc', 'local', 'school', 'mani', 'korea', 'becaus', 'winter', 'summer', 'import', 'environment', 'host', 'relat', 'publ

In [28]:
# Now we convert the corpus into sparse vector representations (1's and 0's)
# this block of code finds each word in the most_freq array and checks if the word exists in the article

sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)

    sent_vec = []
    for token in most_freq:
        if token in sentence_tokens:
            sent_vec.append(1)
        else:
            sent_vec.append(0)
    sentence_vectors.append(sent_vec)

sentence_vectors = np.asarray(sentence_vectors)

print(sentence_vectors)

[[1 0 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## Future

In this tutorial we walked you through the onversion of content in the form of text into its' corresponding vector representations. These are sparse vectors, meaning each token is converted into a feature which is just a single digit representation (1's and 0's). 

There is still the process of how to retrieve similar texts as features, and to do this your search engine would simply convert the user-supplied query into tokens, then find their most similar corresponding feature embeddings.

The next step in this walkthrough will be the conversion of a similar corpus into its' dense vector representation. 