## DIY Word Embedding

Transistors are the building blocks of computers, and they understand the world with 1's and 0's. Consequentially, in order to build systems that interpret the world, conversion into 1's and 0's needs to be applied. 

In this walkthrough, we'll create a [Bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model). This is a method of "feature extraction" from text data. Features, in this case are these numerical (1's and 0's) representations of words. 

In [3]:
import nltk
import numpy as np
import random
import string

import bs4 as bs
import urllib.request
import re

import ssl

# because we don't care about security. sweat emoji.
ssl._create_default_https_context = ssl._create_unverified_context

In [4]:
# grab the entire Japan wikipedia article
html = urllib.request.urlopen('https://en.wikipedia.org/wiki/Japan').read()
article = bs.BeautifulSoup(html, 'lxml')

# extract all the paragraphs
paragraphs = article.find_all('p')

print(paragraphs)

[<p class="mw-empty-elt">
</p>, <p><b>Japan</b> (<a href="/wiki/Japanese_language" title="Japanese language">Japanese</a>: <span lang="ja">日本</span>, <span title="Japanese-language romanization"><i lang="ja-Latn">Nippon</i></span> or <span title="Japanese-language romanization"><i lang="ja-Latn">Nihon</i></span>,<sup class="reference" id="cite_ref-8"><a href="#cite_note-8">[nb 1]</a></sup> and formally <span title="Japanese-language text"><span lang="ja">日本国</span></span>, <i>Nihonkoku</i><sup class="reference" id="cite_ref-fn1_10-0"><a href="#cite_note-fn1-10">[nb 2]</a></sup>) is an <a href="/wiki/Island_country" title="Island country">island country</a> in <a href="/wiki/East_Asia" title="East Asia">East Asia</a>. It is situated in the northwest <a href="/wiki/Pacific_Ocean" title="Pacific Ocean">Pacific Ocean</a>, and is bordered on the west by the <a href="/wiki/Sea_of_Japan" title="Sea of Japan">Sea of Japan</a>, while extending from the <a href="/wiki/Sea_of_Okhotsk" title="Sea 

In [5]:
article_text = ''

# get just the content of the article
for para in paragraphs:
    article_text += para.text

print(article_text)


Japan (Japanese: 日本, Nippon or Nihon,[nb 1] and formally 日本国, Nihonkoku[nb 2]) is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north toward the East China Sea, Philippine Sea, and Taiwan in the south. Japan is a part of the Ring of Fire, and spans an archipelago of 6852 islands covering 377,975 square kilometers (145,937 sq mi); the five main islands are Hokkaido, Honshu (the "mainland"), Shikoku, Kyushu, and Okinawa. Tokyo is the nation's capital and largest city, followed by Yokohama, Osaka, Nagoya, Sapporo, Fukuoka, Kobe, and Kyoto.
Japan is the eleventh most populous country in the world, as well as one of the most densely populated and urbanized. About three-fourths of the country's terrain is mountainous, concentrating its population of 125.5 million on narrow coastal plains. Japan is divided into 47 administrative prefectures and eight traditional re

## Tokenize

In order to find similar strings, we need to get to the "core" of each word. This is what tokenization does. We'll create an incredbily basic tokenizer function which:

1. Splits the article by sentence
2. Normalizes it into all lowercase 
3. Retrieves only the stems
4. Remove all punctuation
5. Remove all stopwords

In [6]:
import Stemmer
import re
import string

# simply breaking a string up by whitespace into an array of strings.
def tokenize(text):
    return text.split()

# converting every string into lowercarse
def lowercase_filter(tokens):
    return [token.lower() for token in tokens]

# applying the stemmer library to get every word to its' root
def stem_filter(tokens):
    STEMMER = Stemmer.Stemmer('english')
    return STEMMER.stemWords(tokens)

# remove all punctuation
def punctuation_filter(tokens):
    PUNCTUATION = re.compile('[%s]' % re.escape(string.punctuation))
    return [PUNCTUATION.sub('', token) for token in tokens]

# remove all stop words
# These are the top 25 most common words in English according to wikipedia: https://en.wikipedia.org/wiki/Most_common_words_in_English
def stopword_filter(tokens):
    STOPWORDS = set(['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
                     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
                     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'wikipedia'])
    return [token for token in tokens if token not in STOPWORDS]

# Now an "analyze" function which puts all of the functions above into one single function:
def analyze(text):
    tokens = tokenize(text)
    tokens = lowercase_filter(tokens)
    tokens = punctuation_filter(tokens)
    tokens = stopword_filter(tokens)
    tokens = stem_filter(tokens)

    return [token for token in tokens if token]

In [11]:
corpus = analyze(article_text)
print(corpus)

['japan', 'japanes', '日本', 'nippon', 'or', 'nihonnb', '1', 'formal', '日本国', 'nihonkokunb', '2', 'is', 'an', 'island', 'countri', 'east', 'asia', 'is', 'situat', 'northwest', 'pacif', 'ocean', 'is', 'border', 'west', 'sea', 'japan', 'while', 'extend', 'sea', 'okhotsk', 'north', 'toward', 'east', 'china', 'sea', 'philippin', 'sea', 'taiwan', 'south', 'japan', 'is', 'part', 'ring', 'fire', 'span', 'an', 'archipelago', '6852', 'island', 'cover', '377975', 'squar', 'kilomet', '145937', 'sq', 'mi', 'five', 'main', 'island', 'are', 'hokkaido', 'honshu', 'mainland', 'shikoku', 'kyushu', 'okinawa', 'tokyo', 'is', 'nation', 'capit', 'largest', 'citi', 'follow', 'yokohama', 'osaka', 'nagoya', 'sapporo', 'fukuoka', 'kobe', 'kyoto', 'japan', 'is', 'eleventh', 'most', 'popul', 'countri', 'world', 'well', 'one', 'most', 'dens', 'popul', 'urban', 'about', 'threefourth', 'countri', 'terrain', 'is', 'mountain', 'concentr', 'it', 'popul', '1255', 'million', 'narrow', 'coastal', 'plain', 'japan', 'is', 'd

In [12]:
# Get the word frequency of each word in the corpus

wordfreq = {}
for sentence in corpus:
    tokens = nltk.word_tokenize(sentence)
    for token in tokens:
        if token not in wordfreq.keys():
            wordfreq[token] = 1
        else:
            wordfreq[token] += 1

print(wordfreq)

{'japan': 184, 'japanes': 90, '日本': 5, 'nippon': 4, 'or': 16, 'nihonnb': 1, '1': 6, 'formal': 2, '日本国': 1, 'nihonkokunb': 1, '2': 2, 'is': 117, 'an': 24, 'island': 18, 'countri': 35, 'east': 6, 'asia': 10, 'situat': 1, 'northwest': 2, 'pacif': 6, 'ocean': 1, 'border': 1, 'west': 4, 'sea': 7, 'while': 6, 'extend': 2, 'okhotsk': 2, 'north': 7, 'toward': 2, 'china': 15, 'philippin': 1, 'taiwan': 3, 'south': 13, 'part': 6, 'ring': 2, 'fire': 1, 'span': 2, 'archipelago': 4, '6852': 2, 'cover': 3, '377975': 1, 'squar': 1, 'kilomet': 6, '145937': 1, 'sq': 3, 'mi': 4, 'five': 3, 'main': 12, 'are': 43, 'hokkaido': 5, 'honshu': 3, 'mainland': 1, 'shikoku': 3, 'kyushu': 4, 'okinawa': 4, 'tokyo': 11, 'nation': 40, 'capit': 4, 'largest': 18, 'citi': 5, 'follow': 9, 'yokohama': 1, 'osaka': 2, 'nagoya': 1, 'sapporo': 2, 'fukuoka': 1, 'kobe': 1, 'kyoto': 4, 'eleventh': 2, 'most': 17, 'popul': 19, 'world': 60, 'well': 10, 'one': 19, 'dens': 2, 'urban': 2, 'about': 8, 'threefourth': 1, 'terrain': 2, 'mo

In [14]:
# What are the 200 most frequently used words?
import heapq
most_freq = heapq.nlargest(200, wordfreq, key=wordfreq.get)

print(most_freq)

['japan', 'is', 'japanes', 'has', 'world', 'are', 'nation', 'was', 'countri', 'which', 'it', 'war', 'an', 'period', 'includ', 'been', 'state', 'use', 'among', 'popul', 'one', 'island', 'largest', 'sinc', 'centuri', 'after', 'unit', 'most', 'develop', 'popular', 'or', 'power', 'militari', 'dure', 'govern', 'china', 'cultur', 'other', 'such', 'day', 'tradit', 'constitut', 'south', 'million', 'first', 'emperor', 'forc', 'industri', 'highest', 'intern', 'main', 'high', 'known', 'languag', 'system', 'major', 'tokyo', 'chines', 'earli', 'influenc', 'were', 'over', 'total', 'asia', 'well', 'region', 'begin', 'rank', 'western', 'all', 'law', 'follow', 'area', 'between', 'gdp', 'establish', 'water', 'number', 'year', 'larg', 'climat', 'secur', 'rate', 'age', 'educ', 'about', 'into', 'more', 'base', 'polit', 'adopt', 'ii', 'began', 'econom', 'becom', 'signific', 'around', 'introduc', 'local', 'school', 'mani', 'korea', 'becaus', 'winter', 'summer', 'import', 'environment', 'host', 'relat', 'publ

In [16]:
# Now we convert the corpus into sparse vector representations (1's and 0's)
# this block of code finds each word in the most_freq array and checks if the word exists in the article

sentence_vectors = []
for sentence in corpus:
    sentence_tokens = nltk.word_tokenize(sentence)
    print(sentence_tokens)
#     sent_vec = []
#     for token in most_freq:
#         if token in sentence_tokens:
#             sent_vec.append(1)
#         else:
#             sent_vec.append(0)
#     sentence_vectors.append(sent_vec)

# sentence_vectors = np.asarray(sentence_vectors)

# print(sentence_vectors)

['japan']
['japanes']
['日本']
['nippon']
['or']
['nihonnb']
['1']
['formal']
['日本国']
['nihonkokunb']
['2']
['is']
['an']
['island']
['countri']
['east']
['asia']
['is']
['situat']
['northwest']
['pacif']
['ocean']
['is']
['border']
['west']
['sea']
['japan']
['while']
['extend']
['sea']
['okhotsk']
['north']
['toward']
['east']
['china']
['sea']
['philippin']
['sea']
['taiwan']
['south']
['japan']
['is']
['part']
['ring']
['fire']
['span']
['an']
['archipelago']
['6852']
['island']
['cover']
['377975']
['squar']
['kilomet']
['145937']
['sq']
['mi']
['five']
['main']
['island']
['are']
['hokkaido']
['honshu']
['mainland']
['shikoku']
['kyushu']
['okinawa']
['tokyo']
['is']
['nation']
['capit']
['largest']
['citi']
['follow']
['yokohama']
['osaka']
['nagoya']
['sapporo']
['fukuoka']
['kobe']
['kyoto']
['japan']
['is']
['eleventh']
['most']
['popul']
['countri']
['world']
['well']
['one']
['most']
['dens']
['popul']
['urban']
['about']
['threefourth']
['countri']
['terrain']
['is']
['mount

## Future

In this tutorial we walked you through the onversion of content in the form of text into its' corresponding vector representations. These are sparse vectors, meaning each token is converted into a feature which is just a single digit representation (1's and 0's). 

There is still the process of how to retrieve similar texts as features, and to do this your search engine would simply convert the user-supplied query into tokens, then find their most similar corresponding feature embeddings.

The next step in this walkthrough will be the conversion of a similar corpus into its' dense vector representation. 