# Critical Introduction to Natural Language Processing

Using only the raw text, we'll derive and explore the semantic properties of its words.

## Imports

Python code in one module gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the import machinery, but it is not the only way.

In [1]:
from __future__ import absolute_import, division, print_function

First we import common system-tools etc. here that are not directly connected to NLP


In [2]:
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re

In [3]:
import nltk
import gensim.models.word2vec as w2v
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import sklearn.manifold
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns



You will probably run into an "ModuleNorFoundError" here. THis means that the needed module is not installed on your system.
You can do that in the anaconda command prompt:
for example: <b>"conda install -c anaconda nltk"</b> or <b>"conda install -c anaconda gensim"</b> <br> for detailed information refer to https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/ <br>


In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [6]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fmx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fmx\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Set a Text

In [7]:
sentence = "The oppressed are allowed once every few years to decide which particular representatives of the oppressing class are to represent and repress them."

**Build your vocabulary (word tokenization)**

In [8]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

In [9]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [10]:
#sentence where each word is tokenized
words = sentence_to_wordlist(sentence)

In [11]:
print(sentence)
print(words)

The oppressed are allowed once every few years to decide which particular representatives of the oppressing class are to represent and repress them.
['The', 'oppressed', 'are', 'allowed', 'once', 'every', 'few', 'years', 'to', 'decide', 'which', 'particular', 'representatives', 'of', 'the', 'oppressing', 'class', 'are', 'to', 'represent', 'and', 'repress', 'them']


In [12]:
num_tokens = len(words)
print("The corpus contains {0:,} tokens".format(num_tokens))

The corpus contains 23 tokens


Now we sort the words lexically

In [13]:
vocab = sorted(set(words))
vocab_size = len(vocab)
', '.join(vocab)
print("The size of the vocabulary is {0:,} words".format(vocab_size))

The size of the vocabulary is 21 words


## Create a one-hot vector representation

In [14]:
onehot_vectors = np.zeros((num_tokens, vocab_size), int)
print(onehot_vectors)

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 

In [15]:
for i, word in enumerate(words):
    onehot_vectors[i, vocab.index(word)] = 1
' '.join(vocab)

'The allowed and are class decide every few of once oppressed oppressing particular represent representatives repress the them to which years'

In [16]:
print(onehot_vectors)

[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 

In [17]:
pd.DataFrame(onehot_vectors, words, columns=vocab)

Unnamed: 0,The,allowed,and,are,class,decide,every,few,of,once,...,oppressing,particular,represent,representatives,repress,the,them,to,which,years
The,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
oppressed,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
are,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
allowed,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
once,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
every,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
few,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
years,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
to,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
decide,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
df = pd.DataFrame(onehot_vectors, words, columns=vocab)
df[df == 0] = ''
df

Unnamed: 0,The,allowed,and,are,class,decide,every,few,of,once,...,oppressing,particular,represent,representatives,repress,the,them,to,which,years
The,1.0,,,,,,,,,,...,,,,,,,,,,
oppressed,,,,,,,,,,,...,,,,,,,,,,
are,,,,1.0,,,,,,,...,,,,,,,,,,
allowed,,1.0,,,,,,,,,...,,,,,,,,,,
once,,,,,,,,,,1.0,...,,,,,,,,,,
every,,,,,,,1.0,,,,...,,,,,,,,,,
few,,,,,,,,1.0,,,...,,,,,,,,,,
years,,,,,,,,,,,...,,,,,,,,,,1.0
to,,,,,,,,,,,...,,,,,,,,1.0,,
decide,,,,,,1.0,,,,,...,,,,,,,,,,
