# Critical Introduction to Natural Language Processing

Using only the raw text, we'll derive and explore the semantic properties of its words.

## Imports

Python code in one module gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the import machinery, but it is not the only way.

In [1]:
from __future__ import absolute_import, division, print_function

First we import common system-tools etc. here that are not directly connected to NLP


In [2]:
import codecs
import glob
import logging
import multiprocessing
import os
import pprint
import re

In [3]:
import nltk
import gensim.models.word2vec as w2v
from gensim.models import KeyedVectors
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import sklearn.manifold
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns



You will probably run into an "ModuleNorFoundError" here. THis means that the needed module is not installed on your system.
You can do that in the anaconda command prompt:
for example: <b>"conda install -c anaconda nltk"</b> or <b>"conda install -c anaconda gensim"</b> and <b>"conda install -c conda-forge glob2"</b> <br> for detailed information refer to https://docs.anaconda.com/anaconda/user-guide/tasks/install-packages/ <br>


In [4]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


**Set up logging**

In [5]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

**Download NLTK tokenizer models (only the first time)**

In [6]:
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\alx\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Prepare Corpus

**Load books from files**

In [7]:
book_filenames = sorted(glob.glob("txtdata\*.txt"))

In [8]:
print("Found books:")
book_filenames

Found books:


['txtdata\\truth-and-method-gadamer-2004.txt']

**Combine the books into one string**

In [9]:
corpus_raw = u""
for book_filename in book_filenames:
    print("Reading '{0}'...".format(book_filename))
    with codecs.open(book_filename, "r", "utf-8") as book_file:
        corpus_raw += book_file.read()
    print("Corpus is now {0} characters long".format(len(corpus_raw)))
    print()

Reading 'txtdata\truth-and-method-gadamer-2004.txt'...
Corpus is now 1618721 characters long



**Build your vocabulary (word tokenization)**

In [10]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

In [11]:
raw_sentences = tokenizer.tokenize(corpus_raw)

In [12]:
#convert into a list of words
#rtemove unnnecessary,, split into words, no hyphens
#list of words
def sentence_to_wordlist(raw):
    clean = re.sub("[^a-zA-Z]"," ", raw)
    words = clean.split()
    return words

In [13]:
#sentence where each word is tokenized
sentences = []
for raw_sentence in raw_sentences:
    if len(raw_sentence) > 0:
        sentences.append(sentence_to_wordlist(raw_sentence))

In [14]:
print(raw_sentences[5])
print(sentence_to_wordlist(raw_sentences[5]))

The hermeneutic phenomenon is
basically not a problem of method at all.
['The', 'hermeneutic', 'phenomenon', 'is', 'basically', 'not', 'a', 'problem', 'of', 'method', 'at', 'all']


In [15]:
token_count = sum([len(sentence) for sentence in sentences])
print("The book corpus contains {0:,} tokens".format(token_count))

The book corpus contains 260,890 tokens


## Train Word2Vec

Word2vec is a method of computing vector representations of words introduced by a team of researchers at Google led by Tomas Mikolov. Google hosts an open-source version of Word2vec released under an Apache 2.0 license. In 2014, Mikolov left Google for Facebook, and in May 2015, Google was granted a patent for the method, which does not abrogate the Apache license under which it has been released. <br>

<b>Foreign Languages</b> <br>

While words in all languages may be converted into vectors with Word2vec, and those vectors learned with deep-learning frameworks, NLP preprocessing can be very language specific, and requires tools beyond our libraries. The Stanford Natural Language Processing Group has a number of Java-based tools for tokenization, part-of-speech tagging and named-entity recognition for languages such as Mandarin Chinese, Arabic, French, German and Spanish. For Japanese, NLP tools like Kuromoji are useful. Other foreign-language resources, including text corpora, are available here.
http://www-nlp.stanford.edu/links/statnlp.html

<b>size (int, optional)</b> – Dimensionality of the word vectors.

<b>window (int, optional)</b> – Maximum distance between the current and predicted word within a sentence.

<b>min_count (int, optional)</b> – Ignores all words with total frequency lower than this.

<b>workers (int, optional)</b> – Use these many worker threads to train the model (=faster training with multicore machines).

<b>sg ({0, 1}, optional)</b> – Training algorithm: 1 for skip-gram; otherwise CBOW.

<b>seed (int, optional)</b> – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed). Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).

<b>sample (float, optional)</b> – The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

<b>iter (int, optional)</b> – Number of iterations (epochs) over the corpus.

https://radimrehurek.com/gensim/models/word2vec.html

2019-11-26 16:27:52,400 : INFO : collecting all words and their counts
2019-11-26 16:27:52,402 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-11-26 16:27:52,488 : INFO : PROGRESS: at sentence #10000, processed 223470 words, keeping 11480 word types
2019-11-26 16:27:52,503 : INFO : collected 12895 word types from a corpus of 260890 raw words and 11459 sentences
2019-11-26 16:27:52,507 : INFO : Loading a fresh vocabulary
2019-11-26 16:27:52,672 : INFO : min_count=2 retains 7334 unique words (56% of original 12895, drops 5561)
2019-11-26 16:27:52,674 : INFO : min_count=2 leaves 255329 word corpus (97% of original 260890, drops 5561)
2019-11-26 16:27:52,737 : INFO : deleting the raw counts dictionary of 12895 items
2019-11-26 16:27:52,738 : INFO : sample=0.001 downsamples 45 most-common words
2019-11-26 16:27:52,740 : INFO : downsampling leaves estimated 185415 word corpus (72.6% of prior 255329)
2019-11-26 16:27:52,784 : INFO : estimated required memory fo

Word2Vec vocabulary length: 7334


**Start training, this might take a minute or two...**

2019-11-26 16:27:53,212 : INFO : training model with 2 workers on 7334 vocabulary and 160 features, using sg=1 hs=0 sample=0.001 negative=5 window=10
2019-11-26 16:27:54,283 : INFO : EPOCH 1 - PROGRESS: at 38.26% words, 70544 words/s, in_qsize 3, out_qsize 0
2019-11-26 16:27:55,306 : INFO : EPOCH 1 - PROGRESS: at 72.71% words, 66481 words/s, in_qsize 3, out_qsize 0
2019-11-26 16:27:56,063 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11-26 16:27:56,176 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-11-26 16:27:56,178 : INFO : EPOCH - 1 : training on 260890 raw words (185276 effective words) took 2.9s, 63996 effective words/s
2019-11-26 16:27:57,295 : INFO : EPOCH 2 - PROGRESS: at 30.60% words, 51920 words/s, in_qsize 4, out_qsize 0
2019-11-26 16:27:58,374 : INFO : EPOCH 2 - PROGRESS: at 65.04% words, 55716 words/s, in_qsize 3, out_qsize 0
2019-11-26 16:27:59,275 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-11

(1854259, 2608900)

**Save to file, can be useful later**

2019-11-26 16:28:21,662 : INFO : saving Word2Vec object under word2vecGadamer.w2v, separately None
2019-11-26 16:28:21,666 : INFO : not storing attribute vectors_norm
2019-11-26 16:28:21,670 : INFO : not storing attribute cum_table
  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
2019-11-26 16:28:21,856 : INFO : saved word2vecGadamer.w2v


## Explore the trained model.