# CHAPTER 7
**Natural Language Processing Using PyTorch**

In [1]:
import torch
import torch.nn as nn

torch.manual_seed(1)

<torch._C.Generator at 0x2388297f270>

## Recipe 7-1. Word Embedding
Word embedding is the process of representing the words, phrases, and tokens in a meaningful way in a vector structure.

In [2]:
word_to_ix = {"data": 0, "science": 1}
embeds = nn.Embedding(2, 5)
lookup_tensor = torch.tensor([word_to_ix["data"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward>)


In [3]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

In [4]:
test_sentence = """The popularity of the term "data science" has exploded in
business environments and academia, as indicated by a jump in job openings.[32]
However, many critical academics and journalists see no distinction between data
science and statistics. Writing in Forbes, Gil Press argues that data science is a
buzzword without a clear definition and has simply replaced "business analytics" in
contexts such as graduate degree programs.[7] In the question-and-answer section of
his keynote address at the Joint Statistical Meetings of American Statistical
Association，noted applied statistician Nate Silver said, "I think data-scientist
is a sexed up term for a statistician... Statistics is a branch of science.
Data scientist is slightly redundant in some way and people shouldn't berate the
term statistician."[9] Similarly,in business sector, multiple researchers and
analysts state that data scientists alone are far from being sufficient in granting
companies a real competitive advantage[33] and consider data scientists as only
one of the four greater job families companies require to leverage big
data effectively, namely: data analysts, data scientists, big data developers
and big data engineers.[34]
on the other hand, responses to criticism are as numerous.In a 2014 wall Street
Journal article, Irving Wladawsky-Berger compares the data science enthusiasm with
the dawn of computer science.He argues data science, like any other interdisciplinary
field, employs methodologies and practices from across the academia and industry, but
then it will morph them into a new discipline. He brings to attention the sharp criticisms
computer science, now a well respected academic discipline, had to once face.[35] Likewise,
NYU Stern's Vasant Dhar, as do many other academic proponents of data science, [35] argues
more specifically in December 2013 that data science is different from the existing practice
of data analysis across all disciplines, which focuses only on explaining data sets.
Data science seeks actionable and consistent pattern for predictive uses.[1] This practical
engineering goal takes data science beyond traditional analytics. Now the data in those
disciplines and applied fields that lacked solid theories,like health science and social
science, could be sought and utilized to generate powerful predictive models.[1]""".split()

In [5]:
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

In [6]:
class NGramLanguage(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguage, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Sequential(
            nn.Linear(context_size * embedding_dim, 128),
            nn.ReLU(),
            nn.Linear(128, vocab_size),
            nn.LogSoftmax(1)
        )
    def forward(self, x):
        embeds = self.embeddings(x).view((1, -1))
        out = self.linear(embeds)
        return out
losses = []
loss_fn = nn.NLLLoss()
ngram = NGramLanguage(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = torch.optim.SGD(ngram.parameters(), lr=0.001)

In [7]:
for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:
        # step 1. Prepare the inputs to be passed to the model
        # (turn the words into integer indices and 
        # wrap them in tensors)
        context_idxs = torch.tensor(
            [word_to_ix[w] for w in context],
            dtype=torch.long
        )
        ngram.zero_grad()
        log_probs = ngram(context_idxs)
        loss = loss_fn(
            log_probs,
            torch.tensor(
                [word_to_ix[target]], dtype=torch.long
            )
        )
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)

[1855.1886982917786, 1841.4242477416992, 1827.9246106147766, 1814.6671857833862, 1801.6455292701721, 1788.847277879715, 1776.2672295570374, 1763.9086821079254, 1751.7693364620209, 1739.8531415462494]


## Recipe 7-2. CBOW Model in PyTorch

In [9]:
raw_text = """For the future of data science，Donoho projects an ever-growing
environment for open science where data sets used for academic publications are
accessible to all researchers.[36] US National Institute of Bealth has already announced
plans to enhance reproducibility and transparency of research data.[39] other big journals
are likewise following suit.[40][41] This way，the future of data science not only exceeds
the boundary of statistical theories in scale and methodology，but data science will
revolutionize current academia and research paradigms.[36] As Donoho concludes, "the scope
and impact of data science will continue to expand enormously in coming decades as scientific
data and data about science itself become ubiquitously available."[36]""".split()

In [10]:
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, vocab_size - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))

In [11]:
def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

## Recipe 7-3. LSTM Model