# Chapter 2: Natural Language and Distibuted Representations

## 2.1. Natural Language Processing

## 2.2 Thesaurus

Thesaurus is a list of words and their synonyms.
It has been created by the [wordnet](https://wordnet.princeton.edu/) project.
But there are problems in using thesaurus for natural language processing because it need a lot of work to keep updating according to word meaning change and new word is added.

## 2.3 Count base method

Text data acuired for the purpose of understanding natural language is called 'corpus'.



In [21]:
# use corpus

text = 'You say goodbye and I say hello.'
text = text.lower()
text = text.replace('.', ' .')

words = text.split(' ')
words

['you', 'say', 'goodbye', 'and', 'i', 'say', 'hello', '.']

In [23]:
import numpy as np

def preprocess(text):
    text = text.lower()
    text = text.replace('.', ' .')
    text = text.split(' ')

    word_to_id = {}
    id_to_word = {}

    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
    
    corpus = np.array([word_to_id[w] for w in words])

    return corpus, word_to_id, id_to_word

text = 'You say goodbye and I say hello.'
corpus, word_to_id, id_to_word = preprocess(text)

print(corpus)
print(word_to_id)
print(id_to_word)


[0 1 2 3 4 1 5 6]
{'you': 0, 'say': 1, 'goodbye': 2, 'and': 3, 'i': 4, 'hello': 5, '.': 6}
{0: 'you', 1: 'say', 2: 'goodbye', 3: 'and', 4: 'i', 5: 'hello', 6: '.'}


# Appendix B: WordNet

## B.1 Install NLTK

`pip install nltk`


In [1]:
import nltk

## B.2 Get synonyms

In [5]:
from nltk.corpus import wordnet

nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [6]:
from nltk.corpus import wordnet

wordnet.synsets('car')

[Synset('car.n.01'),
 Synset('car.n.02'),
 Synset('car.n.03'),
 Synset('car.n.04'),
 Synset('cable_car.n.01')]

In [15]:
car = wordnet.synset('car.n.01')
print(car.definition())
print(car.lemma_names())

a motor vehicle with four wheels; usually propelled by an internal combustion engine
['car', 'auto', 'automobile', 'machine', 'motorcar']


## B.3 WordNet and terms network



In [16]:
car.hypernym_paths()[0]

[Synset('entity.n.01'),
 Synset('physical_entity.n.01'),
 Synset('object.n.01'),
 Synset('whole.n.02'),
 Synset('artifact.n.01'),
 Synset('instrumentality.n.03'),
 Synset('container.n.01'),
 Synset('wheeled_vehicle.n.01'),
 Synset('self-propelled_vehicle.n.01'),
 Synset('motor_vehicle.n.01'),
 Synset('car.n.01')]

## B.4 Semantic similarity according to WordNet

Similarity can be measured by using `path_similarity` or `wup_similarity`.


In [17]:
car = wordnet.synset('car.n.01')
novel = wordnet.synset('novel.n.01')
dog = wordnet.synset('dog.n.01')
motorcycle = wordnet.synset('motorcycle.n.01')

print(car.path_similarity(novel))
print(car.path_similarity(dog))
print(car.path_similarity(motorcycle))

0.05555555555555555
0.07692307692307693
0.3333333333333333
