# Finding the nearest sentences about *cat*

In [1]:
import numpy as np
import re
from scipy.spatial import distance

### Reading sentences from file and place them to dict and list
We use the following structure to store words:

```struct Word {
  Index int,
  Count int
}
dict{\<word\>: @Word}```

In [2]:
class Words:
    def __init__(self):
        self._words = dict()
    
    def add(self, word):
        if word in self._words:
            self._words[word]['count'] += 1
        else:
            self._words[word] = {'index': len(self._words), 'count': 1}
    
    def index(self, word):
        if word in self._words:
            return self._words[word]['index']
        return None
    
    def count(self, word):
        if word in self._words:
            return self._words[word]['count']
        return None
    
    def capacity(self):
        return len(self._words)
    
    def print_words(self):
        print self._words

Reading sentances from file

In [3]:
allWords = Words()
listOfSentenceFeatures = []
with open('data/1c2s_sentences.txt', 'r') as f:
    for sentence in f.readlines():
        sentenceWords = Words()
        words = np.array([x.lower() for x in re.split('[^A-Za-z]', sentence.strip())])
        words = words[words != '']
        for word in words:
            allWords.add(word)
            sentenceWords.add(word)
        sentenceFeatures = np.zeros(allWords.capacity())
        for word in words:
            sentenceFeatures[allWords.index(word)] = sentenceWords.count(word)
        listOfSentenceFeatures.append(sentenceFeatures)

Construct matrix of sentance features

In [4]:
features = np.zeros((len(listOfSentenceFeatures), len(listOfSentenceFeatures[-1])))
for enu, row in enumerate(listOfSentenceFeatures):
    features[enu, :len(row)] += row

### Calculating cosine distance for all sentences from the first one

In [5]:
distances = dict()
for num, row in enumerate(features[1:]):
    distances[distance.cosine(features[0], row)] = num + 1
print [(distances[key], key) for key in sorted(distances.keys())]

[(6, 0.7327387580875756), (4, 0.7770887149698589), (21, 0.8250364469440588), (10, 0.8328165362273942), (12, 0.8396432548525454), (16, 0.8406361854220809), (20, 0.8427572744917122), (2, 0.8644738145642124), (13, 0.8703592552895671), (14, 0.8740118423302576), (11, 0.8804771390665607), (8, 0.8842724875284311), (19, 0.8885443574849294), (3, 0.8951715163278082), (9, 0.9055088817476932), (7, 0.9258750683338899), (5, 0.9402385695332803), (18, 0.9442721787424647), (1, 0.9527544408738466), (17, 0.956644501523794)]


### Writing results to file

In [6]:
with open('data/1c2s_result1.txt', 'w') as f:
    f.write(' '.join((str(distances[key]) for key in sorted(distances.keys()))))

In [10]:
!cat data/1c2s_result1.txt

6 4 21 10 12 16 20 2 13 14 11 8 19 3 9 7 5 18 1 17

In [9]:
!cat data/1c2s_sentences.txt

In comparison to dogs, cats have not undergone major changes during the domestication process.
As cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes.
A common interactive use of cat for a single file is to output the content of a file to standard output.
Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals.
In one, people deliberately tamed cats in a process of artificial selection, as they were useful predators of vermin.
The domesticated cat and its closest wild ancestor are both diploid organisms that possess 38 chromosomes and roughly 20,000 genes.
Domestic cats are similar in size to the other members of the genus Felis, typically weighing between 4 and 5 kg (8.8 and 11.0 lb).
However, if the output is piped or redirected, cat is unnecessary.
cat with one named file is safer where human error is a concern - one wrong us