# Texts similarity

You have set of sentences copied from Wikipedia. Every of them related with "cats" in one of these contexts:

 * Cats (animals)
 * Unix-tool `cat` for printing files content to console
 * Versions of operation system OS X named after the cat family


 Your task is find two most similar sentences with sentence in first line. Similarity in this task means minimal cosine distance.

In [0]:
# Imports
import numpy as np
import scipy as sp
import scipy.spatial
import matplotlib as mpl
from matplotlib import pylab as plt
import pandas as pd

%matplotlib inline

import re

In [3]:
# Original text
text = """In comparison to dogs, cats have not undergone major changes during the domestication process.
As cat simply catenates streams of bytes, it can be also used to concatenate binary files, where it will just concatenate sequence of bytes.
A common interactive use of cat for a single file is to output the content of a file to standard output.
Cats can hear sounds too faint or too high in frequency for human ears, such as those made by mice and other small animals.
In one, people deliberately tamed cats in a process of artificial selection, as they were useful predators of vermin.
The domesticated cat and its closest wild ancestor are both diploid organisms that possess 38 chromosomes and roughly 20,000 genes.
Domestic cats are similar in size to the other members of the genus Felis, typically weighing between 4 and 5 kg (8.8 and 11.0 lb).
However, if the output is piped or redirected, cat is unnecessary.
cat with one named file is safer where human error is a concern - one wrong use of the default redirection symbol ">" instead of "<" (often adjacent on keyboards) may permanently delete the file you were just needing to read.
In terms of legibility, a sequence of commands starting with cat and connected by pipes has a clear left-to-right flow of information.
Cat command is one of the basic commands that you learned when you started in the Unix / Linux world.
Using cat command, the lines received from stdin can be redirected to a new file using redirection symbols.
When you type simply cat command without any arguments, it just receives the stdin content and displays it in the stdout.
Leopard was released on October 26, 2007 as the successor of Tiger (version 10.4), and is available in two editions.
According to Apple, Leopard contains over 300 changes and enhancements over its predecessor, Mac OS X Tiger.
As of Mid 2010, some Apple computers have firmware factory installed which will no longer allow installation of Mac OS X Leopard.
Since Apple moved to using Intel processors in their computers, the OSx86 community has developed and now also allows Mac OS X Tiger and later releases to be installed on non-Apple x86-based computers.
OS X Mountain Lion was released on July 25, 2012 for purchase and download through Apple's Mac App Store, as part of a switch to releasing OS X versions online and every year.
Apple has released a small patch for the three most recent versions of Safari running on OS X Yosemite, Mavericks, and Mountain Lion.
The Mountain Lion release marks the second time Apple has offered an incremental upgrade, rather than releasing a new cat entirely.
Mac OS X Mountain Lion installs in place, so you won't need to create a separate disk or run the installation off an external drive.
The fifth major update to Mac OS X, Leopard, contains such a mountain of features - more than 300 by Apple's count."""

# Split original text to sentences
sentences = text.lower().split("\n")

# Build words index
# Associate every word with unique index
words_index = dict()
words_count = []
for sentence in sentences:
  t = re.split('[^a-z]', sentence)
  words_count_local = dict()
  for word in t:
    if word != '':
      if word in words_count_local:
        words_count_local[word] += 1
      else:
        words_count_local[word] = 1
      if not word in words_index:
        words_index[word] = len(words_index)
  words_count.append(words_count_local)

# Build matrix
# Rows is vectors, every N-th coordinate in vector associated with number of
# N-th word in this sentence 
n_sentences = len(sentences)
n_words = len(words_index)
matrix = np.zeros((n_sentences, n_words))

sentence_index = 0
for sentence in words_count:
  for word, count in sentence.items():
    if word in words_index:
      matrix[sentence_index, words_index[word]] = count
  sentence_index += 1

# Find most similar sentences with first sentence. Most similar sentences have
# minimal cosine distance 
res = []
for index in range(1, n_sentences):
  res.append((index, sp.spatial.distance.cosine(matrix[0], matrix[index])))

r = [x[0] for x in sorted(res, key = lambda t: t[1])[0:2]]

[6, 4]

In [0]:
# Write result to file
with open('submission-1.txt', 'w') as f:
  f.write(str(r[0]) + " " + str(r[1]))