## Spacy and NLTK comprassion

In [1]:
# importing spacy and nltk
import spacy
import nltk.data

In [2]:
# nltk try
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sents_nltk = sent_detector.tokenize('This is first sentence. And this is another one.')
for sent in sents_nltk:
    print(sent)

This is first sentence.
And this is another one.


In [3]:
# spacy try
spacy_model = spacy.load('en_core_web_sm-2.0.0/en_core_web_sm/en_core_web_sm-2.0.0/')
doc = spacy_model('This is first sentence. And this is another one.')
sents_spacy = [sent.text for sent in doc.sents]
for sent in sents_spacy:
    print(sent)

This is first sentence.
And this is another one.


In [4]:
# import random and choose 50 paragraphs from wiki.txt
import random

with open('wiki.txt', 'r') as wiki_file:
    wiki = wiki_file.readlines()

random.seed(0)
our_pars = random.sample(wiki, k=50)

In [5]:
# apply nltk and spacy on this pagargraphs
our_pars = ''.join(our_pars)
sents_nltk = sent_detector.tokenize(our_pars)
doc = spacy_model(our_pars)
sents_spacy = [sent.text for sent in doc.sents]

In [6]:
# looking at result
for pair in list(zip(sents_nltk, sents_spacy)):
    print(pair)
    print('\n')

('The term "dominatrix" is mostly used to describe a female professional dominant (or "pro-domme") who is paid to engage in BDSM play with a submissive.', 'The term "dominatrix" is mostly used to describe a female professional dominant (or "pro-domme") who is paid to engage in BDSM play with a submissive.')


('An appointment or roleplay is referred to as a "session", and is often conducted in a dedicated professional play space which has been set up with specialist equipment, known as a "dungeon".', 'An appointment or roleplay is referred to as a "session", and is often conducted in a dedicated professional play space which has been set up with specialist equipment, known as a "dungeon".')


('Sessions may also be conducted remotely by letter or telephone, or in the contemporary era of technological connectivity by email or online chat.', 'Sessions may also be conducted remotely by letter or telephone, or in the contemporary era of technological connectivity by email or online chat.')

In [7]:
# spacy results are with \n and space symbols, so strip them
sents_spacy = [sent.strip() for sent in sents_spacy]
sents_spacy[:5]

['The term "dominatrix" is mostly used to describe a female professional dominant (or "pro-domme") who is paid to engage in BDSM play with a submissive.',
 'An appointment or roleplay is referred to as a "session", and is often conducted in a dedicated professional play space which has been set up with specialist equipment, known as a "dungeon".',
 'Sessions may also be conducted remotely by letter or telephone, or in the contemporary era of technological connectivity by email or online chat.',
 'Most, but not all, clients of female professional dominants are men.',
 'Male professional dominants also exist, catering predominantly to the gay male market.']

In [8]:
# looking at result again
for pair in list(zip(sents_nltk, sents_spacy)):
    print(pair)
    print('\n')

('The term "dominatrix" is mostly used to describe a female professional dominant (or "pro-domme") who is paid to engage in BDSM play with a submissive.', 'The term "dominatrix" is mostly used to describe a female professional dominant (or "pro-domme") who is paid to engage in BDSM play with a submissive.')


('An appointment or roleplay is referred to as a "session", and is often conducted in a dedicated professional play space which has been set up with specialist equipment, known as a "dungeon".', 'An appointment or roleplay is referred to as a "session", and is often conducted in a dedicated professional play space which has been set up with specialist equipment, known as a "dungeon".')


('Sessions may also be conducted remotely by letter or telephone, or in the contemporary era of technological connectivity by email or online chat.', 'Sessions may also be conducted remotely by letter or telephone, or in the contemporary era of technological connectivity by email or online chat.')

In [9]:
# now compare results
set_nltk = set(sents_nltk)
set_spacy = set(sents_spacy)
print(len(sents_nltk))
print(len(set_nltk.intersection(set_spacy)))

131
119


In [10]:
# how we see, there are different sentences in sets. How does they looks like?

sets_difference_nltk = set_nltk.difference(set_spacy)
sets_difference_spacy = set_spacy.difference(set_nltk)

for line in sets_difference_nltk:
    print('nltk: {0}'.format(line))
print('\n')
for line in sets_difference_spacy:
    print('spacy: {0}'.format(line))

nltk: The album's final single, "That Summer", would go on to be the most successful single from the album, reaching No.
nltk: I have hardly anything in common with myself and should stand very quietly in a corner, content that I can breathe".
nltk: With its principal source in the far west, it reverses the direction of flow exhibited by the Nile and Congo, and ultimately flows into the Atlantic — a fact that eluded European geographers for many centuries.
nltk: 12 on the "Billboard" Top Country Albums chart, Brooks' first song in three years to fail to make the top 10.
nltk: Pei, Louis Kahn, Philip Johnson, and Ludwig Mies van der Rohe; he was the only architect who had more than one building on the list.
nltk: Nonetheless, "We Shall Be Free" peaked at No.
nltk: Discourse 5 of that work, "Knowledge Its Own End", is a recent statement of a Christian educational perennialism.
nltk: 22 on the "Billboard" Christian Songs charts through a marketing deal with Rick Hendrix Company, and earne

(some examples are from same code, but before I stated random seed)

Looking on this difference we can see that spacy works better with sentenses with '!' and 'i.e.' in the middle, like:

- On 23 April 2009, Depeche Mode performed for the television program "Jimmy Kimmel Live!" at the famed corner of Hollywood Boulevard and Vine Street, drawing more than 12,000 fans, which was the largest audience the program had seen since its 2003 premiere, with a performance by Coldplay.
- Antianginal: Herodotus (484 c BC–425 c BC) attests that the Gandarian mercenaries (i.e. "Gandharans/Kambojans" of Gandari Strapy of Achaemenids) from the 20th strapy of the Achaemenids were recruited in the army of emperor Xerxes

But fails with roman numbers (in the same sentence!):

- Antianginal: Herodotus (484 c BC–425 c BC) attests that the Gandarian mercenaries (i.e. "Gandharans/Kambojans" of Gandari Strapy of Achaemenids) from the 20th strapy of the Achaemenids were recruited in the army of emperor Xerxes I (486-465 BC), which he led against the Hellas.

Also spacy failed to separate: "Two examples of miscellaneous pronunciations which contrast with both standard American and British usages are "data", which may be pronounced with ("dah") instead of ("day"); and "maroon", pronounced with ("own") as opposed to ("oon").\nEffervescents.\nPolitics." and "On 23 April 2009, Depeche Mode performed for the television program "Jimmy Kimmel Live!" at the famed corner of Hollywood Boulevard and Vine Street, drawing more than 12,000 fans, which was the largest audience the program had seen since its 2003 premiere, with a performance by Coldplay.
History."

Nltk fails with 'No.' abbreviation:

- The album only reached No. 12 on the "Billboard" Top Country Albums chart, Brooks' first song in three years to fail to make the top 10.
- Nonetheless, "We Shall Be Free" peaked at No. 22 on the "Billboard" Christian Songs charts through a marketing deal with Rick Hendrix Company, and earned Brooks a 1993 GLAAD Media Award.

Also nltk has problems with initials like in:

- On that list, Wright was listed along with many of the USA's other greatest architects including Eero Saarinen, I.M. Pei, Louis Kahn, Philip Johnson, and Ludwig Mies van der Rohe; he was the only architect who had more than one building on the list.

Spacy has broblems with clauses like in:

- I have hardly anything in common with myself and should stand very quietly in a corner, content that I can breathe".
- Discourse 5 of that work, "Knowledge Its Own End", is a recent statement of a Christian educational perennialism.

Both nltk and spacy has problems with ':\n' sequence, but in different sentences.

## Tokenization

In [19]:
import re

In [36]:
dictionary = []
with open('ja_gsd-ud-train.conllu.txt') as jap_train:
    for line in jap_train:
        try:
            some_line = line.split('\t')
            word = some_line[2]
            if (not word in dictionary) and (not len(re.findall(r'[\d\s a-zA-Z]', word))>0):
                dictionary.append(word)
        except IndexError:
            pass

In [62]:
max_len = max([len(word) for word in dictionary])
indexed_dict = []
for i in range(max_len):
    out_dict = []
    for word in dictionary:
        if len(word)==i+1:
            out_dict.append(word)
    indexed_dict.append(out_dict)

In [79]:
def maxmatch_index(sent, indexed_dict):
    tokens = []
    sent = str(sent)
    max_len = len(indexed_dict)
    while sent!='':
        flag = False
        for i in range(max_len,0,-1):
            curr_dict = indexed_dict[i-1]
            for word in curr_dict:
                if sent.startswith(word):
                    tokens.append(word)
                    sent = sent[i:]
                    flag = True
                    break
            if flag:
                break
            elif i==1:
                tokens.append(sent[0])
                sent = sent[i:]
    return tokens

In [80]:
maxmatch_index('新たに指定される美郷町では「特産品の開発などソフト事業も含めて独自の過疎計画を策定し町の活性化につなげたい', indexed_dict)

['新た',
 'に',
 '指定',
 'さ',
 'れる',
 '美郷町',
 'では',
 '「',
 '特産品',
 'の',
 '開発',
 'など',
 'ソフト',
 '事業',
 'も',
 '含',
 'め',
 'て',
 '独自',
 'の',
 '過疎',
 '計画',
 'を',
 '策定',
 'し',
 '町',
 'の',
 '活性化',
 'に',
 'つ',
 'な',
 'げ',
 'たい']

In [125]:
import nltk 

def compare_with_conllu(conllu_path, indexed_dict, num_examples=100):
    with open(conllu_path, 'r') as conllus:
        test_tags = []
        bleus = []
        count = 0
        while count < num_examples:
            line = conllus.readline()
            if line.startswith('# text = '):
                text = line[9:].strip()
                test_tags = []
            elif line == '\n':
                good_counter = 0
                pred_tags = maxmatch_index(text, indexed_dict)
                for tag in pred_tags:
                    if tag in test_tags:
                        good_counter += 1
                bleu = nltk.translate.bleu_score.sentence_bleu([test_tags], pred_tags)
                bleus.append(bleu)
                count += 1
            elif len(line.split('\t'))>1:
                test_tags.append(line.split('\t')[1])
            elif line == '':
                break
    return bleus

In [126]:
import numpy as np

bleus_train = compare_with_conllu('ja_gsd-ud-train.conllu.txt', indexed_dict, 1000)
bleus_test = compare_with_conllu('ja_gsd-ud-test.conllu.txt', indexed_dict, 1000)

print(np.mean(bleus_train))
print(np.mean(bleus_test))

0.6541781036907555
0.48157346386564276


How we can see that bleu score of maxmatch algorithm is better on train set than on test test (0.65 vs 0.48).
This can be connected with lack of dictionary in second case.