## TFLearn Fragment Detection

Catherine has prepared datafiles with sentences turned into fragments. I will use as input 60,000 fragments and 60,000 sentences. The fragments will come from the sentences. In the future the fragments will not be descendants of the input sentences. The labels will be either a 1 or 0, where 1 indicates a sentence and 0 indicates a fragment.

#### Install Dependencies

In [78]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical
import spacy
nlp = spacy.load('en')
import re
from nltk.util import ngrams, trigrams

#### Load Datafiles

In [39]:
texts = []
labels = []

with open("./removingPOS/updatedSentences/conjunctionSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)
        
with open("./removingPOS/updatedSentences/nounSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)

with open("./removingPOS/updatedSentences/nounverbSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)
        
with open("./removingPOS/updatedSentences/verbSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)
        
print(texts[-10:])

['With 92% of dawson creek residents canadian-born, and 93% speaking only english, the city has few visible minorities.', 'With 92% of Dawson Creek residents being Canadian-born, and 93% speaking only English, the city has few visible minorities.', 'By the end of the year, the texians all mexican troops from texas.', 'By the end of the year, the Texians had driven all Mexican troops from Texas.', 'In northern manitoba, quartz to make arrowheads.', 'In Northern Manitoba, quartz was mined to make arrowheads.', 'There significant fictionalisation, however.', 'There was significant fictionalisation, however.', "Extremeolation from society and community also apparent in crane's work.", "Extreme isolation from society and community is also apparent in Crane's work."]


##### Shuffle the data

In [40]:
import random

combined = list(zip(texts,labels))
random.shuffle(combined)

texts[:], labels[:] = zip(*combined)
print(texts[-10:])
print(labels[-10:])

["An archetypal, it generally close to the facts of sheppard's life, but portrays him as a swashbuckling hero.", 'In 2005, president kagame also launched a program known.', 'Only in late the bill finalized and forwarded to the senate for consideration.', 'Ashmole met the botanist and collector John Tradescant the younger around 1650.', 'City-to-city relationships communities and special interest groups both locally and abroad to engage in a wide range of exchange activities.', 'They arrived on 14 May.', 'Gained prestige as the oxford of the east in its early years.', 'Laffoon vigorously defended the commissions he had issued and those issued by his predecessors.', 'To the air force permanent in 1928.', 'Unable to return home immediately, Chrisye became distracted by thoughts of his family and began to find playing difficult.']
[0, 0, 0, 1, 0, 1, 0, 1, 0, 1]


##### Get parts of speech for text string

In [92]:
def textStringToPOSArray(text):
    doc = nlp(text)
    tags = []
    for word in doc:
        tags.append(word.pos_)
    return tags

textStringToPOSArray(texts[3])

['ADV',
 'PUNCT',
 'ADP',
 'PROPN',
 'NUM',
 'PRON',
 'VERB',
 'ADP',
 'NOUN',
 'CONJ',
 'VERB',
 'VERB',
 'PUNCT']

##### Get POS trigrams for a text string

In [99]:
def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

def getPOSTrigramsForTextString(text):
    tags = textStringToPOSArray(text)
    tgrams = list(trigrams(tags))
    return tgrams

print("Text: ", texts[3], labels[3])
getPOSTrigramsForTextString(texts[3])

Text:  However, in February 1969 she collapsed with exhaustion and was hospitalised. 1


[('ADV', 'PUNCT', 'ADP'),
 ('PUNCT', 'ADP', 'PROPN'),
 ('ADP', 'PROPN', 'NUM'),
 ('PROPN', 'NUM', 'PRON'),
 ('NUM', 'PRON', 'VERB'),
 ('PRON', 'VERB', 'ADP'),
 ('VERB', 'ADP', 'NOUN'),
 ('ADP', 'NOUN', 'CONJ'),
 ('NOUN', 'CONJ', 'VERB'),
 ('CONJ', 'VERB', 'VERB'),
 ('VERB', 'VERB', 'PUNCT')]

##### Turn Trigrams into Dict keys

In [100]:
def trigramsToDictKeys(trigrams):
    keys = []
    for trigram in trigrams:
        keys.append('>'.join(trigram))
    return keys

print(texts[2])
print(trigramsToDictKeys(getPOSTrigramsForTextString(texts[2])))

After the war Flamininus visited the Nemean Games in Argos and proclaimed the polis free.
['ADP>DET>NOUN', 'DET>NOUN>PROPN', 'NOUN>PROPN>VERB', 'PROPN>VERB>DET', 'VERB>DET>PROPN', 'DET>PROPN>PROPN', 'PROPN>PROPN>ADP', 'PROPN>ADP>PROPN', 'ADP>PROPN>CONJ', 'PROPN>CONJ>VERB', 'CONJ>VERB>DET', 'VERB>DET>NOUN', 'DET>NOUN>ADJ', 'NOUN>ADJ>PUNCT']


In [None]:
from collections import Counter

c = Counter()

for textString in texts:
    c.update(trigramsToDictKeys(getPOSTrigramsForTextString(textString)))

total_counts = c

print("Total words in data set: ", len(total_counts))

In [None]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:10000]
print(vocab[:60])