## TFLearn Fragment Detection

Catherine has prepared datafiles with sentences turned into fragments. I will use as input 60,000 fragments and 60,000 sentences. The fragments will come from the sentences. In the future the fragments will not be descendants of the input sentences. The labels will be either a 1 or 0, where 1 indicates a sentence and 0 indicates a fragment.

#### Install Dependencies

In [104]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tflearn
from tflearn.data_utils import to_categorical
import spacy
nlp = spacy.load('en')
import re
from nltk.util import ngrams, trigrams

#### Load Datafiles

In [105]:
texts = []
labels = []

with open("./removingPOS/updatedSentences/conjunctionSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)
        
with open("./removingPOS/updatedSentences/nounSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)

with open("./removingPOS/updatedSentences/nounverbSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)
        
with open("./removingPOS/updatedSentences/verbSentences/detailedRemoval.txt","r") as f:
    for line in f:
        asArray = line.split(" ||| ")
        fragment = asArray[2].strip()
        fragment = re.sub("\ \.", ".", fragment)
        fragment = re.sub("\,\.", ".", fragment)
        texts.append(fragment.capitalize())
        labels.append(0)
        texts.append(asArray[0].strip())
        labels.append(1)
        
print(texts[-10:])

['With 92% of dawson creek residents canadian-born, and 93% speaking only english, the city has few visible minorities.', 'With 92% of Dawson Creek residents being Canadian-born, and 93% speaking only English, the city has few visible minorities.', 'By the end of the year, the texians all mexican troops from texas.', 'By the end of the year, the Texians had driven all Mexican troops from Texas.', 'In northern manitoba, quartz to make arrowheads.', 'In Northern Manitoba, quartz was mined to make arrowheads.', 'There significant fictionalisation, however.', 'There was significant fictionalisation, however.', "Extremeolation from society and community also apparent in crane's work.", "Extreme isolation from society and community is also apparent in Crane's work."]


##### Shuffle the data

In [106]:
import random

combined = list(zip(texts,labels))
random.shuffle(combined)

texts[:], labels[:] = zip(*combined)
print(texts[-10:])
print(labels[-10:])

['After graduating in 1970, he enrolled at the University of Kentucky.', 'Glacier is distant from major cities.', 'Consequently, the remained in force.', 'His grandfather, james whatman bosanquet, a banker and achieved distinction as a biblical historian.', 'During the trip, he consulted with many prominent politicians, Adair among them, about the possibility of wresting Mexico from Spain.', 'However, the new archbishop began to look for someone outside the Salzburg court to be his new Kapellmeister.', 'Three years.', 'He was one of five southern governors.', 'Born in Harvey, Western Australia, Becher entered the Royal Australian Naval College in 1922.', 'The townsite was named Medford in December.']
[1, 1, 0, 0, 1, 1, 0, 0, 1, 1]


##### Get parts of speech for text string

In [107]:
def textStringToPOSArray(text):
    doc = nlp(text)
    tags = []
    for word in doc:
        tags.append(word.pos_)
    return tags

textStringToPOSArray(texts[3])

['PROPN',
 'VERB',
 'ADP',
 'DET',
 'NOUN',
 'PUNCT',
 'ADV',
 'PRON',
 'VERB',
 'VERB',
 'CONJ',
 'VERB',
 'ADP',
 'ADJ',
 'NOUN',
 'PUNCT']

##### Get POS trigrams for a text string

In [108]:
def find_ngrams(input_list, n):
  return zip(*[input_list[i:] for i in range(n)])

def getPOSTrigramsForTextString(text):
    tags = textStringToPOSArray(text)
    tgrams = list(trigrams(tags))
    return tgrams

print("Text: ", texts[3], labels[3])
getPOSTrigramsForTextString(texts[3])

Text:  Lara returned to the reservation, where he was arrested and charged with public intoxication. 1


[('PROPN', 'VERB', 'ADP'),
 ('VERB', 'ADP', 'DET'),
 ('ADP', 'DET', 'NOUN'),
 ('DET', 'NOUN', 'PUNCT'),
 ('NOUN', 'PUNCT', 'ADV'),
 ('PUNCT', 'ADV', 'PRON'),
 ('ADV', 'PRON', 'VERB'),
 ('PRON', 'VERB', 'VERB'),
 ('VERB', 'VERB', 'CONJ'),
 ('VERB', 'CONJ', 'VERB'),
 ('CONJ', 'VERB', 'ADP'),
 ('VERB', 'ADP', 'ADJ'),
 ('ADP', 'ADJ', 'NOUN'),
 ('ADJ', 'NOUN', 'PUNCT')]

##### Turn Trigrams into Dict keys

In [109]:
def trigramsToDictKeys(trigrams):
    keys = []
    for trigram in trigrams:
        keys.append('>'.join(trigram))
    return keys

print(texts[2])
print(trigramsToDictKeys(getPOSTrigramsForTextString(texts[2])))

The depression persisted for another day while tracking southward, before being absorbed by nearby Tropical Storm Lorena.
['DET>NOUN>VERB', 'NOUN>VERB>ADP', 'VERB>ADP>DET', 'ADP>DET>NOUN', 'DET>NOUN>ADP', 'NOUN>ADP>VERB', 'ADP>VERB>ADV', 'VERB>ADV>PUNCT', 'ADV>PUNCT>ADP', 'PUNCT>ADP>VERB', 'ADP>VERB>VERB', 'VERB>VERB>ADP', 'VERB>ADP>ADJ', 'ADP>ADJ>PROPN', 'ADJ>PROPN>PROPN', 'PROPN>PROPN>PROPN', 'PROPN>PROPN>PUNCT']


In [110]:
from collections import Counter

c = Counter()

for textString in texts:
    c.update(trigramsToDictKeys(getPOSTrigramsForTextString(textString)))

total_counts = c

print("Total words in data set: ", len(total_counts))

Total words in data set:  2097


In [119]:
vocab = sorted(total_counts, key=total_counts.get, reverse=True)[:1200]
print(vocab[:60])

['ADP>DET>NOUN', 'NOUN>ADP>DET', 'ADJ>NOUN>PUNCT', 'DET>ADJ>NOUN', 'DET>NOUN>ADP', 'ADJ>NOUN>ADP', 'NOUN>ADP>NOUN', 'DET>NOUN>PUNCT', 'ADP>ADJ>NOUN', 'ADP>DET>ADJ', 'VERB>DET>NOUN', 'NOUN>NOUN>PUNCT', 'VERB>ADP>DET', 'DET>NOUN>VERB', 'ADP>NOUN>PUNCT', 'VERB>VERB>ADP', 'NOUN>ADP>ADJ', 'NOUN>VERB>VERB', 'ADJ>NOUN>VERB', 'ADJ>ADJ>NOUN', 'ADJ>NOUN>NOUN', 'NOUN>ADP>PROPN', 'DET>NOUN>NOUN', 'VERB>ADJ>NOUN', 'PROPN>PROPN>PUNCT', 'NOUN>NOUN>ADP', 'VERB>DET>ADJ', 'NOUN>VERB>ADP', 'NOUN>PUNCT>NOUN', 'NOUN>PUNCT>VERB', 'VERB>ADP>NOUN', 'ADP>NOUN>ADP', 'ADP>NUM>PUNCT', 'NOUN>ADP>NUM', 'ADP>DET>PROPN', 'NOUN>PUNCT>CONJ', 'NOUN>CONJ>NOUN', 'ADP>PROPN>PUNCT', 'VERB>NOUN>ADP', 'PROPN>PROPN>PROPN', 'DET>PROPN>PROPN', 'VERB>ADP>PROPN', 'NOUN>NOUN>VERB', 'NUM>NOUN>PUNCT', 'VERB>PART>VERB', 'VERB>ADV>VERB', 'PUNCT>DET>NOUN', 'ADP>PROPN>PROPN', 'VERB>ADP>ADJ', 'NOUN>VERB>DET', 'NOUN>PART>VERB', 'ADP>NUM>NOUN', 'VERB>NOUN>PUNCT', 'NOUN>VERB>ADV', 'ADV>VERB>ADP', 'ADP>NOUN>NOUN', 'NOUN>CONJ>VERB', 'NUM>NOUN>

In [120]:
print(vocab[-1], ': ', total_counts[vocab[-1]])

NOUN>X>X :  27


Take the trigrams and index them

In [121]:
word2idx = {n: i for i, n in enumerate(vocab)}## create the word-to-index dictionary here
print(word2idx)

{'ADP>DET>NOUN': 0, 'NOUN>ADP>DET': 1, 'ADJ>NOUN>PUNCT': 2, 'DET>ADJ>NOUN': 3, 'DET>NOUN>ADP': 4, 'ADJ>NOUN>ADP': 5, 'NOUN>ADP>NOUN': 6, 'DET>NOUN>PUNCT': 7, 'ADP>ADJ>NOUN': 8, 'ADP>DET>ADJ': 9, 'VERB>DET>NOUN': 10, 'NOUN>NOUN>PUNCT': 11, 'VERB>ADP>DET': 12, 'DET>NOUN>VERB': 13, 'ADP>NOUN>PUNCT': 14, 'VERB>VERB>ADP': 15, 'NOUN>ADP>ADJ': 16, 'NOUN>VERB>VERB': 17, 'ADJ>NOUN>VERB': 18, 'ADJ>ADJ>NOUN': 19, 'ADJ>NOUN>NOUN': 20, 'NOUN>ADP>PROPN': 21, 'DET>NOUN>NOUN': 22, 'VERB>ADJ>NOUN': 23, 'PROPN>PROPN>PUNCT': 24, 'NOUN>NOUN>ADP': 25, 'VERB>DET>ADJ': 26, 'NOUN>VERB>ADP': 27, 'NOUN>PUNCT>NOUN': 28, 'NOUN>PUNCT>VERB': 29, 'VERB>ADP>NOUN': 30, 'ADP>NOUN>ADP': 31, 'ADP>NUM>PUNCT': 32, 'NOUN>ADP>NUM': 33, 'ADP>DET>PROPN': 34, 'NOUN>PUNCT>CONJ': 35, 'NOUN>CONJ>NOUN': 36, 'ADP>PROPN>PUNCT': 37, 'VERB>NOUN>ADP': 38, 'PROPN>PROPN>PROPN': 39, 'DET>PROPN>PROPN': 40, 'VERB>ADP>PROPN': 41, 'NOUN>NOUN>VERB': 42, 'NUM>NOUN>PUNCT': 43, 'VERB>PART>VERB': 44, 'VERB>ADV>VERB': 45, 'PUNCT>DET>NOUN': 46, 'ADP>

In [123]:
def textToTrigrams(text): 
    return trigramsToDictKeys(getPOSTrigramsForTextString(text))

def text_to_vector(text):
    wordVector = np.zeros(len(vocab))
    for word in textToTrigrams(text):
        index = word2idx.get(word, None)
        if index != None:
            wordVector[index] += 1
    return wordVector

In [124]:
text_to_vector('The tea is for a party to celebrate '
               'the movie so she has no time for a cake')[:65]

array([ 2.,  1.,  0.,  0.,  2.,  0.,  0.,  0.,  0.,  0.,  2.,  0.,  1.,
        1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
        0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [128]:
word_vectors = np.zeros((len(texts), len(vocab)), dtype=np.int_)
for ii, text in enumerate(texts):
    word_vectors[ii] = text_to_vector(text)

In [129]:
# Printing out the first 5 word vectors
word_vectors[:5, :23]

array([[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 2, 1, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

### Chunking the data for TF

In [143]:
records = len(labels)
test_fraction = 0.9

train_split, test_split = int(records*test_fraction), int(records*test_fraction)
print(train_split, test_split)
trainX, trainY = word_vectors[:train_split], to_categorical(labels[:train_split], 2)
testX, testY = word_vectors[test_split:], to_categorical(labels[test_split:], 2)

106133 106133


In [147]:
trainX[-1], trainY[-1]

(array([1, 1, 0, ..., 0, 0, 0]), array([ 0.,  1.]))

In [153]:
len(trainY), len(testY), len(trainY) + len(testY)

(106133, 11793, 117926)

# Setting up TF

In [159]:
# Network building
def build_model():
    # This resets all parameters and variables, leave this here
    tf.reset_default_graph()
    
    #### Your code ####
    net = tflearn.input_data([None, len(vocab)])                          # Input
    net = tflearn.fully_connected(net, 200, activation='ReLU')      # Hidden
    net = tflearn.fully_connected(net, 25, activation='ReLU')      # Hidden
    net = tflearn.fully_connected(net, 2, activation='softmax')   # Output
    net = tflearn.regression(net, optimizer='sgd', learning_rate=0.1, loss='categorical_crossentropy')
    model = tflearn.DNN(net)

    return model

### Initialize

In [160]:
model = build_model()

### Training

In [161]:
# Training
model.fit(trainX, trainY, validation_set=0.1, show_metric=True, batch_size=128, n_epoch=50)

Training Step: 37349  | total loss: [1m[32m0.05092[0m[0m | time: 9.386s
| SGD | epoch: 050 | loss: 0.05092 - acc: 0.9784 -- iter: 95488/95519
Training Step: 37350  | total loss: [1m[32m0.05543[0m[0m | time: 10.501s
| SGD | epoch: 050 | loss: 0.05543 - acc: 0.9767 | val_loss: 0.65664 - val_acc: 0.8425 -- iter: 95519/95519
--


In [162]:
# Testing
predictions = (np.array(model.predict(testX))[:,0] >= 0.5).astype(np.int_)
test_accuracy = np.mean(predictions == testY[:,0], axis=0)
print("Test accuracy: ", test_accuracy)

Test accuracy:  0.84227931824


### Playground

In [163]:
def test_sentence(sentence):
    positive_prob = model.predict([text_to_vector(sentence)])[0][1]
    print('Sentence: {}'.format(sentence))
    print('P(positive) = {:.3f} :'.format(positive_prob), 
          'Positive' if positive_prob > 0.5 else 'Negative')

In [164]:
test_sentence("Even though he had the better arguments and was by far the more powerful speaker.")

Sentence: Even though he had the better arguments and was by far the more powerful speaker.
P(positive) = 0.000 : Negative


In [167]:
test_sentence("Even though he had the better arguments and was by far the more powerful speaker, Peter lost the debate.")

Sentence: Even though he had the better arguments and was by far the more powerful speaker, Peter lost the debate.
P(positive) = 1.000 : Positive


In [165]:
test_sentence("Working far into the night in an effort to salvage her little boat.")

Sentence: Working far into the night in an effort to salvage her little boat.
P(positive) = 0.000 : Negative


In [166]:
test_sentence("She was working far into the night in an effort to salvage her little boat.")

Sentence: She was working far into the night in an effort to salvage her little boat.
P(positive) = 0.977 : Positive


In [169]:
test_sentence("The man eating pizza.")

Sentence: The man eating pizza.
P(positive) = 0.381 : Negative


In [171]:
test_sentence("The man eating pizza is overwieght.")

Sentence: The man eating pizza is overwieght.
P(positive) = 0.004 : Negative


In [172]:
test_sentence("While we were swimming at the lake.")

Sentence: While we were swimming at the lake.
P(positive) = 0.010 : Negative


In [173]:
test_sentence("While we were swimming at the lake, we saw a fish.")

Sentence: While we were swimming at the lake, we saw a fish.
P(positive) = 0.982 : Positive


In [174]:
test_sentence("Keep going.")

Sentence: Keep going.
P(positive) = 0.000 : Negative


In [175]:
test_sentence("A time of wonder and amazement")

Sentence: A time of wonder and amazement
P(positive) = 0.000 : Negative


In [176]:
test_sentence("That was a time of wonder and amazement")

Sentence: That was a time of wonder and amazement
P(positive) = 0.012 : Negative


In [177]:
test_sentence("Since she never saw that movie.") 

Sentence: Since she never saw that movie.
P(positive) = 0.006 : Negative


In [178]:
test_sentence("We should invite her, since she never saw that movie.")

Sentence: We should invite her, since she never saw that movie.
P(positive) = 0.999 : Positive
