Title: 3. Tweet2Bible - Comparing Similarity Measures
Tags: initial_model
Authors: Ben Hoyle
Summary: This post looks at some approaches for matching tweets to Bible passages.

# 3. Tweet2Bible - Comparing Similarity Measures

Now we have our data we can look at some matching.

To start we will look at a number of off-the-shelf similarity functions. We will then compare these subjectively and see what gets us the best matches.

Note the docker-machine virtual machine only has one CPU and 1GB RAM - we need to run docker without the VM...

## Similarity Functions

Here are some initial similarity functions we can look at:

* [Difflib's SequenceMatcher](https://docs.python.org/3/library/difflib.html) has a "ratio" function that provides a match score for two strings. This represents a "naive" baseline.
* We can use spaCy's ["similarity" method](https://spacy.io/usage/vectors-similarity) on "doc" objects (i.e. as applied to each string).
* We can apply the techniques set available in Gensim as set out in [this helpful tutorial](https://radimrehurek.com/gensim/tut3.html).

We can then use the results as a baseline for more complex models and algorithms.

We will also time how long each method takes.

### Load Data

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [2]:
import pickle
with open("processed_data.pkl", 'rb') as f:
    tweets, bible_data = pickle.load(f)

In [3]:
print("We have {0} tweets.".format(len(tweets)))
print("We have {0} Bible passages.".format(len(bible_data)))

We have 9806 tweets.
We have 31102 Bible passages.


### Difflib SequenceMatcher

In [6]:
from difflib import SequenceMatcher

def similar(a, b):
    """Get a similarity metric for strings a and b"""
    return SequenceMatcher(None, a, b).ratio()

def get_matches(tweet, bible_data):
    """Match a tweet against the bible_data."""
    # Get matches
    scores = [
        (verse, passage, similar(tweet, passage)) 
        for verse, passage in bible_data
    ]
    # Sort by descending score
    scores.sort(key=lambda tup: tup[2], reverse = True) 
    return scores

def test_random_tweets(tweets, bible_data, n=5, k=5):
    """Print n examples for k tweets selected at random."""
    import random
    num_tweets = len(tweets)
    indices = random.sample(range(0, num_tweets), k)
    for i in indices:
        tweet = tweets[i]
        print("-----------------")
        print("Tweet text: {}".format(tweet))
        scores = get_matches(tweet, bible_data)
        for verse, passage, score in scores[0:n]:
            print("\n{0}, {1}, {2}".format(verse, passage, score))

In [7]:
test_random_tweets(tweets, bible_data)

-----------------
Tweet text: "In addition, along with the advance of the electronic information society, a variety of electronic devices are utilized." #thetimeswelivein

Mark 8:19, When I broke the five loaves among the five thousand, how many baskets full of broken pieces did you take up? They told him, Twelve., 0.4338235294117647

Exodus 6:16, These are the names of the sons of Levi according to their generations: Gershon, and Kohath, and Merari; and the years of the life of Levi were one hundred thirty-seven years., 0.43174603174603177

Numbers 3:21, Of Gershon was the family of the Libnites, and the family of the Shimeites: these are the families of the Gershonites., 0.4186046511627907

Job 4:10, The roaring of the lion, and the voice of the fierce lion, the teeth of the young lions, are broken., 0.4166666666666667

2 Corinthians 3:9, For if the service of condemnation has glory, the service of righteousness exceeds much more in glory., 0.4132231404958678
-----------------
Tweet 

### spaCy String Similarity

The 'en_core_web_lg' file crashed my Jupyter kernel but the 'en_core_web_sm' file loaded okay. I'll try the medium-sized file 'en_core_web_md'. Yes - 'md' file loaded okay.

In [4]:
!python3 -m spacy download en_core_web_md

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz (120.8MB)
[K    100% |################################| 120.9MB 7.0MB/s ta 0:00:011 0% |                                | 1.2MB 2.2MB/s eta 0:00:56    2% |                                | 2.8MB 4.0MB/s eta 0:00:30    2% |                                | 3.3MB 3.9MB/s eta 0:00:31    3% |#                               | 4.1MB 5.6MB/s eta 0:00:21    3% |#                               | 4.4MB 3.7MB/s eta 0:00:32    5% |#                               | 6.8MB 7.2MB/s eta 0:00:16    8% |##                              | 9.8MB 6.0MB/s eta 0:00:19    8% |##                              | 10.8MB 6.1MB/s eta 0:00:19    9% |##                              | 11.2MB 6.3MB/s eta 0:00:18    9% |###                             | 11.5MB 5.1MB/s eta 0:00

In [8]:
import spacy

nlp = spacy.load('en_core_web_md')

In [8]:
def similar(a, b):
    """Get a similarity metric for strings a and b"""
    spacy_a = nlp(a)
    spacy_b = nlp(b)
    return spacy_a.similarity(spacy_b)

In [11]:
test_random_tweets(tweets, bible_data)

-----------------
Tweet text: Next-Gen Bluetooth Bulb Controllable Via Smartphone | Freshome http://t.co/5oEjvL6c [A great patented idea - get it to market!]

Nehemiah 3:32, Between the ascent of the corner and the sheep gate repaired the goldsmiths and the merchants., 0.38009049773755654

Jeremiah 27:5, I have made the earth, the men and the animals that are on the surface of the earth, by my great power and by my outstretched arm; and I give it to whom it seems right to me., 0.3588039867109635

1 Corinthians 10:7, Neither be idolaters, as some of them were. As it is written, The people sat down to eat and drink, and rose up to play., 0.3562753036437247

Acts 2:20, The sun will be turned into darkness, and the moon into blood, before the great and glorious day of the Lord comes., 0.35537190082644626

Matthew 18:4, Whoever therefore humbles himself as this little child, the same is the greatest in the Kingdom of Heaven., 0.351931330472103
-----------------
Tweet text: Ask Yourself: Are

### Gensim

Gensim needs a little bit of pre-processing to convert our texts into vector form. We need to get a bag of words that represents each portion of text.

First we need to tokenise our text. We can use spaCy or NLTK to do this. (The method above involves generating a spaCy doc for each Bible passage - we can maybe do this once and then use elsewhere.)

Then we filter the text and convert it into a vector form.

The procedure below mirrors the [Gensim tutorial](https://radimrehurek.com/gensim/tut1.html).

In [None]:
# This took quite a long time so I might go for the quicker word_tokenize from nltk
# spacy_bible = [(verse, nlp(passage)) for verse, passage in bible_data]

In [9]:
from nltk import word_tokenize
tokenised = [(verse, word_tokenize(passage)) for verse, passage in bible_data]

In [10]:
tokenised[3]

('Genesis 1:4',
 ['God',
  'saw',
  'the',
  'light',
  ',',
  'and',
  'saw',
  'that',
  'it',
  'was',
  'good',
  '.',
  'God',
  'divided',
  'the',
  'light',
  'from',
  'the',
  'darkness',
  '.'])

In [11]:
def process_words(tokens):
    """ Remove digits and punctuation from text and convert to lower case. """
    # Alternative for complete text is re.sub('\W+', '', text)
    return [w.lower() for w in tokens if w.isalpha()]

In [12]:
tokenised = [(verse, process_words(tokens)) for verse, tokens in tokenised]

In [13]:
tokenised[3]

('Genesis 1:4',
 ['god',
  'saw',
  'the',
  'light',
  'and',
  'saw',
  'that',
  'it',
  'was',
  'good',
  'god',
  'divided',
  'the',
  'light',
  'from',
  'the',
  'darkness'])

In [14]:
texts = [tokens for _, tokens in tokenised]

In [15]:
# Import NLTK modules
from nltk import word_tokenize
from nltk.corpus import stopwords
# Load stopwords
ENG_STOPWORDS = stopwords.words('english')

def text_preprocessing(original_text):
    """Clean and process texts for Gensim methods.""" 
    # Tokenise
    tokenised = word_tokenize(original_text) 
    
    # Convert to lowercase and remove non-text / stopwords
    tokenised = [w.lower() for w in tokenised if (w.isalpha() and w not in ENG_STOPWORDS)]
    return tokenised

In [16]:
text_preprocessing(bible_data[3][1])

['god', 'saw', 'light', 'saw', 'good', 'god', 'divided', 'light', 'darkness']

In [17]:
texts = [text_preprocessing(passage) for _, passage in bible_data]

In [18]:
texts[5]

['god',
 'said',
 'let',
 'expanse',
 'middle',
 'waters',
 'let',
 'divide',
 'waters',
 'waters']

In [32]:
# Create a dictionary from our processed bible texts

from gensim import corpora

# Create a dictionary that maps numbers to words
dictionary = corpora.Dictionary(texts)
# Save dictionary
dictionary.save('bible.dict')
print(dictionary)

2018-06-21 12:50:20,527 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-21 12:50:20,799 : INFO : adding document #10000 to Dictionary(6375 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-21 12:50:21,016 : INFO : adding document #20000 to Dictionary(9840 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-21 12:50:21,242 : INFO : adding document #30000 to Dictionary(12041 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-21 12:50:21,272 : INFO : built Dictionary(12255 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...) from 31102 documents (total 370556 corpus positions)
2018-06-21 12:50:21,276 : INFO : saving Dictionary object under bible.dict, separately None
2018-06-21 12:50:21,288 : INFO : saved bible.dict


Dictionary(12255 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)


In [33]:
corpus = [dictionary.doc2bow(text) for text in texts]
# Save corpus for later
corpora.MmCorpus.serialize('bible.mm', corpus)
print(corpus)

2018-06-21 12:50:26,077 : INFO : storing corpus in Matrix Market format to bible.mm
2018-06-21 12:50:26,082 : INFO : saving sparse matrix to bible.mm
2018-06-21 12:50:26,085 : INFO : PROGRESS: saving document #0
2018-06-21 12:50:26,122 : INFO : PROGRESS: saving document #1000
2018-06-21 12:50:26,162 : INFO : PROGRESS: saving document #2000
2018-06-21 12:50:26,199 : INFO : PROGRESS: saving document #3000
2018-06-21 12:50:26,237 : INFO : PROGRESS: saving document #4000
2018-06-21 12:50:26,274 : INFO : PROGRESS: saving document #5000
2018-06-21 12:50:26,313 : INFO : PROGRESS: saving document #6000
2018-06-21 12:50:26,350 : INFO : PROGRESS: saving document #7000
2018-06-21 12:50:26,391 : INFO : PROGRESS: saving document #8000
2018-06-21 12:50:26,429 : INFO : PROGRESS: saving document #9000
2018-06-21 12:50:26,470 : INFO : PROGRESS: saving document #10000
2018-06-21 12:50:26,504 : INFO : PROGRESS: saving document #11000
2018-06-21 12:50:26,549 : INFO : PROGRESS: saving document #12000
2018-

On a first run of this we note that most topics are defined by common stopwords. Let's get rid of these.

In [35]:
from gensim import models, similarities
# We'll start with LSI and a 100D vector
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=100)

2018-06-21 12:51:15,457 : INFO : using serial LSI version on this node
2018-06-21 12:51:15,458 : INFO : updating model with new documents
2018-06-21 12:51:15,460 : INFO : preparing a new chunk of documents
2018-06-21 12:51:15,620 : INFO : using 100 extra samples and 2 power iterations
2018-06-21 12:51:15,626 : INFO : 1st phase: constructing (12255, 200) action matrix
2018-06-21 12:51:15,866 : INFO : orthonormalizing (12255, 200) action matrix
2018-06-21 12:51:16,872 : INFO : 2nd phase: running dense svd on (200, 20000) matrix
2018-06-21 12:51:18,142 : INFO : computing the final decomposition
2018-06-21 12:51:18,146 : INFO : keeping 100 factors (discarding 18.228% of energy spectrum)
2018-06-21 12:51:18,326 : INFO : processed documents up to #20000
2018-06-21 12:51:18,329 : INFO : topic #0(128.237): 0.579*"shall" + 0.450*"yahweh" + 0.419*"i" + 0.176*"said" + 0.173*"god" + 0.131*"israel" + 0.108*"king" + 0.106*"the" + 0.087*"he" + 0.085*"house"
2018-06-21 12:51:18,333 : INFO : topic #1(9

In [36]:
# Create index
index = similarities.MatrixSimilarity(lsi[corpus])
index.save('bible.index')

2018-06-21 12:53:39,874 : INFO : creating matrix with 31102 documents and 100 features
2018-06-21 12:53:43,628 : INFO : saving MatrixSimilarity object under bible.index, separately None
2018-06-21 12:53:43,786 : INFO : saved bible.index


In [37]:
def text2vec(text, dictionary, lsi):
    """Convert a portion of text to an LSI vector."""
    processed = text_preprocessing(text)
    vec_bow = dictionary.doc2bow(processed)
    vec_lsi = lsi[vec_bow] # convert the query to LSI space
    return vec_lsi

In [39]:
vec_lsi = text2vec(tweets[5], dictionary, lsi)

In [47]:
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))[0:5]) # print (document_number, document_similarity) 2-tuples

[(0, 0.026685458), (1, 0.010693545), (2, 0.02526992), (3, 0.018537477), (4, -0.0064496454)]


In [48]:
sims_sorted = sorted(enumerate(sims), key=lambda item: -item[1])
print(sims_sorted[0:5]) # print sorted (document number, similarity score) 2-tuples

[(14471, 0.98052335), (16335, 0.78113711), (5388, 0.75273919), (17916, 0.74159586), (26053, 0.73570257)]


In [49]:
bible_data[sims_sorted[0][0]]

('Psalm 37:21',
 "The wicked borrow, and don't pay back, but the righteous give generously.")

In [46]:
tweets[5]

'“The ungovernable metropolis, with its fluid population and ethnic and occupational enclaves, is an affront to a mindset that envisions a world of harmony, purity, and organic wholeness.” - to thrive you need to give up unattainable perfection and unquestioning agreement'

Now let's fold all this into a function.

In [60]:
zipped = [(p, v, s) for (p, v), s in zip(bible_data, sims)]

In [61]:
zipped[14471]

('Psalm 37:21',
 "The wicked borrow, and don't pay back, but the righteous give generously.",
 0.98052335)

In [19]:
# Import gensim modules
from gensim import corpora, models, similarities

# Import NLTK modules
from nltk import word_tokenize
from nltk.corpus import stopwords
# Load stopwords
ENG_STOPWORDS = stopwords.words('english')

def text_preprocessing(original_text):
    """Clean and process texts for Gensim methods.""" 
    # Tokenise
    tokenised = word_tokenize(original_text) 
    
    # Convert to lowercase and remove non-text / stopwords
    tokenised = [w.lower() for w in tokenised if (w.isalpha() and w not in ENG_STOPWORDS)]
    return tokenised

def text2vec(text, dictionary, lsi):
    """Convert a portion of text to an LSI vector."""
    processed = text_preprocessing(text)
    vec_bow = dictionary.doc2bow(processed)
    vec_lsi = lsi[vec_bow] # convert the query to LSI space
    return vec_lsi

def build_data(tweets, bible_data):
    """Generate variables for matching."""
    # Process text
    texts = [text_preprocessing(passage) for _, passage in bible_data]
    # Build dictionary
    dictionary = corpora.Dictionary(texts)
    # Convert bible data to corpus
    corpus = [dictionary.doc2bow(text) for text in texts]
    lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=100)
    # Create index
    index = similarities.MatrixSimilarity(lsi[corpus])
    # Save all of these
    dictionary.save('bible.dict')
    corpora.MmCorpus.serialize('bible.mm', corpus)
    lsi.save('bible.lsi')
    index.save('bible.index')
    return dictionary, corpus, lsi, index

def get_matches(tweet, bible_data, dictionary, lsi, index):
    """Match a tweet against the bible_data."""
    # To run this we need dictionary, lsi, and index variables
    # Get matches
    vec_lsi = text2vec(tweet, dictionary, lsi)
    sims = index[vec_lsi] # perform a similarity query against the corpus
    scores = [(p, v, s) for (p, v), s in zip(bible_data, sims)]
    # Sort by descending score
    scores.sort(key=lambda tup: tup[2], reverse = True) 
    return scores

def test_random_tweets(tweets, bible_data, n=5, k=5):
    """Print n examples for k tweets selected at random."""
    try:
        dictionary = corpora.Dictionary.load('bible.dict')
        corpus = corpora.MmCorpus('bible.mm')
        lsi = models.LsiModel.load('bible.lsi')
        index = similarities.MatrixSimilarity.load('bible.index')
    except FileNotFoundError:
        dictionary, corpus, lsi, index = build_data(tweets, bible_data)
        
    import random
    num_tweets = len(tweets)
    indices = random.sample(range(0, num_tweets), k)
    for i in indices:
        tweet = tweets[i]
        print("-----------------")
        print("Tweet text: {}".format(tweet))
        scores = get_matches(tweet, bible_data, dictionary, lsi, index)
        for verse, passage, score in scores[0:n]:
            print("\n{0}, {1}, {2}".format(verse, passage, score))

2018-06-23 08:15:34,349 : INFO : 'pattern' package not found; tag filters are not available for English


In [5]:
test_random_tweets(tweets, bible_data)

2018-06-21 13:41:59,527 : INFO : loading Dictionary object from bible.dict
2018-06-21 13:41:59,541 : INFO : loaded bible.dict
2018-06-21 13:41:59,552 : INFO : loaded corpus index from bible.mm.index
2018-06-21 13:41:59,557 : INFO : initializing cython corpus reader from bible.mm
2018-06-21 13:41:59,562 : INFO : accepted corpus with 31102 documents, 12255 features, 339121 non-zero entries
2018-06-21 13:41:59,564 : INFO : loading LsiModel object from bible.lsi
2018-06-21 13:42:09,193 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-21 13:42:09,432 : INFO : adding document #10000 to Dictionary(6375 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-21 13:42:09,651 : INFO : adding document #20000 to Dictionary(9840 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-21 13:42:09,875 : INFO : adding document #30000 to Dictionary(12041 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-21 

-----------------
Tweet text: @sustrans Kids to Newbridge primary in Bath have river cycle path close - but no safe way to travel 200m up hill & across A-road or to path

Genesis 49:17, Dan will be a serpent in the way, an adder in the path, That bites the horse's heels, so that his rider falls backward., 0.9940089583396912

Psalm 80:12, Why have you broken down its walls, so that all those who pass by the way pluck it?, 0.9881158471107483

Proverbs 13:6, Righteousness guards the way of integrity, but wickedness overthrows the sinner., 0.9859569668769836

Genesis 35:19, Rachel died, and was buried in the way to Ephrath (the same is Bethlehem)., 0.9790574312210083

Ezekiel 12:5, Dig through the wall in their sight, and carry your stuff out that way., 0.9741089344024658
-----------------
Tweet text: Stand-Up Comics Have to Censor Their Jokes on (US) College Campuses - The Atlantic http://t.co/W998v8oahs

Lamentations 5:16, The crown is fallen from our head: Woe to us! for we have sinned.

### Comparing Approaches

To compare the approaches, let's generate 200 random examples and take the top match for each of the three techniques. We will then manually score each match on a scale of 0 to 5 where 0 = no match and 5 = perfect match. Then we can see which technique comes up on top.

The easiest way to quickly compare the results is to export to a spreadsheet, with columns for the scores of each.

In [4]:
def get_difflib_matches(tweet, bible_data):
    """Match a tweet against the bible_data."""
    # Get matches
    scores = [
        (verse, passage, SequenceMatcher(None, tweet, passage).ratio()) 
        for verse, passage in bible_data
    ]
    # Sort by descending score
    scores.sort(key=lambda tup: tup[2], reverse = True) 
    return scores

def get_spacy_matches(spacy_tweet, spacy_bible):
    """Perform matches on text as spacy docs"""
    # Get matches
    scores = [
        (verse, passage, spacy_tweet.similarity(passage)) 
        for verse, passage in spacy_bible
    ]
    # Sort by descending score
    scores.sort(key=lambda tup: tup[2], reverse = True) 
    return scores

def get_gensim_matches(tweet, bible_data, dictionary, lsi, index):
    """Match a tweet against the bible_data."""
    # To run this we need dictionary, lsi, and index variables
    # Get matches
    vec_lsi = text2vec(tweet, dictionary, lsi)
    sims = index[vec_lsi] # perform a similarity query against the corpus
    scores = [(v, p, s) for (v, p), s in zip(bible_data, sims)]
    # Sort by descending score
    scores.sort(key=lambda tup: tup[2], reverse = True) 
    return scores

In [20]:
spacy_bible = [(verse, nlp(passage)) for verse, passage in bible_data]

In [21]:
dictionary, corpus, lsi, index = build_data(tweets, bible_data)

2018-06-23 09:02:25,811 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-06-23 09:02:26,052 : INFO : adding document #10000 to Dictionary(6375 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-23 09:02:26,272 : INFO : adding document #20000 to Dictionary(9840 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-23 09:02:26,491 : INFO : adding document #30000 to Dictionary(12041 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...)
2018-06-23 09:02:26,516 : INFO : built Dictionary(12255 unique tokens: ['beginning', 'created', 'earth', 'god', 'heavens']...) from 31102 documents (total 370556 corpus positions)
2018-06-23 09:02:26,953 : INFO : using serial LSI version on this node
2018-06-23 09:02:26,954 : INFO : updating model with new documents
2018-06-23 09:02:26,955 : INFO : preparing a new chunk of documents
2018-06-23 09:02:27,111 : INFO : using 100 extra samples and 2 power iterations
2018-06-

In [27]:
import random

rows = []
k = 200
num_tweets = len(tweets)
indices = random.sample(range(0, num_tweets), k)
for i in indices:
    print(i)
    tweet = tweets[i]
    dl_match = get_difflib_matches(tweet, bible_data)[0]
    spacy_tweet = nlp(tweet)
    sp_match = get_spacy_matches(spacy_tweet, spacy_bible)[0]
    gs_match = get_gensim_matches(tweet, bible_data, dictionary, lsi, index)[0]
    comparison_row = (
        tweet, 
        dl_match[1], dl_match[2], "", 
        sp_match[1].text, sp_match[2], "", 
        gs_match[1], gs_match[2], ""
    )
    rows.append(comparison_row)

805
6338
1954
187
643
7804
6135
164
6627
8612
4862
2978
9167
6449
3845
1993
3365
5952
941
313
314
9059
8760
1841
5401
2997
8512
5086
6803
2831
2655
9023
3000
476
2465
783
2912
7930
7052
4218
707
6193
3530
3201
3774
5127
9574
2833
9105
8807
2772
8165
7355
2892
9623
8057
6258
9365
6321
6285
971
6196
7150
3677
8989
4857
5877
4966
3596
4438
4617
526
1084
2951
9232
2905
7975
4409
9726
4311
6729
8826
7963
8821
140
1964
3994
8583
2130
374
7358
1483
6824
3229
6959
7958
8653
2224
3041
4267
3366
2482
7479
6039
5755
3522
3747
2297
7627
7571
4365
4298
8474
8648
120
6467
4933
1853
1780
7857
8309
493
3624
8162
1093
4801
8481
3711
9250
9513
6900
2795
9638
9596
1251
5128
9748
7428
7772
4540
1859
1783
2143
9208
55
977
7828
8427
5265
6371
9628
8436
8064
2037
3148
6370
8570
978
8671
107
3996
4878
9688
4768
7048
3180
1938
5662
6360
1298
8129
384
5868
8724
4077
7595
1265
5731
5376
626
9718
3743
5387
3120
8677
2743
8999
1261
5010
227
3014
31
4337
1005
2808
3607
8593
2686
7139
5767


In [26]:
dl_match[1]

'that I may reveal it as I ought to speak.'

In [28]:
import pickle
with open("rows.pkl", 'wb') as f:
    pickle.dump(rows, f)

In [53]:
import pandas as pd

row_df = pd.DataFrame(rows)

In [58]:
del rows

In [54]:
row_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Also: if your patent specification does not co...,"Therefore you need to be in subjection, not on...",0.420168,,"But whatever has a blemish, that you shall not...",0.921702,,"There is one body, and one Spirit, even as you...",0.836346,
1,The CImg Library - C++ Template Image Processi...,"the people which I formed for myself, that the...",0.374384,,For the house he made windows of fixed lattice...,0.736559,,"The sound of a cry from Horonaim, desolation a...",0.971526,
2,Criteria for boilerplate legal text: it should...,He began to build in the second [day] of the s...,0.447368,,It's also written in your law that the testimo...,0.923304,,"the two pillars, and the two bowls of the capi...",0.711855,
3,Hadn't realised: the difference between deduct...,and the avenger of blood find him outside of t...,0.386946,,However in the assembly I would rather speak f...,0.927348,,"Should he reason with unprofitable talk, or wi...",0.997764,
4,You need to be wrong on certain things in orde...,"No eye pitied you, to do any of these things t...",0.45283,,Be of the same mind one toward another. Don't ...,0.966708,,You are witnesses of these things.,0.823695,


In [61]:
row_df.to_excel("comparison.xls")

ModuleNotFoundError: No module named 'xlwt'