## Data sampling - sampling of sentences
The sampled sentences will be used to create the data set. 

#### Imports

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter

#### Loading the data

In [3]:
PATH_TO_DATA = "../data/"
data = pd.read_csv(PATH_TO_DATA + "quotes_dataset.csv")

  data = pd.read_csv(PATH_TO_DATA + "quotes_dataset.csv")


In [4]:
data.drop(data.iloc[:, 3:42], inplace=True, axis=1)

In [5]:
data.columns = ['quote', 'author', 'categories']
data.head()

Unnamed: 0,quote,author,categories
0,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
1,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
2,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
3,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."
4,We accept the love we think we deserve.,"Stephen Chbosky, The Perks of Being a Wallflower","inspirational, love"


In [6]:
# keep only the sentences with <= 25 characters
data = data[data.quote.apply(lambda x: len(str(x)) <= 25)]

print("The reamining data set consists of {} quotes.".format(data.shape[0]))

The reamining data set consists of 5424 quotes.


In [7]:
# create character trigrams for each quote (discarding the white spaces)
data['trigrams'] = data.quote.apply(lambda x: [str(x).replace(" ", "")[i:i+3] for i in range(len(str(x).replace(" ", ""))-2)])


In [8]:
data

Unnamed: 0,quote,author,categories,trigrams
552,You are my life now.,"Stephenie Meyer, Twilight","bella, edward-cullen, entertainment, love","[You, oua, uar, are, rem, emy, myl, yli, lif, ..."
1156,Sweets to the sweet.,"William Shakespeare, Hamlet","death, love, shakespeare","[Swe, wee, eet, ets, tst, sto, tot, oth, the, ..."
1221,Love makes us such fools.,"Leslye Walton, The Strange and Beautiful Sorro...",love,"[Lov, ove, vem, ema, mak, ake, kes, esu, sus, ..."
1519,Love those you hate you.,"Leo Tolstoy, Anna Karenina","hate, inspirational, love","[Lov, ove, vet, eth, tho, hos, ose, sey, eyo, ..."
1593,Love as thou wilt,"Jacqueline Carey, Kushiel's Chosen",love,"[Lov, ove, vea, eas, ast, sth, tho, hou, ouw, ..."
...,...,...,...,...
498851,My dad died of a stroke.,William Shatner,"Stroke, Died","[Myd, yda, dad, add, ddi, die, ied, edo, dof, ..."
498895,My dad was a preacher.,Matthew Desmond,Preacher,"[Myd, yda, dad, adw, dwa, was, asa, sap, apr, ..."
498907,My dad is an egomaniac.,Sia,Egomaniac,"[Myd, yda, dad, adi, dis, isa, san, ane, neg, ..."
499541,The future is today.,William Osler,Today,"[The, hef, efu, fut, utu, tur, ure, rei, eis, ..."


In [9]:
# list of all trigrams
all_trigrams = data.trigrams.sum()

# create dictonary for the frequencies of the different trigrams
frequencies = Counter(all_trigrams)

In [10]:
abs_frequencies = sum(frequencies.values())
abs_frequencies

84019

In [11]:
def transform_to_frequencies(triagrams):
    sum = 0
    for i in triagrams:
        sum += 1 - (frequencies.get(i) / abs_frequencies)
    return sum

data['freq_weights'] = data.trigrams.apply(lambda y: transform_to_frequencies(y))

In [12]:
data

Unnamed: 0,quote,author,categories,trigrams,freq_weights
552,You are my life now.,"Stephenie Meyer, Twilight","bella, edward-cullen, entertainment, love","[You, oua, uar, are, rem, emy, myl, yli, lif, ...",13.981718
1156,Sweets to the sweet.,"William Shakespeare, Hamlet","death, love, shakespeare","[Swe, wee, eet, ets, tst, sto, tot, oth, the, ...",14.986551
1221,Love makes us such fools.,"Leslye Walton, The Strange and Beautiful Sorro...",love,"[Lov, ove, vem, ema, mak, ake, kes, esu, sus, ...",18.985491
1519,Love those you hate you.,"Leo Tolstoy, Anna Karenina","hate, inspirational, love","[Lov, ove, vet, eth, tho, hos, ose, sey, eyo, ...",17.969650
1593,Love as thou wilt,"Jacqueline Carey, Kushiel's Chosen",love,"[Lov, ove, vea, eas, ast, sth, tho, hou, ouw, ...",11.985277
...,...,...,...,...,...
498851,My dad died of a stroke.,William Shatner,"Stroke, Died","[Myd, yda, dad, add, ddi, die, ied, edo, dof, ...",16.994620
498895,My dad was a preacher.,Matthew Desmond,Preacher,"[Myd, yda, dad, adw, dwa, was, asa, sap, apr, ...",15.985182
498907,My dad is an egomaniac.,Sia,Egomaniac,"[Myd, yda, dad, adi, dis, isa, san, ane, neg, ...",16.988122
499541,The future is today.,William Osler,Today,"[The, hef, efu, fut, utu, tur, ure, rei, eis, ...",14.978767


In [13]:
sample = data.sample(200, replace=False, weights=data.freq_weights, random_state=1)

In [14]:
sample

Unnamed: 0,quote,author,categories,trigrams,freq_weights
168534,Words can be bridges.,Joydeep Roy-Bhattacharya,words,"[Wor, ord, rds, dsc, sca, can, anb, nbe, beb, ...",15.992514
325402,Carlton was a blade,sharp and hard and well built.,"Kate Harper, His Wayward Ward","[Car, arl, rlt, lto, ton, onw, nwa, was, asa, ...",13.995227
552,You are my life now.,"Stephenie Meyer, Twilight","bella, edward-cullen, entertainment, love","[You, oua, uar, are, rem, emy, myl, yli, lif, ...",13.981718
125348,The Majesty of the Maker!,Lailah Gifty Akita,"christian, inspiring, majesty, maker, wise-words","[The, heM, eMa, Maj, aje, jes, est, sty, tyo, ...",18.983075
65527,Be Good-Do Good-Be One,Kirpal Singh,"guru, kirpal-singh, sant-mat, spiritual-master...","[BeG, eGo, Goo, ood, od-, d-D, -Do, DoG, oGo, ...",16.995906
...,...,...,...,...,...
442367,Nature is neutral.,Adlai E. Stevenson,Neutral,"[Nat, atu, tur, ure, rei, eis, isn, sne, neu, ...",13.985860
229994,Fearless fate!,Lailah Gifty Akita,"faith, fate, fearless","[Fea, ear, arl, rle, les, ess, ssf, sfa, fat, ...",10.988907
418342,Who can refute a sneer?,William Paley,sneer,"[Who, hoc, oca, can, anr, nre, ref, efu, fut, ...",16.993406
281733,I write because I have to,because I wouldn't know what to do with my ha...,Angeline Trevena,"[Iwr, wri, rit, ite, teb, ebe, bec, eca, cau, ...",17.989443


In [17]:
print(sample.quote.iloc[:50]) 

168534        Words can be bridges.
325402          Carlton was a blade
552            You are my life now.
125348    The Majesty of the Maker!
65527        Be Good-Do Good-Be One
37475     We are all poets, really.
86425           Respect your haters
136882                 Daring faith
159157       Discover your destiny.
222396      And darkness will rule.
170581      Life is merely terrible
305789    O me, this place is hell.
94693          Seeked of self love.
405671       Learn how to feel joy.
12688     When you're lost in space
296938      Yes, you are a battery.
168665    And empty words are evil.
230911    Silence rolled at me, in 
58749           Make each day count
93158     Revere the Righteous One.
367179           Pills for sickness
471874     I'm a travel enthusiast.
125955    Life is sacred existence.
309698        Oh captain my captain
404343      The way to do is to be.
411393       Laws die  books never.
34310     Action achieves ambition.
17899             God is my 

In [18]:
print(sample.quote.iloc[51:100]) 

96308         People before Profit.
115668        Hate wound the heart.
202755    Be wise enough to forgive
21790                   Keep trying
240084                Yay, science.
249512       Impossible is nothing.
313789        Create your own path.
39618       You are what you write.
167710     Words don’t get accident
311053    Comments outnumber ideas.
167731           Life is but words.
20251        Every wound is a word.
220839         Love is purely holy.
293167    Aim High and Hit the Mark
210704    Aim deliberately at goal.
436785          I love team sports.
248093            Lycans are human.
413412         Philosophy is doubt.
58036       Time reveals character.
58357       Work to fulfill destiny
371429     One who finishes, lasts.
159358       I walk in my own path.
72624      Love is a divine lullaby
422146           World without end.
137831       Beliefs create reality
341323        Anything from Kipling
327973        Doubt isn't original.
407103        All history is

In [19]:
print(sample.quote.iloc[101:150]) 

415326              Include me out.
270334      Humility Preceeds Glory
8771        Fall off your own roof.
423386         Toys can be anything
308524            Pain is like love
495800           Alone I'm nothing.
76029                Everybody lies
57289            Time is a teacher.
428407     I have respect for beer.
312737       It took me a lifetime.
26893     See me just as I see you.
344270     The dead must heed them.
343242     Struggle forms character
420705     Trust one who has tried.
319104      The dawn is your enemy.
50937         Reading is my breath.
10537     I don't want to be a tree
12328         Sin is a gravitation.
12805          We are the universe.
110218         Stay united in love.
397709              'Old Tomorrow.'
229135           Grace is goodness.
390391        You call that a kiss?
50935      Books are sacred wisdom.
118449     Music is breath of life.
247188                Yes. I rememb
475529               I love design.
232783     Love doesn’t conq