Basic NLTK Example from https://www.dataknowsall.com/bowtfidf.html

with some additions



In [27]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# random features
from random import sample

from string import punctuation

from nltk.util import ngrams
from nltk.tokenize import SyllableTokenizer
from nltk import word_tokenize
from nltk.tokenize import LegalitySyllableTokenizer
import nltk
nltk.download("punkt")


[nltk_data] Downloading package punkt to /home/kugel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

A corpus is casically a list of sentences. YOu can create your own or use an example

In [28]:
corpus = [
    "Tune a hyperparameter.",
    "You can tune a piano but you cannot tune a fish.",
    "Fish who eat fish, catch fish.",
    "People can tune a fish or a hyperparameter.",
    "It is hard to catch fish and tune it.",
]

We do some standard text processing first
First thing is to get the occurancies of all words in all lines. We use the count vectorizer to do this. Stop words are dropped.

In [29]:
# start with CountVectorizer which creates a BoW
vectorizer = CountVectorizer(stop_words='english') 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

Unnamed: 0,catch,eat,fish,hard,hyperparameter,people,piano,tune
0,0,0,0,0,1,0,0,1
1,0,0,1,0,0,0,1,2
2,1,1,3,0,0,0,0,0
3,0,0,1,0,1,1,0,1
4,1,0,1,1,0,0,0,1


The version above shows the actual occurancies. Now we use another vectorizer, which uses a different metric. We can configure this TfIdf (Term frequency-inverse document frequency, see e.g. [here](https://towardsdatascience.com/tf-idf-simplified-aba19d5f5530) ) vectorizer in two modes ("use_idf" fasle or true). False doens not consider the document frequency

In [30]:
# change vectorizer
vectorizer = TfidfVectorizer(stop_words='english', use_idf=False) 
X = vectorizer.fit_transform(corpus) 
df = pd.DataFrame(np.round(X.A,3), columns=vectorizer.get_feature_names_out())
df


Unnamed: 0,catch,eat,fish,hard,hyperparameter,people,piano,tune
0,0.0,0.0,0.0,0.0,0.707,0.0,0.0,0.707
1,0.0,0.0,0.408,0.0,0.0,0.0,0.408,0.816
2,0.302,0.302,0.905,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.5,0.0,0.5,0.5,0.0,0.5
4,0.5,0.0,0.5,0.5,0.0,0.0,0.0,0.5


use_idf = True uses the iverse document frequency, which favors words which are used in fewer lines.

In [31]:
# inverse vectorizer
vectorizer = TfidfVectorizer(stop_words='english', use_idf=True) 
X = vectorizer.fit_transform(corpus) 
df = pd.DataFrame(np.round(X.A,3), columns=vectorizer.get_feature_names_out())
df


Unnamed: 0,catch,eat,fish,hard,hyperparameter,people,piano,tune
0,0.0,0.0,0.0,0.0,0.82,0.0,0.0,0.573
1,0.0,0.0,0.35,0.0,0.0,0.0,0.622,0.701
2,0.38,0.471,0.796,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.373,0.0,0.534,0.661,0.0,0.373
4,0.534,0.0,0.373,0.661,0.0,0.0,0.0,0.373


Instead of counting words (somehow) we can also split lines


In [32]:
# tokenize into words
w = word_tokenize(" ".join(corpus))
print("Words: ",w)
# syllables
result = [SSP.tokenize(token) for token in word_tokenize(" ".join(corpus))]
print("Syllables in sentence: ",result)


Words:  ['Tune', 'a', 'hyperparameter', '.', 'You', 'can', 'tune', 'a', 'piano', 'but', 'you', 'can', 'not', 'tune', 'a', 'fish', '.', 'Fish', 'who', 'eat', 'fish', ',', 'catch', 'fish', '.', 'People', 'can', 'tune', 'a', 'fish', 'or', 'a', 'hyperparameter', '.', 'It', 'is', 'hard', 'to', 'catch', 'fish', 'and', 'tune', 'it', '.']
Syllables in sentence:  [['Tu', 'ne'], ['a'], ['hy', 'per', 'pa', 'ra', 'me', 'ter'], ['.'], ['Yo', 'u'], ['can'], ['tu', 'ne'], ['a'], ['pia', 'no'], ['but'], ['yo', 'u'], ['can'], ['not'], ['tu', 'ne'], ['a'], ['fish'], ['.'], ['Fish'], ['who'], ['eat'], ['fish'], [','], ['catch'], ['fish'], ['.'], ['Peo', 'ple'], ['can'], ['tu', 'ne'], ['a'], ['fish'], ['or'], ['a'], ['hy', 'per', 'pa', 'ra', 'me', 'ter'], ['.'], ['It'], ['is'], ['hard'], ['to'], ['catch'], ['fish'], ['and'], ['tu', 'ne'], ['it'], ['.']]


We can also tokenize words into sylables

In [33]:
# sylable, single word
SSP = SyllableTokenizer()
s = SSP.tokenize('justification')
print("Syllables: ",s)



Syllables:  ['jus', 'ti', 'fi', 'ca', 'tion']


### After these basic examples we try to create lyrics
A HAIKU A haiku is defined to have 3 lines with any number of words, provided the number of sylables is 5, 7 and 5 in the 3 lines.

We start with the same corpus, but you may use some other text line or an NLTK sample corpus (check NLTK website)
First thing to do here is to remove punctuation

In [34]:
# combine into single string and replace all punction with ""
text = "".join(corpus)
for p in punctuation:
    text = text.replace(p," ")
    
print("Text: ",text)


Text:  Tune a hyperparameter You can tune a piano but you cannot tune a fish Fish who eat fish  catch fish People can tune a fish or a hyperparameter It is hard to catch fish and tune it 


Now we split into words (similar to example above) and create a dict with the number of sylables for each word

In [35]:
# tokenize into words
words = word_tokenize(text)
print("Words: ",words)

# create dict with words and number of sylables
wdict = {}
for w in words:
    if not w in wdict:
        wdict[w] = len(SSP.tokenize(w))

print("Wdict: ",wdict)


Words:  ['Tune', 'a', 'hyperparameter', 'You', 'can', 'tune', 'a', 'piano', 'but', 'you', 'can', 'not', 'tune', 'a', 'fish', 'Fish', 'who', 'eat', 'fish', 'catch', 'fish', 'People', 'can', 'tune', 'a', 'fish', 'or', 'a', 'hyperparameter', 'It', 'is', 'hard', 'to', 'catch', 'fish', 'and', 'tune', 'it']
Wdict:  {'Tune': 2, 'a': 1, 'hyperparameter': 6, 'You': 2, 'can': 1, 'tune': 2, 'piano': 2, 'but': 1, 'you': 2, 'not': 1, 'fish': 1, 'Fish': 1, 'who': 1, 'eat': 1, 'catch': 1, 'People': 2, 'or': 1, 'It': 1, 'is': 1, 'hard': 1, 'to': 1, 'and': 1, 'it': 1}


Lets take a random sample from our dict (this will generate different results on every run)

In [39]:
n = 5
rwords = sample(list(wdict.keys()), n)
print(rwords)
for r in rwords:
    print(f"{r}: {wdict[r]} sylables")



['It', 'not', 'eat', 'can', 'you']
It: 1 sylables
not: 1 sylables
eat: 1 sylables
can: 1 sylables
you: 2 sylables


### Up to you to create lines with the appropriate number of sylables for the haiku
