Basic NLTK Example from https://www.dataknowsall.com/bowtfidf.html

with some additions



In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# random features
from random import sample

from string import punctuation

from nltk.util import ngrams
from nltk.tokenize import SyllableTokenizer
from nltk import word_tokenize
from nltk.tokenize import LegalitySyllableTokenizer
import nltk
nltk.download("punkt")


A corpus is casically a list of sentences. YOu can create your own or use an example

In [None]:
corpus = [
    "Tune a hyperparameter.",
    "You can tune a piano but you cannot tune a fish.",
    "Fish who eat fish, catch fish.",
    "People can tune a fish or a hyperparameter.",
    "It is hard to catch fish and tune it.",
]

We do some standard text processing first
First thing is to get the occurancies of all words in all lines. We use the count vectorizer to do this. Stop words are dropped.

In [None]:
# start with CountVectorizer which creates a BoW
vectorizer = CountVectorizer(stop_words='english') 
X = vectorizer.fit_transform(corpus) 
pd.DataFrame(X.A, columns=vectorizer.get_feature_names_out())

The version above shows the actual occurancies. Now we use another vectorizer, which uses a different metric. We can configure this TfIdf (Term frequency-inverse document frequency, see e.g. [here](https://towardsdatascience.com/tf-idf-simplified-aba19d5f5530) ) vectorizer in two modes ("use_idf" fasle or true). False doens not consider the document frequency

In [None]:
# change vectorizer
vectorizer = TfidfVectorizer(stop_words='english', use_idf=False) 
X = vectorizer.fit_transform(corpus) 
df = pd.DataFrame(np.round(X.A,3), columns=vectorizer.get_feature_names_out())
df


use_idf = True uses the iverse document frequency, which favors words which are used in fewer lines.

In [None]:
# inverse vectorizer
vectorizer = TfidfVectorizer(stop_words='english', use_idf=True) 
X = vectorizer.fit_transform(corpus) 
df = pd.DataFrame(np.round(X.A,3), columns=vectorizer.get_feature_names_out())
df


Instead of counting words (somehow) we can also split lines


In [None]:
# tokenize into words
w = word_tokenize(" ".join(corpus))
print("Words: ",w)


We can also tokenize words into sylables


**Note:** the tokenizer does not always produce the correct number of sylables ...

In [None]:
# sylable, single word
SSP = SyllableTokenizer()
s = SSP.tokenize('justification')
print("Syllables: ",s)


### After these basic examples we try to create lyrics
A HAIKU A haiku is defined to have 3 lines with any number of words, provided the number of sylables is 5, 7 and 5 in the 3 lines.

We start with the same corpus, but you may use some other text line or an NLTK sample corpus (check NLTK website)
First thing to do here is to remove punctuation

In [None]:
# combine into single string and replace all punction with " "
text = " ".join(corpus)
for p in punctuation:
    text = text.replace(p," ")
    
print("Text: ",text)


Now we split into words (similar to example above) and create a dict with the number of sylables for each word.

In [None]:
# tokenize into words
words = word_tokenize(text)
print("Words: ",words)

# create dict with words and number of sylables
wdict = {}
for w in words:
    if not w in wdict:
        wdict[w] = len(SSP.tokenize(w))

print("Wdict: ",wdict)


Lets take a random sample from our dict (this will generate different results on every run)

In [None]:
n = 5
rwords = sample(list(wdict.keys()), n)
print(rwords)
for r in rwords:
    print(f"{r}: {wdict[r]} sylables")



### Up to you to create lines with the appropriate number of sylables for the haiku
