## POETRY Generation using N-grams




- Poetry Generation with n-gram Language Modeling:
  - This project involves generating poetry using n-gram language modeling techniques.
  
- Task Description:
  - The task is to print three stanzas of poetry with an empty line in between.
  - The generation model is trained on a provided Poetry Corpus containing poems from renowned Urdu poets like Faiz, Ghalib, and Iqbal, along with other Urdu poetry scraped from the internet.
  
- Assignment Task:
  - Load the Poetry Corpus and tokenize it to split it into a list of words.
  - Generate n-gram models, including unigram and bigram models.
  - For each stanza:
    - For each verse:
      - Generate a random number of words in the range [7...10].
      - Select the first word randomly.
      - Select subsequent words until the verse is complete.
      - Attempt to rhyme the last word with the last word of the previous verse.
      - Print the verse.
    - Print an empty line after each stanza.
    
- Implementation Challenges:
  - Challenges include selecting subsequent words based on the first word chosen for the verse.
  - Conditional Frequency Distribution (CFD) is used to predict the most probable next word.
  - Rhyming the generated verses adds complexity, requiring the building of a rhyming dictionary.
  - Considering the Urdu sentence structure from right to left, n-gram models are adapted accordingly.
  
- Standard n-gram Models:
  - Develop unigram, bigram, and trigram models using the Conditional Frequency Distribution method.
  - Start by selecting the first word randomly and use the bigram model to generate subsequent words until the verse is complete.
  - Compare the results of the two n-gram models.
  
This project showcases proficiency in natural language processing (NLP), specifically in generating poetry using n-gram language modeling techniques applied to Urdu language data.


In [None]:
# File reading

with open('ghalib.txt', 'r', encoding='utf-8') as file:
    ghalib_urdu_text = file.read()

with open('iqbal.txt', 'r', encoding='utf-8') as file:
    iqbal_urdu_text = file.read()

combine_urdu=result = ''.join([ghalib_urdu_text, iqbal_urdu_text])
#print(combine_urdu)

In [None]:
# Words tokenization
from nltk.tokenize import word_tokenize

urdu_words = word_tokenize(combine_urdu)

In [None]:
# Creating ngrams

from nltk.util import ngrams
from nltk import FreqDist

unigrams = list(ngrams(urdu_words, 1))
unigram_freq = FreqDist(unigrams)

bigrams = list(ngrams(urdu_words, 2))
bigram_freq = FreqDist(bigrams)

trigrams = list(ngrams(urdu_words, 3))
trigram_freq = FreqDist(trigrams)

#print(type(unigrams))
#print(bigrams)

In [None]:
import random

start_word = random.sample(unigrams, 21)
#print (starting_words)
starting_words = [item[0] for item in start_word]

def verse_generation(model):
    verse_length=random.randint(7, 10)
    verse = []

    current_word=random.choice(starting_words)
    verse.append(current_word)
#     print('start:',current_word)
#     print('verse:',verse)

    for _ in range(verse_length-1):
        next_words = []
        for word in model:
            if word[0] == current_word:
                next_words.append(word[1])
        if not next_words:
            break
        #print('\n',next_words,'\n')
        freq_nextword= FreqDist(next_words)
        frequent_word_nextword = freq_nextword.max()
        current_word = frequent_word_nextword
        verse.append(current_word)

    return ' '.join(verse)

def stanza_generation(n_grams, count_for_verse):
    stanza = []
    for _ in range(count_for_verse):
        stanza.append(verse_generation(n_grams))
    return stanza

for stanza_number in range(3):
    stanza = stanza_generation(bigrams, 4)
    for i in stanza:
        print(i)
    if stanza_number < 2:
        print()

معدن کو؟ فروغِ جوہرِ تیغ، آب دار
معدن کو؟ فروغِ جوہرِ تیغ، آب دار تھا کہ ‘
صفا کیش و دل کی ہے کہ ‘ تو نے
ہے کہ ‘ تو نے کیا ہے کہ ‘ تو

بیاباں نورد تھا کہ ‘ تو نے کیا ہے
کشا ہے کہ ‘ تو نے کیا ہے کہ ‘
میرے ‘ تو نے کیا ہے کہ
بیاباں نورد تھا کہ ‘ تو نے کیا

دبا یہ بات کہ ‘ تو نے کیا ہے کہ
پیشہ طلبگارِ مرد تھا کہ ‘ تو نے کیا ہے
میں ہے کہ ‘ تو نے کیا ہے کہ ‘
از نمود کچھ بھی نہیں ہے کہ


# Question 3
 Rule Based Roman Urdu Text Normalization

Roman Urdu lacks standard lexicon and usually many spelling variations exist for a given word, e.g., the word zindagi (life) is also written as zindagee, zindagy, zaindagee and zndagi. So, in this question you have to Normalize Roman Urdu words using the following Rules given in the attached Pdf. Your Code works for a complete Sentence or multiple sentences.

For Example: zaroori, zaruri, zarori map to the 'zrory'. So zrory becomes the correct word for all representations mentioned above.

In [None]:
import re
import string
from nltk import word_tokenize

sentence="zaroori zaruri zarori armaniyyyyyyan karain mara tutai haye hay aeihhhhhh haey ytesssssss htie hdhryd esrladh aaaaamin hdjtyhfdjfti hdfhfjjjjjjfljd"
sentence=sentence.lower()
sentence_tokens=word_tokenize(sentence)
modified_list=[]

for words in sentence_tokens:
    if (words.endswith("ain")):
        #print(f'if i am true {words}')
        words=words.replace('ain','ein')

    if("ar" in words[1:]):
        words=words.replace('ar','r')
        #print(f'if ar i am true {words}')


    if("ai" in words):
        words=words.replace('ai','ae')
        #print(f'if ai i am true {words}')
        help=True

    if("ai" in words):
        words=words.replace('ai','ae')
        print(f'if ai i am true {words}')


    if ("iy" in words):
        multi_case=r'iy+y*'
        words=re.sub(multi_case,'I',words)
        #print(f'if ai i am true {words}')

    if (words.endswith("ay")):
        #print(f'if ay i am true {words}')
        words=words.replace('ay','e')

    if ("ih" in words):
        multi_case=r'ih+h*'
        words=re.sub(multi_case,'eh',words)
        #print(f'ih ai i am true {words}')

    if (words.endswith("ey")):
        #print(f'if ey i am true {words}')
        words=words.replace('ey','e')

    if ("s" in words):
        multi_case=r's+'
        words=re.sub(multi_case,'s',words)
        #print(f'if s i am true {words}')

    if (words.endswith("ie")):
        #print(f'if ie i am true {words}')
        words=words.replace('ie','y')

    if ("ry" in words and not(words.endswith("ry"))):
        #print(f'if ri i am true {words}')
        words=words.replace('ry','ri')

    if (words.startswith("es")):
        #print(f'if es i am true {words}')
        words=words.replace('es','is')

    if("a" in words):
        multi_case=r'a+'
        words=re.sub(multi_case,"a",words)

    if ("ty" in words and not(words.endswith("ty"))):
        #print(f'if ty i am true {words}')
        words=words.replace('ty','ti')

    if("j" in words):
        multi_case=r'j+'
        words=re.sub(multi_case,"j",words)

    if("o" in words):
        multi_case=r'o+'
        words=re.sub(multi_case,"o",words)

    if("e" in words):
        multi_case=r'e+'
        words=re.sub(multi_case,"i",words)

    if("d" in words):
        multi_case=r'd+'
        words=re.sub(multi_case,"d",words)

    if 'u' in words:
        words=words.replace('u','o')

    if 'i' in words:
        pre_case=f'([{string.ascii_lowercase[1:]}])i'
        words=re.sub(pre_case,r'\1y',words)

    modified_list.append(words)

new_sentence=' '.join(modified_list)
print(new_sentence)


zrory zrory zrory armanIan kryin mra totai hayy hy aih hai ytys hty hdhryd isrladh amyn hdjtyhfdjfty hdfhfjfljd
