# LLM - Large Language Model (a sample example)

In [1]:
# nltk library to create a basic language model
#   --> minimal viable product (MVP) for providing a complete and detailed implementation
#   --> language model = probabilistic model --> predict the likelihood of a sequence of words appearing in a given context
#   --> used in NLP (natural language processing) --> speech recognition, machine translation, and text generation

In [2]:
# import libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import re
import nltk
from nltk import bigrams, FreqDist
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
import random
from random import choice
# Run in case warnings should be ignored
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth',100)

In [3]:
# loading data
df01 = pd.read_csv('topical_chat.csv')

In [4]:
df01.head()

Unnamed: 0,conversation_id,message,sentiment
0,1,Are you a fan of Google or Microsoft?,Curious to dive deeper
1,1,Both are excellent technology they are helpful in many ways. For the security purpose both are ...,Curious to dive deeper
2,1,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopol...",Curious to dive deeper
3,1,"Google provides online related services and products, which includes online ads, search engine ...",Curious to dive deeper
4,1,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",Curious to dive deeper


In [5]:
df01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188378 entries, 0 to 188377
Data columns (total 3 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   conversation_id  188378 non-null  int64 
 1   message          188378 non-null  object
 2   sentiment        188378 non-null  object
dtypes: int64(1), object(2)
memory usage: 4.3+ MB


In [6]:
df01.sentiment.value_counts()

sentiment
Curious to dive deeper    80888
Neutral                   41367
Surprised                 30638
Happy                     29617
Sad                        2533
Disgusted                  1433
Fearful                    1026
Angry                       876
Name: count, dtype: int64

In [7]:
# the feature 'sentiment' could be my label --> i change it to numeric
le = LabelEncoder()
df01['sentiment'] = le.fit_transform(df01['sentiment'])

In [8]:
# use only the messages and the sentiments:
df01 = df01[['message','sentiment']]
df01.head(5)

Unnamed: 0,message,sentiment
0,Are you a fan of Google or Microsoft?,1
1,Both are excellent technology they are helpful in many ways. For the security purpose both are ...,1
2,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopol...",1
3,"Google provides online related services and products, which includes online ads, search engine ...",1
4,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",1


In [9]:
df01.message.value_counts()

message
Hi, how are you?                                                                                                                            317
You too!                                                                                                                                     98
Nice chatting with you!                                                                                                                      91
Bye!                                                                                                                                         77
Bye                                                                                                                                          61
                                                                                                                                           ... 
Or he could try being an actor, he could audition as the president, he'd probably do better than Reagan did in his audition as p

In [10]:
# remove duplicated messages
df02 = df01.drop_duplicates(subset='message',keep='first')
df02.head(5)

Unnamed: 0,message,sentiment
0,Are you a fan of Google or Microsoft?,1
1,Both are excellent technology they are helpful in many ways. For the security purpose both are ...,1
2,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopol...",1
3,"Google provides online related services and products, which includes online ads, search engine ...",1
4,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",1


In [11]:
df02.message.value_counts().max()

1

In [12]:
df02.isnull().value_counts()

message  sentiment
False    False        184303
Name: count, dtype: int64

In [13]:
df02.shape

(184303, 2)

In [14]:
# i only use messages with more than 199 chars
df02 = df02[df02.message.astype(str).str.len() >= 200]
df02.head()

Unnamed: 0,message,sentiment
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think a...",7
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award ...",7
187,Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! S...,5
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat ow...",7
293,Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1...,1


In [15]:
# message lengths
messages_len = [len(df02.iloc[i,0]) for i in range(len(df02))]
messages_len.sort()
print(messages_len[0],messages_len[-1])

200 697


In [16]:
df02.shape

(10821, 2)

In [17]:
# i correct the size of my data set to work with
df03 = df02.iloc[:10000,:]
df03.head()

Unnamed: 0,message,sentiment
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think a...",7
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award ...",7
187,Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! S...,5
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat ow...",7
293,Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1...,1


In [18]:
# collection of functions to clean the text in 1 function:
def lower(text):
    return str(text).lower()

# Remove HTML
def remove_HTML(text):
    return re.sub(r'<.*?>', '', text) 

# Removing all words with digits and standalone digits
def remove_digits(text):
    return re.sub(r'\d+', '', text)

# Removing backslashes:
def remove_backsl(text):
    return text.replace('\\','')

# Removing slashes:
def remove_sl(text):
    return text.replace('/','')

# Removing all dots:
def remove_dots(text):
    return text.replace('.','')

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Removing all non-printable symbols like "ड", "ட"
def remove_non_printable(text):
    text = text.encode("ascii", "ignore")
    return text.decode()
        
# One function to clean it all
def clean_text(text):
    text = lower(text)
    text = remove_digits(text)
    text = remove_backsl(text)
    text = remove_sl(text)
    text = remove_dots(text)
    text = remove_emoji(text)
    text = remove_HTML(text)
    text = remove_non_printable(text)
    return text

In [19]:
df03['Clean_message'] = df03['message'].apply(clean_text)
df03.head(5)

Unnamed: 0,message,sentiment,Clean_message
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think a...",7,"not where i lived, if i want to read a comic book, i had to go to the library but as i think ab..."
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award ...",7,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award f..."
187,Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! S...,5,indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! spe...
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat ow...",7,"yes, i have a dog sounds like in the dear amy that they do everything tit for tat the cat owne..."
293,Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1...,1,yeah we all know that guy! how about the loch ness monster being protected under the scottish ...


In [20]:
# create new, regular indices
df03.reset_index(drop=True,inplace=True)

In [21]:
df03.head(3)

Unnamed: 0,message,sentiment,Clean_message
0,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think a...",7,"not where i lived, if i want to read a comic book, i had to go to the library but as i think ab..."
1,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award ...",7,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award f..."
2,Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! S...,5,indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! spe...


In [22]:
# prepared data set
df03.iloc[:3,2:3]

Unnamed: 0,Clean_message
0,"not where i lived, if i want to read a comic book, i had to go to the library but as i think ab..."
1,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award f..."
2,indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! spe...


# LLM for a sample with a bigram model

In [23]:
# --> setting up the environment by using the nltk library
def generate_text(text):
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # create a bigram model
    # --> generate bigrams and their frequency distribution
    bigrams = list(ngrams(tokens, 2))
    bigram_freq_dist = FreqDist(bigrams)
    
    # Prepare the dataset for training
    train_data, padded_sents = padded_everygram_pipeline(2, tokens)

    # train the model --> the bigram model
    model = MLE(2)
    model.fit(train_data, padded_sents)
    
    # generate text with various questions
    # --> you can ask more questions and generate text based on different input words or phrases.
    # --> expanded LLM will provide more accurate and diverse answers based on the larger dataset
    # example questions
    questions = [
        "What is the content about",
        "How does it work",
        "What are the advantages",
        "How can I use them",
        "What should I consider"
    ]
    num_words = 25
    question = random.choice(questions)
    tokens = nltk.word_tokenize(question)
    seed_word = choice(tokens)
    sentence = [seed_word]
    for _ in range(num_words - 1):
        next_word = model.generate(1, text_seed=sentence)
        sentence.append(next_word)
    return (question,' '.join(sentence))

In [24]:
df03['Question_Answer'] = df03.Clean_message.apply(generate_text)

In [25]:
# separate question and answer --> 2 features
df03['Question_bigram'] = [df03.Question_Answer.iloc[i][0] for i in range(len(df03))]
df03['Answer_bigram'] = [df03.Question_Answer.iloc[i][1] for i in range(len(df03))]
df04 = df03[['Clean_message','Question_bigram','Answer_bigram']]
df04.head()

Unnamed: 0,Clean_message,Question_bigram,Answer_bigram
0,"not where i lived, if i want to read a comic book, i had to go to the library but as i think ab...",How can I use them,"them , </s> , </s> c </s> <s> w a s </s> p e </s> </s> </s> e </s> </s> l i </s> y o"
1,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award f...",How can I use them,use r s </s> g </s> a t </s> d </s> </s> e e c a b e r d </s> </s> <s> d </s>
2,indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! spe...,What are the advantages,advantages <s> ' s </s> t h a n t </s> p e r s </s> s </s> <s> m i n k </s> <s>
3,"yes, i have a dog sounds like in the dear amy that they do everything tit for tat the cat owne...",How can I use them,them <s> t i s t </s> </s> y </s> t i a i t h a t h a m a i k e
4,yeah we all know that guy! how about the loch ness monster being protected under the scottish ...,How does it work,does r e p r e i n o t r e </s> h e a l y </s> h </s> n d e p


# LLM for a sample with a trigram model

In [26]:
# --> setting up the environment by using the nltk library
def generate_text(text):
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # create a trigram model
    # --> generate trigrams and their frequency distribution
    trigrams = list(ngrams(tokens, 3))
    trigram_freq_dist = FreqDist(trigrams)
    
    # Prepare the dataset for training
    train_data, padded_sents = padded_everygram_pipeline(3, tokens)

    # train the model --> the trigram model
    model = MLE(3)
    model.fit(train_data, padded_sents)
    
    # generate text with various questions
    # --> you can ask more questions and generate text based on different input words or phrases.
    # --> expanded LLM will provide more accurate and diverse answers based on the larger dataset
    # example questions
    questions = [
        "What is the content about",
        "How does it work",
        "What are the advantages",
        "How can I use them",
        "What should I consider"
    ]
    num_words = 25
    question = random.choice(questions)
    tokens = nltk.word_tokenize(question)
    seed_word = choice(tokens)
    sentence = [seed_word]
    for _ in range(num_words - 1):
        next_word = model.generate(1, text_seed=sentence)
        sentence.append(next_word)
    return (question,' '.join(sentence))

In [27]:
df04['Question_Answer'] = df04.Clean_message.apply(generate_text)

In [28]:
# separate question and answer --> 2 features
df04['Question_trigram'] = [df04.Question_Answer.iloc[i][0] for i in range(len(df03))]
df04['Answer_trigram'] = [df04.Question_Answer.iloc[i][1] for i in range(len(df03))]
df05 = df04[['Clean_message','Question_bigram','Answer_bigram','Question_trigram','Answer_trigram']]
df05

Unnamed: 0,Clean_message,Question_bigram,Answer_bigram,Question_trigram,Answer_trigram
0,"not where i lived, if i want to read a comic book, i had to go to the library but as i think ab...",How can I use them,"them , </s> , </s> c </s> <s> w a s </s> p e </s> </s> </s> e </s> </s> l i </s> y o",How does it work,does t i n t </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s...
1,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award f...",How can I use them,use r s </s> g </s> a t </s> d </s> </s> e e c a b e r d </s> </s> <s> d </s>,How can I use them,use </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </...
2,indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! spe...,What are the advantages,advantages <s> ' s </s> t h a n t </s> p e r s </s> s </s> <s> m i n k </s> <s>,How does it work,How </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </...
3,"yes, i have a dog sounds like in the dear amy that they do everything tit for tat the cat owne...",How can I use them,them <s> t i s t </s> </s> y </s> t i a i t h a t h a m a i k e,How can I use them,How e s </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s...
4,yeah we all know that guy! how about the loch ness monster being protected under the scottish ...,How does it work,does r e p r e i n o t r e </s> h e a l y </s> h </s> n d e p,How does it work,does <s> <s> d e r </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </...
...,...,...,...,...,...
9995,it's a great show for kids although i haven't seen too many no where near ha! i prefer south p...,How can I use them,I o d </s> r k n </s> </s> s </s> p a r </s> </s> t e p r </s> </s> e a </s>,What should I consider,I </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>...
9996,"i'm honestly not sure it's crazy bart never appeared nor was mentioned in an episode, right? wo...",What should I consider,I </s> </s> k </s> t </s> a y </s> </s> m a s p o k i g e d e </s> r i,How does it work,work </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> <...
9997,"you can definitely get an answer if you post it, there are some very loyal fans of the show, i ...",How can I use them,use <s> k </s> r e d </s> <s> ' v e </s> s t i d e f i t h o o t,How does it work,it d </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> <...
9998,"yeah, definitely makes me think about the things i ""didn't"" accomplish while i was stuck in my ...",What is the content about,is n i d </s> <s> ' t u r o w h o m </s> </s> a h o o o m a h,What is the content about,What r a t e </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s...
