# LLM - Large Language Model (an expanding example)

In [1]:
# nltk library to create a basic language model
#   --> minimal viable product (MVP) for providing a complete and detailed implementation
#   --> language model = probabilistic model --> predict the likelihood of a sequence of words appearing in a given context
#   --> used in NLP (natural language processing) --> speech recognition, machine translation, and text generation

In [2]:
# import libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import re
import nltk
import random
from nltk import word_tokenize, sent_tokenize
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
#from nltk.util import ngrams
#from collections import defaultdict, Counter
# Run in case warnings should be ignored
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth',200)

In [3]:
# loading data
df01 = pd.read_csv('topical_chat.csv')

In [4]:
df01.head()

Unnamed: 0,conversation_id,message,sentiment
0,1,Are you a fan of Google or Microsoft?,Curious to dive deeper
1,1,Both are excellent technology they are helpful in many ways. For the security purpose both are super.,Curious to dive deeper
2,1,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense.",Curious to dive deeper
3,1,"Google provides online related services and products, which includes online ads, search engine and cloud computing.",Curious to dive deeper
4,1,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",Curious to dive deeper


In [5]:
df01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188378 entries, 0 to 188377
Data columns (total 3 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   conversation_id  188378 non-null  int64 
 1   message          188378 non-null  object
 2   sentiment        188378 non-null  object
dtypes: int64(1), object(2)
memory usage: 4.3+ MB


In [6]:
df01.sentiment.value_counts()

sentiment
Curious to dive deeper    80888
Neutral                   41367
Surprised                 30638
Happy                     29617
Sad                        2533
Disgusted                  1433
Fearful                    1026
Angry                       876
Name: count, dtype: int64

In [7]:
# the feature 'sentiment' could be my label --> i change it to numeric
le = LabelEncoder()
df01['sentiment'] = le.fit_transform(df01['sentiment'])

In [8]:
# use only the messages and the sentiments:
df01 = df01[['message','sentiment']]
df01.head(5)

Unnamed: 0,message,sentiment
0,Are you a fan of Google or Microsoft?,1
1,Both are excellent technology they are helpful in many ways. For the security purpose both are super.,1
2,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense.",1
3,"Google provides online related services and products, which includes online ads, search engine and cloud computing.",1
4,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",1


In [9]:
df01.message.value_counts()

message
Hi, how are you?                                                                                                                            317
You too!                                                                                                                                     98
Nice chatting with you!                                                                                                                      91
Bye!                                                                                                                                         77
Bye                                                                                                                                          61
                                                                                                                                           ... 
Or he could try being an actor, he could audition as the president, he'd probably do better than Reagan did in his audition as p

In [10]:
# remove duplicated messages
df02 = df01.drop_duplicates(subset='message',keep='first')
df02.head(5)

Unnamed: 0,message,sentiment
0,Are you a fan of Google or Microsoft?,1
1,Both are excellent technology they are helpful in many ways. For the security purpose both are super.,1
2,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense.",1
3,"Google provides online related services and products, which includes online ads, search engine and cloud computing.",1
4,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",1


In [11]:
df02.message.value_counts().max()

1

In [12]:
df02.isnull().value_counts()

message  sentiment
False    False        184303
Name: count, dtype: int64

In [13]:
df02.shape

(184303, 2)

In [14]:
# i only use messages with more than 199 chars
df02 = df02[df02.message.astype(str).str.len() >= 200]
df02.head()

Unnamed: 0,message,sentiment
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1


In [15]:
# message lengths
messages_len = [len(df02.iloc[i,0]) for i in range(len(df02))]
messages_len.sort()
print(messages_len[0],messages_len[-1])

200 697


In [16]:
df02.shape

(10821, 2)

In [17]:
# i correct the size of the data set to work with
df03 = df02.iloc[:10000,:]
df03.head()

Unnamed: 0,message,sentiment
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1


In [18]:
# i use the first and the second sentence of every chat to work with
def first_sentence(text):
    sentenceEnd = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
    sentenceList = sentenceEnd.split(text, re.UNICODE)
    return sentenceList[0]
def second_sentence(text):
    sentenceEnd = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
    sentenceList = sentenceEnd.split(text, re.UNICODE)
    return sentenceList[-1]

In [19]:
df03['message_first_sentence'] = df03.message.apply(first_sentence)
df03['message_second_sentence'] = df03.message.apply(second_sentence)
df03.head()

Unnamed: 0,message,sentiment,message_first_sentence,message_second_sentence
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"Not where I lived, if I want to read a comic book, I had to go to the library","But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they were European."
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,Agreed,That is just downright unbelievable!
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,Indeed he doesn't,"Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that understand pointing!"
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7,"Yes, I have a dog",They seem like they may struggle as a couple in the future.
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1,Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act,"Creatures of the great deep are extraordinary. ""The immortal Jelly fish."


In [20]:
# collection of functions to clean the text in 1 function:
def lower(text):
    return str(text).lower()

# Remove HTML
def remove_HTML(text):
    return re.sub(r'<.*?>', '', text) 

# Removing all words with digits and standalone digits
def remove_digits(text):
    return re.sub(r'\d+', '', text)

# Removing backslashes:
def remove_backsl(text):
    return text.replace('\\','')

# Removing slashes:
def remove_sl(text):
    return text.replace('/','')

# Removing all dots:
def remove_dots(text):
    return text.replace('.','')

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Removing all non-printable symbols like "ड", "ட"
def remove_non_printable(text):
    text = text.encode("ascii", "ignore")
    return text.decode()
        
# One function to clean it all
def clean_text(text):
    text = lower(text)
    text = remove_digits(text)
    text = remove_backsl(text)
    text = remove_sl(text)
    text = remove_dots(text)
    text = remove_emoji(text)
    text = remove_HTML(text)
    text = remove_non_printable(text)
    return text

In [21]:
df03['Clean_message_first_sentence'] = df03['message_first_sentence'].apply(clean_text)
df03['Clean_message_second_sentence'] = df03['message_second_sentence'].apply(clean_text)
df03.head(5)

Unnamed: 0,message,sentiment,message_first_sentence,message_second_sentence,Clean_message_first_sentence,Clean_message_second_sentence
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"Not where I lived, if I want to read a comic book, I had to go to the library","But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they were European.","not where i lived, if i want to read a comic book, i had to go to the library","but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they were european"
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,Agreed,That is just downright unbelievable!,agreed,that is just downright unbelievable!
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,Indeed he doesn't,"Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that understand pointing!",indeed he doesn't,"speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!"
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7,"Yes, I have a dog",They seem like they may struggle as a couple in the future.,"yes, i have a dog",they seem like they may struggle as a couple in the future
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1,Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act,"Creatures of the great deep are extraordinary. ""The immortal Jelly fish.",yeah we all know that guy! how about the loch ness monster being protected under the scottish protection of animals act,"creatures of the great deep are extraordinary ""the immortal jelly fish"


In [22]:
# create new, regular indices
df03.reset_index(drop=True,inplace=True)

In [23]:
df03.head(3)

Unnamed: 0,message,sentiment,message_first_sentence,message_second_sentence,Clean_message_first_sentence,Clean_message_second_sentence
0,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"Not where I lived, if I want to read a comic book, I had to go to the library","But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they were European.","not where i lived, if i want to read a comic book, i had to go to the library","but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they were european"
1,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,Agreed,That is just downright unbelievable!,agreed,that is just downright unbelievable!
2,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,Indeed he doesn't,"Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that understand pointing!",indeed he doesn't,"speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!"


In [24]:
# prepared data set
df03.iloc[:3,4:]

Unnamed: 0,Clean_message_first_sentence,Clean_message_second_sentence
0,"not where i lived, if i want to read a comic book, i had to go to the library","but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they were european"
1,agreed,that is just downright unbelievable!
2,indeed he doesn't,"speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!"


# Expanding this LLM

In [25]:
df03.iloc[0,5]

'but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they were european  '

In [26]:
# Combine 'Clean_message_first_sentence' and 'Clean_message_second_sentence' into a new column 'Combined'
df03['Combined'] = df03['Clean_message_first_sentence'] + ' ' + df03['Clean_message_second_sentence']

In [27]:
df03.iloc[:3,4:]

Unnamed: 0,Clean_message_first_sentence,Clean_message_second_sentence,Combined
0,"not where i lived, if i want to read a comic book, i had to go to the library","but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they were european","not where i lived, if i want to read a comic book, i had to go to the library but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they we..."
1,agreed,that is just downright unbelievable!,agreed that is just downright unbelievable!
2,indeed he doesn't,"speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!","indeed he doesn't speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!"


In [28]:
# generate the data set
# --> load, tokenize, create model, train model, generate text with questions

def generate_data(text):
    # --> load my dataset (Combined)
    # --> text should be clean and well-formatted

    # tokenization
    # --> the combined text into sentences and words
    sent_tokens = sent_tokenize(text)
    word_tokens = [word_tokenize(t) for t in sent_tokens]

    # create a trigram model
    # --> i use a trigram model
    # --> considers three words at a time, to improve the model's performance
    n = 3
    train_data, padded_sents = padded_everygram_pipeline(n, word_tokens)

    # train the model with more text
    # --> instantiate the MLE model and fit it with the training data
    model = MLE(n)
    model.fit(train_data, padded_sents)

    # generate text with various questions
    # --> you can ask more questions and generate text based on different input words or phrases.
    # --> expanded LLM will provide more accurate and diverse answers based on the larger dataset
    # example questions
    questions = [
        "What is the content about",
        "How does it work",
        "What are the advantages",
        "How can I use them",
        "What should I consider"
    ]
    num_words = 25
    question = random.choice(questions)
    word_list = model.generate(num_words, text_seed=question.split())
    return (question,' '.join(word_list))

In [29]:
df03['Question_Answer'] = df03.Combined.apply(generate_data)

In [30]:
df03['Question'] = [df03.Question_Answer.iloc[i][0] for i in range(len(df03))]

In [31]:
df03['Answer'] = [df03.Question_Answer.iloc[i][1] for i in range(len(df03))]

In [32]:
df03.head(3)

Unnamed: 0,message,sentiment,message_first_sentence,message_second_sentence,Clean_message_first_sentence,Clean_message_second_sentence,Combined,Question_Answer,Question,Answer
0,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"Not where I lived, if I want to read a comic book, I had to go to the library","But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they were European.","not where i lived, if i want to read a comic book, i had to go to the library","but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they were european","not where i lived, if i want to read a comic book, i had to go to the library but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they we...","(How can I use them, when i was wrong , i did read a lot of them when i was wrong , i had to go to the library but)",How can I use them,"when i was wrong , i did read a lot of them when i was wrong , i had to go to the library but"
1,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,Agreed,That is just downright unbelievable!,agreed,that is just downright unbelievable!,agreed that is just downright unbelievable!,"(What is the content about, </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>)",What is the content about,</s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
2,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,Indeed he doesn't,"Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that understand pointing!",indeed he doesn't,"speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!","indeed he doesn't speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understand pointing!","(How does it work, only animals that understand pointing ! </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>)",How does it work,only animals that understand pointing ! </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>


In [33]:
df04 = df03[['Question','Answer']]
df04

Unnamed: 0,Question,Answer
0,How can I use them,"when i was wrong , i did read a lot of them when i was wrong , i had to go to the library but"
1,What is the content about,</s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
2,How does it work,only animals that understand pointing ! </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
3,How can I use them,they may struggle as a couple in the future </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
4,What are the advantages,immortal jelly fish </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
...,...,...
9995,What should I consider,"many speaking of which , did you know the simpson 's crew actually sent the south park producers flowers after they aired the family guy"
9996,How does it work,</s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
9997,What are the advantages,"show , i would 've never noticed bart never appeared did you know how facebook came to be and how it was developed ? </s>"
9998,What are the advantages,the things i `` did n't '' accomplish while i was stuck in my dorm room with roommates back in the day how would you
