# LLM - Large Language Model (a simple example)

In [1]:
# nltk library to create a basic language model
#   --> minimal viable product (MVP) for providing a complete and detailed implementation
#   --> language model = probabilistic model --> predict the likelihood of a sequence of words appearing in a given context
#   --> used in NLP (natural language processing) --> speech recognition, machine translation, and text generation

In [2]:
# import libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import re
import nltk
import random
from nltk.util import ngrams
from collections import defaultdict, Counter
# Run in case warnings should be ignored
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth',200)

In [3]:
# loading data
df01 = pd.read_csv('topical_chat.csv')

In [4]:
df01.head()

Unnamed: 0,conversation_id,message,sentiment
0,1,Are you a fan of Google or Microsoft?,Curious to dive deeper
1,1,Both are excellent technology they are helpful in many ways. For the security purpose both are super.,Curious to dive deeper
2,1,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense.",Curious to dive deeper
3,1,"Google provides online related services and products, which includes online ads, search engine and cloud computing.",Curious to dive deeper
4,1,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",Curious to dive deeper


In [5]:
df01.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188378 entries, 0 to 188377
Data columns (total 3 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   conversation_id  188378 non-null  int64 
 1   message          188378 non-null  object
 2   sentiment        188378 non-null  object
dtypes: int64(1), object(2)
memory usage: 4.3+ MB


In [6]:
df01.sentiment.value_counts()

sentiment
Curious to dive deeper    80888
Neutral                   41367
Surprised                 30638
Happy                     29617
Sad                        2533
Disgusted                  1433
Fearful                    1026
Angry                       876
Name: count, dtype: int64

In [7]:
# the feature 'sentiment' could be my label --> i change it to numeric
le = LabelEncoder()
df01['sentiment'] = le.fit_transform(df01['sentiment'])

In [8]:
# use only the messages and the sentiments:
df01 = df01[['message','sentiment']]
df01.head(5)

Unnamed: 0,message,sentiment
0,Are you a fan of Google or Microsoft?,1
1,Both are excellent technology they are helpful in many ways. For the security purpose both are super.,1
2,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense.",1
3,"Google provides online related services and products, which includes online ads, search engine and cloud computing.",1
4,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",1


In [9]:
df01.message.value_counts()

message
Hi, how are you?                                                                                                                            317
You too!                                                                                                                                     98
Nice chatting with you!                                                                                                                      91
Bye!                                                                                                                                         77
Bye                                                                                                                                          61
                                                                                                                                           ... 
Or he could try being an actor, he could audition as the president, he'd probably do better than Reagan did in his audition as p

In [10]:
# remove duplicated messages
df02 = df01.drop_duplicates(subset='message',keep='first')
df02.head(5)

Unnamed: 0,message,sentiment
0,Are you a fan of Google or Microsoft?,1
1,Both are excellent technology they are helpful in many ways. For the security purpose both are super.,1
2,"I'm not a huge fan of Google, but I use it a lot because I have to. I think they are a monopoly in some sense.",1
3,"Google provides online related services and products, which includes online ads, search engine and cloud computing.",1
4,"Yeah, their services are good. I'm just not a fan of intrusive they can be on our personal lives.",1


In [11]:
df02.message.value_counts().max()

1

In [12]:
df02.isnull().value_counts()

message  sentiment
False    False        184303
Name: count, dtype: int64

In [13]:
df02.shape

(184303, 2)

In [14]:
# i only use messages with more than 199 chars
df02 = df02[df02.message.astype(str).str.len() >= 200]
df02.head()

Unnamed: 0,message,sentiment
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1


In [15]:
# message lengths
messages_len = [len(df02.iloc[i,0]) for i in range(len(df02))]
messages_len.sort()
print(messages_len[0],messages_len[-1])

200 697


In [16]:
df02.shape

(10821, 2)

In [17]:
# i correct the size of my data set to work with
df03 = df02.iloc[:10000,:]
df03.head()

Unnamed: 0,message,sentiment
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1


In [18]:
# collection of functions to clean the text in 1 function:
def lower(text):
    return str(text).lower()

# Remove HTML
def remove_HTML(text):
    return re.sub(r'<.*?>', '', text) 

# Removing all words with digits and standalone digits
def remove_digits(text):
    return re.sub(r'\d+', '', text)

# Removing backslashes:
def remove_backsl(text):
    return text.replace('\\','')

# Removing slashes:
def remove_sl(text):
    return text.replace('/','')

# Removing all dots:
def remove_dots(text):
    return text.replace('.','')

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# Removing all non-printable symbols like "ड", "ட"
def remove_non_printable(text):
    text = text.encode("ascii", "ignore")
    return text.decode()
        
# One function to clean it all
def clean_text(text):
    text = lower(text)
    text = remove_digits(text)
    text = remove_backsl(text)
    text = remove_sl(text)
    text = remove_dots(text)
    text = remove_emoji(text)
    text = remove_HTML(text)
    text = remove_non_printable(text)
    return text

In [19]:
df03['Clean_message'] = df03['message'].apply(clean_text)
df03.head(5)

Unnamed: 0,message,sentiment,Clean_message
167,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"not where i lived, if i want to read a comic book, i had to go to the library but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they we..."
175,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award for her editing on star wars and yet george lucas didn't win an academy award for directing that is j..."
187,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,"indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understa..."
228,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7,"yes, i have a dog sounds like in the dear amy that they do everything tit for tat the cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat that's not..."
293,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1,"yeah we all know that guy! how about the loch ness monster being protected under the scottish protection of animals act creatures of the great deep are extraordinary ""the immortal jelly fish"


In [20]:
# create new, regular indices
df03.reset_index(drop=True,inplace=True)

In [21]:
df03

Unnamed: 0,message,sentiment,Clean_message
0,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"not where i lived, if i want to read a comic book, i had to go to the library but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they we..."
1,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award for her editing on star wars and yet george lucas didn't win an academy award for directing that is j..."
2,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,"indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understa..."
3,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7,"yes, i have a dog sounds like in the dear amy that they do everything tit for tat the cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat that's not..."
4,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1,"yeah we all know that guy! how about the loch ness monster being protected under the scottish protection of animals act creatures of the great deep are extraordinary ""the immortal jelly fish"
...,...,...,...
9995,"It's a great show for kids although I haven't seen too many. No where near 750. Ha! I prefer South Park. Speaking of which, did you know the Simpson's crew actually sent the South Park producers ...",5,"it's a great show for kids although i haven't seen too many no where near ha! i prefer south park speaking of which, did you know the simpson's crew actually sent the south park producers flower..."
9996,"I'm honestly not sure. It's crazy Bart never appeared nor was mentioned in an episode, right? Wonder what episode that was? Maybe I can post about it on Facebook and get an answer! Speaking of Fa...",7,"i'm honestly not sure it's crazy bart never appeared nor was mentioned in an episode, right? wonder what episode that was? maybe i can post about it on facebook and get an answer! speaking of fac..."
9997,"You can definitely get an answer if you post it, there are some very loyal fans of the show, I would've never noticed Bart never appeared. Did you know how Facebook came to be and how it was deve...",1,"you can definitely get an answer if you post it, there are some very loyal fans of the show, i would've never noticed bart never appeared did you know how facebook came to be and how it was devel..."
9998,"Yeah, definitely makes me think about the things I ""didn't"" accomplish while I was stuck in my dorm room with roommates back in the day. I wonder how MySpace feels now for turning down Facebook f...",1,"yeah, definitely makes me think about the things i ""didn't"" accomplish while i was stuck in my dorm room with roommates back in the day i wonder how myspace feels now for turning down facebook fo..."


In [22]:
# preparing data set
df03.Clean_message[:3]

0     not where i lived, if i want to read a comic book, i had to go to the library but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they we...
1     agreed can you also believe that george lucas' wife at the time, marcia, won an academy award for her editing on star wars and yet george lucas didn't win an academy award for directing that is j...
2     indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understa...
Name: Clean_message, dtype: object

In [23]:
# function --> tokenize, create N-Gram Model, generate text
def generate_text(text):

    # tokenization
    #   --> function for breaking a text into individual words or tokens
    # check if model for tokenizer exists
    try:
        nltk.data.find('punkt.zip')
    except:
        nltk.download('punkt')
    #nltk.download('punkt')
    tokens = nltk.word_tokenize(text)
    
    # N-Gram Model
    #  function for --> a contiguous sequence of n items from a given sample of text
    #               --> creating a simple bigram model (n=2) for the LLM
    #               --> creating a dictionary of bigrams and their frequencies
    bigrams = list(ngrams(tokens, 2))
    bigram_freq = defaultdict(Counter)

    for w1, w2 in bigrams:
        bigram_freq[w1][w2] += 1
    bigram_freq = bigram_freq
    
    # generating the text
    # --> with the bigram model i generate the text using a function
    # --> the function accepts a seed word (randomly) and generates a sequence of words using the bigram model
    n_words = 15
    result = [str(tokens[5])]
    for _ in range(n_words):
        next_word_options = bigram_freq[result[-1]]
        if len(next_word_options) > 0:
            next_word = random.choices(list(next_word_options.keys()), list(next_word_options.values()))[0]
            result.append(next_word)
    return ' '.join(result)

In [None]:
df03['Generated_message'] = df03.Clean_message.apply(generate_text)

In [25]:
df03.head()

Unnamed: 0,message,sentiment,Clean_message,Generated_message
0,"Not where I lived, if I want to read a comic book, I had to go to the library. But as I think about it Tintin is a comic book, so I was wrong, I did read a lot of them when I was young but they w...",7,"not where i lived, if i want to read a comic book, i had to go to the library but as i think about it tintin is a comic book, so i was wrong, i did read a lot of them when i was young but they we...","if i want to read a comic book , i lived , if i had to"
1,"Agreed. Can you also believe that George Lucas' wife at the time, Marcia, won an academy award for her editing on Star Wars and yet George Lucas didn't win an academy award for directing. That is...",7,"agreed can you also believe that george lucas' wife at the time, marcia, won an academy award for her editing on star wars and yet george lucas didn't win an academy award for directing that is j...",that george lucas did n't win an academy award for her editing on star wars and
2,"Indeed he doesn't. But I've always liked Chewbacca because he reminded me of a giant dog. Ha! Speaking of dogs, I think it's cool that dogs, along with elephants, are the only animals that unders...",5,"indeed he doesn't but i've always liked chewbacca because he reminded me of a giant dog ha! speaking of dogs, i think it's cool that dogs, along with elephants, are the only animals that understa...","i think it 's cool that understand pointing ! speaking of dogs , along with elephants"
3,"Yes, I have a dog. Sounds like in the Dear Amy that they do everything tit for tat. The cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat. That's ...",7,"yes, i have a dog sounds like in the dear amy that they do everything tit for tat the cat owner states that her fiance has to get rid of his saltwater fish if she gets rid of her cat that's not...",dog sounds like they seem like in the cat owner states that 's not necessarily fair
4,"Yeah we all know that guy! how about the loch ness monster being protected under the Scottish 1912 protection of animals act. Creatures of the great deep are extraordinary. ""The immortal Jelly fi...",1,"yeah we all know that guy! how about the loch ness monster being protected under the scottish protection of animals act creatures of the great deep are extraordinary ""the immortal jelly fish",guy ! how about the immortal jelly fish


In [26]:
df04 = df03[['sentiment','Generated_message']]
df04

Unnamed: 0,sentiment,Generated_message
0,7,"if i want to read a comic book , i lived , if i had to"
1,7,that george lucas did n't win an academy award for her editing on star wars and
2,5,"i think it 's cool that understand pointing ! speaking of dogs , along with elephants"
3,7,dog sounds like they seem like in the cat owner states that 's not necessarily fair
4,1,guy ! how about the immortal jelly fish
...,...,...
9995,5,for kids although i prefer south park producers flowers after they aired the family guy episode
9996,7,it 's crazy how far it was ? wonder what episode that was launched on facebook
9997,1,answer if you can definitely get an answer if you know how facebook came to be
9998,1,think about the day i was stuck in my dorm room with roommates back in my


In [27]:
df04.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   sentiment          10000 non-null  int32 
 1   Generated_message  10000 non-null  object
dtypes: int32(1), object(1)
memory usage: 117.3+ KB
