Introduction:

This document discusses a simple sentence generator program, which mimicks real life language models, as well as details regarding how the algorithm functions. 

This program functions by recieving a list of sentences, and is able to output new sentences with a given starting word.


If you are too impatient to reading the whole document, here are some outputs generated by this program:


Note:
Please keep in mind that this project does not use actual machine learning elements, rather it was merely inspired by the general idea. More information and limitations of the program can be found in the end of the documentation.

Section 1: Interpretation

This section discusses how the program reads in info, and how it processes said inputs for "learning".

In [4]:
#This function breaks down sentences into digestable chunks for the program to analyze

PUNCTS = (".", "?", "!") #Given list of currently accepted punctuations

def processSentence(sentence):
    chunks = []
    nextChunk = ""

    for i in sentence:
        if i in PUNCTS:
            chunks.append(nextChunk)
            nextChunk = ""

            chunks.append(i)
        elif i == " ":
            if len(nextChunk) > 0:
                chunks.append(nextChunk)
                nextChunk = ""
        else:
            nextChunk += i
    
    return chunks

Section 2: Training

This section discusses how the program conform itself to the processed input, in preparation for sentence generating. This is the second level of preprocessing before the program can actually create its own outputs.

In [2]:
trainedMatrix = {
    #"SampleWord" : [[NextWords], [NextProbs]]
}

def train(trainingSentences):
    for sen in trainingSentences:
        procSen = processSentence(sen)

        for i in range(len(procSen)):
            chk = procSen[i]
            prevChk = procSen[i-1] if i > 0 else None


            if not chk in PUNCTS and not chk in trainedMatrix.keys():
                trainedMatrix[chk] = [[], []]
            
            if prevChk is not None:
                #targetIndex = [trainedMatrix.keys].index(prevChk)

                if chk in trainedMatrix[prevChk][0]:
                    updatedWrdList = trainedMatrix[prevChk][0]
                    updatedProbList = trainedMatrix[prevChk][1]

                    chkIndex = updatedWrdList.index(chk)
                    updatedProbList[chkIndex] += 1

                    trainedMatrix.update({prevChk : [updatedWrdList, updatedProbList]})
                else:
                    updatedWrdList = trainedMatrix[prevChk][0]
                    updatedProbList = trainedMatrix[prevChk][1]

                    updatedWrdList.append(chk)
                    updatedProbList.append(1)

                    trainedMatrix.update({prevChk : [updatedWrdList, updatedProbList]})

Section 3: Generation

This section touches on how the program uses all the preprocessed data to actually create its outputs.

In [3]:
import random

#These functions create scaled chunks based on the given information, and then select a random chunk, see more in the documentation

#Helper function, this confines the probability range between 0 and 1
def probabilityGradient(probs):
    total = sum(probs)
    return [i/total for i in probs]

#This function returns the index of the randomly chosen section
def selectValue(probs):
    gradient = probabilityGradient(probs)
    target = random.random()
    total = 0
    result = 0

    for bound in gradient:
        total += bound

        if target < total:
            return result
        result += 1

    #technically it would never reach this point but just return the last chunk in case it didn't pick anything prior
    return len(gradient) - 1

After being able to predict the next word, now the program can simply generate a full sentence through recursion, using its previous output as the next input.

In [1]:
MAX_SENT_LIM = 50 #Hard stop maximum sentence length to prevent possible infinite recursion

def generate(startWrd, matrix):
    sent = [startWrd]

    if not startWrd in matrix.keys():
        print("Starting word hasn't been learnt before")
        return None

    def generateRec(prevWrd):
        if len(sent) >= MAX_SENT_LIM:
            return False
        else:
            nxtWords = matrix[prevWrd][0]
            nextProbs = matrix[prevWrd][1]

            nxtWord = nxtWords[selectValue(nextProbs)]
            sent.append(nxtWord)

            if nxtWord in PUNCTS:
                return True
            else:
                return generateRec(nxtWord)

    outcome = generateRec(startWrd)

    if outcome:
        print("Success, returning current result")
        return(sent)
    else:
        print("Exceeded maximum word limit, returning current result")
        return(sent)

Section 4: Results

This section provides example input and outputs of the program, while giving a deeper analysis on said inputs and outputs.

Section 5: Implications and Limitations

This section discusses more about the general idea as well as possible future features that can greatly improve this very basic program.

Conclusion:

Overall, although this program obviously have little practical purposes (due to its limitations as well as being unable to compete with real language models), it may act as an oversimplified example for curious beginners who are looking into the field of artificial intelligence. At least, I had a fun time making this, and I hope that this document has been entertaining to read as well as being somewhat educational and insightful.

Thank you for reading.