Caitlin Neppl  
LING 4100: Machine Learning in Linguistics  
Fall 2019  
Prof Mans Hulden  
Final Project  

Character Classification using Stochastic Gradient Descent Training
==========


**-- Data --**

The webcomic _Homestuck_ (2009-2016) is a story that begins with four human kids, who are online friends. They communicate to one another throughout the comic via the "Pesterchum" instant messaging platform. The comic is known for being lengthy (at ~ 900,000 words) with much of the wordcount being taken up by these "pesterlogs". 
<img src= "pesterlog.png" alt="Pesterlog" width="600" height="200">

Many of the characters in the comic have very particular ways of typing (it's an alien thing), which helps to differentiate them from one another, which is why I initially thought of this project. However, it was much simpler to stick to the human kids, as their data would be a lot easier to parse and compare with one another. I thought it would be interesting to see how different their vocabularies are such that a machine learning strategy would enable high accuracy in predictions, and to explore what kinds of multi-class classifiers were out there and see how they worked.

The text file that contained all of the text from the webcomic was obtained courtesy of http://readmspa.org/. I was able to extract the lines from the pesterlogs of the four main characters. 

As such, my classifier is composed of  four classes, one for each character: Dave Strider, John Egbert, Jade Harley, and Rose Lalonde. Examples are composed of a character's dialog from a single pesterlog exchange. Features are words. My initial goal was to train a classifier using the data from the earlier parts of the comic, and test it on the newer ones (when they're older), but I ended up being surprised by how much data I would need to make the classifier work decently.  


There are many multi-class classifier models out there, but I landed on this Stochastic Gradient Descent Model, in part because it is well-supported through the Scikit-learn library, which I utilized on this project. Prior to classification, the data was processed using NLTK and other Python techniques. Sklearn was then used to format the data from txt to matrix, and stop words were removed. Weights were adjusted using TF-IDF, and I used Sklearn's linear model library's SGD training for the classifier. 

_(Note:)_ The cleaning code below is kind of wonky since I changed the file structure as I figured out how to handle the data. The fully processed example files are up on my github: https://github.com/cnepp/ling4100_project

In [52]:
def writeFile(filename, text):
     # create file if not exists, write to eof
    f = open(filename, 'a+', encoding='utf8')
    for i in text:
        f.write(i)
    f.close()
    print("file written to:", filename)

# separate character lines from entire script
file = open('hsscrpt.txt', 'rt', encoding='utf8')

count = 0 # counter for keeping track of which example
act6 = False

#iterate through all lines in file
for line in file:
    
    # conditions for acts 1-5
    ds_conditions = ("DAVE: " in line) or ("TG: " in line) or ("DAVESPRITE: " in line)
    je_conditions = ("JOHN: " in line) or ("EB: " in line) or ("GT: " in line)
    jh_conditions = ("JADE: " in line) or ("GG:" in line) or ("JADESPRITE: " in line)
    rl_conditions = ("ROSE: " in line) or ("TT: " in line)
    
    # need to change conditions once we enter act 6 - handle entanglement (eg roxy and dave are both'TG' )
    if "[S] ACT 6" in line: 
        act6 = True
    if act6 == True:
        ds_conditions = ("DAVE: " in line) or ("DAVESPRITE: " in line)
        je_conditions = ("JOHN: " in line) or ("EB: " in line)
        jh_conditions = ("JADE: " in line) or ("JADESPRITE: " in line)
        rl_conditions = ("ROSE: " in line)

    if ("pesterlog" in line) or ("spritelog" in line) or ("dialoglog" in line): # indicates beginning of convo
        count = count + 1 
    
    # dave
    if ds_conditions: 
        dspath = 'examples/ds/ds_' + str(count) + '.txt'
        #writeFile(dspath, line)
   
    # john
    if je_conditions:
        path = 'examples/je/je_' + str(count) + '.txt'
        #writeFile(path, line)
        
    # jade
    if jh_conditions:
        path = 'examples/jh/jh_' + str(count) + '.txt'
        #writeFile(path, line)
        
    # rose
    if rl_conditions:
        path = 'examples/rl/rl_' + str(count) + '.txt'
        #writeFile(path, line)


## Cleaning the Data

* Used NLTK to isolate tokens and clean them up
* Cleaned by:
    * tokinizing using TwitterTokenizer
    * all lowercase
    * no punctuation
    * no non-alpha characters
    * removed chat handles / names which prefaced each line 
* Stored character info/previously separated files in dict
* Wrote cleaned data for each example into a separate file 

 Four classes: dave, john, rose, jade  
 Example is one line  
 Features are words in the line (out of some crazy total)  

In [223]:
import nltk, os

# dictionaries for ea character to make data processing a lil easier

# dave strider (CLASS 0)
daveStrider = {
    'examples': 'raw_examples/ds/dirt/',
    'initials' : 'ds',
    'name' : 'dave',
    'handle' : 'tg',
    'aka' : 'davesprite'
}
# john egbert (CLASS 1)
johnEgbert = {
    'examples': 'raw_examples/je/dirt/',
    'initials' : 'je',
    'name' : 'john',
    'handle' : 'eb',
    'aka' : 'gt'
}
# jade harley (CLASS 2)
jadeHarley = {
    'examples': 'raw_examples/jh/dirt/',
    'initials' : 'jh',
    'name' : 'jade',
    'handle' : 'gg',
    'aka' : 'jadesprite'
}
# rose lalonde (CLASS 3)
roseLalonde = {
    'examples': 'raw_examples/rl/dirt/',
    'initials' : 'rl',
    'name' : 'rose',
    'handle' : 'tt',
    'aka' : ''
}

# helper functions to clean up data

def readFile(filename):
    file = open(filename, 'rt')
    text = file.read()
    file.close()
    return text
    
def tokenize(pesterlog):
    # TweekTokenizer much more forgiving about punctuation  
    from nltk.tokenize import TweetTokenizer
    tt = TweetTokenizer()
    tokens = tt.tokenize(pesterlog)
    return tokens

def lowercase(tokens):
    tokens = [w.lower() for w in tokens]
    return tokens

def punctuation(tokens):
    punct = '!#$%&"()*+,-./:;?' 
    punct = punct + "'" # gotta cram that apostophe in there somehow
    table = str.maketrans('', '', punct)
    stripped = [w.translate(table) for w in tokens]
    return stripped

def removeNonAlpha(strippedwords):
    # remove remaining tokens (containing) non-alphabetic characters
    words = [word for word in strippedwords if word.isalpha()]
    return words

def removeWords(name, handle, aka, words):    
    # remove handles/names
    handles = set([name, handle, aka])
    words = [w for w in words if (not w in handles)]
    return words

def writeWords(filename, words):
    # create if not exists; overwrite if exists
    f = open(filename, 'w+', encoding = 'utf8')
    for i in words:
        f.write(i +'\n')
    f.close()
    print("file written to ", filename)

In [224]:
def cleanData(kid):
    print(kid['name'])
    #wordcount = 0
    freq = {}
    charwords = []

    # for all examples (pesterlogs)
    for file in os.listdir(kid['examples']):
        # run through helper functions for each component
        filename = kid['examples'] + file
        #print("reading in ", filename)
        rawtext = readFile(filename)
        tokens = tokenize(rawtext) 
        tokens = lowercase(tokens)  
        tokens = punctuation(tokens)
        tokens = removeNonAlpha(tokens)        
        words = removeWords(kid['name'], kid['handle'], kid['aka'], tokens)
        
        # put into a nice list for later tinkering in scikit or whatever
        listToStr = ' '.join([str(elem) for elem in words]) 
        charwords.append(listToStr) 
        
        # get word frequencies
        for i in words:
            freq[i] = freq.get(i,0) + 1
        #wordcount += len(words)
        
        # write new file with cleaned up tokens
        nfile = 'examples/' + kid['initials'] + '/clean/clean_' + file # format stupid path
        #writeWords(nfile, words) 
        #print(words[:100])
    
    #print("wordcount", wordcount, kid['name'])   
    wordfreq = sorted(freq.items(), key = lambda x: x[1], reverse = True)
    print(len(wordfreq))
    #return charwords

In [53]:
# generate cleaned up word files for each kid
chars = [daveStrider, johnEgbert, roseLalonde, jadeHarley]
for c in chars:
     cleanData(c)
        
# array format, one ex = one entry 
dswords = cleanData(daveStrider)
jewords = cleanData(johnEgbert)
jhwords = cleanData(jadeHarley)
rlwords = cleanData(roseLalonde)

NameError: name 'daveStrider' is not defined

## Exploring the Data


Character            |  Words     |    Acts 1-4      |     Act 5        |  Act 6          | Unique Words  |   Examples
 ------------------- |------------|------------------| ---------------  | ----------------|---------------|------------
    Dave Strider     |   48734    |  9425(~19%)    |   15319(~31%)  |  23990(~50%)  |    5443       |     371
    John Egbert      |   46684    |  8538(~18%)    |   14090(~30%)  |  24056(~52%)  |    3792       |     538
    Jade Harley      |   25657    |  3659(~28%)    |   10320(~31%)  |  11590(~41%)  |    2884       |     273
    Rose Lalonde     |   25569    |  7170(~14%)    |    7944(~40%)  |  10543(~45%)  |    4505       |     343

There was a heavy imbalance of words being said by each character, with Rose saying the least, and Dave saying the most. This was interesting, because Rose had the second-highest number of unique words, which makes sense because of her sizeable vocabulary. As Rose turned out to have one of the highest accuracy when testing the classifier later on, I now realize that a higher percentage of unique words over total words must be correlated with classifier accuracy, as it's easier to identify characters who say more rare words, compared to a character who keeps similar turns of phrase as their friend.


The table above shows percentages in reference to the total number of words said by the character. I had originally planned on training on Acts 1-4, development on Act 5, and test on Act 6, but I hadn't realized the sheer disproportionality of the act structure. So instead, I processed the entire comic into one folder per character, and used sk-learn to divide the data.
    


In [213]:
#bring on over our nice lil lists
ds = dswords
je = jewords
jh = jhwords
rl = rlwords
allexamples = ds + je + jh + rl

343


# Using sklearn 
- load cleaned up data into 'examples'
- split 'examples' into training and testing sets
    - examples: 'x_train', 'x_test'
    - classes: 'c_train', 'c_test'
- use countVectorizer 
    - remove common words with stop_words
    - convert text docs to matrix of token counts


In [2]:
import sklearn
import numpy as np
from glob import glob
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# load in example files using strategic foldering
classes = ['dave', 'john', 'jade', 'rose']
examples = sklearn.datasets.load_files("examples/",description=None, categories=classes,load_content=True, encoding='utf-8', shuffle=True, random_state=42)

# split into train and test sets; hold out 20% of the dataset for testing/dev
x_train, x_test, c_train, c_test = train_test_split(examples.data,examples.target, test_size=0.2)

# sklearn: countvec counts tokens, gets rid of stop words
countvec = CountVectorizer(stop_words='english')
# get counts of training file
x_train_counts = countvec.fit_transform(x_train)

## Weighting features with TF-IDF
- term frequency - inverse document frequency
- weight the features based on their relative rarity/uniqueness
- sklearn: TfidfTransformer on vector created with countvec 
- referenced https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f tutorial

In [3]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

#tfidf = TfidfTransformer(use_idf=True)
#x_train_tfidf = tfidf.fit_transform(x_train_counts)


## Top 10 Weighted Features per Class

In [11]:
def print_top10(, clf, class_labels):
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" & (class_label, " ".join(feature_names[j] for j in top10)))

## Set up Classifier

#### Stochastic Gradient Descent (SGD) Classifier
- calculates loss from one example, rather than total loss from whole set (as in standard)
- uses this loss to nudge weight of feature - decreasing learning rate
- chosen because it's easy to implement using sklearn and fairly accurate

- loss='hinge': default loss function, also fastest one I tested
- penalty=l2 for linear SVM models

In [12]:
from sklearn.linear_model import SGDClassifier
#from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import metrics
from sklearn.pipeline import Pipeline 

# use pipeline to set up vector representations/freq -> tfidf -> classifier 
clf = Pipeline([('vect', CountVectorizer(stop_words='english')),('tfidf', TfidfTransformer(use_idf=True)),('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=51,verbose=0)),])
# pass training data to fit function
clf.fit(x_train, c_train)


Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        ...dom_state=51, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))])

## Metrics / Investigate Results


- sklearn has some metrics capabilities built in - shown below
- results show around a 70% average accuracy
- accuracy of characters in descending order
    - Rose: 75-86%
    - Dave: 72-78%
    - John: 66-75%
    - Jade: 66-73%
- I am not entirely surprised by this order, I believe it is largely because of the difference in vocabulary that Rose and Dave use, versus John and Jade. Using tf-idf got rid of a lot of words overall. Also, this is very specific, but Jade tends to draw out words: "okay!" becomes "oookkaaaayy!!!" which isn't something that's handled easily by the libraries, but most likely could be mitigated.
- If I continue work on this, an enhancement I would consider would be using bi-grams, especially for newline situations
- Using NLTK's TwitterTokenizer preserved a lot more than the standard tokenizer, but I still wish I had been able to keep emojis or certain punctuation-based quirks.


In [13]:
# nparray to store classifier's predictions of the test set
predict = clf.predict(x_test)
avgsuccess = (np.mean(predict == c_test))
print ("Average Success with Classifier on Test Data: ", avgsuccess)

# check out metrics classification report
print('\nSGD Classifier Metrics:\n')
print(metrics.classification_report(c_test, predict, target_names = examples.target_names))

Average Success with Classifier on Test Data:  0.6721311475409836

SGD Classifier Metrics:

              precision    recall  f1-score   support

        dave       0.68      0.75      0.71        73
        jade       0.70      0.55      0.62        67
        john       0.65      0.76      0.70       111
        rose       0.69      0.54      0.60        54

   micro avg       0.67      0.67      0.67       305
   macro avg       0.68      0.65      0.66       305
weighted avg       0.68      0.67      0.67       305



In [14]:
def prediction(text):
    c = ['Dave', 'John', 'Jade', 'Rose']
    p = clf.predict(text)[0]
    print("Prediction: Class", p)
    return c[p.astype(int)]

text = ["thank you for looking at my presentation"]
prediction(text)

Prediction: Class 2


'Jade'

In [15]:
text = ["i have to give to him"]
prediction(text)

Prediction: Class 2


'Jade'

In [23]:
vectorizer = CountVectorizer
class_labels = classes

def print_top10(vectorizer, clf, class_labels):
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" & (class_label, " ".join(feature_names[j] for j in top10)))

In [24]:
print_top10(vectorizer, clf, class_labels)

TypeError: get_feature_names() missing 1 required positional argument: 'self'

In [32]:
from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin, ClusterMixin, TransformerMixin

get_params(self, deep=True)

NameError: name 'get_params' is not defined

n-grams for concat/ phrases