In "[Literary Pattern Recognition](https://www.journals.uchicago.edu/doi/full/10.1086/684353)", Long and So train a classifier to differentiate haiku poems from non-haiku poems, and find that many features help do so.  In class, we've discussed the importance of representation--how you *describe* a text computationally influences the kinds of things you are able to do with it.  While Long and So explore description in the context of classification, in this homework, you'll see how well you can design features that can differentiate these two classes *without* any supervision. Are you able to featurize a collection of poems such that two clusters (haiku/non-haiku) emerge when using KMeans clustering, with the text representation as your only degree of freedom?

In [1]:
import csv, os, re
import nltk
from scipy import sparse
from sklearn.cluster import KMeans
from sklearn import metrics
import math
from collections import Counter
import random

In [2]:
def read_texts(path, metadata, filepath_col):
    data=[]
    with open(metadata, encoding="utf-8") as file:
        csv_reader = csv.reader(file)
        next(csv_reader)
        for cols in csv_reader:
            poem_path=os.path.join(path, cols[filepath_col])
            if os.path.exists(poem_path):
                with open(poem_path, encoding="utf-8") as poem_file:
                    poem=poem_file.read()
                    data.append(poem)
    return data

Here we'll use data originally released on Github to support "Literary Pattern Recognition": [https://github.com/hoytlong/PatternRecognition](https://github.com/hoytlong/PatternRecognition)

In [3]:
haiku=read_texts("../data/haiku/long_so_haiku", "../data/haiku/Haikus.csv", 4)

In [4]:
others=read_texts("../data/haiku/long_so_others", "../data/haiku/OthersData.csv", 5)

In [5]:
# don't change anything within this code block

def run_all(haiku, others, feature_function):
    
    X, Y, featurize_vocab=feature_function(haiku, others)
    kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
    nmi=metrics.normalized_mutual_info_score(Y, kmeans.labels_)
    print("%.3f NMI" % nmi)

As one example, let's take a simple featurization and represent each poem by a binary indicator of the dictionary word types it contains.  "To be or not to be", for example, would be represented as {"to": 1, "be": 1, "or": 1, "not": 1}

In [6]:
# This function takes in a list of haiku poems and non-haiku poems, and returns:

# X (sparse matrix, with poems as rows and features as columns)
# Y (list of poem labels, with 1=haiku and 0=non-haiku)
# feature_vocab (dict mapping feature name to feature ID)

def unigram_featurize_all(haiku, others):

    def unigram_featurize(poem, feature_vocab):
        
        # featurize text by just noting the binary presence of words within it
        
        feats={}

        tokens=nltk.word_tokenize(poem.lower())
        for token in tokens:
            if token not in feature_vocab: # check if token is in feature_vocab 
                feature_vocab[token]=len(feature_vocab) # if not, add it to the vocab
            feats[feature_vocab[token]]=1 # then, add it to feats dict with binary 1/0
        return feats

    feature_vocab={}
    data=[]
    Y=[]

    for poem in haiku:
        feats=unigram_featurize(poem, feature_vocab)
        data.append(feats)
        Y.append(1)
    for poem in others:
        feats=unigram_featurize(poem, feature_vocab)
        data.append(feats)
        Y.append(0)
    
    # since the data above has all haiku ordered before non-haiku, let's shuffle them
    temp = list(zip(data, Y))
    random.shuffle(temp)
    data, Y = zip(*temp)

    # we'll use a sparse representation since our features are sparse
    X=sparse.lil_matrix((len(data), len(feature_vocab)))

    for idx,feats in enumerate(data):
        for f in feats:
            X[idx,f]=feats[f]
    
    return X, Y, feature_vocab

This method yields an NMI of ~0.07 (with some variability due to the randomness of KMeans)

In [7]:
run_all(haiku, others, unigram_featurize_all)

0.073 NMI


**Q1**: Copy the `unigram_featurize_all` code above and adapt it to create your own featurization method named `fancy_featurize_all`.  You may use whatever information you like to represent these poems for the purposes of clustering them into two categories, but you must use the KMeans clustering (with 2 clusters) as defined in `run_all`.  Use your own understanding of haiku, or read the Long and So article above for other ideas.  Are you able to improve over an NMI of 0.07?

In [391]:
# packages for embedding
from transformers import BertModel, BertTokenizer
import numpy as np
from sentence_transformers import SentenceTransformer

In [397]:
# embedding approach

def fancy_featurize_all(haiku, others):
    
    sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')
    
    poems = haiku + others
    Y = [0] * len(haiku) + [1] * len(others) 
        
    data = sentence_model.encode(poems)
    
    feature_vocab = {}
    
    # since the data above has all haiku ordered before non-haiku, let's shuffle them
    temp = list(zip(data, Y)) # zip into 2 item list
    random.shuffle(temp)
    data, Y = zip(*temp)
    
    return data, Y, feature_vocab

In [398]:
run_all(haiku, others, fancy_featurize_all)

0.120 NMI


In [75]:
# packages for first-try approach
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

arpabet = nltk.corpus.cmudict.dict()
from g2p_en import G2p
g2p = G2p()

def get_pronunciation(word):
    if word in arpabet:
        # pick the first pronunciation
        return arpabet[word][0]

    else:
        return g2p(word)

def get_syllable_count(word):
    pronunciation=get_pronunciation(word)
    sylls=0
    for phon in pronunciation:
        # vowels in arpabet end in digits (indicating stress)
        if re.search("\d$", phon) is not None:
            sylls+=1
    return sylls


import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner,parser'])
nlp.remove_pipe('ner')
nlp.remove_pipe('parser')

def spacy_tokenizer(data):
    spacy_tokens=nlp(data)
    return [token for token in spacy_tokens]

In [361]:
# first-try approach
def fancy_featurize_all(haiku, others):
    
    def fancy_featurize(poem, feature_vocab):

            lowered = poem.lower()
            tokens = spacy_tokenizer(lowered)
            cleaned_tokens = []
            for word in tokens:
                if str(word) not in stop_words:
                    if word.pos_ == "NOUN":
                        cleaned_tokens.append(str(word.lemma_))
                    elif word.is_punct == False and "\n" not in str(word):
                        cleaned_tokens.append(str(word))

            feats = {}

            text_length = len(cleaned_tokens)
            syllables = get_syllable_count(lowered)
            feats[0] = text_length
            feats[1] = syllables

            for token in cleaned_tokens:
                if token not in feature_vocab: # check if token is in feature_vocab 
                    feature_vocab[token]=len(feature_vocab) # if not, add it to the vocab
                if feature_vocab[token] not in feats: 
                    feats[feature_vocab[token]] = 1 
                else:
                    feats[feature_vocab[token]] = feats[feature_vocab[token]] + 1

            keys_to_drop = []
            for key in feats:
                if feats[key] < 2:
                    keys_to_drop.append(key)

            [feats.pop(key) for key in keys_to_drop]
            return feats
    
    feature_vocab={"length_": 0, "syllables_": 1} 
    data=[]
    Y=[]

    for poem in haiku:
        feats=fancy_featurize(poem, feature_vocab) 
        data.append(feats) 
        Y.append(1) 
    for poem in others:
        feats=fancy_featurize(poem, feature_vocab)
        data.append(feats)
        Y.append(0)
    
    # since the data above has all haiku ordered before non-haiku, let's shuffle them
    temp = list(zip(data, Y)) # zip into 2 item list
    random.shuffle(temp)
    data, Y = zip(*temp)

    
    # we'll use a sparse representation since our features are sparse
    X=sparse.lil_matrix((len(data), len(feature_vocab))) # convert to matrix

    for idx,feats in enumerate(data):
        for f in feats:
            X[idx,f]=feats[f]
    
    return X, Y, feature_vocab

In [362]:
run_all(haiku, others, fancy_featurize_all)

0.032 NMI


**Q2**: Describe your method for featurization in 100 words and why you expect it to be able to separate haiku poems from non-haiku poems in this data.

My method for featurization utilizes a SentenceTransformer to encode each haiku. As described by [Reimers and Gurevych](https://arxiv.org/abs/1908.10084), this model utilizes Sentence-BERT which utilizes a modified version of BERT to generate sentence embeddings, representing the sentence by mapping it to a vector space.

In this case, I expect this approach to separate haikus from non-haikus because it takes into account the context of the entire poem, as compared to simply analyzing the presence or absence of word tokens. As Long and So discuss in their paper, the embedding approach allows for analysis of key factors, such as sentence structure and common themes. 

(Note: I included my initial approach above for my own record, which tried to build on the first approach by incorporating features that Long and So discuss, such as syllable count and poem length, with similar cleaning such as lemmatizing nouns and removing stop words. This was much less successful -- it did much worse, and I assume that this is because there are significant overlaps in the haikus and poems in terms of the added features). 