# Caption Style Segmentation for Instagram Food Influencers

---

[https://dryanfurman.shinyapps.io/caption_clustering/](https://dryanfurman.shinyapps.io/caption_clustering/).

This notebook explores style variability in Instagram captions created by better-for-you food influencers (from Sep and Aug). I attempt to focus on style rather than topic; therefore, I filtered out any posts that were not about food and any text that uniqely identified to the profile (for example, their name).

I used KMeans clustering on BERT embeddings and sparse word presences (nmi = 0.9), revealing signficant style similarity between profiles. I found that it was easier to cluster captions by profile relative to, for example, clustering poems by type (i.e., in [Literary Pattern Recognition](https://www.journals.uchicago.edu/doi/pdf/10.1086/684353)). 

    
![dataviz](Insta-caption-style-viz.png)


In [222]:
import nltk
from transformers import BertModel, BertTokenizer

from scipy import sparse
from sklearn.cluster import KMeans
from sklearn import metrics

import csv, os, re
import math
from collections import Counter
import random
import numpy as np
from pathlib import Path
from glob import glob
import matplotlib.pyplot as plt
from random import sample


In [223]:
def read_texts(dir_folder):
    
    paths = glob(dir_folder)
    txt_list = []
    for path in paths:
        txt = Path(path).read_text()
        txt = txt.replace('\n', '')
        txt_list.append(txt)

    return txt_list

In [224]:
rachel_mansfield=read_texts("../data/insta-food-influencers/rachelmansfield/*.txt")
#rachel_mansfield=sample(rachel_mansfield, 40)
minimalist_baker=read_texts("../data/insta-food-influencers/minimalistbaker/*.txt")
minimalist_baker=sample(minimalist_baker,54)
delicioushealthyvideos=read_texts("../data/insta-food-influencers/delicioushealthyvideos/*.txt")

In [225]:
print(len(rachel_mansfield), len(minimalist_baker), len(delicioushealthyvideos))


54 54 54


In [226]:
def run_all(influencer1, influencer2, influencer3, feature_function):
    X, Y, feature_list=feature_function(influencer1, influencer2, influencer3)
    kmeans = KMeans(n_clusters=3, random_state=46).fit(X)
    nmi=metrics.normalized_mutual_info_score(Y, kmeans.labels_)
    print("%.3f NMI" % nmi)

    return X, Y, kmeans.labels_


In [227]:
# This function takes in a list of Instagram captions from two influencers, and returns:

# X (sparse matrix, with captions as rows and features as columns)
# Y (list of caption labels, with 1=minimalist_baker and 0=rachel_mansfield)

def featurize_method_1(influencer1, influencer2, influencer3):

    def featurize(caption, feature_vocab):
                
        feats={}

        tokens=nltk.word_tokenize(caption.lower())
        for token in tokens:
            if token not in feature_vocab:
                feature_vocab[token]=len(feature_vocab)
            feats[feature_vocab[token]]=1
        return feats

    feature_vocab={}
    data=[]
    Y=[]

    for caption in influencer1:
        feats=featurize(caption, feature_vocab)
        data.append(feats)
        Y.append(0)
    for caption in influencer2:
        feats=featurize(caption, feature_vocab)
        data.append(feats)
        Y.append(1)
    for caption in influencer3:
        feats=featurize(caption, feature_vocab)
        data.append(feats)
        Y.append(2)
        
    # shuffle the data
    temp = list(zip(data, Y))
    random.shuffle(temp)
    data, Y = zip(*temp)

    # sparse representation
    X=sparse.lil_matrix((len(data), len(feature_vocab)))

    for idx,feats in enumerate(data):
        for f in feats:
            X[idx,f]=feats[f]
    
    return X, Y, feature_vocab

In [228]:
run_all(rachel_mansfield, minimalist_baker, delicioushealthyvideos, featurize_method_1)

0.567 NMI


(<162x2061 sparse matrix of type '<class 'numpy.float64'>'
 	with 7398 stored elements in List of Lists format>,
 (2,
  2,
  0,
  0,
  2,
  2,
  2,
  1,
  2,
  2,
  2,
  2,
  2,
  2,
  0,
  0,
  1,
  1,
  0,
  0,
  2,
  1,
  1,
  2,
  2,
  0,
  1,
  1,
  2,
  1,
  2,
  1,
  2,
  0,
  2,
  1,
  1,
  1,
  0,
  0,
  2,
  1,
  1,
  1,
  1,
  1,
  0,
  0,
  0,
  0,
  1,
  1,
  1,
  2,
  2,
  0,
  2,
  0,
  2,
  2,
  0,
  0,
  1,
  1,
  2,
  1,
  1,
  1,
  1,
  0,
  0,
  1,
  0,
  0,
  2,
  2,
  2,
  2,
  0,
  1,
  1,
  0,
  1,
  1,
  1,
  1,
  2,
  0,
  1,
  1,
  1,
  1,
  2,
  0,
  0,
  1,
  2,
  0,
  0,
  1,
  0,
  0,
  2,
  0,
  1,
  2,
  0,
  0,
  2,
  2,
  0,
  1,
  0,
  1,
  0,
  2,
  0,
  2,
  0,
  1,
  1,
  2,
  0,
  2,
  1,
  0,
  2,
  1,
  1,
  1,
  2,
  0,
  2,
  2,
  2,
  1,
  2,
  1,
  2,
  2,
  0,
  0,
  0,
  1,
  1,
  2,
  1,
  0,
  1,
  2,
  0,
  0,
  2,
  0,
  0,
  2,
  2,
  0,
  0,
  0,
  2,
  0),
 array([2, 2, 0, 2, 2, 2, 2, 1, 0, 2, 1, 2, 2, 2, 0, 2, 1, 1, 2, 0, 2, 1,
  

In [229]:
#Bert helper fcn

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_bert_average_across_text_tokens(string): 
    
    # tokenize
    inputs = tokenizer(string, return_tensors="pt")
    # convert input ids to words
    tokens=tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

    if len(tokens)<=512:
        outputs = model(**inputs)
        bert_av = np.mean(outputs.last_hidden_state[0].detach().numpy(), axis=0)
    else: 
        bert_av = np.zeros(768).tolist()
        
    return bert_av

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [230]:
# This function takes in a list of Instagram captions from two influencers, and returns:

# X (matrix with average of BERT embeddings (768 features per caption))
# Y (list of caption labels, with 1=minimalist_baker and 0=rachel_mansfield)
    
def featurize_method_3(influencer1, influencer2, influencer3):
    
    def featurize(caption, feature_vocab):
                
        feats=get_bert_average_across_text_tokens(caption)

        return feats
        
    feature_vocab={}
    data=[]
    Y=[]

    for caption in influencer1:
        feats=featurize(caption, feature_vocab)
        data.append(feats)
        Y.append(0)
    for caption in influencer2:
        feats=featurize(caption, feature_vocab)
        data.append(feats)
        Y.append(1)
    for caption in influencer3:
        feats=featurize(caption, feature_vocab)
        data.append(feats)
        Y.append(2)
    temp = list(zip(data, Y))
    random.shuffle(temp)
    data, Y = zip(*temp)

    # not a sparse matrix
    X=data
        
    return X, Y, feature_vocab

In [231]:
X, Y, kmeans_labels = run_all(rachel_mansfield, minimalist_baker, delicioushealthyvideos, featurize_method_3)

0.580 NMI


### Conclusion (2 class kmeans)

I explored three methods for featurization to cluster Instagram captions. The first involved taking a sparse matrix across the entire vocabulary, where captions are rows and words are columns, with a binary word presence/absence. The second method simply took the total number of syllables and the total number of sentences in the captions, resulting in two features per caption (see visualization at the top of the notebook). The third involved taking the BERT embeddings of each token in the caption and averaging them, resulting in 768 features per caption. The BERT method performed best, with a NMI of ~0.9. Ultimately, style detection on Instagram captions is quite feasible, a task that scores significantly better than other types of style recognition (with similar methods for poem style segmentation scoring a NMI ~0.1, for example). 

In [232]:
X = pd.DataFrame(X)

In [233]:
true_labels = pd.DataFrame(Y)


In [234]:
kmeans_labels = pd.DataFrame(kmeans_labels)


In [235]:
X['true_labels'] = true_labels
X['kmeans_labels'] = kmeans_labels

In [236]:
X.to_csv('PCA_insta_data1.csv')