# Unsupervised learning techniques on Instagram data

For our Practical Data Science final project we scraped Instagram data to examine different ways in people put forth their views on this platform. 

## Motivation

Instagram is widely used to post content by a plethora of users like the general public, businesses, celebrities, eductaion and scientific organziations around the world. A common way of posting content on this social media platform to increase the popularity of the posts is by using hashtags. Users have different sentiments and strategies to use these hashtags and often times they only relate to the post content tangentially. Many times users randomly use 'trending' hashtags to accomplish their goal of engaging more users through their posts.

Through this project we wanted to analyze the behavior of users by analyzing how images are posted on instagram paired with different hashtags and captions. While captions are written in an attempt to describe the post, hashtags are used to add value to the description and also gain popularity. This is achieved by clustering the posts using unsupervised learning techniques and clustering the content within each cluster again to define sub-groups of posts using hashtags and sentence embeddings that are derived from captions.

## Content



1. [Introduction](#Introduction)
2. [Data](#Data)
3. [Data Cleaning](#Data-Cleaning)
4. [Modelling and Visualizations](#Modelling-and-Visualizations)
5. [Results](#Results)
6. [FutureWork](#FutureWork)



### Introduction

Instagram's content can be analyzed to derive key insights about user behavior on he platform. Within the scope of our study we focus on a specific domain to extract data and analyze it. Our reference case refers to how people post content with the hashtag 'cliamtechange'. Using only images (no data related to stories or video posts were analyzed within the scope of our project), the associated hashtags, and captions of the post we attempted to do a cluster within cluster analysi to classify the data within groups and sub-groups anc visualize them

### Data

The data used in the project was scraped using 'Igram Scraper' (https://github.com/realsirjoe/instagram-scraper) instead of the official Instagram API key. The official API key gives access to only limited Instagram data and thus the alternative scraper tool was used. The scraper was used to pull posts which used the hashtag 'climatechange'. With a break of 1.5 seconds between each request to ensure that our IP address is not blocked by Instagram, 10000 posts were extracted on 28th Novemeber 2019. These 10000 posts included both images and video posts and spanned within the time frame of 26th November 2019 - 28th November 2019.

Due to security and privacy policies defined by Instagram, there is no best method to pull "large" amounts of data from Instagram in real time. The scraper tool used in this project was the best option to extract data related to images, i.e. the image itself, the url to the post, the timestamp, associated caption and hashtag to the post, location of post (if provided by the user), total number of organic likes, total of organic + sponsored likes, and  total number of comments. 

An attempt was made to extract 25000 and 50000 posts using this scraper, but beyond 10000 posts our requests to the Instagram failed to gather any content. In future work, an attempt can be made to use a scraper of write one from scratch to extract more data, but within the scope of our project we were successful in deriving criticall insights from the 10000 posts collected.

Additionally, attempting to extract images along with its meta-data (hashtags, captions, likes etc.) consumed a lot of time, which resulted in multiple time-out errors, and thus the images were extracted separately after the meta data was collected using the 'Igram scraper'. This was completed by using the post links collected in the meta-data to separately extract all the images by preserving the order. It is important to note that Instagram introduced a new feature where multiple images can be posted in a single post, but within the time-frame for our study, while scraping the images, only the first image of the post was considered for the analysis.

In [None]:
# Import all libraries
import json
import os
import argparse
import sys
import re
import torch
from PIL import Image
from torchvision import transforms
import torchvision.models as models
import numpy as np

from igramscraper.instagram import Instagram # pylint: disable=no-name-in-module

In [None]:
# Scraping the data

instagram = Instagram(sleep_between_requests=1.5)

medias = instagram.get_medias_by_tag('climatechange', count=10000)

fp = open("climate_igram_10k.txt", "w")

for media in medias:
    print(media, file=fp)
    print('Account info:', file=fp)
    account = media.owner
    print('Id', account.identifier, file=fp)
    # print('Username', account.username)
    # print('Full Name', account.full_name)
    # print('Profile Pic Url', account.get_profile_picture_url_hd())
    print('--------------------------------------------------', file=fp)
fp.close()

In [None]:

# parser = argparse.ArgumentParser()
# parser.add_argument("--infile")
# parser.add_argument("--outfile")
# args = parser.parse_args()

fp = open("climate_igram_10k.txt")

hdr_to_key = {
    "'Id:": "media_id",
    "Shortcode:": "shortcode",
    "Created at:": "timestamp",
    "Caption:": "caption",
    "Number of comments:": "num_comments",
    "Number of likes:": "num_likes",
    "Link:": "post_link",
    "Hig res image:": "media_link",
    "Media type:": "media_type",
    "Id": "account_id"}

metadata = []
image_info = {}
caption_str = ""
reading_caption = 0
num_media =  0

# Hashtags regex - taken from rarcega/instagram-scraper
hashtag_regex_string = r"(?<!&)#(\w+|(?:[\xA9\xAE\u203C\u2049\u2122\u2139\u2194-\u2199\u21A9\u21AA\u231A\u231B\u2328\u2388\u23CF\u23E9-\u23F3\u23F8-\u23FA\u24C2\u25AA\u25AB\u25B6\u25C0\u25FB-\u25FE\u2600-\u2604\u260E\u2611\u2614\u2615\u2618\u261D\u2620\u2622\u2623\u2626\u262A\u262E\u262F\u2638-\u263A\u2648-\u2653\u2660\u2663\u2665\u2666\u2668\u267B\u267F\u2692-\u2694\u2696\u2697\u2699\u269B\u269C\u26A0\u26A1\u26AA\u26AB\u26B0\u26B1\u26BD\u26BE\u26C4\u26C5\u26C8\u26CE\u26CF\u26D1\u26D3\u26D4\u26E9\u26EA\u26F0-\u26F5\u26F7-\u26FA\u26FD\u2702\u2705\u2708-\u270D\u270F\u2712\u2714\u2716\u271D\u2721\u2728\u2733\u2734\u2744\u2747\u274C\u274E\u2753-\u2755\u2757\u2763\u2764\u2795-\u2797\u27A1\u27B0\u27BF\u2934\u2935\u2B05-\u2B07\u2B1B\u2B1C\u2B50\u2B55\u3030\u303D\u3297\u3299]|\uD83C[\uDC04\uDCCF\uDD70\uDD71\uDD7E\uDD7F\uDD8E\uDD91-\uDD9A\uDE01\uDE02\uDE1A\uDE2F\uDE32-\uDE3A\uDE50\uDE51\uDF00-\uDF21\uDF24-\uDF93\uDF96\uDF97\uDF99-\uDF9B\uDF9E-\uDFF0\uDFF3-\uDFF5\uDFF7-\uDFFF]|\uD83D[\uDC00-\uDCFD\uDCFF-\uDD3D\uDD49-\uDD4E\uDD50-\uDD67\uDD6F\uDD70\uDD73-\uDD79\uDD87\uDD8A-\uDD8D\uDD90\uDD95\uDD96\uDDA5\uDDA8\uDDB1\uDDB2\uDDBC\uDDC2-\uDDC4\uDDD1-\uDDD3\uDDDC-\uDDDE\uDDE1\uDDE3\uDDEF\uDDF3\uDDFA-\uDE4F\uDE80-\uDEC5\uDECB-\uDED0\uDEE0-\uDEE5\uDEE9\uDEEB\uDEEC\uDEF0\uDEF3]|\uD83E[\uDD10-\uDD18\uDD80-\uDD84\uDDC0]|(?:0\u20E3|1\u20E3|2\u20E3|3\u20E3|4\u20E3|5\u20E3|6\u20E3|7\u20E3|8\u20E3|9\u20E3|#\u20E3|\\*\u20E3|\uD83C(?:\uDDE6\uD83C(?:\uDDEB|\uDDFD|\uDDF1|\uDDF8|\uDDE9|\uDDF4|\uDDEE|\uDDF6|\uDDEC|\uDDF7|\uDDF2|\uDDFC|\uDDE8|\uDDFA|\uDDF9|\uDDFF|\uDDEA)|\uDDE7\uD83C(?:\uDDF8|\uDDED|\uDDE9|\uDDE7|\uDDFE|\uDDEA|\uDDFF|\uDDEF|\uDDF2|\uDDF9|\uDDF4|\uDDE6|\uDDFC|\uDDFB|\uDDF7|\uDDF3|\uDDEC|\uDDEB|\uDDEE|\uDDF6|\uDDF1)|\uDDE8\uD83C(?:\uDDF2|\uDDE6|\uDDFB|\uDDEB|\uDDF1|\uDDF3|\uDDFD|\uDDF5|\uDDE8|\uDDF4|\uDDEC|\uDDE9|\uDDF0|\uDDF7|\uDDEE|\uDDFA|\uDDFC|\uDDFE|\uDDFF|\uDDED)|\uDDE9\uD83C(?:\uDDFF|\uDDF0|\uDDEC|\uDDEF|\uDDF2|\uDDF4|\uDDEA)|\uDDEA\uD83C(?:\uDDE6|\uDDE8|\uDDEC|\uDDF7|\uDDEA|\uDDF9|\uDDFA|\uDDF8|\uDDED)|\uDDEB\uD83C(?:\uDDF0|\uDDF4|\uDDEF|\uDDEE|\uDDF7|\uDDF2)|\uDDEC\uD83C(?:\uDDF6|\uDDEB|\uDDE6|\uDDF2|\uDDEA|\uDDED|\uDDEE|\uDDF7|\uDDF1|\uDDE9|\uDDF5|\uDDFA|\uDDF9|\uDDEC|\uDDF3|\uDDFC|\uDDFE|\uDDF8|\uDDE7)|\uDDED\uD83C(?:\uDDF7|\uDDF9|\uDDF2|\uDDF3|\uDDF0|\uDDFA)|\uDDEE\uD83C(?:\uDDF4|\uDDE8|\uDDF8|\uDDF3|\uDDE9|\uDDF7|\uDDF6|\uDDEA|\uDDF2|\uDDF1|\uDDF9)|\uDDEF\uD83C(?:\uDDF2|\uDDF5|\uDDEA|\uDDF4)|\uDDF0\uD83C(?:\uDDED|\uDDFE|\uDDF2|\uDDFF|\uDDEA|\uDDEE|\uDDFC|\uDDEC|\uDDF5|\uDDF7|\uDDF3)|\uDDF1\uD83C(?:\uDDE6|\uDDFB|\uDDE7|\uDDF8|\uDDF7|\uDDFE|\uDDEE|\uDDF9|\uDDFA|\uDDF0|\uDDE8)|\uDDF2\uD83C(?:\uDDF4|\uDDF0|\uDDEC|\uDDFC|\uDDFE|\uDDFB|\uDDF1|\uDDF9|\uDDED|\uDDF6|\uDDF7|\uDDFA|\uDDFD|\uDDE9|\uDDE8|\uDDF3|\uDDEA|\uDDF8|\uDDE6|\uDDFF|\uDDF2|\uDDF5|\uDDEB)|\uDDF3\uD83C(?:\uDDE6|\uDDF7|\uDDF5|\uDDF1|\uDDE8|\uDDFF|\uDDEE|\uDDEA|\uDDEC|\uDDFA|\uDDEB|\uDDF4)|\uDDF4\uD83C\uDDF2|\uDDF5\uD83C(?:\uDDEB|\uDDF0|\uDDFC|\uDDF8|\uDDE6|\uDDEC|\uDDFE|\uDDEA|\uDDED|\uDDF3|\uDDF1|\uDDF9|\uDDF7|\uDDF2)|\uDDF6\uD83C\uDDE6|\uDDF7\uD83C(?:\uDDEA|\uDDF4|\uDDFA|\uDDFC|\uDDF8)|\uDDF8\uD83C(?:\uDDFB|\uDDF2|\uDDF9|\uDDE6|\uDDF3|\uDDE8|\uDDF1|\uDDEC|\uDDFD|\uDDF0|\uDDEE|\uDDE7|\uDDF4|\uDDF8|\uDDED|\uDDE9|\uDDF7|\uDDEF|\uDDFF|\uDDEA|\uDDFE)|\uDDF9\uD83C(?:\uDDE9|\uDDEB|\uDDFC|\uDDEF|\uDDFF|\uDDED|\uDDF1|\uDDEC|\uDDF0|\uDDF4|\uDDF9|\uDDE6|\uDDF3|\uDDF7|\uDDF2|\uDDE8|\uDDFB)|\uDDFA\uD83C(?:\uDDEC|\uDDE6|\uDDF8|\uDDFE|\uDDF2|\uDDFF)|\uDDFB\uD83C(?:\uDDEC|\uDDE8|\uDDEE|\uDDFA|\uDDE6|\uDDEA|\uDDF3)|\uDDFC\uD83C(?:\uDDF8|\uDDEB)|\uDDFD\uD83C\uDDF0|\uDDFE\uD83C(?:\uDDF9|\uDDEA)|\uDDFF\uD83C(?:\uDDE6|\uDDF2|\uDDFC))))[\ufe00-\ufe0f\u200d]?)+"

for l in fp:
    num_media += 1
    line = l.lstrip()
    # else
    
    if(reading_caption == 1 and (not line.startswith("Number of comments:"))):
        caption_str += line
        continue
    if(line.startswith("Media Info:")):
        pass
    elif(line.startswith("'Id:")):
        hdr_str = "'Id:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = int(line.strip())
    elif(line.startswith("Shortcode:")):
        hdr_str = "Shortcode:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = line.strip()
    elif(line.startswith("Created at:")):
        hdr_str = "Created at:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = int(line.strip())
    elif(line.startswith("Caption:")):
        if(reading_caption == 1):
            caption_str += line
            continue
        hdr_str = "Caption:"
        reading_caption = 1
        line = line.strip(hdr_str).lstrip()
        caption_str += line
    elif(line.startswith("Number of comments:")):
        reading_caption = 0
        image_info["caption"] = caption_str
        image_info["hashtags"] = re.findall(hashtag_regex_string, caption_str, re.UNICODE)
        image_info["hashtags"] = list(set(image_info["hashtags"]))
        caption_str = ""
        hdr_str = "Number of comments:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = int(line.strip())
    elif(line.startswith("Number of likes:")):
        hdr_str = "Number of likes:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = int(line.strip())
    elif(line.startswith("Link:")):
        hdr_str = "Link:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = line.strip()
    elif(line.startswith("Hig res image:")):
        hdr_str = "Hig res image:"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = line.strip()
    elif(line.startswith("Media type:")):
        hdr_str = "Media type:"
        line = line.strip("Media").lstrip().strip("type:").lstrip()
        image_info[hdr_to_key[hdr_str]] = line.strip()
    elif(line.startswith("Account info:")):
        pass
    elif(line.startswith("Id")):
        hdr_str = "Id"
        line = line.strip(hdr_str).lstrip()
        image_info[hdr_to_key[hdr_str]] = int(line.strip())
    elif(line.startswith("--------------------------------------------------")):
        metadata.append(image_info)
        image_info = {}	

fp.close()

with open("climate_igram_10k.json", 'w') as outfile:
    images_json = {"media_metadata": metadata}
    json.dump(images_json, outfile, indent=4)

In [None]:
# Get Image data for the meta-data collected

# parser = argparse.ArgumentParser()
# parser.add_argument('--jsonfile')
# parser.add_argument('--outfile')
# args = parser.parse_args()

fp = open("climate_igram_10k.json", 'r')

media_json = json.load(fp)

img_links = []

for media in media_json["media_metadata"]:
    if(media["media_type"] == "image"):
        img_links.append(media["media_link"])

fp.close()

with open("IMG_climate_igram_10k.json", 'w') as fo:
    links_json = {"img_links": img_links}
    json.dump(links_json, fo, indent=4)


In [None]:
# Scraper 

import scrapy
from scrapy.crawler import CrawlerProcess
import os
import logging
import json
import argparse


# img_links - JSON file containing list of image links
# out_dir - directory where images are stored
# parser = argparse.ArgumentParser()
# parser.add_argument('--img_links')
# parser.add_argument('--out_dir')
# args = parser.parse_args()

img_links = []

if("images_climate_10k" is None):
    exit()

with open("IMG_climate_igram_10k.json", 'r') as fin:
    img_json = json.load(fin)
    img_links = img_json["img_links"]

try:
    os.mkdir("images_climate_10k")
except:
    pass

# Download IG images
class ImgSpider(scrapy.Spider):
    name = "img"
    
    # Describe requests
    def start_requests(self):        
        for url in img_links:
            req = scrapy.Request(url, self.save_img)
            im_name = url.rsplit('/', 1)[1].rsplit('?', 1)[0]
            p_im = os.path.join(args.out_dir, im_name)
            req.meta["img_path"] = p_im
            yield req   

    # Save images
    # Don't use image libraries 
    # Since some formats may not be recognized
    def save_img(self, response):
        p_im = response.meta["img_path"]
        with open(p_im, "wb") as fout:
            # Body of page is image data
            fout.write(response.body)    
        

# Be polite! Set a download delay, in seconds.
# Reduce/increase concurrent requests if needed
# Set settings and run spider
# TODO - autothrottle
# THings to explore- 
# 1. Autothrottle 2. Pipelines 3. Writing to database 4. Xpath 5. Infinite scrolling
img_settings = {
                    "BOT_NAME": 'igimg',
                    "LOG_LEVEL": "WARNING",
                    "USER_AGENT": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36',
                    "ROBOTSTXT_OBEY": True,
                    "CONCURRENT_REQUESTS": 32,
                    "DOWNLOAD_DELAY": 1.5
                }                

img_process = CrawlerProcess(settings=img_settings)
img_process.crawl(ImgSpider)
img_process.start()

### Data Cleaning

The first step in our modeling technique is the essential step in the Exploratory Data Analysis pipeline, the data cleaning methods. Data collected using our scraper was obatined in an unstructured text file format, which required conversion to a structured data format and additional cleaning to seggregate hashtags for further analysis. We achieved this by converting the unstructured text data to a json file, and the objects within the json file represented name-value pairs of the following fields:
1. media_id: A unique number of the post on Instagram
2. shortcode: Shortcode for the tinyurl
3. timestamp: Unix timestamp on when the content was posted
4. captions: The caption of the post
5. hashtags: The hashtags used with that post
6. num_comments: The total number of comments 
7. num_likes: The total number of likes
8. post_link: The link to the post on Instagram
9. media_link: The link to only the image, story, or video on Instagram (without meta-data for that post)
10. media_type: Type of post - Image, Video, Story(known as sidecar)
11. account_id: The unique account id on Instagram for the user who posted this content.

After the data was converted in json format for easy parsing and analysis, the follwing additional cleaning techniques were deployed:
1. First step achieved was to filter the data to only retain media which were images. This was done as our objective is to analyze how users post images on Instagram and use image embeddings to accomplish the objective. 
2. Secondly, the filtered Images were processed further to remove redundant hashtags like "ClimateChange" which were representative of the hashtag "climatechange". 



### Converting posts to embeddings

We converted the images to image embeddings using the VGGnet and the captions (cleaned by removing emojis, hashtag symbols etc) to sentence embeddings using a 16-language text encoder (https://tfhub.dev/google/universal-sentence-encoder-multilingual/2). Some of the code for converting to image embeddings is shown below. The notebook for this code is present separately on the repo. The code for sentence embeddings was similar but much simpler, as we simply has to query the TensorFlowHub API.


In [None]:
import torch
from PIL import Image
from torchvision import transforms
import torchvision.models as models
import numpy as np
import os
import json

In [None]:
preprocess = transforms.Compose([
    transforms.Resize(299),
    transforms.CenterCrop(299),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

In [None]:
model = models.vgg16(pretrained=True)
print(list(model.children()))

In [None]:
im_dir = '/content/gdrive/My Drive/Fall_2019/Data science/Project/Climate_scrape/images_climate_10k'

im_names = []

images = []

listdirs = os.listdir(im_dir)

num_pics = len(listdirs)
lists = []

for i in range(num_pics // 500 + 1):
  lo = i * 500
  hi = (i+1) * 500
  if(i == num_pics//500):
    hi = num_pics
  lists.append(listdirs[lo:hi])

for l in lists:
  print(len(l))

# lists = [listdirs[:1000], listdirs[1000:2000], listdirs[2000:3000], listdirs[3000:4000], 
#         listdirs[4000:5000], listdirs[5000:6000], listdirs[6000:]]

In [None]:
for num_list, listdir in enumerate(lists):
  print(f"List {num_list}")
  n = len(listdir)
  print(f"Num of elements = {n}")
  input_batch = np.zeros([n, 3, 299, 299])
  for i, name in enumerate(listdir):
    if((i+1) % 100 == 0):
    	print(f"Iteration {i+1}")  	  	
    im_path = os.path.join(im_dir, name)
    im_names.append(im_path)    
    with Image.open(im_path) as im:  
  	  if(im.mode != "RGB"):
  	  	im = im.convert("RGB")
  	  it = preprocess(im)
  	  input_batch[i, :] = np.array(it)

  input_batch = torch.tensor(np.stack(input_batch, axis=0), dtype=torch.float32)
  print(input_batch.shape)
  # second_m = torch.nn.Sequential(*list(model.children()))

  torch.cuda.empty_cache()

  if torch.cuda.is_available():
      input_batch = input_batch.to('cuda')
      model.to('cuda')


  list_outputs = []

  with torch.no_grad():

    for i in range(n // 64):
      if((i+1) % 20 == 0):
        print(f"Batch {i+1}")
      lo = i*64
      hi = (i+1) * 64
      if(i == n//64 - 1):
        hi = n
      output2 = model.features(input_batch[lo:hi]) 
      output3 = model.avgpool(output2) 
      output3 = torch.flatten(output3,1)   
      list_outputs.append(model.classifier[:4](output3))
  
  list_outputs = [np.array(arr.cpu()) for arr in list_outputs]
  print(len(list_outputs))
  output = np.concatenate(list_outputs)
  list_outputs = None

  embed_json = {}
  for i, im in enumerate(listdir):
  	embed = list(output[i, :])
  	embed = [str(fl) for fl in embed]
  	embed_json[im] = embed

  with open(f'/content/gdrive/My Drive/Fall_2019/Data science/Project/Climate_scrape/im_embed_{num_list}.json', 'w') as f2:
  	json.dump(embed_json, f2, indent=4)

  embed_json= None
  input_batch = None
  output = None

### Modelling and Visualizations

The objective of our project was to create groups and sub-groups of the data to characterize the images with related hashtags and sentences. The data derived had no inherent annotations to classify into categories, which called for the use of unsupervised learning techniques. Using unsupervised learning algorithms, like k-means clustering, hierarchical clustering, and DBSCAN in a two-step process we analyzed the image data and meta-data of hashtags and captions. 

In step 1, we first created image embeddings using a pre-trained model from pytorch, 'Inception_v3'. Inception_v3 is an image recognition model created by Google (https://pytorch.org/hub/pytorch_vision_inception_v3/) used for projects requiring image recognition. We used this model because in its 42-layer network architecture could.....

#The image embeddings were then clustered using visualized using t-SNE to see how the representative groups 

In step 2, the captions from the Image posts were analyzed by creating vectors using sentence-embedding techniques. These vectors were classified into clusters using k-means, DBSCAN, and hierarchial clustering algorithms as shown below. After the clusters were created they were analyzed by counting the number of frequent hashtags.

# TSNE visualisation of KMeans

In [None]:
with open('line_embeddings.json', 'r') as f:
  array = json.load(f)
with open('img_embeddings.json', 'r') as f:
  img_array = json.load(f)

In [None]:
img_names = list(array.keys())
img_dict = {im_name:i for i, im_name in enumerate(img_names)}

textembl = [np.float32(array[key]) for key in array]
textembl = np.array(textembl)
keys1 = [img_dict[key] for key in array]
array = None
textembl=np.insert(textembl,0,keys1,1)
#print(len(textembl),textembl[0])
print("Loaded sentence embeddings")

img_embed = [np.float32(img_array[key]) for key in img_array]
img_embed = np.array(img_embed)
img_keys = [img_dict[key] for key in img_array]
img_array = None
img_embed = np.insert(img_embed, 0, img_keys, 1)

In [None]:
kmeans = KMeans(init='k-means++',  random_state = 42)
kmeans.fit(textembl[:,1:].astype('float32'))
ypred=kmeans.predict(textembl[:,1:].astype('float32'))

feat_cols = [ 'pixel'+str(i) for i in range(textembl.shape[1]) ]
df = pd.DataFrame(textembl,columns=feat_cols)
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(textembl[:, 1:])

df_subset['tsne-2d-one'] = tsne_results[:,0]
df_subset['tsne-2d-two'] = tsne_results[:,1]

plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue=ypred,
    palette=sns.color_palette("hls", 8),
    data=df_subset,
    # legend="full",
    alpha=0.3)

<img src="tSNE.png">
tSNE visualization of sentence embeddings clustered and projected to two dimensions.


Now we did a hashtag distribution analysis on each such cluster, and found something very interesting. The topmost 10 hashtags in each cluster more or less followed a subjective theme. This is demonstrated below.

<img src="plots/line/cat0.png">
Theme is environment and sustainability.

<img src="plots/line/cat1.png">
Theme is climate activism.

<img src="plots/line/cat2.png">
Theme is nature photography.

<img src="plots/line/cat3.png">
Theme is eco-friendliness measures like recycling.

<img src="plots/line/cat4.png">
This is kind of a mixed bag, but it includes food related hashtags.

<img src="plots/line/cat5.png">
This is talking about climate crisis.

<img src="plots/line/cat6.png">
This is again mixed but some themes identified are lgbtq and feminism

<img src="plots/line/cat7.png">
This focuses VERY prominently on veganism, despite the actual share of vegan marked tweets being quite low. In fact, vegan related hashtags did not even come up when this clustering analysis was done with image embeddings.

We tried this same analysis using image embeddings, but the topmost hashtags in each category discovered by KMeans were all almost the same throughout the groups. There was no meaningful discovery of important groups of posts using image embeddings.

# Hierarchical Clustering - Dendograms (based on sentence embeddings)

In [None]:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Posts classification")
dend = shc.dendrogram(shc.linkage(textembl[:,1:].astype('float32'), method='ward'),show_leaf_counts=True,truncate_mode='lastp',
    show_contracted=True)
plt.axhline(y=11.5, color='r', linestyle='--')

<img src="dendo.png">
Dendrogram doesnot provide a single partitioning of the data set, but instead persents an extensive hierarchy of clusters that merge with each other at certain distances. In a dendrogram, the y-axis marks the distance at which the clusters merge. Based on the distances, the dendrogram was understood to be partitioning the data into 4 groups. The last p non-singleton clusters formed in the linkage are the only non-leaf nodes in the linkage, and all other non-singleton clusters are contracted into leaf nodes. The linkage method ward is used which minimizes the variance of the clusters being merged.

### Results

We discovered that clustering using sentence embeddings is most helpful to analyze instagram posts for the hashtag climatechange, and the images carried little to no meaning. Indeed, the images, on visual inspection, were found to be very random, consisting of stock photos, activists, memes or posters, often unrelated to the caption, which held the real message.

### Conclusions and Future Work
Through our analysis, we attempted to examine publicly available 10,000 Instagram image posts related which used the hashtag climatechange. Our analysis showed how myriad of images posted on Instagram with this hashtag does not categorize into the domain of climate change science. We purposefully chose this particular hashtag to do our analysis, as climatechange is one of the recurring trending topics and it was interesting to examine how users apply this hashtag to their posts. Initially our hypothesis was that relevant images may be using this hashtag, although co-occuring gobbledygook hashtags along with climate change was expected. This was because the trending hashtags which are often used to make a post popular are random words like “follow4follow”, “fun”, “instalike” etc. (https://www.all-hashtag.com/top-hashtags.php). Instead it was found that, no key takeaways could be derived from image clusters due to the plethora of different images posted on Instagram. Whereas, clusters of sentence-embeddings derived from the captions contained similar subjective-domains related to climate change, like hashtags that advocated climate change action were clustered together, and hashtags that highlighted veganism were grouped together. Thus, it can be concluded that the text data like captions, and hashtags are much more helpful in characterizing the topic. 

To understand user behavior in depth on Instagram, some other techniques of exploration can be adopted in future:
1. The data scraped contained total number of comments, total number of likes, total number of organic + sponsored likes (applicable only to Business profiles on Instagram, which promote the post to get more popular). This data can be leveraged to check which hashtags within each cluster of sentence-embeddings led to more number of comments and likes.

2. The data collected in our study was limited to 10,000 posts, due to technical complexity posed by the scraper used, and Instagram’s privacy policy. Sophisticated methods can be used in future to extract more posts. More data will help in training the models better to do the clustering analysis. This might often not be true for other Data Science problems, but in our case, the data extracted spanned over two days and often topics like climate change have increased posts based on public outcry, and thus collecting more data will be helpful.
3. Furthermore, analyzing user behavior on Instagram should not be limited to data derived for a particular hashtag. Instead data for multiple hashtags or location specific data should be extracted to analyze which ‘categories’ (again clusters of image posts) trend in a particular region or for a group of co-occurring hashtags.