# Clustering Post-hoc Analysis

This notebook analyses the clustering output marked by the coders.

## First pass at saved clusters

In [1]:
directory = '/mnt/5tb/dark-patterns-output/'

saved_clusters = []

with open(directory + 'saved_first_pass_arunesh.txt') as f:
    for line in f.readlines():
        saved_clusters.append(int(line))

How many saved clusters?

In [2]:
len(saved_clusters)

1768

Read in the clustering output and extract the segment texts.

In [3]:
import pandas as pd

segment_clusters = pd.read_pickle(directory + 'clusters_with_processed_text.pickle')
len(segment_clusters.index)

1240588

In [4]:
segment_clusters_saved = segment_clusters[segment_clusters.cluster_10_bow_euc.isin(saved_clusters)]
len(segment_clusters_saved.index)

178584

Let's go back and label the original segments for the second pass.

In [5]:
import json
from tqdm import tqdm

segments_json = directory + 'segments.json'

saved_clusters = dict(zip(segment_clusters_saved.inner_text_processed.values, segment_clusters_saved.cluster_10_bow_euc.values))
saved_clusters_set = set(saved_clusters.keys())

segments_second_pass_text = []
segments_second_pass_hostname = []
segments_second_pass_site_url = []
segments_second_pass_cluster_id = []

with open(segments_json) as f:
    
    for line in tqdm(f):
        segment = json.loads(line)
        
        if segment['inner_text_processed'] in saved_clusters_set:
            segments_second_pass_text.append(segment['inner_text'])
            segments_second_pass_hostname.append(segment['hostname'])
            segments_second_pass_site_url.append(segment['site_url'])
            segments_second_pass_cluster_id.append(saved_clusters[segment['inner_text_processed']])
            

1850895it [00:27, 67543.92it/s]


In [6]:
segments_second_pass = pd.DataFrame({'hostname': segments_second_pass_hostname, 
                                     'inner_text': segments_second_pass_text,
                                     'site_url': segments_second_pass_site_url,
                                     'cluster_id': segments_second_pass_cluster_id})

In [7]:
len(segments_second_pass.index)

267380

Shuffle the segments and write them to disk.

In [8]:
segments_second_pass = segments_second_pass.sample(frac=1.0)
segments_second_pass.to_pickle(directory + 'segments_second_pass.pickle')
segments_second_pass.to_csv(directory + 'segments_second_pass.csv', encoding='utf-8', index=False)