# Crawl Data Analysis: Feature Processing

This notebook tries various feature processing techniques on our web crawl data. It was written for Python 2.7. Before you run this notebook, please make sure you append the two json files from the previous step.

## Read the segments

Read the processed segments from the previous step using the JSON files.

In [1]:
segments_json = '/mnt/ssd/amathur/dark-patterns-output/segments.json'

## Processing routines

Collection of preprocessing routines before feature processing. Add more routines depending on what we would like to try.

In [2]:
from nltk.stem.porter import PorterStemmer
import nltk

stemmer = PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')

def tokenize(line):
    if (line is None):
        line = ''
    tokens = [stemmer.stem(t) for t in nltk.word_tokenize(line) if len(t) != 0 and t not in stopwords and not t.isdigit()]
    return tokens

## Bag of words - HashingVectorizer

In [3]:
from sklearn.feature_extraction.text import HashingVectorizer
import json
from tqdm import tqdm
from scipy.sparse import vstack, hstack
import numpy as np

vec = HashingVectorizer(tokenizer=tokenize, strip_accents='ascii', norm=None)

In [4]:
feature_matrix = None
counter = 0

with open(segments_json) as f:
    seg_matrix_list = []
    for line in tqdm(f):
        seg = json.loads(line)
        seg_matrix = vec.fit_transform([seg['inner_text_processed']])
        seg_matrix = hstack((seg_matrix, 
                             np.array([seg['top']])[:,None],
                             np.array([seg['left']])[:,None],
                             np.array([seg['height']])[:,None],
                             np.array([seg['width']])[:,None],
                             np.array([seg['num_buttons']])[:,None],
                             np.array([seg['num_imgs']])[:,None],
                             np.array([int(seg['num_anchors'])])[:,None]))
        
        seg_matrix_list.append(seg_matrix)
        counter += 1
        
        if counter % 500 == 0:
            feature_matrix = vstack([feature_matrix] + seg_matrix_list)
            seg_matrix_list = []
            counter = 0
        
    feature_matrix = vstack([feature_matrix] + seg_matrix_list)

1850895it [59:07, 521.68it/s]


In [5]:
feature_matrix

<1850895x1048583 sparse matrix of type '<type 'numpy.float64'>'
	with 25222926 stored elements in COOrdinate format>

Let's write the matrix to disk.

In [4]:
from scipy.sparse import save_npz, load_npz

#save_npz("/mnt/ssd/amathur/dark-patterns-output/feature_matrix.npz", feature_matrix)
feature_matrix = load_npz("/mnt/ssd/amathur/dark-patterns-output/feature_matrix.npz")

## Dimensionality Reduction

Let's reduce the dimensions of this feature matrix to make clustering more tractable. What are the current dimensions?

In [5]:
feature_matrix.shape

(1850895, 1048583)

We will use TruncatedSVD since it works well with sparse matrices.

In [10]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=4)
result = svd.fit_transform(feature_matrix)

How much of the variance do these 4 components when taken together represent?

In [11]:
np.sum(svd.explained_variance_ratio_)

0.9999995115319016

What are the dimensions of the returned matrix?

In [12]:
result.shape

(1850895, 4)

Let's write this to disk.

In [13]:
np.savetxt('/mnt/ssd/amathur/dark-patterns-output/feature_svd.arr', result)