# Crawl Data Analysis: Feature Processing

This notebook tries various feature processing techniques on our web crawl data. It was written for Python 2.7. Before you run this notebook, please make sure you append the two json files from the previous step.

## Read the segments

Read the processed segments from the previous step using the JSON files.

In [1]:
segments_json = '/mnt/ssd/amathur/dark-patterns-output/segments.json'

## Processing routines

Collection of preprocessing routines before feature processing. Add more routines depending on what we would like to try.

In [2]:
from nltk.stem.porter import PorterStemmer
import nltk

stemmer = PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')

def tokenize(line):
    if (line is None):
        line = ''
    tokens = [stemmer.stem(t) for t in nltk.word_tokenize(line) if len(t) != 0 and t not in stopwords and not t.isdigit()]
    return tokens

## Bag of words - HashingVectorizer

In [3]:
from sklearn.feature_extraction.text import HashingVectorizer
import json
from tqdm import tqdm
from scipy.sparse import vstack, hstack
import numpy as np
from scipy.sparse import save_npz, load_npz

vec = HashingVectorizer(tokenizer=tokenize, strip_accents='ascii', n_features=2**18, alternate_sign=False)

In [4]:
text_feature_matrix = None
other_feature_matrix = None
counter = 0

with open(segments_json) as f:
    text_matrix_list = []
    other_matrix_list = []
    
    for line in tqdm(f):
        segment = json.loads(line)
        text_matrix = vec.fit_transform([segment['inner_text_processed']])
        other_matrix = np.array([[segment['top'], segment['left'], segment['height'], segment['width']]])
        # There are other features we might want to consider (e.g., num_anchors)
        
        text_matrix_list.append(text_matrix)
        other_matrix_list.append(other_matrix)
        counter += 1
        
        if counter % 50000 == 0:
            text_feature_matrix = vstack([text_feature_matrix] + text_matrix_list)
            other_feature_matrix = vstack([other_feature_matrix] + other_matrix_list)
            text_matrix_list = []
            other_matrix_list = []
            counter = 0
        
    text_feature_matrix = vstack([text_feature_matrix] + text_matrix_list)
    other_feature_matrix = vstack([other_feature_matrix] + other_matrix_list)

1850895it [29:43, 1037.59it/s]


In [5]:
text_feature_matrix

<1850895x262144 sparse matrix of type '<type 'numpy.float64'>'
	with 17117540 stored elements in COOrdinate format>

In [6]:
other_feature_matrix

<1850895x4 sparse matrix of type '<type 'numpy.int64'>'
	with 7335967 stored elements in COOrdinate format>

Let's write the matrices to disk.

In [7]:
save_npz("/mnt/ssd/amathur/dark-patterns-output/text_feature_matrix_bow.npz", text_feature_matrix)
save_npz("/mnt/ssd/amathur/dark-patterns-output/other_matrix.npz", other_feature_matrix)

#text_feature_matrix = load_npz("/mnt/ssd/amathur/dark-patterns-output/text_feature_matrix_bow.npz")
#other_feature_matrix = load_npz("/mnt/ssd/amathur/dark-patterns-output/other_matrix.npz")

## TF-IDF - TfidfTransformer

Let's convert the bag of words matrix to a TF-IDF structure

In [8]:
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer()
tf_idf_text_feature_matrix = transformer.fit_transform(text_feature_matrix)

In [9]:
tf_idf_text_feature_matrix

<1850895x262144 sparse matrix of type '<type 'numpy.float64'>'
	with 17117540 stored elements in Compressed Sparse Row format>

Let's write the matrix to disk.

In [10]:
save_npz("/mnt/ssd/amathur/dark-patterns-output/text_feature_matrix_tfidf.npz", tf_idf_text_feature_matrix)
#tf_idf_text_feature_matrix = load_npz("/mnt/ssd/amathur/dark-patterns-output/text_feature_matrix_tfidf.npz")

## Sanity check

In [11]:
text_feature_matrix

<1850895x262144 sparse matrix of type '<type 'numpy.float64'>'
	with 17117540 stored elements in COOrdinate format>

In [12]:
other_feature_matrix

<1850895x4 sparse matrix of type '<type 'numpy.int64'>'
	with 7335967 stored elements in COOrdinate format>

In [13]:
tf_idf_text_feature_matrix

<1850895x262144 sparse matrix of type '<type 'numpy.float64'>'
	with 17117540 stored elements in Compressed Sparse Row format>

## Dimensionality Reduction

Let's perform a PCA on the BoW and Tf-IDF outputs to reduce their dimensions. Starting with the BoW.

In [14]:
from sklearn.decomposition import TruncatedSVD
svd_bow = TruncatedSVD(n_components=200)
svd_bow_output = svd_bow.fit_transform(text_feature_matrix)

How much of the variance do these 50 components when taken together represent?

In [15]:
np.sum(svd_bow.explained_variance_ratio_)

0.5054071724930227

What are the dimensions of the returned matrix?

In [16]:
svd_bow_output.shape

(1850895, 200)

Let's repeat the same reduction for the TF-IDF matrix.

In [17]:
svd_tfidf = TruncatedSVD(n_components=300)
svd_tfidf_output = svd_tfidf.fit_transform(tf_idf_text_feature_matrix)

How much of the variance do these 50 components when taken together represent?

In [18]:
np.sum(svd_tfidf.explained_variance_ratio_)

0.4161701791669506

What are the dimensions of the returned matrix?

In [19]:
svd_tfidf_output.shape

(1850895, 300)

## Attach the matrices and save

In [22]:
np.savetxt('/mnt/ssd/amathur/dark-patterns-output/svd_bow_output.arr', svd_bow_output)

In [23]:
np.savetxt('/mnt/ssd/amathur/dark-patterns-output/svd_tfidf_output.arr', svd_tfidf_output)