In [1]:
###########
# PRELUDE #
###########

# auto-reload changed python files
%load_ext autoreload
%autoreload 2

# Format cells with %%black
%load_ext blackcellmagic

# nice interactive plots
%matplotlib inline

# add repository directory to include path
from pathlib import Path
import sys
PROJECT_DIR = Path('../..').resolve()
sys.path.append(str(PROJECT_DIR))

from IPython.display import display, Markdown
def markdown(s):
    return display(Markdown(s))

print("Add '<div class=\"alert alert-block alert-info\">\\n\\n' to the top of markdown cells to mark professor-provided assignment content")

Add '<div class="alert alert-block alert-info">\n\n' to the top of markdown cells to mark professor-provided assignment content


<div class="alert alert-block alert-info">

# Part 1: Similarity Metrics

In [2]:
from cs168.mini_project_2 import load_data

<div class="alert alert-block alert-info">

## Goal 

The goal of this part of the assignment is to understand better the differences between distance
metrics, and to think about which metric makes the most sense for a particular application.

<div class="alert alert-block alert-info">

## Description

In this part you will look at the similarity between the posts on various newsgroups. We’ll use the well-known [20 newsgroups dataset](http://qwone.com/~jason/20Newsgroups/). You will use a version of the dataset where every article is represented by a bag-of-words — a vector indexed by words, with each component indicating the number of occurrences of that word. You will need 3 files: `data50.csv`, `label.csv`, and `group.csv`, all of these can be downloaded from the course website. In `data50.csv` there is a sparse representation of the bags-of-words, with each line containing 3 fields: `articleId`, `wordId`, and `count`. To find out which group an article belongs to, use the file `label.csv`, where for `articleId` $i$, line $i$ in `label.csv` contains the `groupId`. Finally the group name is in `group.csv`, with line $i$ containing the name of group $i$.

We’ll use the following similarity metrics, where $x$ and $y$ are two bags of words:

* Jaccard Similarity: $J(x,y) = \frac{\sum_i{min(x_i,y_i)}}{\sum_i{max(x_i,y_i)}}$
* $L_2$ Similarity: $L_2(x,y) = \|x - y\|_2 = -\sqrt{\sum_i(x_i - y_i)^2}$
* Cosine Similarity: $S_C(x, y) = \frac{\sum_i{x_i \cdot y_i}}{\|x\|_2 \cdot \|y\|_2}$

Note that Jaccard and cosine similarity are numbers between 0 and 1, while $L_2$ similarity is between $-\infty$ and 0 (with higher numbers indicating more similarity).

<div class="alert alert-block alert-info">

(a) (2 points) Make sure you can import the given datasets into whatever language you’re using. For
example, if you’re using python, read the data50.csv file and store the information in an appropriate
way. Remember that the total number of words in the corpus is huge, so you might want to work with
a sparse representation of your data (e.g., you don’t want to waste space on words that don’t occur in
a document). If you’re using MATLAB, you can simply import the data using the GUI.

In [3]:
group_names, data, labels = load_data()

assert group_names.shape == (20,)
assert data.shape == (1000, 19575)
assert labels.shape == (1000,)

data.head()

Unnamed: 0_level_0,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count,count
word_id,1,2,3,4,5,6,7,8,9,10,...,61058,61059,61060,61061,61062,61063,61064,61065,61066,61067
article_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,,,1.0,,,,,,,1.0,...,,,,,,,,,,
2,,,,,,,1.0,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,1.0,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


<div class="alert alert-block alert-info">

(b) (8 points) Implement the three similarity metrics described above. For each metric, prepare the following plot. The plot will look like a 20 × 20 matrix. Rows and columns are index by newsgroups (in the same order). For each entry $(A, B)$ of the matrix (including the diagonal), compute the average similarity over all ways of pairing up one article from $A$ with one article from $B$. After you’ve computed these 400 numbers, plot your results in a heatmap. Make sure that you label your axes with the group
names and pick an appropriate colormap to represent the data: the rainbow colormap may look fancy, but a simple color map from white to blue may be a lot more insightful. Make sure to include a legend. (Note that the computation might take five or ten minutes, but shouldn’t take much more.)

<div class="alert alert-block alert-info">

(c) (4 points) Based on your three heatmaps, which of the similarity metrics seems the most reasonable, and why would you expect that/those metrics to be better suited to this data?

Are there any pairs of newsgroups that are very similar?

Would you have expected these to be similar?

<div class="alert alert-block alert-info">

Parts 2 and 3: A nearest-neighbor classification system

<div class="alert alert-block alert-info">

A “nearest-neighbor” classification system is conceptually extremely simple, and often is very effective. Given a large dataset of labeled examples, a nearest-neighbor classification system will predict a label for a new example, $x$, as follows: it will find the element of the labeled dataset that is closest to $x$—closest in whatever metric makes the most sense for that dataset—and then output the label of this closest point. \[As you can imagine, there are many natural extensions of this system—for example considering the labels of the $r > 1$ closest neighbors.]

From a computational standpoint, naively, finding the closest point to $x$ might be time consuming if the
labeled dataset is large, or the points are very high dimensional. In the next two parts, you will explore two
ways of speeding up this computation: dimension reduction, and via locality sensitive hashing.

<div class="alert alert-block alert-info">

## Part 2: Dimension Reduction

<div class="alert alert-block alert-info">

### Goal
The goal of this part is to get a feel for the trade-off in dimensionality reduction between the quality
of approximation and the number of dimensions used.

### Description

You may have noticed that it takes some time to compute all the distances in the previous part (though it should not take more than a minute or two). In this part we will implement a dimension reduction technique to reduce the running time, which can be used to also speed up classification.

In the following, $k$ will refer to the original dimension of your data, and $d$ will refer to the target dimension.

* Random Projection: Given a set of $k$-dimensional vectors $\{v1, v2, \dots\}$, define a $d × k$ matrix $M$ by drawing each entry randomly (and independently) from a normal distribution of mean 0 and variance 1. The $d$-dimensional reduced vector corresponding to $v_i$
is given by the matrix-vector product $Mv_i$.
We can think of the matrix $M$ as a set of $d$ random $k$-dimensional vectors $\{w1, \dots , wd\}$ (the rows
of $M$), and then the $j$th coordinate of the reduced vector $Mv_i$ is the inner product between that $v_i$ and $w_j$. If you need to review the basics of matrix-vector multiplication, see the primer on the course webpage.