Skip to content

Commit 373c622

Browse files
updating lab04
1 parent c521dae commit 373c622

File tree

3 files changed

+1266
-38
lines changed

3 files changed

+1266
-38
lines changed

Labs/Lab04/Exercises04.ipynb

Lines changed: 259 additions & 38 deletions
Large diffs are not rendered by default.

Labs/Lab04/Exercises04_old.ipynb

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
{
2+
"metadata": {
3+
"name": "",
4+
"signature": "sha256:af9ebe40acaaaf8849035ea1f09639f5d2e08c861ca4ba5b2c0f061ea899ddb1"
5+
},
6+
"nbformat": 3,
7+
"nbformat_minor": 0,
8+
"worksheets": [
9+
{
10+
"cells": [
11+
{
12+
"cell_type": "code",
13+
"collapsed": false,
14+
"input": [
15+
"import os\n",
16+
"import sys\n",
17+
"import glob\n",
18+
"import matplotlib.pyplot as plt\n",
19+
"import numpy as np\n",
20+
"import pandas as pd\n",
21+
"%matplotlib inline\n",
22+
"%precision 4\n",
23+
"plt.style.use('ggplot')\n"
24+
],
25+
"language": "python",
26+
"metadata": {},
27+
"outputs": [],
28+
"prompt_number": 1
29+
},
30+
{
31+
"cell_type": "markdown",
32+
"metadata": {},
33+
"source": [
34+
"Latent Semantic Analysis (LSA) is a method for reducing the dimnesionality of documents treated as a bag of words. It is used for document classification, clustering and retrieval. For example, LSA can be used to search for prior art given a new patent application. In this homework, we will implement a small library for simple latent semantic analysis as a practical example of the application of SVD. The ideas are very similar to PCA."
35+
]
36+
},
37+
{
38+
"cell_type": "markdown",
39+
"metadata": {},
40+
"source": [
41+
"**Exercise 1 (10 points)**. Calculating pairwise distance matrices.\n",
42+
"\n",
43+
"Suppose we want to construct a distance matrix between the rows of a matrix. For example, given the matrix \n",
44+
"\n",
45+
"```python\n",
46+
"M = np.array([[1,2,3],[4,5,6]])\n",
47+
"```\n",
48+
"\n",
49+
"we want to find the new matrix\n",
50+
"\n",
51+
"````python\n",
52+
"D = np.array([[distance([1,2,3], [1,2,3]), distance([1,2,3], [4,5,6])],\n",
53+
" [distance([4,5,6], [1,2,3]), distance([4,5,6], [4,5,6])]])\n",
54+
"```\n",
55+
"\n",
56+
"where `distance` is some appropriate function of two vectors (e.g. squared Euclidean).\n",
57+
"\n",
58+
"Write a function to calculate the pairwise-distance matrix given the matrix $M$ and some abritrary distance function. Your functions should have the following signature:\n",
59+
"```\n",
60+
"def func_name(M, distance_func):\n",
61+
" pass\n",
62+
"```\n",
63+
"\n",
64+
"0. Write a distance function for the Euclidean, squared Euclidean and cosine measures.\n",
65+
"1. Write the function using looping for M as a collection of row vectors.\n",
66+
"2. Write the function using looping for M as a collection of column vectors.\n",
67+
"3. Wrtie the function using broadcasting for M as a collection of row vectors.\n",
68+
"4. Write the function using broadcasting for M as a colleciton of column vecotrs. \n",
69+
"\n",
70+
"For 3 and 4, try to avoid using transposition. Check that all four functions give the same result when applied to the given matrix $M$."
71+
]
72+
},
73+
{
74+
"cell_type": "code",
75+
"collapsed": false,
76+
"input": [
77+
"# Your code here\n",
78+
"\n",
79+
"\n",
80+
"\n",
81+
"\n"
82+
],
83+
"language": "python",
84+
"metadata": {},
85+
"outputs": []
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"metadata": {},
90+
"source": [
91+
"**Exercise 2 (10 points)**. Write 3 functions to calculate the term frequency (tf), the inverse document frequency (idf) and the product (tf-idf). Each function should take a single argument `docs`, which is a dictionary of (key=identifier, value=dcoument text) pairs, and return an appropriately sized array. Remove punctuation, convert text to lowercase and split on whitepsace to genrate a collection of terms from the docoument text.\n",
92+
"\n",
93+
"Print the table of tf-idf values for the following document collection\n",
94+
"\n",
95+
"```\n",
96+
"s1 = \"The quick brown fox\"\n",
97+
"s2 = \"Brown fox jumps over the jumps jumps jumps\"\n",
98+
"s3 = \"The the the lazy dog elephant.\"\n",
99+
"s4 = \"The the the the the dog peacock lion tiger elephant\"\n",
100+
"\n",
101+
"docs = {'s1': s1, 's2': s2, 's3': s3, 's4': s4}\n",
102+
"```"
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"collapsed": false,
108+
"input": [
109+
"# Your code here\n",
110+
"\n",
111+
"\n",
112+
"\n",
113+
"\n"
114+
],
115+
"language": "python",
116+
"metadata": {},
117+
"outputs": []
118+
},
119+
{
120+
"cell_type": "markdown",
121+
"metadata": {},
122+
"source": [
123+
"**Exercise 3 (10 points)**. \n",
124+
"\n",
125+
"1. Write a function that takes a matrix $M$ and an integer $k$ as arguments, and reconstructs a reduced matrix using only the $k$ largest singular values. Use the `scipy.linagl.svd` function to perform the decomposition.\n",
126+
"\n",
127+
"2. Apply the function you just wrote to the following term-frequency matrix for a set of 9 documents using k=2 and print the reconstructed matrix $M'$.\n",
128+
"```\n",
129+
"M = np.array([[1, 0, 0, 1, 0, 0, 0, 0, 0],\n",
130+
" [1, 0, 1, 0, 0, 0, 0, 0, 0],\n",
131+
" [1, 1, 0, 0, 0, 0, 0, 0, 0],\n",
132+
" [0, 1, 1, 0, 1, 0, 0, 0, 0],\n",
133+
" [0, 1, 1, 2, 0, 0, 0, 0, 0],\n",
134+
" [0, 1, 0, 0, 1, 0, 0, 0, 0],\n",
135+
" [0, 1, 0, 0, 1, 0, 0, 0, 0],\n",
136+
" [0, 0, 1, 1, 0, 0, 0, 0, 0],\n",
137+
" [0, 1, 0, 0, 0, 0, 0, 0, 1],\n",
138+
" [0, 0, 0, 0, 0, 1, 1, 1, 0],\n",
139+
" [0, 0, 0, 0, 0, 0, 1, 1, 1],\n",
140+
" [0, 0, 0, 0, 0, 0, 0, 1, 1]])\n",
141+
"```\n",
142+
"\n",
143+
"3. Calculate the pairwise correlation matrix for the original matrix M and the reconstructed matrix using $k=2$ singular values (this is a distance matrix using Spearman's $\\rho$ as the distance measure). Consider the first 5 sets of documents as one group $G1$ and the last 4 as another group $G2$ (i.e. first 5 and last 4 columns). What is the average within group correlation for $G1$, $G2$ and the average cross-group correlation for G1-G2 using either $M$ or $M'$. (Do not include self-correlation in the within-group calculations.)."
144+
]
145+
},
146+
{
147+
"cell_type": "code",
148+
"collapsed": false,
149+
"input": [
150+
"# Your code here\n",
151+
"\n",
152+
"\n",
153+
"\n",
154+
"\n"
155+
],
156+
"language": "python",
157+
"metadata": {},
158+
"outputs": []
159+
},
160+
{
161+
"cell_type": "markdown",
162+
"metadata": {},
163+
"source": [
164+
"**Exercise 4 (20 points)**. Clustering with LSA\n",
165+
"\n",
166+
"1. Starting from the books b01.txt to b18.txt, create a tf-idf matrix for every term that appears at least once in any of the documents.\n",
167+
"\n",
168+
"2. Reconstruct the tf-idf matrix using the top 50 singular values. Find the pairwise distances for the reconstructed matrix for the 16 documents. \n",
169+
"\n",
170+
"3. Use agglomerative hierachical clustering with complete linkage to plot a dendrogram and comment on the likely number of book clusters.\n",
171+
"\n",
172+
"4. Rank each of the oringal 16 documents in terms of its similarity to the new document `b00.txt` using the cosine distance relative to the reconstructed 50-dimensional space.\n",
173+
"\n",
174+
"5. Does it matter that the front and back matter of each document is essentially identical for either LSA-based clustering (part 3) or information retrieval (part 4)? Why or why not?"
175+
]
176+
},
177+
{
178+
"cell_type": "code",
179+
"collapsed": false,
180+
"input": [
181+
"# Your code here\n",
182+
"\n",
183+
"\n",
184+
"\n",
185+
"\n"
186+
],
187+
"language": "python",
188+
"metadata": {},
189+
"outputs": []
190+
}
191+
],
192+
"metadata": {}
193+
}
194+
]
195+
}

0 commit comments

Comments
 (0)