You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Latent Semantic Analysis (LSA) is a method for reducing the dimnesionality of documents treated as a bag of words. It is used for document classification, clustering and retrieval. For example, LSA can be used to search for prior art given a new patent application. In this homework, we will implement a small library for simple latent semantic analysis as a practical example of the application of SVD. The ideas are very similar to PCA."
"where `distance` is some appropriate function of two vectors (e.g. squared Euclidean).\n",
57
+
"\n",
58
+
"Write a function to calculate the pairwise-distance matrix given the matrix $M$ and some abritrary distance function. Your functions should have the following signature:\n",
59
+
"```\n",
60
+
"def func_name(M, distance_func):\n",
61
+
" pass\n",
62
+
"```\n",
63
+
"\n",
64
+
"0. Write a distance function for the Euclidean, squared Euclidean and cosine measures.\n",
65
+
"1. Write the function using looping for M as a collection of row vectors.\n",
66
+
"2. Write the function using looping for M as a collection of column vectors.\n",
67
+
"3. Wrtie the function using broadcasting for M as a collection of row vectors.\n",
68
+
"4. Write the function using broadcasting for M as a colleciton of column vecotrs. \n",
69
+
"\n",
70
+
"For 3 and 4, try to avoid using transposition. Check that all four functions give the same result when applied to the given matrix $M$."
71
+
]
72
+
},
73
+
{
74
+
"cell_type": "code",
75
+
"collapsed": false,
76
+
"input": [
77
+
"# Your code here\n",
78
+
"\n",
79
+
"\n",
80
+
"\n",
81
+
"\n"
82
+
],
83
+
"language": "python",
84
+
"metadata": {},
85
+
"outputs": []
86
+
},
87
+
{
88
+
"cell_type": "markdown",
89
+
"metadata": {},
90
+
"source": [
91
+
"**Exercise 2 (10 points)**. Write 3 functions to calculate the term frequency (tf), the inverse document frequency (idf) and the product (tf-idf). Each function should take a single argument `docs`, which is a dictionary of (key=identifier, value=dcoument text) pairs, and return an appropriately sized array. Remove punctuation, convert text to lowercase and split on whitepsace to genrate a collection of terms from the docoument text.\n",
92
+
"\n",
93
+
"Print the table of tf-idf values for the following document collection\n",
94
+
"\n",
95
+
"```\n",
96
+
"s1 = \"The quick brown fox\"\n",
97
+
"s2 = \"Brown fox jumps over the jumps jumps jumps\"\n",
98
+
"s3 = \"The the the lazy dog elephant.\"\n",
99
+
"s4 = \"The the the the the dog peacock lion tiger elephant\"\n",
"1. Write a function that takes a matrix $M$ and an integer $k$ as arguments, and reconstructs a reduced matrix using only the $k$ largest singular values. Use the `scipy.linagl.svd` function to perform the decomposition.\n",
126
+
"\n",
127
+
"2. Apply the function you just wrote to the following term-frequency matrix for a set of 9 documents using k=2 and print the reconstructed matrix $M'$.\n",
128
+
"```\n",
129
+
"M = np.array([[1, 0, 0, 1, 0, 0, 0, 0, 0],\n",
130
+
" [1, 0, 1, 0, 0, 0, 0, 0, 0],\n",
131
+
" [1, 1, 0, 0, 0, 0, 0, 0, 0],\n",
132
+
" [0, 1, 1, 0, 1, 0, 0, 0, 0],\n",
133
+
" [0, 1, 1, 2, 0, 0, 0, 0, 0],\n",
134
+
" [0, 1, 0, 0, 1, 0, 0, 0, 0],\n",
135
+
" [0, 1, 0, 0, 1, 0, 0, 0, 0],\n",
136
+
" [0, 0, 1, 1, 0, 0, 0, 0, 0],\n",
137
+
" [0, 1, 0, 0, 0, 0, 0, 0, 1],\n",
138
+
" [0, 0, 0, 0, 0, 1, 1, 1, 0],\n",
139
+
" [0, 0, 0, 0, 0, 0, 1, 1, 1],\n",
140
+
" [0, 0, 0, 0, 0, 0, 0, 1, 1]])\n",
141
+
"```\n",
142
+
"\n",
143
+
"3. Calculate the pairwise correlation matrix for the original matrix M and the reconstructed matrix using $k=2$ singular values (this is a distance matrix using Spearman's $\\rho$ as the distance measure). Consider the first 5 sets of documents as one group $G1$ and the last 4 as another group $G2$ (i.e. first 5 and last 4 columns). What is the average within group correlation for $G1$, $G2$ and the average cross-group correlation for G1-G2 using either $M$ or $M'$. (Do not include self-correlation in the within-group calculations.)."
144
+
]
145
+
},
146
+
{
147
+
"cell_type": "code",
148
+
"collapsed": false,
149
+
"input": [
150
+
"# Your code here\n",
151
+
"\n",
152
+
"\n",
153
+
"\n",
154
+
"\n"
155
+
],
156
+
"language": "python",
157
+
"metadata": {},
158
+
"outputs": []
159
+
},
160
+
{
161
+
"cell_type": "markdown",
162
+
"metadata": {},
163
+
"source": [
164
+
"**Exercise 4 (20 points)**. Clustering with LSA\n",
165
+
"\n",
166
+
"1. Starting from the books b01.txt to b18.txt, create a tf-idf matrix for every term that appears at least once in any of the documents.\n",
167
+
"\n",
168
+
"2. Reconstruct the tf-idf matrix using the top 50 singular values. Find the pairwise distances for the reconstructed matrix for the 16 documents. \n",
169
+
"\n",
170
+
"3. Use agglomerative hierachical clustering with complete linkage to plot a dendrogram and comment on the likely number of book clusters.\n",
171
+
"\n",
172
+
"4. Rank each of the oringal 16 documents in terms of its similarity to the new document `b00.txt` using the cosine distance relative to the reconstructed 50-dimensional space.\n",
173
+
"\n",
174
+
"5. Does it matter that the front and back matter of each document is essentially identical for either LSA-based clustering (part 3) or information retrieval (part 4)? Why or why not?"
0 commit comments