cliburn
diff --git a/‎Labs/Lab04/Exercises04.ipynb‎
Lines changed: 259 additions & 38 deletions b/‎Labs/Lab04/Exercises04.ipynb‎
Lines changed: 259 additions & 38 deletions
diff --git a/‎Labs/Lab04/Exercises04_old.ipynb‎
Lines changed: 195 additions & 0 deletions b/‎Labs/Lab04/Exercises04_old.ipynb‎
Lines changed: 195 additions & 0 deletions
@@ -0,0 +1,195 @@
+{
+ "metadata": {
+  "name": "",
+  "signature": "sha256:af9ebe40acaaaf8849035ea1f09639f5d2e08c861ca4ba5b2c0f061ea899ddb1"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import os\n",
+      "import sys\n",
+      "import glob\n",
+      "import matplotlib.pyplot as plt\n",
+      "import numpy as np\n",
+      "import pandas as pd\n",
+      "%matplotlib inline\n",
+      "%precision 4\n",
+      "plt.style.use('ggplot')\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 1
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Latent Semantic Analysis (LSA) is a method for reducing the dimnesionality of documents treated as a bag of words. It is used for document classification, clustering and retrieval. For example, LSA can be used to search for prior art given a new patent application. In this homework, we will implement a small library for simple latent semantic analysis as a practical example of the application of SVD. The ideas are very similar to PCA."
+     ]
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "**Exercise 1 (10 points)**.  Calculating pairwise distance matrices.\n",
+      "\n",
+      "Suppose we want to construct a distance matrix between the rows of a matrix. For example, given the matrix \n",
+      "\n",
+      "```python\n",
+      "M = np.array([[1,2,3],[4,5,6]])\n",
+      "```\n",
+      "\n",
+      "we want to find the new matrix\n",
+      "\n",
+      "````python\n",
+      "D = np.array([[distance([1,2,3], [1,2,3]), distance([1,2,3], [4,5,6])],\n",
+      "              [distance([4,5,6], [1,2,3]), distance([4,5,6], [4,5,6])]])\n",
+      "```\n",
+      "\n",
+      "where `distance` is some appropriate function of two vectors (e.g. squared Euclidean).\n",
+      "\n",
+      "Write a function to calculate the pairwise-distance matrix given the matrix $M$ and some abritrary distance function. Your functions should have the following signature:\n",
+      "```\n",
+      "def func_name(M, distance_func):\n",
+      "    pass\n",
+      "```\n",
+      "\n",
+      "0. Write a distance function for the Euclidean, squared Euclidean and cosine measures.\n",
+      "1. Write the function using looping for M as a collection of row vectors.\n",
+      "2. Write the function using looping for M as a collection of column vectors.\n",
+      "3. Wrtie the function using broadcasting for M as a collection of row vectors.\n",
+      "4. Write the function using broadcasting for M as a colleciton of column vecotrs. \n",
+      "\n",
+      "For 3 and 4, try to avoid using transposition. Check that all four functions give the same result when applied to the given matrix $M$."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# Your code here\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "**Exercise 2 (10 points)**. Write 3 functions to calculate the term frequency (tf), the inverse document frequency (idf) and the product (tf-idf). Each function should take a single argument `docs`, which is a dictionary of (key=identifier, value=dcoument text) pairs, and return an appropriately sized array. Remove punctuation, convert text to lowercase and split on whitepsace to genrate a collection of terms from the docoument text.\n",
+      "\n",
+      "Print the table of tf-idf values for the following document collection\n",
+      "\n",
+      "```\n",
+      "s1 = \"The quick brown fox\"\n",
+      "s2 = \"Brown fox jumps over the jumps jumps jumps\"\n",
+      "s3 = \"The the the lazy dog elephant.\"\n",
+      "s4 = \"The the the the the dog peacock lion tiger elephant\"\n",
+      "\n",
+      "docs = {'s1': s1, 's2': s2, 's3': s3, 's4': s4}\n",
+      "```"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# Your code here\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "**Exercise 3 (10 points)**. \n",
+      "\n",
+      "1. Write a function that takes a matrix $M$ and an integer $k$ as arguments, and reconstructs a reduced matrix using only the $k$ largest singular values. Use the `scipy.linagl.svd` function to perform the decomposition.\n",
+      "\n",
+      "2. Apply the function you just wrote to the following term-frequency matrix for a set of 9 documents using k=2 and print the reconstructed matrix $M'$.\n",
+      "```\n",
+      "M = np.array([[1, 0, 0, 1, 0, 0, 0, 0, 0],\n",
+      "       [1, 0, 1, 0, 0, 0, 0, 0, 0],\n",
+      "       [1, 1, 0, 0, 0, 0, 0, 0, 0],\n",
+      "       [0, 1, 1, 0, 1, 0, 0, 0, 0],\n",
+      "       [0, 1, 1, 2, 0, 0, 0, 0, 0],\n",
+      "       [0, 1, 0, 0, 1, 0, 0, 0, 0],\n",
+      "       [0, 1, 0, 0, 1, 0, 0, 0, 0],\n",
+      "       [0, 0, 1, 1, 0, 0, 0, 0, 0],\n",
+      "       [0, 1, 0, 0, 0, 0, 0, 0, 1],\n",
+      "       [0, 0, 0, 0, 0, 1, 1, 1, 0],\n",
+      "       [0, 0, 0, 0, 0, 0, 1, 1, 1],\n",
+      "       [0, 0, 0, 0, 0, 0, 0, 1, 1]])\n",
+      "```\n",
+      "\n",
+      "3. Calculate the pairwise correlation matrix for the original matrix M and the reconstructed matrix using $k=2$ singular values (this is a distance matrix using Spearman's $\\rho$ as the distance measure). Consider the first 5 sets of documents as one group $G1$ and the last 4 as another group $G2$ (i.e. first 5 and last 4 columns). What is the average within group correlation for $G1$, $G2$ and the average cross-group correlation for G1-G2 using either $M$ or $M'$. (Do not include self-correlation in the within-group calculations.)."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# Your code here\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "**Exercise 4 (20 points)**. Clustering with LSA\n",
+      "\n",
+      "1. Starting from the books b01.txt to b18.txt, create a tf-idf matrix for every term that appears at least once in any of the documents.\n",
+      "\n",
+      "2. Reconstruct the tf-idf matrix using the top 50 singular values. Find the pairwise distances for the reconstructed matrix for the 16 documents. \n",
+      "\n",
+      "3. Use agglomerative hierachical clustering with complete linkage to plot a dendrogram and comment on the likely number of book clusters.\n",
+      "\n",
+      "4. Rank each of the oringal 16 documents in terms of its similarity to the new document `b00.txt` using the cosine distance relative to the reconstructed 50-dimensional space.\n",
+      "\n",
+      "5. Does it matter that the front and back matter of each document is essentially identical for either LSA-based clustering (part 3) or information retrieval (part 4)? Why or why not?"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# Your code here\n",
+      "\n",
+      "\n",
+      "\n",
+      "\n"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}