In [1]:
# -*- coding: utf-8 -*-

import os, json, re, random
from os.path import join, dirname, basename, split, splitext
from collections import Counter, defaultdict, OrderedDict

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt

import general as ge

%matplotlib inline

We use an example of the titles of five documents.
1. Romeo and Juliet.
2. Juliet: O happy dagger!
3. Romeo died by dagger.
4. "Live free or die", that's the New-Hampshire's motto.
5. Did you know, New-Hampshire is in New-England.

Then the matrix of the word-doc will be:

|  | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
|romeo |  1|0 |1 |0 |0
|juliet    | 1 |1 |0 |0| 0
|happy       | 0| 1| 0| 0 |0
|dagger     |  0| 1| 1| 0| 0
|live       |  0| 0| 0| 1| 0
|die         | 0| 0| 1| 1| 0
|free        | 0| 0| 0| 1| 0
|new-hampshire|0| 0| 0| 1| 1



In [20]:
# Compute SVD.

tmpm = np.array([[1,0,1,0,0],[1,1,0,0,0],[0,1,0,0,0],[0,1, 1, 0, 0],
                 [0, 0, 0, 1, 0],[0, 0, 1, 1, 0],[0,0,0,1,0],
                 [0, 0, 0, 1, 1]]).astype(float)

U, s, V = np.linalg.svd(tmpm, full_matrices=0)

Then use the first 2 singular values of the system as the main directions, since the singular values decays very fast.

In [21]:
# Print.

print "The first 2 singular values:"
print np.diag(s[:2])
print '\n'

print "Corresponding left matrix: U"
print U[:, :2]
print '\n'

print "Corresponding right matrix: V^T"
print V[:2, :]

The first 2 singular values:
[[ 2.28529793  0.        ]
 [ 0.          2.01025824]]


Corresponding left matrix: U
[[-0.39615277 -0.28005737]
 [-0.31426806 -0.44953214]
 [-0.17823952 -0.26899154]
 [-0.43836375 -0.36850831]
 [-0.26388058  0.34592143]
 [-0.52400482  0.24640466]
 [-0.26388058  0.34592143]
 [-0.32637322  0.45966878]]


Corresponding right matrix: V^T
[[-0.31086574 -0.40733041 -0.59446137 -0.60304575 -0.1428143 ]
 [-0.36293322 -0.54074246 -0.20005441  0.6953914   0.22866156]]


Then the terms (words) in the concept space are represented by the row vectors of $U*s$; the documents are represented by the column vector of $s*V^T$.

Here shows the matrix with each row represent a word in the concept space from the list ['romeo', 'juliet', 'happy', 'dagger', 'live', 'die', 'free', 'new-hampshire'].

In [23]:
print np.dot(U[:, :2],np.diag(s[:2]))

[[-0.90532712 -0.56298763]
 [-0.71819615 -0.90367568]
 [-0.40733041 -0.54074246]
 [-1.00179178 -0.74079687]
 [-0.60304575  0.6953914 ]
 [-1.19750713  0.49533699]
 [-0.60304575  0.6953914 ]
 [-0.74586005  0.92405295]]


The belowing shows the matrix with each column the represetation of the titles.

In [24]:
print np.dot(np.diag(s[:2]), V[:2, :])

[[-0.71042084 -0.93087134 -1.35852135 -1.37813921 -0.32637322]
 [-0.7295895  -1.08703198 -0.40216102  1.39791629  0.45966878]]


Each vector from the above matrix can be regarded by the distribution of the semantic of the corresponding text.