Projects with practical application in the course of Massive Data Analysis (CS573200) at National Tsing Hua University.
The projects are implemented in Java code with Hadoop Map-Reduce structure, running on Hortonworks Sandbox, and the actual implementation includes PageRank, Locality-Sensitivity Hashing, KMeans, and Frequent Itemsets.
A brief introduction on PageRank.
- The input data set can be downloaded here
- There are 10876 nodes in the network graph
- The project lists out the top 10 node with the highest page rank as the image shows:
This project implements the application of Finding Similar Items, and the a brief introduction on the process (3 parts) is as follows:
There are 3 main parts of the project:
- Shingling
- Read in the articles and create shingles of 3 words (3-shingles). For example, there will be 4 3-shingles from the sentence "I commit java code on github", and they are "I commit java", "commit java code", "java code on", "code on github".
- Min-hashing
- Hash the shingles into tokens, so that the total space consumption for computation can be greatly reduced
- Emulate the idea of permutation with the Min-hashing function to obtain a Signature Matrix (Min-hashing functions used for 100 times and its formula is shown in the iamge below)
- Leverage the notion that the Jaccard Similarity of 2 columns equals the similarity of 2 Signatures (And the longer the signatures, the smaller the expected error will be)
- LSH
- Divide all 100 rows (results of 100 Min-hashing functions) into 50 bands with 2 rows
- In each band, hash identical Signatures into the same hash function entry
- Calculate Jaccard Similarity of each pair of all Signatures hashed into the same hash function entry to get actual similarity