Massive_Data_Analysis_Projects (Currently Re-Editing)

Projects with practical application in the course of Massive Data Analysis (CS573200) at National Tsing Hua University.

The projects are implemented in Java code with Hadoop Map-Reduce structure, running on Hortonworks Sandbox, and the actual implementation includes PageRank, Locality-Sensitivity Hashing, KMeans, and Frequent Itemsets.

Prerequisite

Project Introduction

PageRank

A brief introduction on PageRank.

The input data set can be downloaded here
There are 10876 nodes in the network graph
The project lists out the top 10 node with the highest page rank as the image shows:

Locality-Sensitivity Hashing (LSH)

This project implements the application of Finding Similar Items, and the a brief introduction on the process (3 parts) is as follows:

There are 3 main parts of the project:

Shingling
1. Read in the articles and create shingles of 3 words (3-shingles). For example, there will be 4 3-shingles from the sentence "I commit java code on github", and they are "I commit java", "commit java code", "java code on", "code on github".
Min-hashing
1. Hash the shingles into tokens, so that the total space consumption for computation can be greatly reduced
2. Emulate the idea of permutation with the Min-hashing function to obtain a Signature Matrix (Min-hashing functions used for 100 times and its formula is shown in the iamge below)
3. Leverage the notion that the Jaccard Similarity of 2 columns equals the similarity of 2 Signatures (And the longer the signatures, the smaller the expected error will be)
LSH
1. Divide all 100 rows (results of 100 Min-hashing functions) into 50 bands with 2 rows
2. In each band, hash identical Signatures into the same hash function entry
3. Calculate Jaccard Similarity of each pair of all Signatures hashed into the same hash function entry to get actual similarity

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Frequent_Itemsets		Frequent_Itemsets
KMeans		KMeans
Locality-Sensitive_Hashing		Locality-Sensitive_Hashing
PageRank		PageRank
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Massive_Data_Analysis_Projects (Currently Re-Editing)

Prerequisite

Project Introduction

PageRank

Locality-Sensitivity Hashing (LSH)

KMeans

Frequent Itemsets

About

Releases

Packages

Languages

YungChengHsu/Massive_Data_Analysis_Projects

Folders and files

Latest commit

History

Repository files navigation

Massive_Data_Analysis_Projects (Currently Re-Editing)

Prerequisite

Project Introduction

PageRank

Locality-Sensitivity Hashing (LSH)

KMeans

Frequent Itemsets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages