Skip to content
Final project and report for DS8003: Management of Data and Big Data Tools
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE Create LICENSE Dec 20, 2018
README.md
Report.pdf Add files via upload Dec 19, 2018
final.py

README.md

TFIDF Search With Spark

My system uses Apache Spark with HDFS for creating the TFIDF index and searching for queries. I am using the cricket corpus for this project. t first loads the documents as separate records and then tokenizes each record and calculates the count of each word per document (TF). Then it calculates the number of distinct documents for each term (DF), IDF, and TFIDF index.

Once the TFIDF index is built, my system is able to take any query and tokenize the query the same way it would tokenize any document in the corpus and then conduct the search.

Part 1 (Computing the TFIDF score):

image

Part 2 (Search) :

image

You can’t perform that action at this time.