Skip to content

Anubhab2002/Big-Data-Analysis-Term-Project-PE

Repository files navigation

Big-Data-Analysis-Term-Project-PE

This repository contains all the python notebooks, scrapped data and other requirements for the BIG DATA ANALYSIS (MA60306) TERM PROJECT for Spring Semester, 2023

GOOGLE DRIVE LINKS

Since the Embeddings take up lots of space, it was not possible to upload them to Github. Hence we have uploaded them in GOOGLE DRIVE instead. Here is how to access them:

  1. Google Drive Link to BERT Embeddings
  2. Google Drive Link to Word2Vec_SkipGram Embeddings
  3. Google Drive Link to GloVe Embeddings
  4. Google Drive Link to BERT Embeddings txt file
  5. Google Drive Link to Word2vec Embeddings txt file
  6. Google Drive Link to GloVe Embeddings txt file

HOW TO REPRODUCE THIS PROJECT

  1. Clone the repository using the following command: git clone https://github.com/Anubhab2002/Big-Data-Analysis-Term-Project-PE.git
  2. Put the BERT embeddings in Embeddings/BERT/, Word2Vec_SkipGram Embeddings in Embeddings/Word2Vec_SkipGram/ and Glove Embeddings in Embeddings/Glove/ folders respectively.
  3. Put the .txt files for the BERT vectors in Embedding Projector/projection_txt and the other 2 vectors in Embeddings/Word2Vec_SkipGram and Embeddings/GloVe respectively.
  4. All the .ipynb files for the codes to the different parts of the project are added in the repository along with the required datasets. Please run the codes either on Google Colab or on your local system (favourably with a Virtual Environment) to generate the required data and results.

EVALUATION RESULTS:

Data 1, 2, 3, 4 represent 4 different products from 4 different categories - Men's Wear, Perfumes, Electronic Accessories and Groceries.

BERT Scores:

Skip-Gram Word2Vec Scores:

Glove Scores:

VISUALISATION:

PCA

t-SNE

TEAM MEMBERS:

Anubhab Mandal (20MA20080), Rohan Das (20MA20077), Rangoju Bhuvan (20MA20048), Mangalik Mitra (20MA20070), Samarth Somani (20MA20049), Sandeep Mishra (20MA20071), Kattunga Lakshmana Sai Kumar (20MA20026), Prabhav Sunil Patil (20MA20042) Jitendra Padmanabhuni (20MA20039) Shatansh Patnaik (20MA20067) Arup Baral (20MA20010), Atharv Bajaj (20MA20014), Vishwash Kumar (20MA20079)

About

This repository contains all the python notebooks, scrapped data and other requirements for the BIG DATA ANALYSIS (MA60306) TERM PROJECT for Spring Semester, 2023

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published