Semantic Search Engine Using the MIND Dataset

Overview

This repository contains a semantic search engine built using the MIND Dataset. The search engine leverages Elasticsearch for indexing and searching, BERT embeddings for vectorizing the titles, and Streamlit for an interactive user interface.

Features

Elasticsearch Integration: Efficient search and indexing using Elasticsearch.
BERT Embeddings: Title column is vectorized using BERT model for semantic search.
Streamlit Interface: User-friendly web interface to interact with the search engine.

Installation

Prerequisites

Python 3.7 or higher
Docker (for running Elasticsearch)
Elasticsearch (can be run via Docker)
Streamlit
SentenceTransformers

Step-by-Step Guide

Clone the repository:

git clone https://github.com/andyzhangstat/semantic_search_engine.git
cd semantic_search_engine

Set up the environment:
```
pip install -r requirements.txt
```

Run Elasticsearch using Docker:

docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.10.2

Prepare the data: Download the MIND dataset and place the news.tsv file in the root directory.
Ingest data into Elasticsearch:
```
python ingest_data.py
```
Run the Streamlit interface:
```
streamlit run app.py
```

Usage

Once the Streamlit app is running, open your browser and navigate to http://localhost:8501 to start using the semantic search engine. You can enter a search query in the search box, and the results will be displayed with the title and abstract of the news articles that match your query.

Dataset

The MIND Dataset is a large-scale dataset for news recommendation research. It contains news articles and user interactions, making it ideal for building and evaluating news recommendation and search systems.

How It Works

Data Preparation:
- The news.tsv file from the MIND dataset is loaded and the title column is vectorized using the BERT model.
Elasticsearch Indexing:
- The vectorized titles are indexed in Elasticsearch to enable fast and efficient semantic search.
Search:
- A search query is vectorized using the same BERT model, and Elasticsearch's KNN search is used to find the most relevant news articles based on the vector similarity.
User Interface:
- The Streamlit app provides a simple interface to enter search queries and display the search results in real-time.

Project Structure

.
├── app.py                # Streamlit app
├── ingest_data.py        # Script to ingest data into Elasticsearch
├── requirements.txt      # Python dependencies
├── news.tsv              # MIND dataset file (not included in the repo)
└── README.md             # Project documentation

Screenshot

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

MIND Dataset: Microsoft for providing the MIND dataset.
SentenceTransformers: Hugging Face for the BERT model used for embedding the titles.

Contact

For any inquiries or issues, please reach out to me.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Semantic_Search.ipynb		Semantic_Search.ipynb
app.py		app.py
screenshot.png		screenshot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Search Engine Using the MIND Dataset

Overview

Features

Installation

Prerequisites

Step-by-Step Guide

Usage

Dataset

How It Works

Project Structure

Screenshot

License

Acknowledgments

Contact

About

Releases

Packages

Languages

License

andyzhangstat/Semantic_Search_Engine

Folders and files

Latest commit

History

Repository files navigation

Semantic Search Engine Using the MIND Dataset

Overview

Features

Installation

Prerequisites

Step-by-Step Guide

Usage

Dataset

How It Works

Project Structure

Screenshot

License

Acknowledgments

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages