This repository contains a semantic search engine built using the MIND Dataset. The search engine leverages Elasticsearch for indexing and searching, BERT embeddings for vectorizing the titles, and Streamlit for an interactive user interface.
- Elasticsearch Integration: Efficient search and indexing using Elasticsearch.
- BERT Embeddings: Title column is vectorized using BERT model for semantic search.
- Streamlit Interface: User-friendly web interface to interact with the search engine.
- Python 3.7 or higher
- Docker (for running Elasticsearch)
- Elasticsearch (can be run via Docker)
- Streamlit
- SentenceTransformers
-
Clone the repository:
git clone https://github.com/andyzhangstat/semantic_search_engine.git cd semantic_search_engine
-
Set up the environment:
pip install -r requirements.txt
-
Run Elasticsearch using Docker:
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.10.2
-
Prepare the data: Download the MIND dataset and place the
news.tsv
file in the root directory. -
Ingest data into Elasticsearch:
python ingest_data.py
-
Run the Streamlit interface:
streamlit run app.py
Once the Streamlit app is running, open your browser and navigate to http://localhost:8501
to start using the semantic search engine. You can enter a search query in the search box, and the results will be displayed with the title and abstract of the news articles that match your query.
The MIND Dataset is a large-scale dataset for news recommendation research. It contains news articles and user interactions, making it ideal for building and evaluating news recommendation and search systems.
-
Data Preparation:
- The
news.tsv
file from the MIND dataset is loaded and the title column is vectorized using the BERT model.
- The
-
Elasticsearch Indexing:
- The vectorized titles are indexed in Elasticsearch to enable fast and efficient semantic search.
-
Search:
- A search query is vectorized using the same BERT model, and Elasticsearch's KNN search is used to find the most relevant news articles based on the vector similarity.
-
User Interface:
- The Streamlit app provides a simple interface to enter search queries and display the search results in real-time.
.
├── app.py # Streamlit app
├── ingest_data.py # Script to ingest data into Elasticsearch
├── requirements.txt # Python dependencies
├── news.tsv # MIND dataset file (not included in the repo)
└── README.md # Project documentation
This project is licensed under the MIT License - see the LICENSE file for details.
- MIND Dataset: Microsoft for providing the MIND dataset.
- SentenceTransformers: Hugging Face for the BERT model used for embedding the titles.
For any inquiries or issues, please reach out to me.