# Project: Designing a virtual assistant for corporate documentation

ChatGPT-like systems employ Natural Language Processing (NLP) to facilitate interactions in natural language with users, enabling intuitive access to specific information. In the context of this project, you are tasked with designing a simplified system for querying corporate documentation using both full-text and semantic search.

## Objective 1 :

Enable the search for documents using a short phrase, returning only the documents that match.

## Objective 2 :

Allow users to pose questions to their corporate documentation and receive a textual response synthesizing the found sources.

Rules

- The entire environment must be containerized using Docker.
- Include instructions for installation and use (Readme.md).
- Address the handling of proper nouns and Out Of Vocabulary (OOV) words in the embedding model.
- Enable both full-text and semantic search capabilities.
- Use models that are either locally hosted or cloud-based, but with unlimited/free access (for the validation phase, which may be intensive, and provide the access details).
- Responses to searches must be supplemented with the sources used

![image_source](./example-LLM-sources.png)

# Dataset

For this project, we will use the Wikipedia documentation. You can download it using the following code.

You are not required to index the entire dataset, but only a part of it. In this case, you will need to specify which subset you have used.

In [2]:
# pip install datasets mwparserfromhell
# see here for explaination : https://huggingface.co/datasets/wikipedia

from datasets import load_dataset

load_dataset("wikipedia", language="fr", date="20231220", trust_remote_code=True)


Downloading data:   0%|          | 0.00/36.8k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/627M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/703M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/583M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/154M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/591M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/288M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/555M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/464M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/185M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/429M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/415M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/472M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/448M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/329M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Extracting content from /home/benjamin/.cache/huggingface/datasets/downloads/f0498bb116fe7e3173f2abc8983487f3253a43f4a314af0b4091643b772417ff


# Info

This project is a typical enterprise use case. Whenever you hear about ChatGPT/LLM/GPT in a business context, it's almost always for performing enhanced semantic search or "talk to my data" on internal documentation or for a chatbot.

There are software solutions that allow you to do this in a plug-and-play manner (which is not the objective of this project), such as:
- Algolia
- Azure Cognitive Search

And there are documentation-focused software solutions that integrate LLM models:
- Slite
- Confluence
- Notion