# 2. Search Engine

This search engine allows you to retrieve restaurants based on a user query. We’ll build two types of search engines:

- Conjunctive Search Engine: Returns restaurants where all query terms appear in the description.
- Ranked Search Engine: Returns the top-k restaurants sorted by similarity to the query, using TF-IDF and Cosine Similarity.

To effectively analyze restaurant descriptions, it is crucial to *pre-process the text*. As in any optimal text analysis, we must proceed with preprocessing, which we addressed in the first part.In general we followed these steps.:

- Firstly we ensured the removal of stop words in English, as well as customized common words related to Italian cuisine and gourmet dining, such as "pasta," "pizza," and other frequently used terms. 

- The next step involved constructing a `vocabulary` to extract all unique words from the various descriptions and associate each with a unique integer. We decided to increment this integer sequentially for simplicity.

- Additionally, we created an `inverted_index` that maps these integers back to the specific documents in which the corresponding words appear. This setup allows us to define a `search_query` function where, by inputting a word or phrase, we can retrieve all documents containing all of those words.

In [128]:
import nltk 
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import string
import re
import pandas as pd
from IPython.display import display



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [129]:


columns_to_use = ['index','restaurantName', 'address', 'city','description','creditCards', 'website']
df= pd.read_csv("dataset.tsv", sep='\t', encoding= "utf-8")


## 2.0 Preprocessing

In [130]:

custom_stopwords = {
    'michelin', 'restaurant', 'cuisine', 'menu', 'chef', 'dishes', 'gourmet',
    'service', 'wine', 'dining', 'food', 'kitchen', 'modern', 'traditional',
    'star', 'quality', 'italian', 'pasta', 'pizza', 'sushi', 'fine', 'taste'
}

def preprocess_and_stem_text(text):
    
    words = word_tokenize(text.lower())
    words = [re.sub(r"[^a-zA-Z']", '', word) for word in words]

    
    stop_words = set(stopwords.words('english')) | custom_stopwords
    filtered_words = [word for word in words if word not in stop_words and word != '']

    
    snowball_stemmer = SnowballStemmer('english')
    stemmed_words = [snowball_stemmer.stem(word) for word in filtered_words]

    return stemmed_words

## 2.1 Conjunctive Query

### 2.1.1 Create Your Index!

In [131]:
# Creazione del vocabolario
def create_vocabulary(descriptions):
    all_words = set(word for desc in descriptions for word in desc)
    vocabulary = {word: idx + 1 for idx, word in enumerate(all_words)}
    return vocabulary

def build_inverted_index(descriptions, vocabulary):
    """
    Costruisce un indice inverso che mappa ogni parola (ID) ai documenti in cui appare.
    """
    inverted_index = {term_id: [] for term_id in vocabulary.values()}  # Crea un indice inverso vuoto
    for doc_id, desc in enumerate(descriptions):
        for word in desc:
            word_id = vocabulary.get(word)  # Ottieni l'ID della parola
            if word_id:
                inverted_index[word_id].append(doc_id + 1)  # Aggiungi l'ID del documento
    return inverted_index

In [132]:
# Funzione principale per eseguire il preprocessing e costruire l'indice inverso
def main(df):
    """
    Funzione principale per eseguire il preprocessing, la creazione del vocabolario,
    la costruzione dell'indice inverso e l'esportazione del vocabolario.
    """

    df['processed_description'] = df['description'].apply(lambda x: preprocess_and_stem_text(x))

    vocabulary = create_vocabulary(df['processed_description'])
    inverted_index = build_inverted_index(df['processed_description'], vocabulary)

    # Restituisci il vocabolario e l'indice inverso per utilizzarli fuori dalla funzione
    return vocabulary, inverted_index

# Chiama la funzione principale e memorizza i risultati
vocabulary, inverted_index = main(df)


In [126]:

def search_query(query, vocabulary, inverted_index):
    
    processed_query = preprocess_and_stem_text(query)
    
    word_ids = [vocabulary.get(word) for word in processed_query if word in vocabulary]
    
    document_ids = set(doc_id for word_id in word_ids if word_id in inverted_index for doc_id in inverted_index[word_id])
    
    return document_ids


query = input("Inserisci la tua query: ")
document_ids = search_query(query, vocabulary, inverted_index)


if document_ids:
    print("Documenti che contengono la query '{}':".format(query))
    
    document_ids_list = sorted(list(document_ids))
    columns = ['restaurantName', 'address', 'description', 'website']
    result = df.iloc[document_ids_list][columns]
    

    #result = df_app_name.loc[document_ids_list, ['restaurantName', 'address', 'description', 'website']]
    
    # Visualizza il risultato
    display(result)
else:
    print("Nessun documento trovato per la query '{}'.".format(query))

Documenti che contengono la query 'veneto':


Unnamed: 0,restaurantName,address,description,website
51,Il Piraña,via G. Valentini 110,A classic fish restaurant that opts for simply...,https://www.ristorantepirana.it
72,Fracia,località Fracia,Park the car and walk up a short track (about ...,https://www.ristorantefracia.it/
230,Il Sole di Ranco,piazza Venezia 5,An elegant residence with gardens sloping down...,https://www.ilsolediranco.it/
303,Controcorrente,via Colombo 101,"In this minimalist restaurant, situated below ...",http://www.ristorantecontrocorrente.it
380,Bros',via degli Acaja 2,"Bros' is a synonym for a young, free spirit wh...",https://www.pellegrinobrothers.it/
438,Pierre Alexis 1877,via Marconi 50/a,Situated in a house dating from 1877 (as the n...,http://www.pierrealexiscourmayeur.it
503,Al Vedel,via Vedole 68,"A temple for the production of culatello ham, ...",http://www.poderecadassa.it
613,Osteria Boccolicchio,via Arco Boccolicchio 15,Having gained experience in various restaurant...,tel:+39 0884 090317
643,Bon Wei,via Castelvetro 16/18,China on a plate! This attractive restaurant w...,https://www.bon-wei.it/
661,Al Capitan della Cittadella,piazza Cittadella 7/a,"Situated just outside the town walls, Al Capit...",https://www.alcapitan.it/


In [127]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Inizializzazione di TfidfVectorizer
vectorizer = TfidfVectorizer()

# Calcolo della matrice TF-IDF sulla colonna 'description' del DataFrame
tfidf_matrix = vectorizer.fit_transform(df['description'])

# Conversione della matrice sparsa TF-IDF in un array per visualizzare i punteggi
tfidf_array = tfidf_matrix.toarray()

# Stampa della matrice TF-IDF
print("Matrice TF-IDF:")
print(tfidf_array)

# Mostra il vocabolario
print("\nVocabolario:")
print(vectorizer.vocabulary_)




Matrice TF-IDF:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Vocabolario:
{'situated': 7564, 'in': 4210, 'the': 8249, 'heart': 3993, 'of': 5659, 'genoa': 3648, 'historic': 4060, 'centre': 1714, 'this': 8275, 'contemporary': 2144, 'style': 7960, 'restaurant': 6875, 'focuses': 3385, 'on': 5698, 'just': 4440, 'few': 3238, 'dishes': 2599, 'almost': 453, 'all': 437, 'fish': 3312, 'based': 931, 'presented': 6436, 'very': 8779, 'modern': 5271, 'and': 526, 'generous': 3643, 'portions': 6355, 'seasonal': 7314, 'ingredients': 4262, 'market': 5009, 'fresh': 3503, 'produce': 6484, 'are': 668, 'guiding': 3919, 'philosophy': 6152, 'here': 4024, 'beautiful': 981, 'stone': 7885, 'vaulted': 8708, 'building': 1367, 'an': 515, 'old': 5680, '17c': 46, 'monastery': 5297, 'town': 8403, 'seafront': 7307, 'promenade': 6508, 'young': 9063, 'owner': 5837, 'chef': 1804, 'alessandro': 425, 'feo': 3216, 