<a href="https://colab.research.google.com/github/amrhsnd/ADM_HW3/blob/federico/ADM_HW3_point2_correct.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

# loade the file
file_path = '/content/output.csv'
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,index,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,creditCards,facilitiesServices,phoneNumber,website
0,1,20Tre,via David Chiossone 20 r,Genoa,16123,Italy,€€,"Farm to table, Modern Cuisine",Situated in the heart of Genoa’s historic cent...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']",['Air conditioning'],+39 010 247 6191,https://www.ristorante20tregenova.it/
1,2,Alessandro Feo,via Angelo Lista 24,Marina di Casal Velino,84040,Italy,€€,"Campanian, Seafood",In a beautiful stone-vaulted building (an old ...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']",[],+39 328 893 7083,https://www.alessandrofeoristorante.it/
2,3,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,€€,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"['Amex', 'Dinersclub', 'Maestrocard', 'Masterc...","['Air conditioning', 'Terrace', 'Wheelchair ac...",+39 0173 363453,https://www.apewinebar.it/alba/
3,4,Charleston,via Generale Magliocco 19,Palermo,90141,Italy,€€€€,"Modern Cuisine, Creative","Before it became famous in Mondello, the renow...","['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Counter dining', 'Terrac...",+39 091 450171,https://casacharleston.net/
4,5,Da Bob Cook Fish,largo Parsano vecchio 16,Sorrento,80067,Italy,€€,Seafood,Working in partnership with the nearby fishmon...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Air conditioning', 'Terrace']",+39 081 1778 3873,https://www.dabobcookfish.com/


In [2]:
print(data.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1983 entries, 0 to 1982
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   index               1983 non-null   int64 
 1   restaurantName      1983 non-null   object
 2   address             1983 non-null   object
 3   city                1983 non-null   object
 4   postalCode          1983 non-null   int64 
 5   country             1983 non-null   object
 6   priceRange          1983 non-null   object
 7   cuisineType         1983 non-null   object
 8   description         1983 non-null   object
 9   creditCards         1983 non-null   object
 10  facilitiesServices  1983 non-null   object
 11  phoneNumber         1983 non-null   object
 12  website             1983 non-null   object
dtypes: int64(2), object(11)
memory usage: 201.5+ KB
None


#**[2.0]**

## 2. Search Engine

This project involves creating a search engine to retrieve restaurants based on a user query. We will build two types of search engines:

- **Conjunctive Search Engine**: Returns restaurants where all query terms appear in the description.
- **Ranked Search Engine**: Returns the top-k restaurants sorted by similarity to the query using TF-IDF and Cosine Similarity.

---

### 2.0 Preprocessing

Before building the search engine, it is crucial to clean and prepare the text in each restaurant’s description. The preprocessing steps include:

- Removing stopwords
- Removing punctuation
- Applying stemming
- Performing any other necessary cleaning to improve search accuracy

We will utilize the `nltk` library to handle the preprocessing tasks.



In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

# Download stopwords
nltk.download('stopwords')

def preprocess_text(text):
    stop_words = set(stopwords.words("english"))
    stemmer = PorterStemmer()

    # Convert text to lowercase and remove punctuation including quotes and numbers
    text = text.lower().translate(str.maketrans("", "", '‘[]’“”-1234567890' + string.punctuation))

    # Tokenize, remove stopwords and stem (and remove single-letter words)
    tokens = [stemmer.stem(word) for word in text.split() if word not in stop_words and len(word) > 1]

    return " ".join(tokens)

# Apply preprocessing to descriptions
data["processed_description"] = data["description"].apply(preprocess_text)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


#**[2.1]**

### 2.1 Conjunctive Query Search Engine

This version of the search engine narrows the search to the description field of each restaurant. Only restaurants whose descriptions contain **all** of the query words will be returned.


#### 2.1.1 Create Your Index!

You need to create two important components:

- **Vocabulary File**: This file (called `vocabulary.csv`) should map each word in the descriptions to a unique integer (`term_id`). Each term_id will uniquely identify a word across all documents (restaurant descriptions).

- **Inverted Index**: This is a dictionary that maps each `term_id` to a list of document IDs (restaurant IDs) where that term appears.

Example structure of the inverted index:

```python
{
  "term_id_1": [document_1, document_2, document_4],
  "term_id_2": [document_1, document_3, document_5],
  ...
}

In [4]:
from collections import defaultdict
#vocabulary and inverted index
vocabulary = {}
inverted_index = defaultdict(list)

term_id = 0

for idx, row in data.iterrows():
    description = row["processed_description"]
    restaurant_id = row["index"]

    for word in set(description.split()):  # Use set to avoid duplicates in one description
        if word not in vocabulary:
            vocabulary[word] = term_id
            term_id += 1

        term_id_for_word = vocabulary[word]
        inverted_index[term_id_for_word].append(restaurant_id)


In [5]:
import json

# Save vocabulary as vocabulary.csv
vocab_df = pd.DataFrame(list(vocabulary.items()), columns=["word", "term_id"])
vocab_df.to_csv("/content/vocabulary.csv", index=False)

# Save inverted index as inverted_index.json
with open("/content/inverted_index.json", "w") as f:
    json.dump(inverted_index, f)


2.1.2

In [6]:
# Function to preprocess and execute the query
def execute_query(query, data):
    # Preprocess the query terms using the preprocess_text function
    processed_query = preprocess_text(query)
    query_terms = set(processed_query.split())

    # Find matching restaurants by checking if all query terms are in the processed description
    matched_restaurants = data[data["processed_description"].apply(lambda desc: query_terms.issubset(desc.split()))]

    # Select only the necessary columns for the output
    return matched_restaurants

# Prompt the user to input a query
query1 = 'modern seasonal cuisine garden'

result1 = execute_query(query1, data)

result1

Unnamed: 0,index,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,creditCards,facilitiesServices,phoneNumber,website,processed_description
323,324,Contrasto,via Roma 55,Cercemaggiore,86012,Italy,€€,"Modern Cuisine, Creative","Having returned to his native village, owner-c...","['Mastercard', 'Visa']","['Air conditioning', 'Terrace']",+39 0874 799230,https://contrastoristorante.it,return nativ villag ownerchef lucio testa open...
499,500,Winter Garden Florence,piazza Ognissanti 1,Florence,50123,Italy,€€€€,Mediterranean Cuisine,Horse-drawn carriages once entered the old cou...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Air conditioning', 'Wheelchair access']",+39 055 2716 3770,https://www.wintergardenflorence.com/it/,horsedrawn carriag enter old courtyard st regi...
627,628,Esplanade,via Lario 3,Desenzano del Garda,25015,Italy,€€€€,"Italian, Modern Cuisine","One of Italy’s long-established restaurants, t...","['Amex', 'Maestrocard', 'Mastercard', 'Visa']","['Air conditioning', 'Car park', 'Garden or pa...",+39 030 914 3361,https://www.ristorante-esplanade.com/,one itali longestablish restaur esplanad proud...
891,892,La Bandiera,contrada Pastini 4,Civitella Casanova,65010,Italy,€€€,"Cuisine from Abruzzo, Contemporary",Although it takes a while to reach this restau...,"['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Car park', 'Great view',...",+39 085 845219,https://www.labandiera.it/,although take reach restaur first open definit...
1540,1541,[àbitat],via Henry Dunant 1,San Fermo della Battaglia,22020,Italy,€€€,Innovative,"A young, enthusiastic and professional couple ...","['Maestrocard', 'Mastercard', 'Visa']","['Air conditioning', 'Terrace', 'Wheelchair ac...",+39 349 068 3973,https://www.abitatproject.it,young enthusiast profession coupl taken rein m...
1898,1899,Saporium,località Palazzetto 110,Chiusdino,53012,Italy,€€€€,"Tuscan, Italian Contemporary",Saporium is the new fine-dining restaurant at ...,"['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Car park', 'Garden or pa...",+39 0577 751222,http://www.saporium.com/it/borgo-santo.pietro/,saporium new finedin restaur superb relai borg...
1909,1910,Il Tino,via Monte Cadria 127,Fiumicino,54,Italy,€€€,Creative,Enjoying an attractive location in the Nautilu...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Air conditioning', 'Car park', 'Interesting ...",+39 06 562 2778,https://www.ristoranteiltino.com/,enjoy attract locat nautilu marina overlook ti...
1952,1953,Mima,via Madonnelle 9,Vico Equense,80069,Italy,€€,"Seasonal Cuisine, Mediterranean Cuisine",You’ll be won over by the seasonal Mediterrane...,"['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Great view', 'Terrace']",+39 081 1904 1517,http://www.domo20.com/restaurant,youll season mediterranean cuisin creat young ...


#**[2.2]**

## 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

In this section, we will build the **ranked search engine**. Given a query, the search engine will retrieve the top-k restaurants ranked by their relevance to the query. The relevance is determined by comparing the query's term frequencies (TF) and the document’s term frequencies using **TF-IDF** (Term Frequency-Inverse Document Frequency) and **Cosine Similarity**.


### 2.2.1 Inverted Index with TF-IDF Scores

To implement the ranked search engine, we will first calculate the **TF-IDF** scores for each term in each restaurant’s description. This will help us measure the importance of each term in the context of the entire corpus of restaurant descriptions.

1. **TF-IDF Scores**: For each restaurant's description, compute the **TF-IDF** score for every word. The formula is:

   \[
   \text{TF-IDF}(term, document) = \text{TF}(term, document) \times \text{IDF}(term)
   \]
   
   where:
   - **TF(term, document)** is the term frequency of the word in the document.
   - **IDF(term)** is the inverse document frequency of the word across all documents.

2. **Updated Inverted Index**: After calculating the TF-IDF scores, we need to update the inverted index. The inverted index will now map each term (identified by `term_id`) to a list of tuples, where each tuple contains a document ID (restaurant ID) and the corresponding TF-IDF score for that term in the document.

   Example format of the updated inverted index:
   
   ```python
   {
     "term_id_1": [(document_1, tfIdf_1), (document_2, tfIdf_2), (document_4, tfIdf_4)],
     "term_id_2": [(document_1, tfIdf_1), (document_3, tfIdf_3), (document_5, tfIdf_5)],
     ...
   }

In [7]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF scores
def build_inverted_index(data):
    # Vectorize descriptions using TF-IDF
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(data['description'])
    terms = vectorizer.get_feature_names_out()

    # inverted index with TF-IDF scores
    inverted_index = defaultdict(list)
    for doc_id, row in enumerate(tfidf_matrix):
        for term_id, tfidf_score in zip(row.indices, row.data):
            term = terms[term_id]
            inverted_index[term].append((doc_id, tfidf_score))

    return inverted_index, vectorizer

# Generate the inverted index
inverted_index, vectorizer = build_inverted_index(data)


2.2.2

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np



# TF-IDF vectors for each restaurant description
def preprocess_data(data):
    # Vectorize descriptions
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(data['description'])
    return tfidf_matrix, vectorizer

# Ranked query using TF-IDF and Cosine Similarity
def execute_ranked_query(query, data, tfidf_matrix, vectorizer, top_k=5):
    #Transform the query using the same vectorizer
    query_vector = vectorizer.transform([query])

    # Calculate cosine similarity between the query and all documents
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()

    # Get indices of documents with non-zero similarity, sorted by score
    ranked_indices = np.argsort(similarity_scores)[::-1]
    ranked_indices = [idx for idx in ranked_indices if similarity_scores[idx] > 0]

    #Top-k results
    top_k_indices = ranked_indices[:top_k] if len(ranked_indices) >= top_k else ranked_indices

    results = data.iloc[top_k_indices][["restaurantName", "address", "description", "website"]].copy()
    results['similarity_score'] = similarity_scores[top_k_indices]
    results = results.reset_index(drop=True)
    return results

tfidf_matrix, vectorizer = preprocess_data(data)


query2 = 'modern seasonal cuisine garden'
top_k = 5  # Number of results to return
result2 = execute_ranked_query(query2, data, tfidf_matrix, vectorizer, top_k=top_k)

print(result2)


  restaurantName                        address  \
0           Saur           via Filippo Turati 8   
1           Mima               via Madonnelle 9   
2          Razzo          via Andrea Doria 17/f   
3       La Botte       via Giuseppe Garibaldi 8   
4    Regio Patio  via San Francesco d'Assisi 23   

                                         description  \
0  In a tiny rural village, this contemporary, al...   
1  You’ll be won over by the seasonal Mediterrane...   
2  A quiet restaurant with a relaxed, young and m...   
3  A modern and welcoming contemporary bistro sit...   
4  Situated just a stone’s throw from the lakefro...   

                                website  similarity_score  
0             https://ristorantesaur.it          0.250014  
1      http://www.domo20.com/restaurant          0.223475  
2                https://vadoarazzo.it/          0.212870  
3  http://www.trattorialabottestresa.it          0.198207  
4              http://www.regiopatio.it          0.19617

#**[3]**

#**Define a New Score!**
Now, we will define a custom ranking metric to prioritize restaurants based on user queries.

**Steps:**
User Query: The user provides a text query. We’ll retrieve relevant documents using the search engine built in Step 2.1.

New Ranking Metric: After retrieving relevant documents, we’ll rank them using a new custom score. Instead of limiting the scoring to only the description field, we can include other attributes like priceRange, facilitiesServices, and cuisineType.
You will use a heap data structure (e.g., Python’s heapq library) to maintain the top-k restaurants.

**New Scoring Function:**
Define a scoring function that takes into account various attributes:

Description Match: Give weight based on the query similarity to the description (using TF-IDF scores).
Cuisine Match: Increase the score for matching cuisine types.
Facilities and Services: Give more points for matching facilities/services (e.g., “Terrace,” “Air conditioning”).
Price Range: Higher scores could be given to more affordable options based on the user’s choice.

**Output:**
The output should include:

restaurantName
address
description
website
The new similarity score based on the custom metric.
Are the results you obtain better than with the previous scoring function? Explain and compare results.



In [9]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import heapq

# Custom scoring function
def custom_score(row, query_terms, vectorizer, description_tfidf, cuisine_weight=1.0, facilities_weight=1.0, price_weight=1.0):
    # Description similarity score using TF-IDF and Cosine Similarity
    query_vector = vectorizer.transform([" ".join(query_terms)])
    desc_vector = description_tfidf[row.name]
    description_score = cosine_similarity(query_vector, desc_vector).flatten()[0]

    # Cuisine match score
    cuisine_score = sum([cuisine_weight for cuisine in row['cuisineType'].split(', ') if cuisine.lower() in query_terms])

    # Facilities match score
    facilities_score = sum([facilities_weight for facility in row['facilitiesServices'] if facility.lower() in query_terms])

    # Price range score (based on € range, prioritizing lower prices)
    price_range = len(row['priceRange'])  # € -> low, €€€€ -> high
    price_score = max(5 - price_range, 0) * price_weight  # prioritize lower price if desired

    #Total score
    total_score = (description_score * 2) + cuisine_score + facilities_score + price_score
    return total_score


In [10]:
def rank_conjunctive_results(query, subset_data, vectorizer, description_tfidf, top_k=10):
    # Preprocess the query terms
    processed_query = preprocess_text(query)
    query_terms = set(processed_query.split())

    # Score each restaurant in the conjunctive results
    scored_restaurants = []
    for idx, row in subset_data.iterrows():
        score = custom_score(row, query_terms, vectorizer, description_tfidf)
        scored_restaurants.append((score, idx))

    #Heap to get the top-k restaurants based on the custom score
    top_k_restaurants = heapq.nlargest(top_k, scored_restaurants, key=lambda x: x[0])
    top_k_indices = [idx for _, idx in top_k_restaurants]

    #Top-k results and sort by custom score
    results = subset_data.loc[top_k_indices].copy()
    results['custom_score'] = [score for score, _ in top_k_restaurants]
    results = results.sort_values(by='custom_score', ascending=False).reset_index(drop=True)

    return results

query3='modern seasonal cuisine garden'

conjunctivs_results = execute_query(query3, data)

sorted_results = rank_conjunctive_results(query1, conjunctivs_results, vectorizer, tfidf_matrix, top_k=10)
print(sorted_results[["restaurantName", "address", "description","website","custom_score"]])


           restaurantName                  address  \
0                    Mima         via Madonnelle 9   
1               Contrasto              via Roma 55   
2                [àbitat]       via Henry Dunant 1   
3             La Bandiera       contrada Pastini 4   
4                 Il Tino     via Monte Cadria 127   
5  Winter Garden Florence      piazza Ognissanti 1   
6                Saporium  località Palazzetto 110   
7               Esplanade              via Lario 3   

                                         description  \
0  You’ll be won over by the seasonal Mediterrane...   
1  Having returned to his native village, owner-c...   
2  A young, enthusiastic and professional couple ...   
3  Although it takes a while to reach this restau...   
4  Enjoying an attractive location in the Nautilu...   
5  Horse-drawn carriages once entered the old cou...   
6  Saporium is the new fine-dining restaurant at ...   
7  One of Italy’s long-established restaurants, t...   

        