<a href="https://colab.research.google.com/github/amrhsnd/ADM_HW3/blob/federico/ADM_HW3_point2_and_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

# loade the file
file_path = '/content/output.csv'
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,index,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,creditCards,facilitiesServices,phoneNumber,website
0,1,20Tre,via David Chiossone 20 r,Genoa,16123,Italy,€€,"Farm to table, Modern Cuisine",Situated in the heart of Genoa’s historic cent...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']",['Air conditioning'],+39 010 247 6191,https://www.ristorante20tregenova.it/
1,2,Alessandro Feo,via Angelo Lista 24,Marina di Casal Velino,84040,Italy,€€,"Campanian, Seafood",In a beautiful stone-vaulted building (an old ...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']",[],+39 328 893 7083,https://www.alessandrofeoristorante.it/
2,3,Ape Vino e Cucina,Piazza Risorgimento 3,Alba,12051,Italy,€€,"Piedmontese, Contemporary",This attractive restaurant in the heart of Alb...,"['Amex', 'Dinersclub', 'Maestrocard', 'Masterc...","['Air conditioning', 'Terrace', 'Wheelchair ac...",+39 0173 363453,https://www.apewinebar.it/alba/
3,4,Charleston,via Generale Magliocco 19,Palermo,90141,Italy,€€€€,"Modern Cuisine, Creative","Before it became famous in Mondello, the renow...","['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Counter dining', 'Terrac...",+39 091 450171,https://casacharleston.net/
4,5,Da Bob Cook Fish,largo Parsano vecchio 16,Sorrento,80067,Italy,€€,Seafood,Working in partnership with the nearby fishmon...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Air conditioning', 'Terrace']",+39 081 1778 3873,https://www.dabobcookfish.com/


In [None]:
print(data.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1983 entries, 0 to 1982
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   index               1983 non-null   int64 
 1   restaurantName      1983 non-null   object
 2   address             1983 non-null   object
 3   city                1983 non-null   object
 4   postalCode          1983 non-null   int64 
 5   country             1983 non-null   object
 6   priceRange          1983 non-null   object
 7   cuisineType         1983 non-null   object
 8   description         1983 non-null   object
 9   creditCards         1983 non-null   object
 10  facilitiesServices  1983 non-null   object
 11  phoneNumber         1983 non-null   object
 12  website             1983 non-null   object
dtypes: int64(2), object(11)
memory usage: 201.5+ KB
None


#**[2.0] Search Engine**



#### **Objective**:
The goal of this step is to build an effective search engine to retrieve restaurants based on user queries. We implemented two types of search engines to address different needs: a **Conjunctive Search Engine** and a **Ranked Search Engine**.

#### **Steps**:

2.0 **Preprocessing**

2.1 **Conjunctive Query Search Engine**

2.2 **Ranked Search Engine with TF-IDF and Cosine Similarity**

#### **Output**:
- For both search engines, the results include:
  - `restaurantName`
  - `address`
  - `description`
  - `cuisineType`
  - `priceRange`
  - `website`

### 2.0 Preprocessing
   - We start by cleaning and preparing the text data from restaurant descriptions to improve the search engine's accuracy.
   - The preprocessing steps include:
     - **Removing Stopwords**: Common words (e.g., "and," "the") that do not contribute to the search context are removed using the `nltk` library.
     - **Removing Punctuation**: Punctuation marks are eliminated to focus purely on text content.
     - **Stemming**: Words are reduced to their base form (e.g., "cooking" to "cook") using the Porter Stemmer from `nltk`, allowing better matching of terms.
   - The cleaned and preprocessed text is stored in a new column called `processed_description`.



In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string

# Download stopwords
nltk.download('stopwords')

def preprocess_text(text):
    stop_words = set(stopwords.words("english"))
    stemmer = PorterStemmer()

    # Convert text to lowercase and remove punctuation including quotes and numbers
    text = text.lower().translate(str.maketrans("", "", '‘[]’“”-1234567890' + string.punctuation))

    # Tokenize, remove stopwords and stem (and remove single-letter words)
    tokens = [stemmer.stem(word) for word in text.split() if word not in stop_words and len(word) > 1]

    return " ".join(tokens)

# Apply preprocessing to descriptions
data["processed_description"] = data["description"].apply(preprocess_text)




[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### 2.1 Conjunctive Query Search Engine
   - This search engine narrows down the results by returning only those restaurants whose **descriptions contain all query terms**.
   - We implemented an **Inverted Index** for efficient querying:
     - **Vocabulary File**: It maps each unique word to a unique `term_id`.
     - **Inverted Index**: A dictionary where each `term_id` points to a list of document IDs (restaurant entries) that contain the word.
   - **Query Execution**:
     - The user query is preprocessed similarly to the restaurant descriptions.
     - The search engine retrieves documents (restaurants) that contain **all terms** from the user query.
   - This method ensures precise matching, making it suitable for specific, multi-keyword searches.


In [None]:
from collections import defaultdict
#vocabulary and inverted index
vocabulary = {}
inverted_index = defaultdict(list)

term_id = 0

for idx, row in data.iterrows():
    description = row["processed_description"]
    restaurant_id = row["index"]

    for word in set(description.split()):  # Use set to avoid duplicates in one description
        if word not in vocabulary:
            vocabulary[word] = term_id
            term_id += 1

        term_id_for_word = vocabulary[word]
        inverted_index[term_id_for_word].append(restaurant_id)


In [None]:
import json

# Save vocabulary as vocabulary.csv
vocab_df = pd.DataFrame(list(vocabulary.items()), columns=["word", "term_id"])
vocab_df.to_csv("/content/vocabulary.csv", index=False)

# Save inverted index as inverted_index.json
with open("/content/inverted_index.json", "w") as f:
    json.dump(inverted_index, f)


2.1.2

In [None]:
# Function to preprocess and execute the query
def execute_query(query, data):
    # Preprocess the query terms using the preprocess_text function
    processed_query = preprocess_text(query)
    query_terms = set(processed_query.split())

    # Find matching restaurants by checking if all query terms are in the processed description
    matched_restaurants = data[data["processed_description"].apply(lambda desc: query_terms.issubset(desc.split()))]

    # Select only the necessary columns for the output
    return matched_restaurants

# Prompt the user to input a query
query1 = 'modern seasonal cuisine'

result1 = execute_query(query1, data)

result1

Unnamed: 0,index,restaurantName,address,city,postalCode,country,priceRange,cuisineType,description,creditCards,facilitiesServices,phoneNumber,website,processed_description
24,25,Casin del Gamba,via Roccolo Pizzati 1,Altissimo,36070,Italy,€€€€,"Country cooking, Modern Cuisine",The journey to get here – a winding road throu...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Car park', 'Interesting wine list', 'Terrace...",+39 0444 687709,https://www.casindelgamba.it/,journey get wind road wood hill may challeng t...
98,99,San Giorgio,viale Brigate Bisagno 69r,Genoa,16129,Italy,€€€,"Modern Cuisine, Ligurian",Situated in the city albeit not right in the c...,"['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Interesting wine list', ...",+39 010 595 5205,https://www.ristorantesangiorgiogenova.it/,situat citi albeit right centr san giorgio typ...
186,187,Il Luogo Aimo e Nadia,via Montecuccoli 6,Milan,20147,Italy,€€€€,"Italian Contemporary, Modern Cuisine",This long-established restaurant has been part...,"['Amex', 'Dinersclub', 'Maestrocard', 'Masterc...","['Air conditioning', 'Counter dining', 'Intere...",+39 02 416886,https://www.aimoenadia.com/il-luogo-aimo-e-nadia,longestablish restaur part milanes culinari sc...
211,212,Vesta Mare,viale Roma 41,Marina di Pietrasanta,55045,Italy,€€€,"Seafood, Classic Cuisine","This typical, elegant Versilian beach club wit...","['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Car park', 'Interesting ...",+39 0584 20187,https://vestafiorichiari.com/mare/,typic eleg versilian beach club openplan feel ...
302,303,Ca' Del Moro,località Erbin 31,Grezzana,37023,Italy,€€€,"Italian Contemporary, Mediterranean Cuisine",Situated within the La Collina dei Ciliegi win...,"['Amex', 'Mastercard', 'Visa']","['Air conditioning', 'Car park', 'Garden or pa...",+39 045 981 4900,https://www.cadelmoro.wine/it,situat within la collina dei ciliegi wine esta...
323,324,Contrasto,via Roma 55,Cercemaggiore,86012,Italy,€€,"Modern Cuisine, Creative","Having returned to his native village, owner-c...","['Mastercard', 'Visa']","['Air conditioning', 'Terrace']",+39 0874 799230,https://contrastoristorante.it,return nativ villag ownerchef lucio testa open...
336,337,Saur,via Filippo Turati 8,Barco,25034,Italy,€€,Italian Contemporary,"In a tiny rural village, this contemporary, al...","['Mastercard', 'Visa']","['Air conditioning', 'Terrace', 'Wheelchair ac...",+39 030 941149,https://ristorantesaur.it,tini rural villag contemporari almost minimali...
416,417,San Michele,via Castello di Fagagna 33,Fagagna,33034,Italy,€€,Modern Cuisine,Situated next to the ruins of the old castle a...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Car park', 'Garden or park', 'Terrace', 'Whe...",+39 0432 810466,http://sanmichele.restaurant,situat next ruin old castl small church san mi...
446,447,Chichibio,via Guglielmo Marconi 1,Roccaraso,67037,Italy,€€,"Modern Cuisine, Cuisine from Abruzzo","Despite its lack of awards, this restaurant st...","['Amex', 'Dinersclub', 'Mastercard', 'Visa']",['Air conditioning'],https://www.theforkmanager.com/partnership-the...,tel:+39 328 905 4831,despit lack award restaur stand qualiti cuisin...
499,500,Winter Garden Florence,piazza Ognissanti 1,Florence,50123,Italy,€€€€,Mediterranean Cuisine,Horse-drawn carriages once entered the old cou...,"['Amex', 'Dinersclub', 'Mastercard', 'Visa']","['Air conditioning', 'Wheelchair access']",+39 055 2716 3770,https://www.wintergardenflorence.com/it/,horsedrawn carriag enter old courtyard st regi...


## 2.2 Ranked Search Engine with TF-IDF and Cosine Similarity

This search engine ranks the top-k restaurants based on their **similarity scores** with the user query using **TF-IDF** and **Cosine Similarity**.

### **Process**:

1. **TF-IDF Calculation**:
   - **TF-IDF** measures term importance within a restaurant description and across the dataset.
   - The formula is:

   \[
   \text{TF-IDF}(term, document) = \text{TF}(term, document) \times \text{IDF}(term)
   \]

   - **TF** is the frequency of the term in the document.
   - **IDF** reduces the weight of common terms across all descriptions.

2. **Updated Inverted Index**:
   - The inverted index is updated to map each term (`term_id`) to a list of tuples containing:
     - The restaurant's document ID.
     - The corresponding **TF-IDF score**.
   
   Example:
   ```python
   {
     "term_id_1": [(doc_1, tfIdf_1), (doc_2, tfIdf_2)],
     "term_id_2": [(doc_3, tfIdf_3), (doc_4, tfIdf_4)]
   }





### 2.2.1 Inverted Index with TF-IDF Scores

In [None]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer

#TF-IDF scores
def build_inverted_index(data):
    # Vectorize descriptions using TF-IDF
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(data['description'])
    terms = vectorizer.get_feature_names_out()

    # inverted index with TF-IDF scores
    inverted_index = defaultdict(list)
    for doc_id, row in enumerate(tfidf_matrix):
        for term_id, tfidf_score in zip(row.indices, row.data):
            term = terms[term_id]
            inverted_index[term].append((doc_id, tfidf_score))

    return inverted_index, vectorizer

# Generate the inverted index
inverted_index, vectorizer = build_inverted_index(data)


2.2.2

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np



# TF-IDF vectors for each restaurant description
def preprocess_data(data):
    # Vectorize descriptions
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(data['description'])
    return tfidf_matrix, vectorizer

# Ranked query using TF-IDF and Cosine Similarity
def execute_ranked_query(query, data, tfidf_matrix, vectorizer, top_k=5):
    #Transform the query using the same vectorizer
    query_vector = vectorizer.transform([query])

    # Calculate cosine similarity between the query and all documents
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()

    # Get indices of documents with non-zero similarity, sorted by score
    ranked_indices = np.argsort(similarity_scores)[::-1]
    ranked_indices = [idx for idx in ranked_indices if similarity_scores[idx] > 0]

    #Top-k results
    top_k_indices = ranked_indices[:top_k] if len(ranked_indices) >= top_k else ranked_indices

    results = data.iloc[top_k_indices][["restaurantName", "address", "description", "website"]].copy()
    results['similarity_score'] = similarity_scores[top_k_indices]
    results = results.reset_index(drop=True)
    return results

tfidf_matrix, vectorizer = preprocess_data(data)


query2 = 'modern seasonal cuisine'
top_k = 5  # Number of results to return
result2 = execute_ranked_query(query2, data, tfidf_matrix, vectorizer, top_k=top_k)

print(result2)


  restaurantName                                   address  \
0           Saur                      via Filippo Turati 8   
1          Razzo                     via Andrea Doria 17/f   
2       La Botte                  via Giuseppe Garibaldi 8   
3   Piccolo Lord               corso San Maurizio 69 bis/g   
4       La Valle  via Umberto I 25, località Valle Sauglio   

                                         description  \
0  In a tiny rural village, this contemporary, al...   
1  A quiet restaurant with a relaxed, young and m...   
2  A modern and welcoming contemporary bistro sit...   
3  Professional service in a welcoming, modern re...   
4  A well - run restaurant in a quiet area just o...   

                                 website  similarity_score  
0              https://ristorantesaur.it          0.311022  
1                 https://vadoarazzo.it/          0.264815  
2   http://www.trattorialabottestresa.it          0.246574  
3  https://www.ristorantepiccololord.it/      


#### **Comparison and Evaluation**:
- The **Conjunctive Search Engine** offers precise, narrow results where all query terms must appear, making it highly accurate for specific searches.
- The **Ranked Search Engine** provides a broader range of results sorted by relevance, making it more flexible for general queries.

#### **Conclusion**:
This two-pronged search engine approach enhances user experience by catering to both specific and general queries. The conjunctive engine ensures accurate filtering, while the ranked engine offers a sorted list of the most relevant options, leveraging TF-IDF scoring to highlight the best matches.


#**[3] Define a New Score!**

To improve the relevance and diversity of search results, we will introduce a custom ranking metric that accounts for multiple restaurant attributes alongside the query's textual similarity.

**Approach:**

1. **User Query Input**:  
   Begin with the user-provided query text to retrieve relevant restaurants using the search engine developed in Step 2.1.

2. **Incorporate Multi-Attribute Scoring**:  
   Move beyond the basic description similarity by considering other attributes such as:
   - **Cuisine Type**: Prioritize matches with cuisine preferences.
   - **Facilities and Services**: Boost scores for restaurants offering sought-after amenities.
   - **Price Range**: Tailor scoring to favor budget-friendly or premium options, based on user preference.
   - **Description Match**: Retain weightage for TF-IDF-based textual relevance.

3. **Efficient Ranking with Heap**:  
   Leverage a heap data structure to dynamically maintain the top-k restaurants as they are scored.

**New Scoring Function**:

The scoring function will evaluate restaurants on multiple criteria and assign a composite score:
   - **Description Similarity**: A weighted score from the TF-IDF vector similarity.
   - **Cuisine Preference**: Add points for matching the cuisine type.
   - **Facilities Match**: Increment the score for each amenity in the user's query.
   - **Affordability Factor**: Assign additional weight to restaurants in the user's preferred price range.

**Implementation Steps**:
1. Preprocess and tokenize the query to identify keywords related to descriptions, facilities, and cuisine types.
2. Compute the description similarity using the TF-IDF vector and cosine similarity from Step 2.1.
3. Evaluate cuisine and facilities matches by cross-referencing attributes.
4. Integrate all scores into a single composite score using predefined weights.
5. Use a heap to efficiently maintain the top-k results.

**Output**:

The final output will include:
   - **restaurantName**: The name of the restaurant.
   - **address**: Location details for user convenience.
   - **description**: A brief overview of the restaurant.
   - **website**: Direct link for further exploration.
   - **Custom Metric Score**: The computed score based on the new ranking function.


In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import heapq

# Custom scoring function
def custom_score(row, query_terms, vectorizer, description_tfidf, cuisine_weight=1.0, facilities_weight=1.0, price_weight=1.0):
    # Description similarity score using TF-IDF and Cosine Similarity
    query_vector = vectorizer.transform([" ".join(query_terms)])
    desc_vector = description_tfidf[row.name]
    description_score = cosine_similarity(query_vector, desc_vector).flatten()[0]

    # Cuisine match score
    cuisine_score = sum([cuisine_weight for cuisine in row['cuisineType'].split(', ') if cuisine.lower() in query_terms])

    # Facilities match score
    facilities_score = sum([facilities_weight for facility in row['facilitiesServices'] if facility.lower() in query_terms])

    # Price range score (based on € range, prioritizing lower prices)
    price_range = len(row['priceRange'])  # € -> low, €€€€ -> high
    price_score = max(5 - price_range, 0) * price_weight  # prioritize lower price if desired

    #Total score
    total_score = (description_score * 2) + cuisine_score + facilities_score + price_score
    return total_score


In [None]:
def rank_conjunctive_results(query, subset_data, vectorizer, description_tfidf, top_k=10):
    # Preprocess the query terms
    processed_query = preprocess_text(query)
    query_terms = set(processed_query.split())

    # Score each restaurant in the conjunctive results
    scored_restaurants = []
    for idx, row in subset_data.iterrows():
        score = custom_score(row, query_terms, vectorizer, description_tfidf)
        scored_restaurants.append((score, idx))

    #Heap to get the top-k restaurants based on the custom score
    top_k_restaurants = heapq.nlargest(top_k, scored_restaurants, key=lambda x: x[0])
    top_k_indices = [idx for _, idx in top_k_restaurants]

    #Top-k results and sort by custom score
    results = subset_data.loc[top_k_indices].copy()
    results['custom_score'] = [score for score, _ in top_k_restaurants]
    results = results.sort_values(by='custom_score', ascending=False).reset_index(drop=True)

    return results

query3='modern seasonal cuisine'

conjunctivs_results = execute_query(query3, data)

sorted_results = rank_conjunctive_results(query1, conjunctivs_results, vectorizer, tfidf_matrix, top_k=10)
print(sorted_results[["restaurantName", "address", "description","website","custom_score"]])


            restaurantName                         address  \
0  Osteria del Miglio 2.10                  via Patrioti 2   
1            Osteria Ophis       corso Serpente Aureo 54/b   
2          Osteria Taviani  piazza Vittorio Emanuele II 28   
3                     Saur            via Filippo Turati 8   
4                    Razzo           via Andrea Doria 17/f   
5             Piccolo Lord     corso San Maurizio 69 bis/g   
6                     Mima                via Madonnelle 9   
7          Locanda Solagna            piazza I  Novembre 2   
8                Chichibio         via Guglielmo Marconi 1   
9              San Michele      via Castello di Fagagna 33   

                                         description  \
0  Although the town may not be of major importan...   
1  Situated in the beautiful historic centre of O...   
2  This pleasant, warmly decorated restaurant is ...   
3  In a tiny rural village, this contemporary, al...   
4  A quiet restaurant with a relaxed,

## Analysis of Restaurant Order Changes

### Key Adjustments and Benefits:
- The custom scoring metric prioritizes restaurants with **matching cuisine types** and **desired facilities** from the user's query.
  - For example, a query like *"modern seasonal cuisine with a garden"* gives higher priority to restaurants such as **Winter Garden**, which offers modern cuisine and garden seating.
- **Middle-ranked restaurants** like **La Bandiera** and **Ape Vino** were ranked higher due to diverse services (e.g., terrace seating, seasonal offerings) and specific cuisines matching user preferences.
- Restaurants that lacked key features (e.g., garden seating) were **lower-ranked**, even if their descriptions were textually similar.

### Methodology and Results:
1. **Custom Metric Implementation**:
   - Textual relevance (TF-IDF) was combined with a custom scoring metric that considers:
     - Facilities like garden, terrace seating.
     - Cuisine type and affordability.
2. **Impact of Changes**:
   - Conjunctive filtering and custom scoring led to a more **personalized recommendation list**.
   - Top restaurants like **Osteria del Miglio 2.10** balanced textual relevance with user preferences.
3. **Example Results**:
   - Example query: *"modern seasonal cuisine"*
   - Top-ranked results included:
     - Osteria del Miglio 2.10
     - Osteria Ophis
     - Osteria Taviani

### Conclusion:
- The custom scoring function effectively **reshaped the ranking order** by integrating user preferences such as cuisine type, facilities, and affordability.
- This enhanced ranking quality by providing recommendations that align closely with user expectations.
