# Homework 3 - Find the perfect place to stay in Texas!

###### Alessandro Flaborea, Egon Ferri, Melis Kaymaz

The homework consists in analyzing the text of Airbnb property listings and building a search engine.

In [1]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize, PunktSentenceTokenizer
from nltk.corpus import stopwords
import string
from nltk.stem import PorterStemmer
from nltk import pos_tag, ne_chunk
import nltk
import math

## Step 2: Create documents

We want to create a `.tsv` file for each record of the dataset.
First thing to do is reading the file.

In [2]:
f = pd.read_csv(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\Airbnb_Texas_Rentals.csv')

Now we can create `.tsv` files and store them in a directory.

In [3]:
for i in range(f.index.max()+1):
    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(i) + '.tsv', 'w', encoding="utf-8")
    for j in range(10):
        op.write('%s\t' %f.iloc[i, j])
    op.close()

KeyboardInterrupt: 

## Step 3: Search Engine

Now, we want to create two different Search Engines that, given as input a query, return the houses that match the query.

As a first common step, we want to preprocess the documents by

1. Removing stopwords
2. Removing punctuation
3. Stemming

Then we want to build a file named `vocabulary.txt`, that maps each word to an integer (`term_id`).

In [4]:
#FUNCTIONS
def preprocess(l):
    final = []
    for i in l:
        if not((ps.stem(i) in stopWords) or (ps.stem(i) in (string.punctuation) )):
            final.append(ps.stem(i))
    return (final)

def vocabularization(vocabulary, final, index):
    for word in final:
        if not(word in vocabulary):
            vocabulary[word] = index
            index = index + 1
    return(vocabulary, index)


In [5]:
stopWords = set(stopwords.words('english'))
ps = PorterStemmer()
string.punctuation = string.punctuation + '–“”’'

vocabulary= {}
index = 0

for i in range(18259):
    
    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(i) + '.tsv', 'r', encoding="utf-8")
    for line in op:
        ou = line.strip().split('\t')
        sentence = ou[5].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ') + ' ' + ou[8].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ')
    op.close()
        
    #preprocessing data deleting stop words, punctuations, ecc.  
    final = preprocess(word_tokenize(sentence))
    
    # IF  word not in vocabulary -> add the word
    vocabulary, index = vocabularization(vocabulary, final, index)
            
op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\vocabulary.txt', 'w', encoding="utf-8")
op.write(str(vocabulary))
op.close()

### 3.1) Conjunctive query
At this moment, we narrow out interest on the `description` and `title` of each document. It means that the first Search Engine will evaluate queries with respect to the aforementioned information.

#### 3.1.1) Creating our index!

We want to create the Inverted Index. It will be a dictionary of this format:

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```

where _document\_i_ is the *id* of a document that contains the word.

We also want to store it in a separate file and load it in memory when needed.

In [6]:
inverted_index = {}

for file in range(18259):

    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(file) + '.tsv', 'r', encoding="utf-8")
    for line in op:
        ou = line.strip().split('\t')
        sentence = ou[5].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ') + ' ' + ou[8].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ')
    op.close()
 
    
    #preprocessing data deleting stop words, punctuations, ecc.  
    final = preprocess(word_tokenize(sentence))
    
    
    #CREATING INVERTED INDEX
    for word in final:
        index = vocabulary[word]
        if not (index in inverted_index):
            inverted_index[index] = ['doc_' + str(file)]
        elif not('doc_' + str(file) in inverted_index[index]):
            inverted_index[index] = inverted_index[index] + ['doc_' + str(file)]

op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\inverted_index.txt', 'w', encoding="utf-8")
op.write(str(inverted_index))
op.close()

#### 3.1.2) Execute the query
Now given a query, that we let the user enter:
```
queen netflix
```
we want that the Search Engine returns a list of documents that contains all the words in the query.

Query input:

In [14]:
user_query = input()

#names, l = chunking(sentence) 
    
#preprocessing data deleting stop words, punctuations, ecc.  
final = preprocess(word_tokenize(user_query))
final

netflix bed Bienvenidos


['netflix', 'bed', 'bienvenido']

Traducing the query in our 'language':

In [15]:
voc = {}
inverted_idx = {}
i=0
for word in final: 
    voc[i]  = vocabulary[word]
    i = i+1
for index in range(i):
    inverted_idx[voc[index]] = inverted_index[voc[index]]

Finding list of docs that contain all the words in the query and printing them in the format that we want:

In [39]:
#finding list of docs that contain all the words in the query
docs = []

for i in range(18259):
    doc = 'doc_' + str(i)
    b = True
    for j in voc.values(): 
        b = b and (doc in inverted_idx[j])
    if b:
        docs.append(i)

df = f.filter(items = ['title', 'description', 'city', 'url']).loc[docs]
df.description = list(map(lambda x: x.replace('\\n', ' '), df.description.tolist()))
df.style.hide_index()

Unnamed: 0,title,description,city,url
57,Lovely Katy Home (3BR/2B) - Fantastic Location,"Bienvenidos! This gorgeous one-story home offers 3 bedrooms, 2 bathrooms and a 2-car garage. The elegant plan features a study room equipped with a sofa bed, gorgeous kitchen and great living space with TV (netflix included). It is tailored for all. We are located in Katy, TX, in close proximity to Katy Mills Shopping Mall, Houston Premium Outlets, Typhoon Water Park, energy corridor, and new hospital area. We sincerely hope you will have a wonderful and profitable stay in our dear town of Katy.",Katy,https://www.airbnb.com/rooms/19387030?location=Cinco%20Ranch%2C%20TX
12924,Lovely Katy Home (3BR/2B) - Fantastic Location,"Bienvenidos! This gorgeous one-story home offers 3 bedrooms, 2 bathrooms and a 2-car garage. The elegant plan features a study room equipped with a sofa bed, gorgeous kitchen and great living space with TV (netflix included). It is tailored for all. We are located in Katy, TX, in close proximity to Katy Mills Shopping Mall, Houston Premium Outlets, Typhoon Water Park, energy corridor, and new hospital area. We sincerely hope you will have a wonderful and profitable stay in our dear town of Katy.",Katy,https://www.airbnb.com/rooms/19387030?location=Beasley%2C%20TX


### 3.2) Conjunctive query & Ranking score
In the new Search Engine, given a query, we want to get the *top-k* (the choice of *k* it's up to you!) documents related to the query. In particular we want:

* Find all the documents that contains all the words in the query (as before...).
* Sort them by their similarity with the query
* Return in output *k* documents, or all the documents with non-zero similarity with the query when the results are less than _k_.

To solve this task, we use the *tfIdf* score, and the _Cosine similarity_. Let's see how.

First thing; we create a new inverted index that contains `tfIdf`s:

In [18]:
inverted_index_2 = {}

for file in range(18259):

    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(file) + '.tsv', 'r', encoding="utf-8")
    for line in op:
        ou = line.strip().split('\t')
        sentence = ou[5].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ') + ' ' + ou[8].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ')
    op.close()
 
    
    #preprocessing data deleting stop words, punctuations, ecc.  
    final = preprocess(word_tokenize(sentence))
    
    
    #CREATING INVERTED INDEX
    for word in final:
        index = vocabulary[word]
        
        tf = final.count(word) / len(final)
        idf = math.log( 18259 / len(inverted_index[vocabulary[word]]))
        
        if not (index in inverted_index_2):
            inverted_index_2[index] = [('doc_' + str(file), tf*idf )]
        elif not(('doc_' + str(file), tf*idf)  in inverted_index_2[index]):
            inverted_index_2[index] = inverted_index_2[index] + [('doc_' + str(file), tf*idf)]


op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\inverted_index_2.txt', 'w', encoding="utf-8")
op.write(str(inverted_index_2))
op.close()

Query input:

In [19]:
user_query = input()

#preprocessing data deleting stop words, punctuations, ecc.  
final_query = preprocess(word_tokenize(user_query))
final_query

netflix bed Bienvenidos


['netflix', 'bed', 'bienvenido']

Traducing the query in our 'language':

In [20]:
voc = {}
inverted_query = {}
i=0
for word in final_query: 
    voc[i]  = vocabulary[word]
    i = i+1
for index in range(i):
    inverted_query[voc[index]] = inverted_index_2[voc[index]]

finding nominator of cosine similarity formula:

In [65]:
n = {}
index = 0
for i in inverted_query:
    for j in inverted_query[i]:
        if index == 0:
            n[j[0]] = [j[1]]
        elif (not (j[0] in n)):
            n[j[0]] = [0]*index + ([j[1]])
        else:   
            n[j[0]] = n[j[0]] + [0]*(index - len(n[j[0]])) + ([j[1]])
    index = index + 1

for i in n:
    if len(n[i]) < len(final_query) :
        n[i] = n[i] + [0]*(len(final_query)-len(n[i]))


finding tfidfs of the query:

In [22]:
tfidf_query = []  
for word in final_query:
    tf_query = final_query.count(word) / len(final_query)
    idf_query = math.log( 18259 / len(inverted_index[vocabulary[word]]))
    tfidf_query.append(tf_query * idf_query)
tfidf_query

[1.4185284421457318, 0.4563606817729836, 2.538396270266838]

finding denominator of our dear formula: norm of the query and norm of docs:

In [23]:
import numpy
norm_query = numpy.sqrt(sum(list(map(lambda x: x**2, tfidf_query))))

In [24]:
norm_doc = {}
for i in range(18259):
    doc = 'doc_'+ str(i)
    nomin = 0
    for i in inverted_query:
        for j in inverted_query[i]:
            if j[0] == doc :
                nomin = nomin + j[1]**2
                norm_doc[doc] = nomin
for i in norm_doc:
    norm_doc[i] = numpy.sqrt(norm_doc[i])


sorting them with haep algorithms:

In [64]:
daje= []
for i in n:
    st = i
    cos = numpy.dot(tfidf_query, n[st])
    cosine = round(cos / (norm_doc[st]*norm_query), 10)
    #the other formula:
    #import scipy
    #cosine = 1 - scipy.spatial.distance.cosine(tfidf_query, n[st])
    daje.append((cosine, i ))


In [59]:
import heapq_max
heap_max = []
for i in daje:
    heapq_max.heappush_max(heap_max, i)
heap_max

best_cosine = []
docs = []
for i in range(5):
    cos = heapq_max.heappop_max(heap_max)
    best_cosine.append(cos[0])
    docs.append(int(cos[1][4:]))

In [63]:
df = f.filter(items = ['title', 'description', 'city', 'url']).loc[docs]
df.description = list(map(lambda x: x.replace('\\n', ' '), df.description.tolist()))
df['ranking'] = best_cosine
df.style.hide_index()

title,description,city,url,ranking
Lovely Katy Home (3BR/2B) - Fantastic Location,"Bienvenidos! This gorgeous one-story home offers 3 bedrooms, 2 bathrooms and a 2-car garage. The elegant plan features a study room equipped with a sofa bed, gorgeous kitchen and great living space with TV (netflix included). It is tailored for all. We are located in Katy, TX, in close proximity to Katy Mills Shopping Mall, Houston Premium Outlets, Typhoon Water Park, energy corridor, and new hospital area. We sincerely hope you will have a wonderful and profitable stay in our dear town of Katy.",Katy,https://www.airbnb.com/rooms/19387030?location=Cinco%20Ranch%2C%20TX,1.0
Lovely Katy Home (3BR/2B) - Fantastic Location,"Bienvenidos! This gorgeous one-story home offers 3 bedrooms, 2 bathrooms and a 2-car garage. The elegant plan features a study room equipped with a sofa bed, gorgeous kitchen and great living space with TV (netflix included). It is tailored for all. We are located in Katy, TX, in close proximity to Katy Mills Shopping Mall, Houston Premium Outlets, Typhoon Water Park, energy corridor, and new hospital area. We sincerely hope you will have a wonderful and profitable stay in our dear town of Katy.",Katy,https://www.airbnb.com/rooms/19387030?location=Beasley%2C%20TX,1.0
Bienvenidos 2 - upstairs Cal-king,"Located in a gated community, the room is a small guest bedroom on the second floor, a full bath is located right by the bedroom, the bathroom is considered a shared bath. Our home is close to HWY 151 making moving around the city easy on highways. Listed rate includes city and county taxes. FYI Airbnb assess state taxes on top of listed price.",San Antonio,https://www.airbnb.com/rooms/8607610?location=Castroville%2C%20TX,0.862386
Mi casa es su Casa. Travel instyle.,"Gorgeous 3 master suites w/ privatel baths, all with designer vanities. Hardwood floors, chefs kitchen, 2000 square ft located in N. Dallas. Cozy backyard.Private parking, beautifully maintained neighborhood, pool and tennis courts. Just seconds off of 635 and 75. If you are a traveler who loves to stay/relax in style- this is it. Bienvenidos My wife works every 3rd day and is out for 24hrs. She is a police officer. I am on a mini break from work but I have invested in a business. I am usually out of the home meeting with clients. We do have a baby dog that is hypoallergenic. Sebastian is a Miniature Schnauzer who is very sweet, house trained, and VERY quiet. He has not been allowed in the extra bedrooms. The neighborhood is filled with 25+ something year olds that are yound professionals. It is very quiet. In this community- we have access to 2 large swimming pools/hot tubs, tennis courts, and a fenced in dog park. There are many bus routes that are within walking distance as well as the Dart Rail. If you are a traveler who loves to travel in style but may be on a budget... This has your name written all over it. We provide: clean linen/ towels, traveler kits (so in case you forgot your toothbrush we have various colors to choose from) directions, and coffee in the morning. We accommodate late check outs as well. EXTRAS EXTRAS EXTRAS AVAILABLE!!!! 1. If you want personal groceries waiting for you upon check in and only for your personal enjoyment. You can send me a detailed email and I will provide this feature for the cost of goods (receipt will be provided) plus a 35% fee. The grocery stores offered are: Whole Foods, Trader Joe's, Spec's Wine, Spirits and Finer Foods, Albertson's (Kosher is available here), and Kroger. All guest's have full access to a chef inspired modern kitchen, living/lounge area with a big screen TV/with cable, an upstairs private balcony, state of the art laundry room, and a private backyard with an extra large stainless steel grill. I am here to help and help you get settled into your bedroom :) I am reachable via text/call/email. I am usually home in the evenings. We have free street parking, lovely tennis courts, and a very well mainteneced community pool. Our home is minutes away from 2 major highways and public transportation is meters away from our front door step. There are many DART bus stops that are literally a 1 minute walk that can take you to the DART rail. Be sure to use their website to help figure out your commute etc. You can use Richland College as a starting point. If your stay is a week or longer then you are much more like a temporary roommate, in which, you are responsible for your own toiletries, laundry detergents, and are responsible in the upkeep of your bedroom/bath. I have renovated and designed the townhome from the ground up. I take pride in what I have created and feel guest's should take pride in their rooms/the home itself. That is why you have chosen to stay with us :D",Dallas,https://www.airbnb.com/rooms/790791?location=Addison%2C%20TX,0.862386
Freshly remodeled midtown bungalow - close to all!,"Bienvenidos to our 1916 bungalow! We have taken special care to keep this craftsman work of art as original as possible while adding moderate modern convenience & style. Set amongst the quiet yet central and walkable Alta Vista neighborhood in midtown San Antonio we know you will enjoy a refreshing cool down after a day exploring all of the local attractions - Zoo, San Pedro Park, Breckenridge Golf Course, the Japanese Tea Garden, Hildebrand St. Antiques, the Pearl, & of course, the Riverwalk!",San Antonio,https://www.airbnb.com/rooms/18943624?location=Alamo%20Heights%2C%20TX,0.862386
