# Homework 3 - Find the perfect place to stay in Texas!

###### Alessandro Flaborea, Egon Ferri, Melis Kaymaz

The homework consists in analyzing the text of Airbnb property listings and building a search engine.

In [41]:
import pandas as pd
from nltk.tokenize import sent_tokenize, word_tokenize, PunktSentenceTokenizer
from nltk.corpus import stopwords
#nltk.download()
import string
from nltk.stem import PorterStemmer
from nltk import pos_tag, ne_chunk
import nltk
import math

## Step 2: Create documents

We want to create a `.tsv` file for each record of the dataset.
First thing to do is reading the file.

In [508]:
f = pd.read_csv(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\Airbnb_Texas_Rentals.csv')

Now we can create `.tsv` files and store them in a directory.

In [6]:
for i in range(f.index.max()+1):
    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(i) + '.tsv', 'w', encoding="utf-8")
    for j in range(10):
        op.write('%s\t' %f.iloc[i, j])
    op.close()

## Step 3: Search Engine

Now, we want to create two different Search Engines that, given as input a query, return the houses that match the query.

As a first common step, we want to preprocess the documents by

1. Removing stopwords
2. Removing punctuation
3. Stemming

Then we want to build a file named `vocabulary.txt`, that maps each word to an integer (`term_id`).

In [4]:
#FUNCTIONS
def preprocess(l):
    final = []
    for i in l:
        if not((ps.stem(i) in stopWords) or (ps.stem(i) in (string.punctuation) )):
            final.append(ps.stem(i))
    return (final)

def vocabularization(vocabulary, final, index):
    for word in final:
        if not(word in vocabulary):
            vocabulary[word] = index
            index = index + 1
    return(vocabulary, index)

#not used at the moment
def chunking(sentence):
    names = []
    l = []
    for chunk in ne_chunk(pos_tag(word_tokenize(sentence))) : 
        if type(chunk) is nltk.tree.Tree:
            s = ''
            for  i in chunk:
                s = s + ' ' + i[0]
            names.append(s[1:])
        else:
            l.append(chunk[0])
    return(names, l)

In [16]:
stopWords = set(stopwords.words('english'))
ps = PorterStemmer()
string.punctuation = string.punctuation + '–“”’'

vocabulary= {}
index = 0

for i in range(18259):
    
    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(i) + '.tsv', 'r', encoding="utf-8")
    for line in op:
        ou = line.strip().split('\t')
        sentence = ou[5].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ') + ' ' + ou[8]
    op.close()
        
    #preprocessing data deleting stop words, punctuations, ecc.  
    final = preprocess(word_tokenize(sentence))
    
    # IF  word not in vocabulary -> add the word
    vocabulary, index = vocabularization(vocabulary, final, index)
            
op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\vocabulary.txt', 'w', encoding="utf-8")
op.write(str(vocabulary))
op.close()

### 3.1) Conjunctive query
At this moment, we narrow out interest on the `description` and `title` of each document. It means that the first Search Engine will evaluate queries with respect to the aforementioned information.

#### 3.1.1) Creating our index!

We want to create the Inverted Index. It will be a dictionary of this format:

```
{
term_id_1:[document_1, document_2, document_4],
term_id_2:[document_1, document_3, document_5, document_6],
...}
```

where _document\_i_ is the *id* of a document that contains the word.

We also want to store it in a separate file and load it in memory when needed.

In [17]:
inverted_index = {}

for file in range(18259):

    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(file) + '.tsv', 'r', encoding="utf-8")
    for line in op:
        ou = line.strip().split('\t')
        sentence = ou[5].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ') + ' ' + ou[8]
    op.close()
 
    
    #preprocessing data deleting stop words, punctuations, ecc.  
    final = preprocess(word_tokenize(sentence))
    
    
    #CREATING INVERTED INDEX
    for word in final:
        index = vocabulary[word]
        if not (index in inverted_index):
            inverted_index[index] = ['doc_' + str(file)]
        elif not('doc_' + str(file) in inverted_index[index]):
            inverted_index[index] = inverted_index[index] + ['doc_' + str(file)]

op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\inverted_index.txt', 'w', encoding="utf-8")
op.write(str(inverted_index))
op.close()

#### 3.1.2) Execute the query
Now given a query, that we let the user enter:
```
queen netflix
```
we want that the Search Engine returns a list of documents that contains all the words in the query.

Query input:

In [505]:
user_query = input()

#names, l = chunking(sentence) 
    
#preprocessing data deleting stop words, punctuations, ecc.  
final = preprocess(word_tokenize(user_query))
final

queen netflix


['queen', 'netflix']

Traducing the query in our 'language':

In [506]:
voc = {}
inverted_idx = {}
i=0
for word in final: 
    voc[i]  = vocabulary[word]
    i = i+1
for index in range(i):
    inverted_idx[voc[index]] = inverted_index[voc[index]]

Finding list of docs that contain all the words in the query and printing them in the format that we want:

In [509]:
finding list of docs that contain all the words in the query
docs = []

for i in range(18259):
    doc = 'doc_' + str(i)
    b = True
    for j in voc.values(): 
        b = b and (doc in inverted_idx[j])
    if b:
        docs.append(i)

df = f.filter(items = ['title', 'description', 'city', 'url']).loc[docs]
df.description = list(map(lambda x: x.replace('\\n', ' '), df.description.tolist()))
df.style.hide_index()

title,description,city,url
"Next to Stadiums! Updated, Modern 1 BD apt","Cozy and comfortable modern 1 BD apt. This apt is in the middle of EVERYTHING. Tons of restaurants and tourist attractions with in 1-2 miles. A second floor unit overlooking the pool. Comfortable king size Serta memory foam mattress in the bedroom, and a comfortable couch and love seat in the living room, along with a queen size air mattress. High speed internet and Netflix provided. This apartment is the perfect place if you are coming to town for the theme Parks or anything at ATT stadium.",Arlington,https://www.airbnb.com/rooms/18363018?location=Bedford%2C%20TX
Great room on East Downtown,"Nice Bedroom, 2 beds (1 full, 1 twin, soon to be 1 queen in September 2017) with private bathroom, fridge, desk, and microwave...Also wifi, cable, Netflix, and Hulu. Only two blocks from Metro station and less that 10 min to downtown, Minute Maid and Dynamo Stadium!",Houston,https://www.airbnb.com/rooms/15254951?location=Baytown%2C%20TX
"1BR perfect location near UT, downtown, Cherrywood","Location! - Walk to UT campus! - 10 minute drive downtown! Apartment features - 50 inch HD 4k TV Netflix, surround sound, bluetooth - Fully stocked kitchen - Queen size bed with ultra plush mattress - microfiber sheets, antimicrobial silicone pillows - 2 towel sets + 2 additional pool towels Building features - Secure covered parking - Resort-style pool, gas grills, covered lounge with big screen cable TVs - Game room with a pool table, cable TVs - 24 hour gym",Austin,https://www.airbnb.com/rooms/15604021?location=Austin%2C%20TX
Glam Celebrity Home in UP! 4BR. 10% for Charity!,"Host your holiday get together here!!! Enjoy staying in our luxurious new 4 Bedroom, 4.5 Bath University Park 3700 sf home in the prestigious Park Cities! Previous Celebrity Home now can be yours! All bedrooms have attached bathrooms. 5 HD TV's. Netflix. HULU. HBO. Cable. WIFI. 8 beds (1K, 5 Queen, 2 Twin) + 4 inflatables. Whether a business traveler, bridal party, or family, come stay at a home and AirBnB of professional athletes, socialites, and other celebrities! Close to everything!",Dallas,https://www.airbnb.com/rooms/14668342?location=Arlington%2C%20TX
Great room on East Downtown,"Nice Bedroom, 2 beds (1 full, 1 twin, soon to be 1 queen in September 2017) with private bathroom, fridge, desk, and microwave...Also wifi, cable, Netflix, and Hulu. Only two blocks from Metro station and less that 10 min to downtown, Minute Maid and Dynamo Stadium!",Houston,https://www.airbnb.com/rooms/15254951?location=Atascocita%2C%20TX
Luxury Oceanview Condo,"This luxury ocean view condo is a short walk directly onto Whitecap beach on Padre Island! Enjoy beautiful views from your private balcony, LR and MBR. Very tastefully decorated unit has it all! Flat-screen cable TVs in LR and MBR, Netflix and Blueray DVD. High-speed internet in the unit allows you to work or play online at no additional cost. Feel at home for as long as you like with washer and dryer, and clean, modern full kitchen. The beautifully appointed MBR has pillow-top queen bed, walk-in closet and large bathroom with tub and shower. The LR has a queen sleeper sofa, balcony and fireplace. Amenities onsite include a beautiful pool and hot tub and a well equipped fitness center. Pond, inlet and surf-fishing are available on the property or within easy walk. Restaurants and nightlife abound on Padre Island and in nearby Port Aransas. Padre Island National Park and Mustang Island State Park are also close by; everything you could want for an unforgettable beach vacation!",Corpus Christi,https://www.airbnb.com/rooms/1022134?location=Baffin%20Bay%2C%20TX
"Just outside Austin ATX, Queen room","You will have room with queen bed upstairs in my beautiful house, access to sitting areas, two bathrooms, living room w/ Netflix, cool backyard patio/fire pit, kitchen, dining room, laundry, etc. Perfect for business or tourism travelers. Close to Dell, Samsung, Apple, etc. An easy drive to downtown Austin. Great for ACL, SXSW, and Formula 1. Close to the Domain, Georgetown, Round Rock (and RR Outlets), Pflugerville, Leander, Cedar Park, Austin, Manor, etc. ! You'll love it.",Pflugerville,https://www.airbnb.com/rooms/13548463?location=Coupland%2C%20TX
Clean Bedroom and Bath Near Austin City Limit,"Private bedroom upstairs in our new house. The room features a queen sized bed and a twin sized bed, TV with Netflix, a desk and a full bathroom. There is also a mini fridge, microwave and a water cooler in the room for your private use. Our house is in the city of Manor, just outside of the Austin City Limits. We are a 15-20 minute drive to any part of Austin (downtown, airport, COTA, Round Rock, etc.) with easy access to downtown via the 290. We live in a very safe and quiet neighborhood.",Manor,https://www.airbnb.com/rooms/6747542?location=Coupland%2C%20TX
Jay's Lounge,"hola, Enjoy this very quiet and comfortable room with Smart TV with Netflix, fridge, air conditioner with queen size bed and laundry machine. This converted garage room is centrally located between Dallas and fort worth metroplex. 9.7 miles from DFW International airport. 7 miles from Love Field airport. 11 miles from Dallas downtown area. 6 miles from a mall, AMC, gym and some of the Great Tex-Mex restaurants. Centrally located and very quite and decent neighborhood.",Irving,https://www.airbnb.com/rooms/18655719?location=Coppell%2C%20TX
Gameday home in Bryan!,"I am a single male living in a new 1,850 sq. ft cozy man cave. 3 bedroom 2 bath home built in 2016. I currently have two private rooms both have a queen bed, master bedroom is off limits. You will have kitchen access, leather couches and Netflix on 65\",Bryan,https://www.airbnb.com/rooms/17991294?location=College%20Station%2C%20TX


### 3.2) Conjunctive query & Ranking score
In the new Search Engine, given a query, we want to get the *top-k* (the choice of *k* it's up to you!) documents related to the query. In particular we want:

* Find all the documents that contains all the words in the query (as before...).
* Sort them by their similarity with the query
* Return in output *k* documents, or all the documents with non-zero similarity with the query when the results are less than _k_.

To solve this task, we use the *tfIdf* score, and the _Cosine similarity_. Let's see how.

First thing; we create a new inverted index that contains `tfIdf`s:

In [264]:
inverted_index_2 = {}

for file in range(18259):

    op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\doc\doc_' + str(file) + '.tsv', 'r', encoding="utf-8")
    for line in op:
        ou = line.strip().split('\t')
        sentence = ou[5].replace('\\n', ' ').replace('/', ' ').replace('*', ' ').replace('\\r', ' ').replace('\\t', ' ') + ' ' + ou[8]
    op.close()
 
    
    #preprocessing data deleting stop words, punctuations, ecc.  
    final = preprocess(word_tokenize(sentence))
    
    
    #CREATING INVERTED INDEX
    for word in final:
        index = vocabulary[word]
        
        tf = final.count(word) / len(final)
        idf = math.log( 18258 / len(inverted_index[vocabulary[word]]))
        
        if not (index in inverted_index_2):
            inverted_index_2[index] = [('doc_' + str(file), tf*idf )]
        elif not(('doc_' + str(file), tf*idf)  in inverted_index_2[index]):
            inverted_index_2[index] = inverted_index_2[index] + [('doc_' + str(file), tf*idf)]


op = open(r'C:\Users\mccol\Desktop\Sapienza\ADM\HW3\inverted_index_2.txt', 'w', encoding="utf-8")
op.write(str(inverted_index_2))
op.close()

Query input:

In [512]:
user_query = input()

#preprocessing data deleting stop words, punctuations, ecc.  
final_query = preprocess(word_tokenize(user_query))
final_query

rare lake netflix


['rare', 'lake', 'netflix']

Traducing the query in our 'language':

In [513]:
voc = {}
inverted_query = {}
i=0
for word in final_query: 
    voc[i]  = vocabulary[word]
    i = i+1
for index in range(i):
    inverted_query[voc[index]] = inverted_index_2[voc[index]]

finding nominator of cosine similarity formula:

In [527]:
n = {}
index = 0
for i in inverted_query:
    for j in inverted_query[i]:
        if index == 0:
            n[j[0]] = [j[1]]
        elif (not (j[0] in n)):
            n[j[0]] = [0]*index + ([j[1]])
        else:   
            n[j[0]] = n[j[0]] + [0]*(index - len(n[j[0]])) + ([j[1]])
    index = index + 1

for i in n:
    if len(n[i]) < len(final_query) :
        n[i] = n[i] + [0]*(len(final_query)-len(n[i]))
n

{'doc_1340': [0.2327329197543106, 0, 0],
 'doc_1344': [0.1253177260215519, 0, 0],
 'doc_2831': [0.10343685322413805, 0, 0],
 'doc_4026': [0.09873517807758633, 0, 0],
 'doc_4659': [0.19747035615517267, 0.1318841827724638, 0],
 'doc_6071': [0.1253177260215519, 0, 0],
 'doc_6186': [0.01916624045035499, 0, 0],
 'doc_6191': [0.17148741455580782, 0, 0],
 'doc_7522': [0.15893955495416334, 0, 0],
 'doc_8201': [0.08926742127562598, 0, 0],
 'doc_8628': [0.11235382332966719, 0, 0],
 'doc_9516': [0.10343685322413805, 0, 0],
 'doc_9707': [0.05978460323963943, 0.059892358231531724, 0],
 'doc_10444': [0.05978460323963943, 0.059892358231531724, 0],
 'doc_10446': [0.30309403502886967, 0, 0],
 'doc_11511': [0.05978460323963943, 0.059892358231531724, 0],
 'doc_13279': [0.18101449314224158, 0, 0],
 'doc_13326': [0.11044952123933385, 0, 0],
 'doc_14448': [0.1253177260215519, 0, 0],
 'doc_14589': [0.1253177260215519, 0, 0],
 'doc_14722': [0.1253177260215519, 0, 0],
 'doc_15739': [0.21021037913292573, 0, 0],

finding tfidfs of the query:

In [528]:
tfidf_query = []  
for word in final_query:
    tf_query = final_query.count(word) / len(final_query)
    idf_query = math.log( 18258 / len(inverted_index[vocabulary[word]]))
    tfidf_query.append(tf_query * idf_query)
tfidf_query

[2.172173917706899, 0.7253630052485508, 1.418510185808496]

finding denominator of our dear formula: norm of the query and norm of docs:

In [529]:
import numpy
norm_query = numpy.sqrt(sum(list(map(lambda x: x**2, tfidf_query))))

In [530]:
norm_doc = {}
for i in range(18259):
    doc = 'doc_'+ str(i)
    nomin = 0
    for i in inverted_query:
        for j in inverted_query[i]:
            if j[0] == doc :
                nomin = nomin + j[1]**2
                norm_doc[doc] = nomin
for i in norm_doc:
    norm_doc[i] = numpy.sqrt(norm_doc[i])


sorting them with haep algorithms:

In [531]:
daje= []
for i in n:
    st = i
    cos = numpy.dot(tfidf_query, n[st])
    cosine = round(cos / (norm_doc[st]*norm_query), 10)
    daje.append((cosine, i ))
    print('cosine for ' + i + ' whit words ['+ user_query + '] is:', cosine)

cosine for doc_1340 whit words [rare lake netflix] is: 0.806354738
cosine for doc_1344 whit words [rare lake netflix] is: 0.806354738
cosine for doc_2831 whit words [rare lake netflix] is: 0.806354738
cosine for doc_4026 whit words [rare lake netflix] is: 0.806354738
cosine for doc_4659 whit words [rare lake netflix] is: 0.820105581
cosine for doc_6071 whit words [rare lake netflix] is: 0.806354738
cosine for doc_6186 whit words [rare lake netflix] is: 0.806354738
cosine for doc_6191 whit words [rare lake netflix] is: 0.806354738
cosine for doc_7522 whit words [rare lake netflix] is: 0.806354738
cosine for doc_8201 whit words [rare lake netflix] is: 0.806354738
cosine for doc_8628 whit words [rare lake netflix] is: 0.806354738
cosine for doc_9516 whit words [rare lake netflix] is: 0.806354738
cosine for doc_9707 whit words [rare lake netflix] is: 0.7602388465
cosine for doc_10444 whit words [rare lake netflix] is: 0.7602388465
cosine for doc_10446 whit words [rare lake netflix] is: 0.8

cosine for doc_6114 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6117 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6131 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6132 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6140 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6146 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6151 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6152 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6164 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6179 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6190 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6195 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6196 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6199 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_6200 whit words [rare lake netfli

cosine for doc_10110 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10114 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10117 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10122 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10123 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10128 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10135 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10138 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10143 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10149 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10152 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10175 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10180 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10182 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_10185 whit words [r

cosine for doc_14966 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_14976 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_14983 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_14998 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15011 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15025 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15040 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15073 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15078 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15092 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15096 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15126 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15135 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15138 whit words [rare lake netflix] is: 0.2692693671
cosine for doc_15145 whit words [r

In [533]:
import heapq

heap = []
for i in daje:
    heapq.heappush(heap, i)

heapq.nlargest(10, heap)

[(0.820105581, 'doc_4659'),
 (0.806354738, 'doc_9516'),
 (0.806354738, 'doc_8628'),
 (0.806354738, 'doc_8201'),
 (0.806354738, 'doc_7522'),
 (0.806354738, 'doc_6191'),
 (0.806354738, 'doc_6186'),
 (0.806354738, 'doc_6071'),
 (0.806354738, 'doc_4026'),
 (0.806354738, 'doc_2831')]