# Lab 1: Information Retrieval

__Students:__ benra741, enral465 

## Lab Description
In this Lab our goal is to implement a search tool based on Google Game apps, where we can write a query as a string with keywords and receive the most similar results.

For the implementation we followed 3 steps. The first step is regarding the data extraction and manipulation of HTML codes by using regular expressions for processing the data. For this step we accessed to the webpage where we collected all the App URL’s and processed them in order to collect only the name and the description of every app. The process of the web URL’s is done by cleaning the HTML and storing the description of every app in a separate files, we had to be careful with the encoding of some characters.

For the second step, with every app description file we collected, we had to apply preprocessing steps which include the implementation of a function called “preprocess” that removes non-alpha numeric characters, tokenize, lowercase the words, removes stop words and Stem. The preprocessing is needed for the computation of the Term Frequency-Inverse Document Frequency which function (TfidfVectorizer) is already implemented at the “sklearn” package, a matrix with the TF-IDF from the collection of raw documents is created which contains for every row (App) a number from zero to one that calculates which words from the description match with the collection of words we are comparing it.

For the last step, a ranked query processor has been created. A string is used as an input, this string needs to be preprocessed and transformed as we did on the preprocessing step. The main objective here is to compute an angle by in order to know which apps are have more relation with our query; in this case we are using the cosine similarity which will return a number from 0 to 1 where 0 means that the query has no relation with an app and 1 is a perfect similarity.



### Crawling



a) Get the webpage content by using functions in 
__[urllib module](https://docs.python.org/3/library/urllib.html#module-urllib)__.

Other libraries are also fine to achieve the crawling.

e.g. scrapy, beautifulsoup... 

In [4]:
import urllib.request
x = urllib.request.urlopen('https://play.google.com/store/apps/category/GAME?hl=en').read().decode('utf-8')

b) Get app url by regular expression using functions from __[re module](https://docs.python.org/3/library/re.html?highlight=re#module-re)__.

A useful online regular expression check.
__[Check your regular expression first](https://regex101.com)__.

In [5]:
import re
appreg = r'href=\"(/store/apps/details.*?)\"'
appre = re.compile(appreg)
app_url_list = re.findall(appre,x)
app_url_list = list(set(app_url_list))
print(len(app_url_list))

79


c) Access specific webpage to get description of each app and then store the description in files.

In [None]:
import re
descg = r'itemprop=\"description.*?\">.*?<div jsname=\".*?\">(.*?)<\/div>'
desc_reg = re.compile(descg)

def cleanHTML(html):
    html = re.sub('<.*?>', ' ', html) #remove the tags
    html = re.sub('&.*?;', ' ', html) #remove the html symbols (ex: &amp;)
    html = re.sub('https?:\/\/.*?(\s|\'|$)', ' ', html) #remove urls
    html = re.sub('[^\w]', ' ', html) #remove non alpha-numeric or space characters
    return html

def gettingDescription(html):
    desc = re.search(desc_reg,html).group()
    return cleanHTML(desc)
    
files = []
names = []

i = 0

while i < 1000 :
    url = app_url_list[0]

    app_url_list = app_url_list[1:]

    html = urllib.request.urlopen('https://play.google.com' + url +"&hl=en").read().decode('utf8')
    
    # retrieve the name of the app
    nameg = r'<title id=\"main-title\">(.*?) - Android'
    name = re.compile(nameg)
    if re.search(name, html) != None :
        name2 = cleanHTML(re.search(name, html).group())
        file_name = 'Files/description_{}.txt'.format(name2)

        #check if not already used
        if file_name not in files:
            desc_url = gettingDescription(html)
            temp = re.findall(appre, html)
            app_url_list = list(set(app_url_list + temp))  

            files.append(file_name)
            names.append(name2) # name list update
            try:
                with open(file_name , 'w') as file:
                    print(desc_url, file = file)
                    i = i + 1
            except:
                print(file_name) # files that throw an exception (character issue)

### Construct Inverted file index (Vector Model)



d) Preprocess text using NLP techniques from __[nltk module](http://www.nltk.org/py-modindex.html)__.

Using nltk.download(ID) to get the corpora if it is not downloaded before. __[nltk corpora](http://www.nltk.org/nltk_data/)__

In [4]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

In [7]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import nltk.stem

english_stemmer = nltk.stem.SnowballStemmer('english')
stop = set(stopwords.words('english'))

# preprocess text
def preprocess(s):
    tokens = [ word for sent in sent_tokenize(s) for word in word_tokenize(sent) ] # tokenization
    tokens = [ token.lower() for token in tokens ] # lowercase
    tokens = [ token for token in tokens if token not in stop ] # stopwords )
    tokens = [ english_stemmer.stem(token) for token in tokens ] # stem
    return(tokens)

# loop to go through each file and write back to it
for file in files:
    with open(file, 'r+') as f:
        init = f.read()
        f.seek(0)
        f.truncate()
        text = ' '.join(preprocess(init))
        print(text, file=f)


...)Compute tdidf using functions from __[scikit-learn module](http://scikit-learn.org/stable/modules/classes.html)__.

eg. TfidfVectorizer is used for converting a collection of raw documents to a matrix of TF-IDF features.

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
transvector = TfidfVectorizer()
corpus = []
for file in files: # go through all the documents
    with open(file, 'r') as f:
        corpus.append(f.readline().strip())

tfidf1 = transvector.fit_transform(corpus)
tfidf_matrix = tfidf1.toarray()
tfidf_matrix.shape

(1076, 9667)

### Query Process

eg. "Dragon, Control, hero, running"

eg. "The hero controls the dragon to run."



In [10]:
from sklearn.metrics.pairwise import cosine_similarity
import operator

def queryProcess(query) :
    query = [' '.join(preprocess(query))]
    print(query)

    query = transvector.transform(query).toarray()
    similarity = cosine_similarity(tfidf_matrix, query)

    #I added a variable called names so we can return the names of the apps
    #print(names)
    #In this part we will need to get somehow an order of the most similar results with their names (dictionary of a matrix? how?)
    apps_dictionary = {}

    for i in range(len(names)) :
        apps_dictionary.update({names[i]:similarity[i]})

    #sorted(similarity, reverse=True)

    sorted_apps_dictionary = sorted(apps_dictionary.items(),key = operator.itemgetter(1),reverse = True)
    return sorted_apps_dictionary[0:10]

print(queryProcess('basketball'))
print(queryProcess('game cooking restaurant'))
print(queryProcess('racing'))

['basketbal']
[(' Free Throw Basketball   Android', array([ 0.70313571])), (' Dunk Hit Basketball   Android', array([ 0.40440768])), (' Basketball Shots 3D  2013    Android', array([ 0.32648624])), (' Dunkers   Basketball Madness   Android', array([ 0.18303438])), (' Basketball Shots 3D  2010    Android', array([ 0.15318526])), (' Flappy Ball   Ball through the Basket   Android', array([ 0.15107596])), (' Jam League Basketball   Android', array([ 0.11764068])), (' Stickman Football   Android', array([ 0.07774137])), (' Stickman Volleyball   Android', array([ 0.07339062])), (' Pink Gold Diamond Live Theme   Android', array([ 0.03596636]))]
['game cook restaur']
[(' Hidden Objects Restaurants   Kitchen Games   Android', array([ 0.57740634])), (' My Cafe  Recipes   Stories   World Cooking Game   Android', array([ 0.396865])), (' Bake Cupcakes   Android', array([ 0.34403219])), ('   Food Truck  Match 3 Game Free   Android', array([ 0.33001276])), (' Cooking colorful cupcakes   Android', ar