#**Deep Natural Language Processing @ PoliTO**

---


**Teaching Assistant:** Moreno La Quatra

**Practice 3:** Information Retrieval & Elastic Search

### Download and setup ElasticSearch on Google Colab

In [65]:
# Download and extract elasticsearch
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
!tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
!chown -R daemon:daemon elasticsearch-7.10.1

--2021-11-22 18:51:34--  https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
Resolving artifacts.elastic.co (artifacts.elastic.co)... 34.120.127.130, 2600:1901:0:1d7::
Connecting to artifacts.elastic.co (artifacts.elastic.co)|34.120.127.130|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 318801277 (304M) [application/x-gzip]
Saving to: ‘elasticsearch-7.10.1-linux-x86_64.tar.gz.1’


2021-11-22 18:51:42 (39.7 MB/s) - ‘elasticsearch-7.10.1-linux-x86_64.tar.gz.1’ saved [318801277/318801277]



In [66]:
import os
from subprocess import Popen, PIPE, STDOUT

# If issues are encountered with this section, ES can be manually started as follows:
# ./elasticsearch-7.10.1/bin/elasticsearch

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30

In [67]:
# wait a bit then test
!curl -X GET "localhost:9200/"

{
  "name" : "8823589fd85e",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "3f7DbuAgTX-pWli9Vs43Iw",
  "version" : {
    "number" : "7.10.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "1c34507e66d7db1211f66f3513706fdf548736aa",
    "build_date" : "2020-12-05T01:00:33.671820Z",
    "build_snapshot" : false,
    "lucene_version" : "8.7.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


## Information Retrieval

Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of **texts**, images or sounds. (source: Wikipedia).

This practice is intended for the creation of a wikipedia-based search engine. For the purpose of the practice, only a subset of the wikipedia pages will be used.

Data Source: https://snap.stanford.edu/data/wikispeedia.html 

### **Question 1: Pagerank scores**
Exploiting the wikipedia citation network, compute, for each page, its associated [pagerank](http://ilpubs.stanford.edu:8090/422/) score.

What is the page with the highest Pagerank score?


In [4]:
%%capture
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/articles.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/categories.tsv
!wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wikipedia_network/links.tsv

In [5]:
%%capture
! pip install elasticsearch==7.10.1
! pip install networkx

In [6]:
## comments are on my own

from urllib.parse import unquote

list_articles = open("articles.tsv").read() # full text of articles separated by \n
list_articles = list_articles.split("\n") # get list of the articles
list_articles = [l for l in list_articles if l!= ""] # remove the "blank" ones
list_articles = [l for l in list_articles if l[0] != "#"] # remove the ones with "#"
unquoted_list_articles = [unquote(l) for l in list_articles if l[0] != "#"] # Replace %xx escapes with their single-character equivalent. 

dict_articles = {}
for i, l in enumerate(unquoted_list_articles):
    dict_articles[l] = {}
    dict_articles[l]["ID"] = l # unquoted_list_articles
    dict_articles[l]["quoted_ID"] = list_articles[i] # quoted article

In [7]:
# same as before

from urllib.parse import unquote

list_categories = open("categories.tsv").read()
list_categories = list_categories.split("\n")
list_categories = [l for l in list_categories if l!= ""]
list_categories = [l for l in list_categories if l[0] != "#"]

for l in list_categories:
    k, v = l.split("\t")
    k = unquote(k)
    v = unquote(v)
    if "categories" in dict_articles[k].keys():
        dict_articles[k]["categories"].append(v)
    else:
        dict_articles[k]["categories"] = [v]
    
# print (dict_articles)

In [8]:
# same as before

from urllib.parse import unquote

list_links = open("links.tsv").read()
list_links = list_links.split("\n")
list_links = [l for l in list_links if l!= ""]
list_links = [l for l in list_links if l[0] != "#"]

for l in list_links:
    s, t = l.split("\t")
    s = unquote(s)
    t = unquote(t)
    if "out_links" in dict_articles[s].keys(): # add a new out link
        dict_articles[s]["out_links"].append(t) 
    else:
        dict_articles[s]["out_links"] = [t]

In [9]:
# print (dict_articles["Áedán_mac_Gabráin"])

In [10]:
# your code here

In [11]:
import networkx as nx
from networkx.algorithms.link_analysis.pagerank_alg import pagerank

# first, we have to create the edges and the nodes
edges = []
nodes = []

# iterate over all the items in the dictionary
# node = k
# edges = (k,link) for link in v["out_links"]
for k, v in dict_articles.items():
  nodes.append(k)
  
  # not all the articles has the "out_links" key
  if "out_links" in v.keys():
    for link in v["out_links"]:
      edges.append((k,link))

# create a directed graph
network = nx.DiGraph()

# define nodes and edges on the network
network.add_nodes_from(nodes)
network.add_edges_from(edges)

p = nx.pagerank(network, max_iter=100)

In [12]:
# sort in descending order
ordered_scores = sorted(list(p.items()), key=lambda x : -x[1])
ordered_scores[:10]

[('United_States', 0.00956180652731311),
 ('France', 0.0064200413810133585),
 ('Europe', 0.006337014005458885),
 ('United_Kingdom', 0.006232394913963077),
 ('English_language', 0.004862980440047761),
 ('Germany', 0.00482224267836269),
 ('World_War_II', 0.0047226367934437305),
 ('England', 0.0044723357530703466),
 ('Latin', 0.004422148441338466),
 ('India', 0.004033922521194668)]

### **Question 2: Wikipedia pages indexing**

Create a new index in ElasticSearch and Index the Wikipedia webpage (alongiside with their content). The content of each page can be found at `plaintext_articles/QUOTED_ID_OF_THE_DOC.txt`

NB: pagerank score must be a field of the indexed doc


In [13]:
%%capture
! wget https://github.com/MorenoLaQuatra/DeepNLP/raw/main/practices/P3/plaintext_articles.zip
! unzip plaintext_articles.zip

In [14]:
# your code here

In [15]:
# es.indices.delete(index='wikipedia', ignore=[400, 404])

In [16]:
from elasticsearch import Elasticsearch

es = Elasticsearch()
res = es.indices.create(index='wikipedia', ignore=400)


for k, v in dict_articles.items():
  
  text = open("plaintext_articles/" + v["quoted_ID"] + ".txt").read()
  v["text"] = text.replace("\n", " ")
  v["pagerank"] = p[k]


# index all the updated elements
for k,v in dict_articles.items():
  res = es.index(index="wikipedia", id=v["ID"], body = v)

In [17]:
from prettytable import PrettyTable

def print_res(res):
  print(f"# Results = {res['hits']['total']['value']}")

  x = PrettyTable()
  x.field_names = ["ID", "Score", "Pagerank"] 

  for r in res["hits"]["hits"]:
      x.add_row([r["_source"]["ID"], r["_score"], r["_source"]["pagerank"]])

  print(x)

### **Question 3: Querying ElasticSearch**

Perform a query using ElasticSearch. Look for your favorite content (choose and report 3 of them) on the full text of the articles.

E.g.:
- query 1 : "The capital of Italy" (surprised by the result?)

In [None]:
# Your code here

In [18]:
query_body = {
    "query" : {
          "match" : { "text" : "The capital of Italy"}
    }
}

res = es.search(index='wikipedia', body=query_body)
print_res(res)

# Results = 4604
+------------------+-----------+------------------------+
|        ID        |   Score   |        Pagerank        |
+------------------+-----------+------------------------+
|      Turin       | 6.4071617 | 0.0001972710679501955  |
|       Rome       |  6.325004 | 0.0016184558611025222  |
|    San_Marino    | 6.1934733 | 0.0002833258882520379  |
|      Harare      | 6.0852537 | 6.195756206059539e-05  |
|   Ouagadougou    |  6.076568 | 7.352589008326316e-05  |
|     Sarajevo     | 6.0530405 | 0.00017634954023473427 |
|      Milan       |  5.998312 | 0.0005573211244109519  |
|     Croatia      |  5.932445 | 0.0006357112448291552  |
| Byzantine_Empire | 5.9303164 | 0.0012913446974865692  |
|    Montenegro    |  5.907314 |  0.000441054799573285  |
+------------------+-----------+------------------------+


### **Question 4: integrating pagerank scores**

Create a template query to include pagerank while computing the score (`_score`). 

Use the [Script score](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score) to generate an hybrid score (`_score + pagerank_score * 250`). 

Perform the same set of queries with this modification, does it change the results?



In [None]:
# Your code here

In [19]:
query_body = {
    "query" : {
        "function_score" : {
            "query" : {
                "match" : {"text" : "The capital of Italy"}
            },
            "script_score" : {
                "script" : {
                  "source" : "if (doc.containsKey('pagerank')) {return _score + 250.0 * doc['pagerank'].value} else {return _score}"
                }
            }
        }
    }
}

res = es.search(index='wikipedia', body=query_body)
print_res(res)

# Results = 4604
+------------------+-----------+------------------------+
|        ID        |   Score   |        Pagerank        |
+------------------+-----------+------------------------+
|       Rome       |  42.56486 | 0.0016184558611025222  |
|      Turin       | 41.367706 | 0.0001972710679501955  |
|    San_Marino    | 38.797806 | 0.0002833258882520379  |
|      Harare      |  37.12457 | 6.195756206059539e-05  |
| Byzantine_Empire | 37.083176 | 0.0012913446974865692  |
|   Ouagadougou    | 37.036377 | 7.352589008326316e-05  |
|     Sarajevo     | 36.906162 | 0.00017634954023473427 |
|      Milan       | 36.815495 | 0.0005573211244109519  |
|     Croatia      | 36.136734 | 0.0006357112448291552  |
|    Montenegro    | 35.547718 |  0.000441054799573285  |
+------------------+-----------+------------------------+


### **Question 5: integrate semantic dense-vectors**

Generate a new index ("wiki-semantic-search") including all the information of the previous one plus an additional field that contains a BERT-based embedding vector of the `full_text` of the article. Once indexing is completed, repeat the same queries for a qualitative evaluation of the IR system. 

**Some hints below:**
- Use Sentence-BERT pretrained encoders (www.sbert.net). Choose the most suitable pretrained model (trade off between speed and accuracy). E.g., `multi-qa-MiniLM-L6-cos-v1`
- Use cosine similarity to compute the similarity between queries and full text of the article.

In [None]:
# Your code here

In [21]:
%%capture
!pip install sentence-transformers
!pip3 install Cython

In [None]:
from sentence_transformers import SentenceTransformer

sbert = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

In [24]:
# we need a list of sentences, therefore we have to
# iterate the first time over the dictionary and
# get all the texts

full_text_list = [ v["text"] for k, v in dict_articles.items()]

encodings = sbert.encode(full_text_list)

for idx, (k, v) in enumerate(dict_articles.items()):
  v["encoding"] = encodings[idx].tolist()

In [28]:
# create mapping

dense_dim = len(encodings[0])

index_properties = {}
index_properties['settings']={ "number_of_shards": 2, "number_of_replicas": 1}
index_properties['mappings']={ "dynamic": "true", "_source": { "enabled": "true" }, "properties": {}}
for t in ['ID', 'quoted_ID', 'text']: 
    index_properties['mappings']['properties'][t]={ "type": "text" }
for t in ['pagerank']: 
    index_properties['mappings']['properties'][t]={ "type": "float" }
for d in ["encoding"]: 
    index_properties['mappings']['properties'][d]={ "type": "dense_vector", "dims": dense_dim }

In [30]:
from elasticsearch import Elasticsearch

es = Elasticsearch()
es.indices.create(index="wiki-semantic-search", body=index_properties)

# upload the updated values
for k,v in dict_articles.items():

  # we have to follow the index structure
  curr_body = {
      "ID" : v["ID"],
      "quoted_ID" : v["quoted_ID"],
      "text" : v["text"],
      "pagerank" : v["pagerank"],
      "encoding" : v["encoding"]
  }

  res = es.index(index="wikipedia-semantic-search", id=v["ID"], body = curr_body)

In [39]:
query_body = {
    "query" : {
        "function_score" : {
            "query" : {
                "match" : {"text" : "The capital of Italy"}
            },
            "script_score" : {
                "script" : {
                  "params" : {"encoding" : sbert.encode("The capital of Italy")},
                  "source" : "cosineSimilarity(params.encoding, doc['encoding']) + 1.0" # +1 in order to avoid negative values
                }
            }
        }
    }
}


res = es.search(index="wikipedia-semantic-search", body=query_body)

# the code is the same as the instructor, but it doesn't work
# to do : inspect the definition of the index and the dictionaries

## Content-based Recommender Systems

A recommender system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item. (source: [Wikipedia](https://en.wikipedia.org/wiki/Recommender_system))

In this part of the practice you will be required to generate a text-based unsupervised recommendation system (only **content**-based). The final goal is similar to a IR search engine, the main difference relies on **how you define the "queries".**

The tools at your disposal are:
1. `Sentence-BERT model`: should be used to obtain a vector representation of the input data.
2. `ElasticSearch`: can be used for indexing movie information and to perform **fast** similarity search.

For the recommendation system you need the following information:
- Movie's title
- Movie's plot
- Plot's embedding vector

The dataset used for this goal is: [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots). For this practice you will use a truncated version of the data collection to reduce runtime.

In [40]:
! wget https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
import pandas as pd
df_movies = pd.read_csv("wiki_plots_2005onward.csv")

--2021-11-22 18:24:23--  https://raw.githubusercontent.com/MorenoLaQuatra/DeepNLP/main/practices/P3/wiki_plots_2005onward.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45936814 (44M) [text/plain]
Saving to: ‘wiki_plots_2005onward.csv’


2021-11-22 18:24:24 (135 MB/s) - ‘wiki_plots_2005onward.csv’ saved [45936814/45936814]



### **Question 6: movie encodings**

Use Sentence-BERT model to encode movie plots into fixed-size vectors.

NB: the vector dimension is dependent on the choice of the pretrained model.

In [None]:
! pip install sentence-transformers

In [46]:
# Your code here

from sentence_transformers import SentenceTransformer

sbert = SentenceTransformer("multi-qa-MiniLM-L6-cos-v1")

plot_list = list(df_movies.Plot)
title_list = list(df_movies.Title)

embedded_plots = sbert.encode(plot_list)

### **Question 7: ElasticSearch indexing**

Create a new ElasticSearch index (`recsys-movies`) and index all movies with their embedding vectors.



In [68]:
from elasticsearch import Elasticsearch

dense_dim = len(embedded_plots[0])

index_properties = {}
index_properties['settings']={ "number_of_shards": 2, "number_of_replicas": 1}
index_properties['mappings']={ "dynamic": "true", "_source": { "enabled": "true" }, "properties": {}}
for t in ['title', 'plot']: 
    index_properties['mappings']['properties'][t]={ "type": "text" }
for d in ["embedding"]: 
    index_properties['mappings']['properties'][d]={ "type": "dense_vector", "dims": dense_dim }

es.indices.delete(index="recsys-movies", ignore=[404])
es.indices.create(index="recsys-movies", body=index_properties)


# Your code here

for i in range(len(plot_list)):

  curr_dic = {
      "title" : title_list[i],
      "plot" : plot_list[i],
      "embedding" : embedded_plots[i]
  }

  res = es.index(index="recsys-movies", body = curr_dic)

### **Question 8: Query generation**

Create a function that accept the following arguments:
1. `embedding_model`: Sentence-BERT model used to generate embeddings
2. `df_movies`: the dataframe containing all the movies' information
3. `movie_title`: a string containing the title of the movie the user is currently watching.

It should return the embedding vector associated to the query by looking for the `movie_title` plot in `df_movies`. It uses `embedding_model` to encode it.




In [62]:
# Your code here

def custom_function(embedding_model, df_movies, movie_title):

  plot = df_movies[df_movies["Title"] == movie_title].Plot.values

  return embedding_model.encode(plot)

### **Question 8: Qualitative evaluation (your personal movie recommendation system)**

Evaluate your personal recommendation system by querying for some movies in the data collection. You need to create an elasticsearch query to use the recommendation system (see Q. 5 of this practice).

Just some examples:
1. title: Harry Potter and the Goblet of Fire
2. title: Avengers: Age of Ultron
3. title: Star Wars: The Last Jedi


In [70]:
 # Your code here

 title = "Star Wars: The Last Jedi"
 title_encoded = custom_function(sbert, df_movies, title)

 query_body : dict = {
    "query" : {
        "function_score" : {
            "query" : {
                "match_all" : {}
            },
            "script_score" : {
                "script" : {
                  "params" : {"encoding" : title_encoded},
                  "source" : "cosineSimilarity(params.encoding, doc['encoding']) + 1.0" # +1 in order to avoid negative values
                }
            }
        }
    }
}

res = es.search(index="recsys-movies", body=query_body)

# same as exercise 5..

### **Question 9 (Bonus)**

Rewrite the function at Q.7 to take multiple movie titles (list of strings). Compute the average vector and use it to obtain recommendations. Perform a qualitative evaluation in this specific case (it is possible to choose movie's titles from the previous list)

In [None]:
# Your code here