# Similarity Modulation

Here we are going to implement another similarity other that the BM25 which is the default in Elastic. We want you to implement a tf-idf similarity and test it with same queries in phase2 so that you can get a sense of how well your Elastic tf-idf works. Follow the instructions and fill where ever it says # TODO.  <br>
You can contact me in case of any problems via Telegram: @mahvash_sp

In [1]:
from google.colab import drive
drive.mount('/content/drive')
!pip install elasticsearch

Mounted at /content/drive
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting elasticsearch
  Downloading elasticsearch-8.3.0-py3-none-any.whl (381 kB)
[K     |████████████████████████████████| 381 kB 6.9 MB/s 
[?25hCollecting elastic-transport<9,>=8
  Downloading elastic_transport-8.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 4.7 MB/s 
Collecting urllib3<2,>=1.26.2
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 40.9 MB/s 
[?25hInstalling collected packages: urllib3, elastic-transport, elasticsearch
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uninstalled urllib3-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

In [2]:
from elasticsearch import Elasticsearch, helpers
import json
import warnings



In [3]:
# import data in json format
file_name = '/content/drive/MyDrive/IR_P3/IR_data_news_12k.json'

with open(file_name) as f:
    data = json.load(f)

In [4]:
# Filter warnings
warnings.filterwarnings('ignore')

In [5]:
# data keys
data['0'].keys()

dict_keys(['title', 'content', 'tags', 'date', 'url', 'category'])

After starting your Elasticsearch on your pc (localhost:9200 is the default) we have to connect to it via the following piece of code


In [6]:
# Here we try to connect to Elastic
ELASTIC_PASSWORD = "6RcaBqaCzfoPaoERT95g1Jwl"

# Found in the 'Manage Deployment' page
CLOUD_ID = "deployment-name:dXMtZWFzdDQuZ2Nw..."

# Create the client instance
es = Elasticsearch(
    cloud_id="th3amirrj:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvOjQ0MyQzOTBjODA5YjIwOGI0NGI3OGZlNWI4Y2JhMjhiMzViOSRjZjUxOGIyYmQ4OTk0NDJmOGJlNzBhZTIxOTY0ZDQ5NA==",
    basic_auth=("elastic", ELASTIC_PASSWORD)
)

## Create tf-idf Index

### Create Index

In [7]:
# Name of index 
import random
sm_index_name = f"tfidf_index_{random.randint(3001, 4999)}"

In [8]:
# Delete index if one does exist
if es.indices.exists(index=sm_index_name):
    es.indices.delete(index=sm_index_name)

# Create index    
es.indices.create(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'tfidf_index_4373'})

### Add documents

In here we used the bulk doc formatter which was introduced in the first subsection of phase 3. <br>
You can find out more in [Here](https://stackoverflow.com/questions/61580963/insert-multiple-documents-in-elasticsearch-bulk-doc-formatter).

In [9]:

from elasticsearch.helpers import bulk

def bulk_sync():
    actions = [
        {
            '_index': sm_index_name,
            '_id':doc_id,
            '_source': doc
        } for doc_id,doc in data.items()
    ]
    bulk(es, actions)
    
    


In [10]:
# run the function to add documents
bulk_sync()

In [11]:
# Check index
es.count(index = sm_index_name)

ObjectApiResponse({'count': 12202, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}})

### Configuring a similarity

In order to configure a new similarity function you have to change the similarity from the settings api of the index. This can be done via the function 'put_settings' in python. What we do is to change the 'default' similarity function in Elastic so that it uses the replaced similarity instead. Type of this similarity is set to 'scripted' because tf-idf is not among the pre-defined similarity functions in Elastic anymore. As this similarity is a scripted type the source code of it must be written **by you** and passed to it.<br>
> In order for the changes to be applied, first we close the index and change the settings and then reopen it<br>

Write the tf-idf code in a string and pass it as a value to the "source" key. <br>
You can find the variables needed in your code in [Here](https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-similarity-context.html).

In [12]:
# TODO : uncomment the code bellow, write the tf-idf code in here
source_code = "double tf = Math.log(1 + doc.freq); double idf = Math.log((field.docCount)/(term.docFreq))"

In [13]:
# closing the index
es.indices.close(index=sm_index_name)

# applying the settings
es.indices.put_settings(index=sm_index_name, 
                            settings={
                                "similarity": {
                                      "default": {
                                        "type": "scripted",
                                        "script": {
                                          # TODO : uncomment the code bellow and pass the suitable parameter
                                           "source": source_code
                                        }
                                      }
                                }
                            }
                       )

# reopening the index
es.indices.open(index=sm_index_name)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True})

### Query

In this section you have to test your index with same queries you tested phase2. The goal here is to observe how different or simillar your tf-idf Elastic implementation works.

In [14]:
# A function that creates appropriate body for our match content type query
def get_query(text):
    body ={
    "query":{  
        "match" : {
            "content" : text

            }
        }
    }
    
    return body

In [15]:
queries = [
    #TODO : add your queries in string format to this list
    "استهلاک"
]

In [16]:
all_res_tfidf = []


for q in queries:
    res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)
    all_res_tfidf.append(dict(res_tfidf))

In [17]:
for res, q in zip(all_res_tfidf, queries):
    print(q)
    for doc in res['hits']['hits']:
        print(doc['_source']['url'])
    print("----------------------------")

استهلاک
https://www.farsnews.ir/news/14001022000292/تعییین-مصادیق-ایجاد-شبکه-شرکت‌های-دانش-بنیان-با-اولویت-تولید-اقلام
https://www.farsnews.ir/news/14001012000305/ریزش-انقلابی‌ها-داریم-اما-ریزش-انقلاب-نه-باید-با-همگرایی-اختلافات-را
https://www.farsnews.ir/news/14001008000652/قاچاق-در-ایران-چگونه-انجام-می-شود
https://www.farsnews.ir/news/14000921000054/رئیسی-برای-مشکلات-نیازمند-برنامه-جهادی-هستیم-خطوط-قرمز-دولت-در-بودجه
----------------------------


In [18]:
queries = [
    #TODO : add your queries in string format to this list
    "صکوک استهلاک"
]

In [19]:
all_res_tfidf = []


for q in queries:
    res_tfidf = es.search(index=sm_index_name, body=get_query(q), explain=True)
    all_res_tfidf.append(dict(res_tfidf))

In [20]:
for res, q in zip(all_res_tfidf, queries):
    print(q)
    for doc in res['hits']['hits']:
        print(doc['_source']['url'])
    print("----------------------------")

صکوک استهلاک
https://www.farsnews.ir/news/14001203000680/کلیات-طرح-حمایت-از-کاربران-فضای-مجازی-در-کمیسیون-ویژه-تصویب-شد
https://www.farsnews.ir/news/14001022000292/تعییین-مصادیق-ایجاد-شبکه-شرکت‌های-دانش-بنیان-با-اولویت-تولید-اقلام
https://www.farsnews.ir/news/14001012000305/ریزش-انقلابی‌ها-داریم-اما-ریزش-انقلاب-نه-باید-با-همگرایی-اختلافات-را
https://www.farsnews.ir/news/14001008000652/قاچاق-در-ایران-چگونه-انجام-می-شود
https://www.farsnews.ir/news/14000921000054/رئیسی-برای-مشکلات-نیازمند-برنامه-جهادی-هستیم-خطوط-قرمز-دولت-در-بودجه
----------------------------
