# Challenges

- Without more context, search may return irrelevant answers
    - Could use *seed* references, and then augment the reference based on those (e.g. only references that agree with seed references in a certain way?)

In [13]:
from dotenv import load_dotenv
load_dotenv()

import os
import itertools
import json 

from azure.cognitiveservices.search.websearch import WebSearchClient
from azure.cognitiveservices.search.websearch.models import SafeSearch
from msrest.authentication import CognitiveServicesCredentials

# Bing API setup

In [14]:
subscription_key = os.getenv("COGNITIVE_SERVICE_KEY")
endpoint = os.getenv("COGNITIVE_ENDPOINT")

In [15]:
client = WebSearchClient(endpoint=endpoint, credentials=CognitiveServicesCredentials(subscription_key))

In [16]:
web_data = client.web.search(query="Yosemite", count=10)

Subtype value Organization has no mapping, use base class Thing.
Subtype value Organization has no mapping, use base class Thing.
Subtype value Organization has no mapping, use base class Thing.
Subtype value Organization has no mapping, use base class Thing.
Subtype value Organization has no mapping, use base class Thing.


In [17]:
web_data.web_pages.value[0].url

'https://www.nps.gov/yose/index.htm'

# Atomic Edits

In [20]:
data_path = '/mnt/nlp-storage/data/processed/atomic-edits/atomic-edits-04192020.json'

In [21]:
edits = []
with open(data_path, 'r') as data:
    for line in data:
        edits.append(json.loads(line))

In [25]:
page_title, sect_title = edits[0]['page_title'], edits[0]['section_title']
(page_title,sect_title)

('Chhota Katra', 'Interior')

In [40]:
print(' '.join(edits[0]['target']['context']))

Inside, there is a tomb of "Champa Bibi", but there is no correct history regarding her identity. There was a small mosque within its enclosure which is ruined. The one-dome square Mausoleum of "Champa Bibi", a listed building now, was within its enclosure which was raised to the ground by Padre Shepherd.


In [42]:
print(edits[0]['source']['sentence'],'\n', edits[0]['target']['sentence'])

It was later reconstructed by the archaeologists, but now lost within mazes of shops at "Champatali". 
 It was later reconstructed by the archaeologists, but is now lost within mazes of shops at "Champatali".


In [43]:
bing_search = client.web.search(query=page_title, count=10)

In [45]:
for res in bing_search.web_pages.value:
    print(res.name)
    print(res.url)
    print(res.snippet + '\n')

Chhota Katra - Wikipedia
https://en.wikipedia.org/wiki/Chhota_Katra
Chhota Katra is slightly smaller than Bara Katra, but similar in plan and it is about 185 metres east to it. The ruins of Chhota Katra, amidst urban encroachment. Origin. Katara is a form of cellular dormitory built around an oblong courtyard; the form ...

Chhota Katra - newikis.com
https://newikis.com/en/Chhota_Katra
Chhota Katra (Bengali: ছোট কাটারা; Small Katra) is one of the two Katras built during Mughal's regime in Dhaka, Bangladesh.It was constructed in 1663 by Subahdar Shaista Khan.It is on Hakim Habibur Rahman lane on the bank of the Buriganga River.Basically it was built to accommodate some officials and Shaista Khan's expanding family.

Chhota Katra | Archnet
https://archnet.org/sites/4128
Chhota Katra is located on Hakim Habibur Rahman Lane and is 600 feet east of the historic Bara Katra caravanserai. Commissioned by Nawab Shaita...

Chhota Katra is a Muslim Arcological Heritage in Old Dhaka
https://www.yo

In [46]:
bing_search_st = client.web.search(query=' '.join((page_title, sect_title)), count=10)

In [47]:
for res in bing_search_st.web_pages.value:
    print(res.name)
    print(res.url)
    print(res.snippet)

Chhota Katra - Wikipedia
https://en.wikipedia.org/wiki/Chhota_Katra
Chhota Katra (Bengali: ছোট কাটারা; Small Katra) is one of the two Katras built during Mughal's regime in Dhaka, Bangladesh.It was constructed in 1663 by Subahdar Shaista Khan.It is on Hakim Habibur Rahman lane on the bank of the Buriganga River.Basically it was built to accommodate some officials and Shaista Khan's expanding family.
Chhota Katra - newikis.com
https://newikis.com/en/Chhota_Katra
Chhota Katra (Bengali: ছোট কাটারা; Small Katra) is one of the two Katras built during Mughal's regime in Dhaka, Bangladesh.It was constructed in 1663 by Subahdar Shaista Khan.It is on Hakim Habibur Rahman lane on the bank of the Buriganga River.Basically it was built to accommodate some officials and Shaista Khan's expanding family.
Dhaka - Historic Pictures and Photos, with Notes on ...
https://www.skyscrapercity.com/threads/dhaka-historic-pictures-and-photos-with-notes-on-architectural-conservation.436396/
The Chhota Katra is 

# Potential Issues to Address

- Sites that duplicate wikipedia information
    - Could remove domains that contain the word "wiki"
- Incomplete snippets
    - May not be an issue, but could also complete the text using the url
- Irrelevant results
    - Maybe this is a downstream concern, not all information in the grounding corpus will be relevant to each edit
    - What happens when the page is very short? Not a lot to go off of to determine what information is relevant

# Advantages of Bing API
- Can form advanced queries to leverage Bing capabilities.
- Relatively fast (and cheap), provides snippets
- Can consider as a "pre" information retrievel step
- Reproducible in the sense that we can release our dataset, but also in the sense that people could build upon our work and use the Bing API to make their own datasets (debatable here because Microsoft product, but this approach should work with any other search engine too)
    
 
- Bing API tiers
    - Free: 3TPS, 1000 transactions per month
    - S1: 250TPS, \$7 per 1000 transactions
    - S2: 100TPS, \$3 per 1000 transactions

# Number of pages to query

In [28]:
page_titles = set()
for e in edits:
    page_titles.add(e['page_title'])

In [33]:
# convert to list
page_titles = list(page_titles)

In [34]:
len(page_titles)

11936