## About:
- this notebook renews the Covid19 Datastore
- Advisories from specified directory are indexed into ElasticSearch
- Articles from straightstimes, channelnewsasia and gov.sg scraped and indexed into ElasticSearch

In [1]:
import sys
sys.path.append(r"C:\Users\tanch\Documents\GitHub\URECA-Covid-19-Question-Answering-Research\Implementation\Datastore Manager")
from haystack.document_store.elasticsearch import ElasticsearchDocumentStore
from haystack.preprocessor.preprocessor import PreProcessor
import NewsScraper
import Preprocessor
import FileManager
import Config

In [2]:
# connect to elastic search
document_store = ElasticsearchDocumentStore(username = Config.AUTH['username'],
                                            password = Config.AUTH['password'],
                                            host = "localhost",
                                            port = 9200,
                                            create_index=False,
                                            similarity = "dot_product",
                                            search_fields = ["text",'name'],
                                            text_field = "text",
                                            name_field = "name",
                                            embedding_field = "embedding",
                                            embedding_dim = 768)

## Renew advisories
- delete all current documents in the index 
- read advisories from local text files
- write new documents into index

In [6]:
folder = r"C:\Users\tanch\Documents\GitHub\URECA-CovidQA-Research\Implementation\Raw Data\advisories"
index = Config.INDEX_NAME
category = "advisories"

In [7]:
document_store.delete_all_documents(index = index, filters = {"category":[category]})          # delete all existing documents from current index and category

files = FileManager.getFileNames(folder)                                                       # get filenames form folder  
for file in files:          
    path = f"{folder}\\{file}"
    text = FileManager.readTextFile(path,Preprocessor.cleanText)                               # read and clean text file

    doc = Preprocessor.text2HaystackFormat(text, category = category)                           # converts text into haystack document format                                    
    document_store.write_documents([doc], index = index)                                         # write document into elastic search into index and category


06/27/2021 22:25:19 - INFO - elasticsearch -   POST http://localhost:9200/covid_datastore/_delete_by_query [status:200 request:0.007s]
06/27/2021 22:25:20 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.003s]
06/27/2021 22:25:21 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.850s]
06/27/2021 22:25:21 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.003s]
06/27/2021 22:25:22 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.159s]
06/27/2021 22:25:22 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.004s]
06/27/2021 22:25:23 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.213s]
06/27/2021 22:25:23 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.004s]
06/27/2021 22:25:2

In [3]:
document_store.get_all_documents(index = Config.INDEX_NAME, filters = {"category":["advisories"]})[0]

06/29/2021 01:33:05 - INFO - elasticsearch -   POST http://localhost:9200/covid_datastore/_search?scroll=1d&size=10000 [status:200 request:0.055s]
06/29/2021 01:33:05 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.003s]
06/29/2021 01:33:05 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.002s]


{'text': 'Requirements for Safe Management Measures at the workplace Read the sector-specific guidelines and infographic on Safe Management Measures at the workplace. From 16 May 2021 to 13 June 2021, the Safe Management Measures for the workplace will be tightened. Previously, up to 50% of employees28 who are able to work from home could be at the workplace at any time. Now, employers must ensure that all employees who are able to work from home do so. Social gatherings at the workplace are disallowed. These measures help lower transmission risks by reducing the levels of interaction at common spaces at or near the workplace, and in public places, including public transport. Issued on 9 May 2020 Updated as of 14 May 2021 The tripartite partners (MOM, SNEF, and NTUC) have updated the workplace safe management measures to allow greater flexibility for businesses, while mitigating the risk of widespread COVID-19 transmission. Effective implementation of these measures will help to avoid 

## Renews recent covid news articles
- delete all current documents in the index 
- scrape text from covid 19 related articles
- write new documents into index

In [9]:
index = Config.INDEX_NAME
category = "articles"
domains = NewsScraper.DOMAINS
keywords = NewsScraper.KEYWORDS

In [10]:
document_store.delete_all_documents(index = index, filters = {"category":[category]} )             # delete all existing documents from current index and category
google_page_urls = NewsScraper.get_google_page_urls(domains, keywords)                             # get google page url

article_urls = []
for p_ in google_page_urls:
    article_urls.extend(NewsScraper.get_article_urls_from_google_page(domains,p_))     # get the article urls from the google page

for url in article_urls:
    text = NewsScraper.scrape(url,Preprocessor.cleanText)                       # scrape the article
    doc = Preprocessor.text2HaystackFormat(text, category = category)           # converts text into haystack document format                                    
    document_store.write_documents([doc], index = index)                      # write document into elastic search into


06/27/2021 22:25:53 - INFO - elasticsearch -   POST http://localhost:9200/covid_datastore/_delete_by_query [status:200 request:0.007s]
06/27/2021 22:26:11 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.003s]
06/27/2021 22:26:11 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.394s]
06/27/2021 22:26:17 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:26:18 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.020s]
06/27/2021 22:26:24 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:26:24 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.304s]
06/27/2021 22:26:25 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.003s]
06/27/2021 22:26:2

06/27/2021 22:27:13 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.511s]
06/27/2021 22:27:13 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:27:14 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.533s]
06/27/2021 22:27:15 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:27:15 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.184s]
06/27/2021 22:27:16 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:27:16 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.166s]
06/27/2021 22:27:17 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:27:18 - INFO -

06/27/2021 22:28:15 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.769s]
06/27/2021 22:28:16 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:28:17 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.130s]
06/27/2021 22:28:20 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.003s]
06/27/2021 22:28:21 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.133s]
06/27/2021 22:28:23 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:28:23 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.550s]
06/27/2021 22:28:24 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:28:25 - INFO -

06/27/2021 22:29:08 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.752s]
06/27/2021 22:29:10 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:10 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.351s]
06/27/2021 22:29:11 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:12 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.378s]
06/27/2021 22:29:13 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:13 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.285s]
06/27/2021 22:29:13 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:14 - INFO -

06/27/2021 22:29:53 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.837s]
06/27/2021 22:29:53 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:54 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.973s]
06/27/2021 22:29:55 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:56 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.954s]
06/27/2021 22:29:56 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:57 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.905s]
06/27/2021 22:29:57 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:29:58 - INFO -

06/27/2021 22:30:34 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.948s]
06/27/2021 22:30:35 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:30:35 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.796s]
06/27/2021 22:30:36 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:30:37 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.812s]
06/27/2021 22:30:37 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:30:38 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.811s]
06/27/2021 22:30:38 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:30:39 - INFO -

06/27/2021 22:31:15 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.874s]
06/27/2021 22:31:18 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:31:18 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.324s]
06/27/2021 22:31:18 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:31:19 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.705s]
06/27/2021 22:31:19 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:31:20 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.885s]
06/27/2021 22:31:21 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:31:22 - INFO -

06/27/2021 22:31:58 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.006s]
06/27/2021 22:32:00 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:32:01 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.559s]
06/27/2021 22:32:01 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:32:02 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.984s]
06/27/2021 22:32:02 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:32:03 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.456s]
06/27/2021 22:32:03 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:32:04 - INFO -

06/27/2021 22:32:44 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.300s]
06/27/2021 22:32:44 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:32:45 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.698s]
06/27/2021 22:32:46 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:32:46 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.356s]
06/27/2021 22:32:47 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:32:48 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.800s]
06/27/2021 22:32:48 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:32:49 - INFO -

06/27/2021 22:33:30 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:1.147s]
06/27/2021 22:33:31 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:33:32 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.738s]
06/27/2021 22:33:32 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.001s]
06/27/2021 22:33:33 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.538s]
06/27/2021 22:33:33 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:33:34 - INFO - elasticsearch -   POST http://localhost:9200/_bulk?refresh=wait_for [status:200 request:0.868s]
06/27/2021 22:33:34 - INFO - elasticsearch -   HEAD http://localhost:9200/covid_datastore [status:200 request:0.002s]
06/27/2021 22:33:35 - INFO -

In [4]:
document_store.get_all_documents(index = Config.INDEX_NAME, filters = {"category":["articles"]})[1]

06/29/2021 01:33:16 - INFO - elasticsearch -   POST http://localhost:9200/covid_datastore/_search?scroll=1d&size=10000 [status:200 request:0.628s]
06/29/2021 01:33:16 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.004s]
06/29/2021 01:33:16 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.004s]


{'text': 'where tech and finance meet Chatbot Coach, UX/UI Designer Robotics Process Designer, Cloud Specialist, Machine Learning Architect. These may sound like Information Technology (IT)jobs, but they are actually roles that exist in the financial services sector. In the past few years, financial institutions have digital upskilling a key priority. Just ask the Talent Acquisition Head at DBS Bank, Ms Susan Cheong.“With COVID-19 accelerating digital disruption and changing the way we live, we are ramping up our upskilling efforts to stay ahead of these massive changes,” she says. From this year, DBS has identified over 7,200 employees to be upskilled or reskilled, in emerging areas such as design thinking, artificial intelligence and machine learning.Head of Group Human Resources at UOB, Mr Dean Tong, says that technology is core to the Bank’s innovation and transformation drive. “By deepening digital knowledge and skills, we ensure that our workforce is ever ready to harness the fut

### Total number of documents currently in the documentstore

In [11]:
len(document_store.get_all_documents(index = Config.INDEX_NAME) )

06/27/2021 22:35:32 - INFO - elasticsearch -   POST http://localhost:9200/covid_datastore/_search?scroll=1d&size=10000 [status:200 request:0.362s]
06/27/2021 22:35:32 - INFO - elasticsearch -   POST http://localhost:9200/_search/scroll [status:200 request:0.003s]
06/27/2021 22:35:32 - INFO - elasticsearch -   DELETE http://localhost:9200/_search/scroll [status:200 request:0.002s]


344