# About the Notebook

In this notebook, we'll inspect how we can obtain structured information about a particular entity, using various types of data, and ML methods.

Sections:
1. Getting the Data Ready

    - We'll collect some data on the GME stock from Reddit.
    - We'll parse the data, and assume we push it into an ES index.
    - We'll query actual data about GME, from an ES index.
    
2. Running NER and RelEx on the Data

    - We'll 
    
3. Parsing and Saving the results

    - This time, we'll assume we have NER + RelEx applied data on "where Elon Musk went" recently.
    - We'll parse our data to nodes and edges, and save the data.
    - We'll both print the data to see it in the notebook, and also run a Dash App to see it as a graph.
    


## Note:

This is not a tutorial notebook, so it does not implement the process end-to-end.

However it gathers most of the components we need together, to do the analysis. It should give us an idea on how to form a complete pipeline.

In [None]:
%cd /Users/pydata_pres

# 1- Getting the Data Ready 

- We'll collect some data on the GME stock from Reddit.
- We'll parse the data, and assume we push it into an ES index.
- We'll query actual data about GME, from an ES index.

## Scrape Data from the Identified Data Source

In [None]:
# https://www.geeksforgeeks.org/scraping-reddit-with-python-and-beautifulsoup/

In [None]:
import requests
from bs4 import BeautifulSoup
from datetime import date

In [None]:
def getdata(url):
    r = requests.get(url)
    return r.text

url = "https://www.reddit.com/r/GME/comments/mfv9we/just_posted_on_sec_%D0%BEver_500000_awarded_to/"
  
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
    
# print(soup)

In [None]:
page = []
for i,item in enumerate(soup.find_all("p")):
    page.append(item.get_text())

In [None]:
page

In [None]:
# ["Link to the Press Release on SEC's website:",
#  'https://www.sec.gov/news/press-release/2021-54',
#  'From the release:',
#  'FOR IMMEDIATE RELEASE2021-54',
#  'Washington D.C., March 29, 2021 —',
#  "The Securities and Exchange Commission awarded more than $500,000 to a whistleblower who raised concerns internally before submitting a tip to the Commission.\xa0The whistleblower's information and assistance allowed the Commission and another agency to quickly file actions, shutting down an ongoing fraudulent scheme.",
#  'The whistleblower\'s information prompted an internal investigation by the company, which then reported to an outside agency, which in turn provided the information to the SEC.\xa0Separately, the whistleblower also reported to the SEC within 120 days of reporting the violations internally to the company.\xa0Under the "safe harbor"\xa0provision of the SEC\'s whistleblower rules, the SEC treats the whistleblower\'s information as though it had been submitted to the SEC at the same time it was internally reported as long as the whistleblower also reports the information to the SEC within 120 days of the internal report.',

## Parse Scraped Data to a Common Format

In [None]:
datastore = {}
page_parsed_into_a_common_format = {"id":123, 
                                    "url":url,
                                    "date_obtained":date.today(),
                                    "content":[]}
for i,par in enumerate(page):
    page_parsed_into_a_common_format["content"].append(par)

In [None]:
page_parsed_into_a_common_format

## Store the Common Format

In [None]:
# from: https://stackoverflow.com/questions/66049377/insert-new-document-using-python-elastic-client-raises-illegal-argument-exceptio
# res = es.index(index='reddit_pages', doc_type="_doc", id=page_parsed_into_a_common_format['id'], body=page_parsed_into_a_common_format)

## Query your Data-Store

In [None]:
from elasticsearch.helpers import scan
from elasticsearch import Elasticsearch
from pydata_pres_vars import MY_ELASTICSEARCH_URL

In [None]:
client = Elasticsearch(MY_ELASTICSEARCH_URL)

In [None]:
query = {
        "bool": {
            "filter": [
                {"range": {"date": {"gte": "2022-06-18", "lt": "2022-06-19"}}},
                {"query_string": {"query": '"elon musk"', "default_field": "content"}}
            ],
        }
    }

In [None]:
client.count(index="en_search", body={"query":query})

In [None]:
from pydata_pres_vars import MY_INDEX_NAME

scanner = scan(
    client=client,
    index="en_search",
    query={
        "query": query,
        "_source": [
            "date",
            "title",
            "entities.organizations",
            "content",
            "spans",
        ],
    },
)

docs_full = [{"id": d["_id"], **d["_source"]} for d in scanner]
docs = [d['content'] for d in docs_full]

len(docs)

In [None]:
for doc in docs[:3]:
    print(doc[:175]+'..','\n','_____________________')

In [None]:
# THREE SpaceX employees have been sacked after calling chief executive Elon Musk a "source of distraction and embarrassment" in an open letter.

# The letter, which was first cir.. 
#  _____________________
# Now the crypto industry is grappling with an even grimmer prospect: The worst may be yet to come.

# Written by David Yaffe-Bellany

# Cryptocurrency prices are plummeting. A so-c.. 
#  _____________________
# Elon Musk warned Twitter staffers its business needed to "get healthy" and undergo a "rationalisation of headcount and expenses" as he addressed the social media platform's em.. 
#  _____________________

# 2- Running NER and RelEx on the Data

![title](imgs/ner.png)

![title](imgs/relex.png)

# 3- Parsing and Saving Results

- This time, we'll assume we have NER + RelEx applied data on "where Elon Musk went" recently.
- We'll parse our data to nodes and edges, and save the data.
- We'll both print the data to see it in the notebook, and also run a Dash App to see it as a graph.

In [None]:
import json

In [None]:
RESULTS_PATH = "elon_dataset.json" 

with open(RESULTS_PATH,"r") as f:
    elon_dataset = json.loads(f.read())

In [None]:
extracted_information = []

for doc in elon_dataset:

    visits_in_a_doc = list(set([visit['answer'] for visit in doc["visits"] if visit != None]))
    
    if len(visits_in_a_doc)>0:
        assert len(visits_in_a_doc)==1
        
        if visits_in_a_doc[0]!='[SEP]':
            extracted_information.append({'title':doc['_source']['title'], 'visits':visits_in_a_doc})

In [None]:
len(extracted_information)

In [None]:
nodes_to_titles = {}

for doc in extracted_information:
    for loc in doc['visits']:
        if loc in nodes_to_titles.keys():
            nodes_to_titles[loc].append(doc['title'])
        else:
            nodes_to_titles[loc] = [doc['title']]
            
nodes = set(nodes_to_titles.keys())

In [None]:
parsed_nodes = [{'data':{'id':node,'label':node}}
                for node in nodes]
parsed_nodes.append({'data':{'id':'Elon Musk','label':'Elon Musk'}})
parsed_nodes[:3]

In [None]:
parsed_nodes.append({'data':{'id':'Elon Musk','label':'Elon Musk'}})
parsed_nodes[:3]

In [None]:
import math 
visualized_coordinates = []
for i in range(len(parsed_nodes)):
    angle = i*((2*math.pi)/len(parsed_nodes))
    visualized_coordinates.append((math.cos(angle), math.sin(angle)))

In [None]:
visualized_coordinates[0]

In [None]:
assert len(parsed_nodes)==len(visualized_coordinates)

for i,n in enumerate(parsed_nodes):
    parsed_nodes[i].update({'position':{'x':visualized_coordinates[i][0]*2000,'y':visualized_coordinates[i][1]*2000}})

In [None]:
parsed_nodes[:3]

In [None]:
parsed_edges = [{'data':{'source':'Elon Musk', 'target':loc, 'label':title}} 
                for loc, titles in nodes_to_titles.items() 
                for title in titles]

parsed_edges[:3]

In [None]:
extracted_information_str = json.dumps(parsed_nodes+parsed_edges)

with open("elon_musk_results_to_visualize.json","w") as f:
    f.write(extracted_information_str)

In [None]:
for edge in parsed_edges[:3]:
    print(edge)