# Data Visualization with Plotly Demo

## Introduction to Jupyter Notebook
Jupyter Notebooks are a staple in any data scientist's toolkit. It is a free, open source, interactive data science environment that can function as both an IDE and a visualisation tool. A Jupyter Notebook is a single document where you can run code, display the output and add equations and explainations. Each notebook is a `.ipynb` file, which is a text file that describes the content of the notebook in JSON format.

Each Jupter Notebook contains a kernal that can be thought of as a "computational engine" that executes the code within the notebook. Notebooks are made up of a number of cells. For example, this piece of text you are reading resides in the first cell of this notebook. They can be markdown cells that display text in-place or code cells. When a code cell is run, the output is displayed below the cell. The order in which cells are run matters! Cells containing functions or variables have to be run before those same functions or variables can be called from a subsequent cell. 

How to use a Jupyter Notebook:
- To run a cell, either click the arrow to the left of the cell or press `ctrl + Enter` after selecting the cell. When a cell is run, a number will appear in square brackets (e.g. [1]) telling you the order in which each cell is run.
- To interrupt a cell while it is running, press the button with the black square in the toolbar at the top
- To restart the kernal, right-click `kernel` and choose from the list of restart options available


## Introduction to Plotly

Pandas is an open source library providing data structure and data analysis tools for the Python language. Plotly is another open source that allows you to put together high quality graphs to faciliate the visualisation of the data. Plotly Dash (written on top of Plotly.js and React.js) allows one to quickly build data apps that are rendered in the browser. 

This notebook contains examples of how each of these libraries can be leveraged to analyse and visualise data. For more information, please check out the official documentation listed below.

#### Further Documentation
https://pandas.pydata.org/docs/ \
https://plot.ly/python/ \
https://dash.plotly.com/introduction 

## Setting Up

You can install the libraries using pip or conda. 

**N.B.** you may have to restart the kernel after installing these packages for your first run.

In [None]:
#!/bin/env python

# install packages
!pip3 install --user pandas
!pip3 install --user numpy
!pip3 install --user matplotlib
!pip3 install --user plotly
!pip3 install --user jupyter-dash

Having installed the libraries, you can import them as follows.

In [None]:
# import libraries
%matplotlib inline

#import plotly
import requests
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from plotly import express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from jupyter_dash import JupyterDash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output, State

# Set display row/column to show all data
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [None]:
#!pip3 install --user plotly
#import plotly


## Access Data From Endpoint

#### Further Documentation
https://docs.python-requests.org/en/master/

**N.B.** the url used in this example is from the demo project we have set up. Please replace it with your own url.

In [None]:
import os

In [None]:
# esg_keyw = []
# with open("../articles/ESG_Keywords.txt", "r") as file:
#     newline_break = ""
#     for readline in file: 
#         line_strip = readline.strip()
#         esg_keyw.append(line_strip)

# print(esg_keyw)

In [None]:
esg_keyw = []  # list to save the ESG Keywords from file
# https://www.delftstack.com/howto/python/python-readlines-without-newline/
with open("articles/ESG_Keywords.txt", "r") as file:
    esg_keyw=file.read().splitlines()[1:] # splitlines method splits lines into list without new line; excluding header with [1:]
    print(esg_keyw)

In [None]:
article1 = []  
# https://www.delftstack.com/howto/python/python-readlines-without-newline/
with open("articles/article1.txt", "r") as file:
    article1=file.read().splitlines() # splitlines method splits lines into list without new line; excluding header with [1:]
while("" in article1) :
    article1.remove("") # remove empty strings from list of strings 'article1'
    
print(article1)

article1g = []  
# https://www.delftstack.com/howto/python/python-readlines-without-newline/
with open("articles/article1german.txt", "r") as file:
    article1g=file.read().splitlines() # splitlines method splits lines into list without new line
while("" in article1g) :
    article1g.remove("") # remove empty strings from list of strings 'article1g'
    
print(article1g)

In [None]:
article2 = []  
# https://www.delftstack.com/howto/python/python-readlines-without-newline/
with open("articles/article2.txt", "r") as file:
    article2=file.read().splitlines() # splitlines method splits lines into list without new line
while("" in article2) :
    article2.remove("") # remove empty strings from list of strings 'article2'
    
print(article2)

article3 = []  
# https://www.delftstack.com/howto/python/python-readlines-without-newline/
with open("articles/article3.txt", "r") as file:
    article3=file.read().splitlines() # splitlines method splits lines into list without new line
while("" in article3) :
    article3.remove("") # remove empty strings from list of strings 'article2'
    
print(article3)

In [33]:
# url = 'https://dbgee-mar22-12.ew.r.appspot.com/api/text'
url = 'https://dbgee-mar22-12.ew.r.appspot.com/api/analyze'

article3 = []  
# https://www.delftstack.com/howto/python/python-readlines-without-newline/
with open("articles/article3.txt", "r") as file:
    article3=file.read().splitlines() # splitlines method splits lines into list without new line
while("" in article3) :
    article3.remove("") # remove empty strings from list of strings 'article2'

art3_post = []

for i in range(len(article3)):
    myobj = {key: article3[i]}
    x = requests.post(url,  data = myobj)
    art3_post.append(x.json())

In [None]:
art3_post

In [None]:
# define endpoint url
url = "https://dbgee-mar22-12.ew.r.appspot.com/api/text"

# use requests library to send HTTP requests
# in this example, GET sentiment analysis data
data = json.loads(requests.get(url).text)

# examine data
data

In [None]:
# import requests

# url = "https://dbgee-mar22-12.ew.r.appspot.com/api/text"

# # data = {"eventType": "AAS_PORTAL_START", "data": {"uid": "hfe3hf45huf33545", "aid": "1", "vid": "1"}}
# # params = {'sessionKey': '9ebbd0b25760557393a43064a92bae539d962103', 'format': 'xml', 'platformId': 1}

# requests.post(url, params=article2[3])

In [None]:
#esg_keyw

filterEsgWords = open('articles/ESG_Keywords.txt', "r").readlines()
words = [w.lower().strip() for w in filterEsgWords]

In [None]:
esg_keyw_low = [w.lower() for w in esg_keyw]
#words

In [None]:
entity_list = [
]
 
#Adding dictionary (entity characteristics) to the list:
#entity_list.append({'name':48,'type':'other', 'score':28})
 
print(entity_list)

## Data Visualisation

Plotly is a commonly-used data visualisation library. The following examples will show you how to create different graphs from the sample data.

We can first read the sample data into a dataframe. The sample data is taken from the UK Met Office and shows the maximum and minimum temperature, the rainfall and the number of hours of sunlight for each month in 2018.

In [None]:
from google.cloud import language_v1

def sample_analyze_entities(text_content):
    """
    Analyzing Entities in a String

    Args:
      text_content The text content to analyze
    """

    client = language_v1.LanguageServiceClient()

    # text_content = 'California is a state.'

    # Available types: PLAIN_TEXT, HTML
    type_ = language_v1.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": text_content, "type_": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = language_v1.EncodingType.UTF8

    response = client.analyze_entities(request = {'document': document, 'encoding_type': encoding_type})
    
    # TODO add in the filter file here
    entity_list = []
    # Loop through entitites returned from the API
    for entity in response.entities:
        #entity_list.append([entity.name,entity.type_, entity.salience])
        entity_list.append({'name':entity.name, 'type':entity.type_, 'score':entity.salience})
        
       # print(u"Representative name for the entity: {}".format(entity.name))

        if (entity.name.lower() in words):
            print("We got a match " + entity.name.lower())

            
    return(entity_list)


In [None]:
## question 2) Named Entity Extraction ##
entity_name=[]
entity_type = []
entity_score = []

for x in range(1,len(article1)):
    result = sample_analyze_entities(article1[x])
    #entity_dict.append([dict(zip(dict_keys,i)) for i in result])
    
    entity_name.append(i["name"] for i in result)
    entity_type.append(i["type"] for i in result)
    entity_score.append([i["score"] for i in result])


In [None]:
## question 2) Named Entity Extraction ##
entity_name2=[]
entity_type2 = []
entity_score2 = []

for x in range(1,len(article2)):
    result = sample_analyze_entities(article2[x])
    #entity_dict.append([dict(zip(dict_keys,i)) for i in result])
    
    entity_name2.append(i["name"] for i in result)
    entity_type2.append(i["type"] for i in result)
    entity_score2.append([i["score"] for i in result])


In [None]:
# reduce to 1 a list of lists:
entity_name_flat = [item for sublist in entity_name for item in sublist]
entity_type_flat = [item for sublist in entity_type for item in sublist]
entity_score_flat = [item for sublist in entity_score for item in sublist]

In [None]:
entity_df = pd.DataFrame(list(zip(entity_name_flat, entity_type_flat, entity_score_flat)),  columns = ["entity","type","score"])

In [None]:
# number of duplicates for each entry, reducing df to unique df:
entity_df_count = entity_df.groupby(entity_df.columns.tolist(),as_index=False).size()

In [None]:
# sort values by score
most_scored = entity_df_count.sort_values(by=['score'],  ascending=False)
# sort values by # repetitions
# entity_df_count.sort_values(by=['size'],  ascending=False)
# no many duplicates

In [None]:
# getting only the entities:
entity_name_df = pd.DataFrame(list(zip(entity_name_flat)), columns=["entity"])
# descending order by number of repetitions:
entity_name_df_sorted = entity_name_df.groupby(entity_name_df.columns.tolist(),as_index=False).size().sort_values(by=['size'],  ascending=False)
entity_name_df_sorted.rename(columns = {'entity':'Entity', 'size':'Frequency'}, inplace = True)

In [None]:
# getting only the entities - Ariicle 2:
entity_name_flat2 =  [item for sublist in entity_name2 for item in sublist]
entity_name_df2 = pd.DataFrame(list(zip(entity_name_flat2)), columns=["entity"])
# descending order by number of repetitions:
entity_name_df_sorted2 = entity_name_df2.groupby(entity_name_df2.columns.tolist(),as_index=False).size().sort_values(by=['size'],  ascending=False)
entity_name_df_sorted2.rename(columns = {'entity':'Entity', 'size':'Frequency'}, inplace = True)

In [None]:
# entity name sorted greater than 1 = more than 1 appearance/repetition
ens_gt1 = entity_name_df_sorted[entity_name_df_sorted['Frequency']>1]
ens_gt2 = entity_name_df_sorted2[entity_name_df_sorted2['Frequency']>1]

In [None]:
# Named entity extraction – table of most important/frequently used entities referenced.
px.bar(ens_gt1, x='Entity', y='Frequency', title= "Most Frequently used Entities - Article 1").show()
px.bar(ens_gt2, x='Entity', y='Frequency', title= "Most Frequently used Entities - Article 2").show()

To gain more insight into a particular column, you can use the *describe()* method on the dataframe column name.

## Introducing Jupyter Dash

Dash is Plotly's open source Python framework for building full stack analytic web applications using pure Python. The JupyterDash library makes these features available from the jupyter notebook.

In [None]:
### Run ngrok to tunnel Dash app port 8050 to the outside world. 
### This command runs in the background.
get_ipython().system_raw('./ngrok http 8050 &')

In [None]:
# get ID of the most recent 
last_text_id = list(data.keys())[0]

app = JupyterDash(__name__)

app.layout = html.Div([
    html.H1("JupyterDash Demo"),
    
    
    # THESE LINES DISPLAY THE OUTPUT OF NLP API
    html.P("Most Recent Text ID: {}".format(last_text_id)),
    html.P("Text Analysed: {}".format(data[last_text_id]["text"])),
    html.P("Sentiment: {}".format(data[last_text_id]["sentiment"])),
  
    # THESE LINES DEMO ONE OF THE DASH CORE COMPONENT(dcc) i.e. dcc.Input
    html.H3("Change the value in the text box to see callbacks in action!"),
    html.Div([
        "Input: ",
        dcc.Input(id='my-input', value='initial value', type='text')
    ]),
    html.Br(),
    html.Div(id='my-output'),
    
    # THESE LINES DEMO THE INTEGRATION OF PLOTLY GRAPHS WITH DASH
    dcc.Graph(figure=subplots_fig),

])


@app.callback(
    Output(component_id='my-output', component_property='children'),
    Input(component_id='my-input', component_property='value')
)
def update_output_div(input_value):
    return 'Output: {}'.format(input_value)


In [None]:
app.run_server(mode="external", port=8050)

#### In case the below cell has errors, please rerun it

In [None]:
### Get the public URL where you can access the Dash app. Copy this URL.
! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"