## Overview

This Notebook contains code for designing 2 applications:

1. An EDA web application for ACLED dataset.
2. A knowledge graph web application.

Both web applications were developed by streamlit but hosted localy due to time constraints.

The sections in this notebook are:

1. Importing libraries and dependencies.
2. Code for developing EDA application.
3. Code for Knowledge Graph application.

**NB**: Reasoning behind creating 2 applications rather than 1, was to improve latency and time taken to reload the interactive independent dashboard sections.
An accompanying video for the web app demos can be found on the Google drive [here](https://drive.google.com/drive/folders/1Wb351y-EaWjTnjRCOojVtcNHE4CiUY2e).

##1. Install Libraries and Dependencies

In [10]:
#!pip3 install geopandas
#!pip install streamlit
#!pip install pyvis

In [11]:
#!pip install boto3

In [12]:
#!pip3 install networkx==2.3

##1.2 Mounting Drive

In [6]:
## Mount drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
os.chdir('/content/drive/MyDrive/Module3-Twist-Challenge')

###2.1 Exploratory Analysis Application Development

This code generates the EDA application on the ACLED dataset.

In [None]:
%%writefile app.py

## Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import time
import datetime as dt
import geopandas
import plotly.express as px 
import seaborn as sns
import os
import streamlit as st
from streamlit import components
from pyvis.network import Network
sns.set_theme()

import re
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher 
from spacy.tokens import Span 

import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm
import boto3
from io import StringIO

## Switch off warnings
st.set_option('deprecation.showPyplotGlobalUse', False)

## Load Data
@st.cache
def load_data():
  """
  Import data from S3 bucket
  """
  ## Read in data
  aws_access_key_id = 'XXXXXXXXXXXXXXXXXXX'
  aws_secret_access_key = 'XXXXXXXXXXXXXXXXXXX'

  client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
          aws_secret_access_key=aws_secret_access_key)

  bucket_name = 'dsi-acled-data'

  object_key = '2019-04-09-2022-04-14.csv'
  csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
  body = csv_obj['Body']
  csv_string = body.read().decode('utf-8')

  acled_data = pd.read_csv(StringIO(csv_string))

  ## Date manipulation
  acled_data['date'] = pd.to_datetime(acled_data['event_date'])
  acled_data['year_mon'] = acled_data['date'].dt.to_period('M')
  acled_data['year_mon2'] = acled_data['year_mon'].astype(str)

  ## Replace DRC name
  acled_data['country'] = acled_data['country'].str.replace('Democratic Republic of Congo','DRC')

  return acled_data

## Read data in memory
acled_data = load_data()

#Add title and subtitle to the main interface of the app
st.title('Armed Conflict Location & Event Data Analysis')

min_date = acled_data['date'].min()
max_date = acled_data['date'].max()

start_date = st.date_input('Start Date', value=min_date, min_value=min_date, max_value=max_date)
end_date = st.date_input('End Date', value=max_date, min_value=min_date, max_value=max_date)

start_date = pd.to_datetime(start_date)
end_date = pd.to_datetime(end_date)

filtered_acled_data = acled_data[((acled_data['date'] >= start_date) & (acled_data['date'] < end_date))]

st.subheader("Africa Analytics")

## Filter for Africa
africa = filtered_acled_data[filtered_acled_data['region'].str.contains("Africa")]

## Conflicts by country
africa_conflicts = africa.groupby('country')['event_id_no_cnty'].count().reset_index(name='events_count').sort_values(['events_count'],ascending=False)

## Fatalities by country
africa_fatalities = africa.groupby('country')['fatalities'].sum().reset_index(name='fatalities_counts').sort_values(['fatalities_counts'],ascending=False)

## Event Types
africa_event_types = africa.groupby('event_type')['event_type'].count().reset_index(name='event_type_count').sort_values(['event_type_count'],ascending=False)

## Fatalities by month
fatalities_by_month = africa.groupby('year_mon')['fatalities'].sum().reset_index(name='fatalities_by_month')
fatalities_by_month['year_mon2'] = fatalities_by_month['year_mon'].astype(str)

## Events distribution by country
africa_analytics_bars = st.container()
with africa_analytics_bars:
  bar_chart1, bar_chart2 = st.columns(2)

  top_ten_events = africa_conflicts[:10]
  fig1 = px.bar(top_ten_events, x='events_count', y='country',title="Conflict Events by Country",orientation='h')
  fig1.update_layout(yaxis={'categoryorder':'total ascending'}) 

  top_ten_fatalities = africa_fatalities[:10]
  fig2 = px.bar(top_ten_fatalities, x='fatalities_counts', y='country',title="Fatalities by Country",orientation='h')
  fig2.update_layout(yaxis={'categoryorder':'total ascending'}) 
  
  bar_chart1.plotly_chart(fig1)
  bar_chart2.plotly_chart(fig2)


## Distribution of Event Types
africa_analytics_conflict_types = st.container()
with africa_analytics_conflict_types:
  bar_chart3,bar_chart4 = st.columns(2)

  fig3 = px.bar(africa_event_types, x='event_type_count', y='event_type',title="Distribution of Event Types",orientation='h')
  fig3.update_layout(yaxis={'categoryorder':'total ascending'}) 

  fig4 = px.line(fatalities_by_month, x="year_mon2", y="fatalities_by_month", title='Fatalities by Month')

  
  
  bar_chart3.plotly_chart(fig3)
  bar_chart4.plotly_chart(fig4)

## Geo-Map for event distribution
continental_analysis_events = st.container()
with continental_analysis_events:

  Cumulative_cases_plot = px.choropleth(africa_conflicts,
                      locations="country", #Spatial coordinates and corrseponds to a column in dataframe
                      color="events_count", #Corresponding data in the dataframe
                      locationmode = 'country names', #location mode == One of ‘ISO-3’, ‘USA-states’, or ‘country names’ 
                      #locationmode == should match the type of data entries in "locations"
                      scope="africa", #limits the scope of the map to Africa
                      title ="Conflict Events Distribution in Africa",
                      hover_name="country",
                      color_continuous_scale = "deep",
                    )
  Cumulative_cases_plot.update_traces(marker_line_color="black") # line markers between states
  st.plotly_chart(Cumulative_cases_plot)


## Fatality distribution Geo-Map
continental_analysis_fatalities = st.container()
with continental_analysis_fatalities:
  Cumulative_fatalities_plot = px.choropleth(africa_fatalities,
                      locations="country", #Spatial coordinates and corrseponds to a column in dataframe
                      color="fatalities_counts", #Corresponding data in the dataframe
                      locationmode = 'country names', #location mode == One of ‘ISO-3’, ‘USA-states’, or ‘country names’ 
                      #locationmode == should match the type of data entries in "locations"
                      scope="africa", #limits the scope of the map to Africa
                      title ="Conflict Fatality Distribution in Africa",
                      hover_name="country",
                      color_continuous_scale = "reds",
                    )
  Cumulative_fatalities_plot.update_traces(marker_line_color="black") # line markers between states
  st.plotly_chart(Cumulative_fatalities_plot)

###3.2 Run EDA Streamlit Application

The below code snipet generates a link for the EDA web app.

In [None]:
!streamlit run app.py & npx localtunnel --port 8501

###3.1  Knowledge Graph Application Development

This code generates the streamlit application for the Knowledge graph.

In [8]:
%%writefile app2.py
## Load libraries
import pandas as pd
import matplotlib.pyplot as plt
import time
import datetime as dt
import geopandas
import plotly.express as px 
import seaborn as sns
import os
import streamlit as st
from streamlit import components
sns.set_theme()

import re
import bs4
import requests
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')

from spacy.matcher import Matcher 
from spacy.tokens import Span 

import networkx as nx
from pyvis.network import Network

import matplotlib.pyplot as plt
from tqdm import tqdm
import boto3
from io import StringIO

## Switch off warnings
st.set_option('deprecation.showPyplotGlobalUse', False)

## Page title
st.title('Armed Conflict Location & Event Network Analysis')

## Load data
@st.cache
def load_data():

  """
  Loads  data from s3 bucket
  """
  
  ## Read in data from s3 bucket

  ## AWS keys
  aws_access_key_id = 'XXXXXXXXXXXXXXXXXXX'
  aws_secret_access_key = 'XXXXXXXXXXXXXXX'

  client = boto3.client('s3', aws_access_key_id=aws_access_key_id,
          aws_secret_access_key=aws_secret_access_key)

  bucket_name = 'dsi-acled-data'

  object_key = '2019-04-09-2022-04-14.csv'
  csv_obj = client.get_object(Bucket=bucket_name, Key=object_key)
  body = csv_obj['Body']
  csv_string = body.read().decode('utf-8')

  acled_data = pd.read_csv(StringIO(csv_string))

  ## Date manipulations
  acled_data['date'] = pd.to_datetime(acled_data['event_date'])
  acled_data['year_mon'] = acled_data['date'].dt.to_period('M')
  acled_data['year_mon2'] = acled_data['year_mon'].astype(str)

  ## Replace DRC name
  acled_data['country'] = acled_data['country'].str.replace('Democratic Republic of Congo','DRC')

  return acled_data

acled_data = load_data()

## Filter for Africa
africa = acled_data[acled_data['region'].str.contains("Africa")]

## Filter for country
africa_select = africa[['event_date','year','event_type','actor1','actor2','region','country','notes','fatalities']]

st.write('Select country of interest for network analysis using knowledge graph')

selected_country = st.selectbox(
    'Select Country for Network Analysis',
    africa['country'].unique())

country_data = africa_select[africa_select['country']==selected_country]

#Add title and subtitle to the main interface of the app


st.subheader("Country Network Analysis")
st.write('To generate entities and relations from the ACLED data click the button below.')

if st.button('Generate Entities and Relations'):
  with st.spinner("Generating Entities and Relations"):
    ## Entities Extraction

    def get_entities(sent):

      ## chunk 1
      ent1 = ""
      ent2 = ""

      prv_tok_dep = ""    # dependency tag of previous token in the sentence
      prv_tok_text = ""   # previous token in the sentence

      prefix = ""
      modifier = ""

      #############################################################
      
      for tok in nlp(sent):
        ## chunk 2
        # if token is a punctuation mark then move on to the next token
        if tok.dep_ != "punct":
          # check: token is a compound word or not
          if tok.dep_ == "compound":
            prefix = tok.text
            # if the previous word was also a 'compound' then add the current word to it
            if prv_tok_dep == "compound":
              prefix = prv_tok_text + " "+ tok.text
          
          # check: token is a modifier or not
          if tok.dep_.endswith("mod") == True:
            modifier = tok.text
            # if the previous word was also a 'compound' then add the current word to it
            if prv_tok_dep == "compound":
              modifier = prv_tok_text + " "+ tok.text
          
          ## chunk 3
          if tok.dep_.find("subj") == True:
            ent1 = modifier +" "+ prefix + " "+ tok.text
            prefix = ""
            modifier = ""
            prv_tok_dep = ""
            prv_tok_text = ""      

          ## chunk 4
          if tok.dep_.find("obj") == True:
            ent2 = modifier +" "+ prefix +" "+ tok.text
            
          ## chunk 5  
          # update variables
          prv_tok_dep = tok.dep_
          prv_tok_text = tok.text
      #############################################################

      return [ent1.strip(), ent2.strip()]

    entity_pairs = []

    for i in tqdm(country_data["notes"]):
      entity_pairs.append(get_entities(i))

    
    def get_relation(sent):
      
      doc = nlp(sent)

      # Matcher class object 
      matcher = Matcher(nlp.vocab)

      #define the pattern 
      pattern = [{'DEP':'ROOT'}, 
                {'DEP':'prep','OP':"?"},
                {'DEP':'agent','OP':"?"},  
                {'POS':'ADJ','OP':"?"}] 

      matcher.add("matching_1", None, pattern) 

      matches = matcher(doc)
      k = len(matches) - 1

      span = doc[matches[k][1]:matches[k][2]] 

      return(span.text)


    relations = [get_relation(i) for i in tqdm(country_data["notes"])]


    # extract subject
    source = [i[0] for i in entity_pairs]

    # extract object
    target = [i[1] for i in entity_pairs]

    ## Create entities and relation data frame
    kg_df = pd.DataFrame({'source':source, 'target':target, 'edge':relations})

    st.dataframe(kg_df)

    ## List top 50 relations
    st.write('The top 50 relations discovered are:')

    st.write(pd.Series(relations).value_counts()[:50])

    st.success('Network analysis completed.')

    st.subheader("Relation Knowledge Graph")

    st.write('Select relation to be displayed on the knowledge graph')

    select_relations = pd.Series(relations).value_counts()[:50].index.tolist()

    select_relations = pd.DataFrame(select_relations,columns=['relations'])

    select_relations.to_csv('select_relations.csv',index=False)
    #st.dataframe(select_relations)

    kg_df.to_csv('kg_df.csv',index= False)
select_relations = pd.read_csv('select_relations.csv')
selected_relation = st.selectbox('Select relation to be displayed on knowledge graph',select_relations['relations'])

if st.button('Generate Knowledge Graph'):
  with st.spinner("Generating Knowledge Graph"):
    
    kg_df = pd.read_csv('kg_df.csv')
    #select_relations = pd.read_csv('select_relations.csv')


    G = nx.from_pandas_edgelist(kg_df[kg_df['edge']==selected_relation], "source", "target", 
                              edge_attr=True,create_using=nx.MultiDiGraph())

    plt.figure(figsize=(12,12))
    pos = nx.spring_layout(G, k = 0.5) # k regulates the distance between nodes
    ax = nx.draw(G, with_labels=True, node_color='skyblue', node_size=1500, edge_cmap=plt.cm.Blues, pos = pos)
      
    plt.savefig("graph.png")
    st.image('graph.png')

  

Overwriting app2.py


###3.2 Run Knowledge Graph Application

This code snippet generates a URL for the application.

In [9]:
!streamlit run app2.py & npx localtunnel --port 8501

2022-04-22 02:50:16.322 INFO    numexpr.utils: NumExpr defaulting to 2 threads.
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.2:8501[0m
[34m  External URL: [0m[1mhttp://34.81.168.41:8501[0m
[0m
[K[?25hnpx: installed 22 in 3.778s
your url is: https://warm-cheetah-68.loca.lt
tcmalloc: large alloc 1645092864 bytes == 0x55851b7a4000 @  0x7f1e5910c2a4 0x558493786424 0x5584936e84e8 0x558493647184 0x558493607902 0x55849367ac4d 0x558493547d14 0x558493677ff1 0x558493675cdd 0x55849360888a 0x5584936768f6 0x558493675cdd 0x55849360888a 0x5584936768f6 0x558493675a2e 0x558493675723 0x558493673acb 0x558493606ff9 0x558493606ef0 0x55849367a9a3 0x5584936087aa 0x558493676b4f 0x558493608ce9 0x558493609341 0x558493677ff1 0x5584936087aa 0x558493676b4f 0x5584936087aa 0x558493676b4f 0x558493608ce9 0x558493609341
100% 1530/1530 [00:22<00:00, 67.96it/s]
100% 1530/1530 [00:21<00:00, 72.45it/s]
The iterable function was deprecated 