<a href="https://colab.research.google.com/github/adamdenault/colab-notebooks/blob/master/DMG_Entity_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is intended to allow a user to run entity analysis on three URLs and understand differences in central topics covered by pieces of content. It is a tutorial implementation of Google's NLP entity Analysis API. 

For more guidance on how to run this, see here: https://sashadagayev.com/systematically-analyze-your-content-vs-competitor-content-and-make-actionable-improvements/

To use this notebook - go to File > Save a Copy in Drive.


In [21]:
import os
import json
import urllib3
import pandas as pd

In [22]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/content/nlp-colab.json" #change to the name of your credentials. These can be obtained here:https://cloud.google.com/natural-language/docs/quickstart

In [23]:
# Imports the Google Cloud client library
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
from google.cloud import language_v1
from google.cloud.language_v1 import enums

# Instantiates a client

client = language.LanguageServiceClient()

This is the default sentiment analysis method recommended by Google. It is currently set up to take in an HTML response from a page.

In [24]:
def sample_analyze_entities(html_content):
    client = language_v1.LanguageServiceClient()

    # Available types: PLAIN_TEXT, HTML
    type_ = enums.Document.Type.HTML #you can change this to be just text; doesn't have to be HTML.

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": html_content, "type": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_entities(document, encoding_type=encoding_type)
    return response


In [25]:
def return_entity_dataframe(response):
  output = sample_analyze_entities(response.data)
  output_list = []
  for entity in output.entities:
    entity_dict = {}
    entity_dict['entity_name'] = entity.name
    entity_dict['entity_type'] = enums.Entity.Type(entity.type).name
    entity_dict['entity_salience('+response._request_url+')'] = entity.salience
    entity_dict['entity_number_of_mentions('+response._request_url+')'] = len(entity.mentions)
    output_list.append(entity_dict)
  json_entity_analysis = json.dumps(output_list)
  df = pd.read_json(json_entity_analysis)
  summed_df = df.groupby(['entity_name']).sum()
  summed_df.sort_values(by=['entity_salience('+response._request_url+')'], ascending=False)
  return summed_df



```
# This is formatted as code
```

# Swap out your URLs HERE!!!


In [26]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [27]:
http = urllib3.PoolManager()
response1 = http.request('GET','https://mbbhm.com')
response2 = http.request('GET','https://mbofmc.com')
response3 = http.request('GET','https://mbotw.com')
response4 = http.request('GET','https://mbobr.com')



In [42]:
url1_analysis = return_entity_dataframe(response1)
url2_analysis = return_entity_dataframe(response2)
url3_analysis = return_entity_dataframe(response3)
url4_analysis = return_entity_dataframe(response4)

These functions here join all of the entity results into one giant table so that it can be easier for the users to review results and compare. 

In [45]:
url1and2 = url1_analysis.merge(url2_analysis,how='outer', left_on='entity_name', right_on="entity_name")
url1and2and3 = url1and2.merge(url3_analysis,how='outer', left_on='entity_name', right_on="entity_name")
url1and2and3and4 = url1and2and3.merge(url4_analysis,how='outer', left_on='entity_name', right_on="entity_name")

In [49]:
url1and2and3and4.sort_values(by=['entity_salience('+response1._request_url+')'], ascending=False)

Unnamed: 0_level_0,entity_salience(https://mbbhm.com),entity_number_of_mentions(https://mbbhm.com),entity_salience(https://mbofmc.com),entity_number_of_mentions(https://mbofmc.com),entity_salience(https://mbotw.com),entity_number_of_mentions(https://mbotw.com),entity_salience(https://mbobr.com),entity_number_of_mentions(https://mbobr.com)
entity_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Mercedes-Benz,0.233861,21.0,0.065588,28.0,0.100136,35.0,0.143482,26.0
Vehicles,0.093655,13.0,0.008338,2.0,0.060297,14.0,0.069115,10.0
Specials,0.074603,14.0,0.029184,10.0,0.021944,11.0,0.016727,10.0
SUV,0.064842,10.0,0.016013,6.0,0.036626,15.0,0.040724,12.0
Service,0.028901,14.0,0.017039,9.0,0.014871,11.0,0.009877,8.0
...,...,...,...,...,...,...,...,...
site,,,,,,,0.000437,1.0
standards,,,,,,,0.000168,1.0
test drive,,,,,,,0.000184,1.0
types,,,,,,,0.000485,1.0


In [50]:
url1and2and3and4.to_csv('DMG-entities.csv')