<a href="https://colab.research.google.com/github/crazimon/github-slideshow/blob/master/Entity_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is intended to allow a user to run entity analysis on three URLs and understand differences in central topics covered by pieces of content. It is a tutorial implementation of Google's NLP entity Analysis API. 

For more guidance on how to run this, see here: https://sashadagayev.com/systematically-analyze-your-content-vs-competitor-content-and-make-actionable-improvements/

To use this notebook - go to File > Save a Copy in Drive.


In [None]:
import os
import json
import urllib3
import pandas as pd

In [None]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "/content/nlp-colab.json" #change to the name of your credentials. These can be obtained here:https://cloud.google.com/natural-language/docs/quickstart

In [None]:
# Imports the Google Cloud client library
from google.cloud import language
from google.cloud.language import enums
from google.cloud.language import types
from google.cloud import language_v1
from google.cloud.language_v1 import enums

# Instantiates a client

client = language.LanguageServiceClient()

This is the default sentiment analysis method recommended by Google. It is currently set up to take in an HTML response from a page.

In [None]:
def sample_analyze_entities(html_content):
    client = language_v1.LanguageServiceClient()

    # Available types: PLAIN_TEXT, HTML
    type_ = enums.Document.Type.HTML #you can change this to be just text; doesn't have to be HTML.

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language = "en"
    document = {"content": html_content, "type": type_, "language": language}

    # Available values: NONE, UTF8, UTF16, UTF32
    encoding_type = enums.EncodingType.UTF8

    response = client.analyze_entities(document, encoding_type=encoding_type)
    return response


In [None]:
def return_entity_dataframe(response):
  output = sample_analyze_entities(response.data)
  output_list = []
  for entity in output.entities:
    entity_dict = {}
    entity_dict['entity_name'] = entity.name
    entity_dict['entity_type'] = enums.Entity.Type(entity.type).name
    entity_dict['entity_salience('+response._request_url+')'] = entity.salience
    entity_dict['entity_number_of_mentions('+response._request_url+')'] = len(entity.mentions)
    output_list.append(entity_dict)
  json_entity_analysis = json.dumps(output_list)
  df = pd.read_json(json_entity_analysis)
  summed_df = df.groupby(['entity_name']).sum()
  summed_df.sort_values(by=['entity_salience('+response._request_url+')'], ascending=False)
  return summed_df



```
# This is formatted as code
```

# Swap out your URLs HERE!!!


In [None]:
http = urllib3.PoolManager()
response1 = http.request('GET','https://en.wikipedia.org/wiki/Search_engine_optimization')
response2 = http.request('GET','http://mozseoclass.com/who-is-the-smartest-seo/')
response3 = http.request('GET','https://hookagency.com/who-is-the-smartest-seo-in-the-world/')
response4 = http.request('GET','https://hookagency.com/who-is-the-smartest-seo-in-the-world/')



In [None]:
url1_analysis = return_entity_dataframe(response1)
url2_analysis = return_entity_dataframe(response2)
url3_analysis = return_entity_dataframe(response3)

These functions here join all of the entity results into one giant table so that it can be easier for the users to review results and compare. 

In [None]:
url1and2 = url1_analysis.merge(url2_analysis,how='outer', left_on='entity_name', right_on="entity_name")
url1and2and3 = url1and2.merge(url3_analysis,how='outer', left_on='entity_name', right_on="entity_name")

In [None]:
url1and2and3.sort_values(by=['entity_salience('+response1._request_url+')'], ascending=False)

Unnamed: 0_level_0,entity_salience(https://en.wikipedia.org/wiki/Search_engine_optimization),entity_number_of_mentions(https://en.wikipedia.org/wiki/Search_engine_optimization),entity_salience(http://mozseoclass.com/who-is-the-smartest-seo/),entity_number_of_mentions(http://mozseoclass.com/who-is-the-smartest-seo/),entity_salience(https://hookagency.com/who-is-the-smartest-seo-in-the-world/),entity_number_of_mentions(https://hookagency.com/who-is-the-smartest-seo-in-the-world/)
entity_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Google,0.450027,72.0,0.037636,7.0,0.001241,6.0
search engines,0.082046,33.0,,,0.000236,1.0
Stats Show Google,0.050376,13.0,,,,
Wikipedia,0.033818,12.0,,,,
search engine,0.018523,13.0,,,,
...,...,...,...,...,...,...
winner,,,,,0.000073,1.0
wins,,,,,0.000412,2.0
world battle,,,,,0.000255,1.0
world title,,,,,0.000280,1.0


In [None]:
url1and2and3.to_csv('x-entities.csv')