# Pipeline for calling Google Cloud Natural Language API on reviews from the IMDB Dataset

In [1]:
!pip install --quiet --upgrade google-cloud-language 

In [2]:
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from google.cloud import language_v2

import warnings
warnings.filterwarnings('ignore')

source: https://ai.stanford.edu/~amaas/data/sentiment/

Read the dataset from Cloud Storage

In [3]:
imdb_data=pd.read_csv('gs://engo-ml_spec2023-demo3/input/IMDB Dataset.csv')
print(imdb_data.shape)

(50000, 2)


In [4]:
imdb_data.head(10)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


The dataset is balanced with 25k positive and 25k negative reviews

In [5]:
imdb_data['sentiment'].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

Converting the "positive" and "negative" in integers (1 and 0)

In [6]:
imdb_data['sentiment_class'] =  [1 if x == "positive" else 0 for x in imdb_data["sentiment"]]

Function to call the Google Cloud Sentiment Analysis API and parse the result from the Natural Language suite

In [7]:
def sample_analyze_sentiment(text_content: str) -> None:
    """
    Analyzes Sentiment in a string.

    Args:
        text_content: The text content to analyze.
    """

    client = language_v2.LanguageServiceClient()

    # Available types: PLAIN_TEXT, HTML
    document_type_in_plain_text = language_v2.Document.Type.PLAIN_TEXT

    # Optional. If not specified, the language is automatically detected.
    # For list of supported languages:
    # https://cloud.google.com/natural-language/docs/languages
    language_code = "en"
    document = {
        "content": text_content,
        "type_": document_type_in_plain_text,
        "language_code": language_code,
    }

    # Available values: NONE, UTF8, UTF16, UTF32
    # See https://cloud.google.com/natural-language/docs/reference/rest/v2/EncodingType.
    encoding_type = language_v2.EncodingType.UTF8

    response = client.analyze_sentiment(
        request={"document": document, "encoding_type": encoding_type}
    )
    return int(response.document_sentiment.score > 0), response.document_sentiment.score, response.document_sentiment.magnitude

Sampling only a subset of the reviews in order to show the pipeline and avoid redundant and extra costs. It has been run already on the full dataset and the results are stored in the /data folder

In [23]:
imdb_subset = imdb_data.sample(20)

Calling the API on each row of the subset

In [24]:
scores = []
magnitudes = []
categories = []
for index, row in imdb_subset.iterrows():
    category, score, magnitude = sample_analyze_sentiment(row["review"])
    categories.append(category)
    scores.append(score)
    magnitudes.append(magnitude)
    np.save(os.path.join("data_examples", "categories_example.npy"), categories)
    np.save(os.path.join("data_examples", "scores_example.npy"), scores)
    np.save(os.path.join("data_examples", "magnitudes_example.npy"), magnitudes)

Accuracy of the results. It's a good metric for binary classification on a balanced dataset.

In [25]:
accuracy = accuracy_score(imdb_subset['sentiment_class'], categories)

In [26]:
print("Cloud Natural Language Sentiment Analysis Accuracy on a random IMDB SUBset:", str(int(accuracy*100)) + "%")

Cloud Natural Language Sentiment Analysis Accuracy on a random IMDB SUBset: 100%


Adding the results to the subset's dataframe

In [27]:
imdb_subset["class_from_api"] = categories
imdb_subset["scores_from_api"] = scores
imdb_subset["magnitudes_from_api"] = magnitudes

Dataframe of reviews that has been misclassified

In [28]:
imdb_subset[imdb_subset["sentiment_class"] != imdb_subset["class_from_api"]]

Unnamed: 0,review,sentiment,sentiment_class,class_from_api,scores_from_api,magnitudes_from_api
