# GCP API based enrichment of modeling datasets

Authors: Asli Sabanci Demiroz, Shu Li

Version Date: 2023-03-20

<div>
<img src="https://storage.googleapis.com/public-artifacts-datarobot/e2e_logos/DR%20and%20GCP%20Better%20Together.svg" width=200>
</div>

## Summary
Text data is a valuable source of information for Machine Learning (ML) models, as it allows algorithms to extract insights from large volumes of unstructured text data. Text data can be obtained from various sources, such as social media, news articles, and customer feedback. The benefits of using text data in ML models include its ability to provide valuable insights, such as sentiment analysis, and topic modeling, which can help organizations make informed decisions. However, using text data in ML models can be challenging due to several factors, such as the complexity of natural language, the presence of bias and noise, and the lack of standardization in text data. Additionally, text data requires significant preprocessing and feature engineering to ensure that it can be effectively used in ML models.

## Demo use case
One common application of text mining is sentiment analysis, where a numerical value is assigned representing whether the text carries a positive, neutral, or negative sentiment. While DataRobot can help efficiently build such models, the training requires large corpuses that have been accurately labeled, making it a challenging task for users lacking such training dataset. 

In this notebook, we demo the usage of the [Google Cloud Natural Language API for sentiment analysis](https://cloud.google.com/natural-language/docs/analyzing-sentiment) to enrich a customer churn dataset. The sentiment scores from the Google API helps improve the model performance in predicting the likelihood of churn for each customer, without requiring the user to train their own sentiment models.




In [None]:
!pip install google-cloud-language 
!pip install google-cloud-aiplatform
!pip install datarobot

## Environment variables

### Google credentials
The `GOOGLE_APPLICATION_CREDENTIALS` environment variable can be utilized to authenticate for access to the Google Cloud APIs among [other approaches](https://cloud.google.com/docs/authentication/application-default-credentials).

### DR credentials

In a local environment, a `drconfig.yaml` file stored under `~/.config/datarobot/` can be used to authenticate for access to the DataRobot client among [other approaches](https://docs.datarobot.com/en/docs/api/api-quickstart/index.html)

In [None]:
import os
import yaml
from google.colab import drive
drive.mount('/content/drive')
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "/content/drive/MyDrive/creds/mlops-testing-a1b6500353ac.json"
with open('/content/drive/MyDrive/creds/drconfig.yaml', 'r') as f:
  data = yaml.safe_load(f)
DATAROBOT_ENDPOINT = data['endpoint']
DATAROBOT_API_KEY = data['token']

Mounted at /content/drive


## Setup

### Important libraries

In [None]:
import numpy as np
import pandas as pd
import datarobot as dr
from google.cloud import language_v1

In [None]:
raw_fpath = "/content/drive/MyDrive/experiments/nlp/input"
raw_fname = "churn_dataset.csv"
enriched_fpath = "/content/drive/MyDrive/experiments/nlp/output"
enriched_fname = "churn_dataset_enriched.csv"

## Customer churn dataset
Besides customer profile data such as tenure, state, current plans, and usage, the `chat_log` column includes the text data of the call transcript between customers and service agents, which we will enrich with a numerical sentiment score.

In [None]:
df = pd.read_csv(f"{raw_fpath}/{raw_fname}")
df.shape

(3333, 21)

In [None]:
df.head()

Unnamed: 0,churn,chat_log,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,...,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls
0,no,,DC,73,area_code_408,no,no,0,122.0,92,...,138.3,114,11.76,224.2,128,10.09,5.8,5,1.57,1
1,no,,OH,114,area_code_415,no,no,0,191.5,88,...,175.2,78,14.89,220.3,118,9.91,0.0,0,0.0,0
2,yes,Customer: I just received an early termination...,MS,65,area_code_415,yes,no,0,277.9,123,...,155.8,112,13.24,256.9,71,11.56,9.2,10,2.48,0
3,no,Customer: I would like to upgrade my contract ...,WY,126,area_code_408,yes,no,0,197.6,126,...,246.5,112,20.95,285.3,104,12.84,12.5,8,3.38,2
4,no,"Customer: Voice, text and data.\nTelCom Agent:...",WY,54,area_code_408,no,yes,39,143.9,73,...,210.3,117,17.88,129.2,117,5.81,12.5,8,3.38,2


## Enrichment of the `chat_log` column using the GCP API for sentiment analysis

Helper function `analyze_sentiment()` scores each row, which is then processed by `extract_sentiment()` to extract the min and max scores among other metrics that may be of interest for experimentation.

In [None]:
def analyze_sentiment(text_content):
    client = language_v1.LanguageServiceClient()
    type_ = language_v1.Document.Type.PLAIN_TEXT
    document = {"content": text_content, "type_": type_}
    encoding_type = language_v1.EncodingType.UTF8
    try:
        response = client.analyze_sentiment(request = {'document': document, 'encoding_type': encoding_type})
        return response
    except Exception as e:
        print(f"Got a GCP exception!: {e}")
        return None

def extract_sentiment(annotations):
    score = annotations.document_sentiment.score
    magnitude = annotations.document_sentiment.magnitude
    max_sentiment = -float('inf')
    min_sentiment = float('inf')
    sentence_sentiment = []
    weighted_sentence_sentiment = []
    for index, sentence in enumerate(annotations.sentences):
        sentence_sentiment.append(sentence.sentiment.score)
        weighted_sentence_sentiment.append(sentence.sentiment.score*sentence.sentiment.magnitude)
    min_sentiment = min(sentence_sentiment)
    max_sentiment = max(sentence_sentiment)
    mean_sentiment = np.mean(sentence_sentiment)
    std_sentiment = np.std(sentence_sentiment)
    min_weighted_sentiment = min(weighted_sentence_sentiment)
    max_weighted_sentiment = max(weighted_sentence_sentiment)
    mean_weighted_sentiment = np.mean(weighted_sentence_sentiment)
    std_weighted_sentiment = np.std(weighted_sentence_sentiment)
    return {
        'document_score':score,
        'magnitude':magnitude,
        'max_score':max_sentiment,
        'min_score':min_sentiment,
        'mean_score':mean_sentiment,
        'std_score':std_sentiment,
        'max_weighted_score':max_weighted_sentiment,
        'min_weighted_score':min_weighted_sentiment,
        'mean_weighted_score':mean_weighted_sentiment,
        'std_weighted_score':std_weighted_sentiment,
    }

### Score the `chat_log` column and concat to the original raw dataset
The GCP API provides scores for each sentence within a given transcript, as well as a document level score, which we summarize into `min_score`, `max_score`, `document_score`, etc.

In [None]:
enrichment = []
for _, row in df.iterrows():
    if not pd.isna(row['chat_log']):
        scores = analyze_sentiment(row["chat_log"])
        enrichment.append(extract_sentiment(scores))
    else:
        enrichment.append(
            {
                'document_score':None,
                'magnitude':None,
                'max_score':None,
                'min_score':None,
                'mean_score':None,
                'std_score':None,
             }
        )
df_enrichment = pd.DataFrame(enrichment)
df_enriched = pd.concat([df, df_enrichment], axis=1)
df_enriched.head()

Unnamed: 0,churn,chat_log,state,account_length,area_code,international_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,...,document_score,magnitude,max_score,min_score,mean_score,std_score,max_weighted_score,min_weighted_score,mean_weighted_score,std_weighted_score
0,no,,DC,73,area_code_408,no,no,0,122.0,92,...,,,,,,,,,,
1,no,,OH,114,area_code_415,no,no,0,191.5,88,...,,,,,,,,,,
2,yes,Customer: I just received an early termination...,MS,65,area_code_415,yes,no,0,277.9,123,...,-0.3,3.1,0.0,-0.6,-0.288889,0.237788,0.0,-0.36,-0.14,0.141657
3,no,Customer: I would like to upgrade my contract ...,WY,126,area_code_408,yes,no,0,197.6,126,...,0.1,2.3,0.5,-0.2,0.171429,0.281396,0.25,-0.04,0.085714,0.115246
4,no,"Customer: Voice, text and data.\nTelCom Agent:...",WY,54,area_code_408,no,yes,39,143.9,73,...,0.0,2.8,0.4,-0.3,0.066667,0.239212,0.16,-0.09,0.031667,0.081326


### Save the output dataset for record

In [None]:
df_enriched.to_csv(f'{enriched_fpath}/{enriched_fname}')

## Connect to DataRobot and create DataRobot project

### Read more about different options for [connecting to DataRobot from the client](https://docs.datarobot.com/en/docs/api/api-quickstart/index.html).

In [None]:
dr.Client(DATAROBOT_API_KEY, DATAROBOT_ENDPOINT)
project = dr.Project.create(df_enriched, project_name='churn_enriched_data')
project.analyze_and_model(target='churn', positive_class='yes', worker_count=-1)

Project(churn_enriched_data)

## Create a feature list that's limited to the raw features in the original dataset, and run Autopilot with the feature list

In [None]:
raw_features = list(df.columns.values)
raw_feature_list = project.create_featurelist(name='no sentiment', features=raw_features)
project.start_autopilot(featurelist_id=raw_feature_list.id, mode='comprehensive')

## Create a feature list that appends the min sentiment score to the raw features, and run Autopilot with the feature list

In [None]:
enriched_features = raw_features + ['min_score']
enriched_feature_list = project.create_featurelist(name='min sentiment', features=enriched_features)
project.start_autopilot(featurelist_id=enriched_feature_list.id, mode='comprehensive')

In [None]:
project.wait_for_autopilot()
project.unlock_holdout()

Project(churn_enriched_data)

### Retrain the top models for each feature list with 100% samples

In [None]:
best_model_no_sentiment = [m for m in project.get_models() if m.featurelist_name == 'no sentiment'][0]
best_model_min_sentiment = [m for m in project.get_models() if m.featurelist_name == 'min sentiment'][0]

In [None]:
best_model_no_sentiment.train(sample_pct=100)
best_model_min_sentiment.train(sample_pct=100)

'250'

### Retrieve the leaderboard

In [None]:
models = []
featurelists = []
logloss_cv = []
logloss_holdout = []
for m in project.get_models():
  models.append(m)
  featurelists.append(m.featurelist_name)
  logloss_cv.append(m.metrics['LogLoss']['crossValidation'])
  logloss_holdout.append(m.metrics['LogLoss']['holdout'])
df_leaderboard = pd.DataFrame({'model':models, 'feature list':featurelists, 'Cross Validation':logloss_cv, 'Holdout':logloss_holdout})

In [None]:
df_leaderboard.sort_values(by='Cross Validation')[:5]

Unnamed: 0,model,feature list,Cross Validation,Holdout
1,Model('eXtreme Gradient Boosted Trees Classifi...,min sentiment,0.05711,0.04583
3,Model('RandomForest Classifier (Entropy)'),no sentiment,0.06213,0.05112
6,Model('eXtreme Gradient Boosted Trees Classifi...,min sentiment,0.062966,0.04449
5,Model('AVG Blender'),min sentiment,0.06357,0.05697
0,Model('Gradient Boosted Trees Classifier'),no sentiment,0.063806,0.04644


## Conclusion

Without the challenge of training a reliable sentiment model, we demonstrate the simplicity by leveraging the Google API to enrich a dataset, and improve the performance of a classification model built with DataRobot. While the notebook showed the use of only the min sentiment score for modeling, the enriched dataset includes others such as the mean, max, and weighted by magnitude scores as well for the readers to experiment. In some cases, the sentiment scores from the Google API can bring higher lift than the NLTK or TextBlob sentiment featurizers, especially if the problem is more sentiment oriented, such as the churn use case that we demo, so experiments are highly encouraged if the readers have access to the Google API.

As a reference, below is a table showing the more extensive tests that we've experimented with the other enriched sentiment features.

| Metric LogLoss, Sample size 100% | Cross Validation | Holdout |
|---|---|---|
| Informative features + just min sentence sentiment | 0.0571 | 0.0458 |
| Informative features + just min & max sentence sentiment | 0.0585 | 0.0464 |
| Informative features + just overall doc sentiment | 0.0645 | 0.0532 |
| Informative features (without sentiment) | 0.0617 | 0.0589 |
| Informative features + overall & min & max sentence sentiment, dropping chat_log | 0.1168 | 0.1243 |

