$$ \text{Twitter Moral Corpus} $$
$$ \text{and} $$
$$ \text{H2O's Python Client with Driverless AI} $$

This notebook utilizes H2O's Driverless AI and related Python Client to analyze the Twitter Corpus Paper 'Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moralsentiment.' The goal is to analyze the Twitter Moral Corpus and test the efficacy and usability of H2O's automated machine learning through their H2O Ai Client.  

Moral Foundations Theory is a five factor taxonomy of human morality. Below are the five.

Moral Factors:
- Care\Harm
- Fairness\Cheating
- Loyalty\Betrayal
- Authority\Subversion
- Purity\Degradation

Import libraries

In [1]:
import os
import wget #!pip install wget
import numpy as np
import pandas as pd
from h2oai_client import Client
from sklearn import model_selection

Download twitter data with tweets.

In [2]:
wget.download('https://drive.google.com/u/0/uc?id=1d1p95CspLTT1em4I42rDpWY6Hcikg296&export=download')

'tw.json'

The twitter data doesn't come with tweets. In order to get the tweets you have to sign up for a twitter develop account. 

You will need to collect the following and fill in the values in the text_script before you run it with python text_script.py. The tweets will then be populated. <br>
Values to be filled: consumer_key='', consumer_secret='', access_token_key='', access_token_secret=''

In [3]:
# Twitter sentiment data
# wget.download('https://osf.io/cwu4m/download')
# os.rename('MFTC_V4.json', 'twitter_sentiment.json')

# Text script
# wget.download('https://osf.io/mzg5w/download')
# !python text_scripts

The davidson comes separate. Below will download that; however, this was left out because the word classications weren't available. 

In [4]:
# Wget for Davidson dataset mentioned below
# wget.download('https://raw.githubusercontent.com/t-davidson/hate-speech-and-offensive-language/ \
# master/data/labeled_data.csv')
# df_davidson = pd.read_csv('labeled_data.csv').drop(columns=['Unnamed: 0'])

Set credentials for H2O Client. The username will be the beginning of the email used for your GCP account. You can also type ‘id’ into the terminal and that will show the username, i.e. uid=1001(username). 

In [2]:
address = 'http://33.106.36.34:12345'
username = 'scidataprojects'
password = 'nnnnn'
h2oai = Client(address = address, username = username, password = password)

Import moral corpus', below is each given corpus. The Davidson corpus needed to be downloaded from elsewhere, and didn't have the actual classification words included so was ultimately excluded from the analysis.  

Domains are selected based on relevant moral problems in social sciences.
- ALM: All Lives Matter - political right
- Baltimore: Baltimore protests related to the death of Freddie Gray.
- BLM: Black Lives Matter - political left
- Election: The 2016 presidential election
- MeToo: Womens sexual harassment/assualt movement. 
- Sandy: Hurricane Sandy
- Davidson: Hate Speech

In [6]:
df = pd.read_json('tw.json')
dfs = []
ALM, Baltimore, BLM, Election, MeToo, Sandy = df['Tweets'][0], df['Tweets'][1], df['Tweets'][2], \
                                              df['Tweets'][4], df['Tweets'][5], df['Tweets'][6]

corpus = [ALM, Baltimore, BLM, Election, MeToo, Sandy]

The domains come embedded in a json file. For ease of use, the below code takes the json and places it in a pandas dataframe. All hash tags were removed from the tweets. Each tweet has the potential to have multiple word classifications based on the aforementioned moral theory factors.  

In [7]:
# Initial lists to be converted to a dataframe later
text_id = []
tweet = []
date = []
annotation = []

for c in corpus:
    for i in c:
        text_id.append(list(i.values())[0])
        tweet.append(list(i.values())[1])
        date.append(list(i.values())[2])
        annotation.append(list(i.values())[3])

tweet = pd.DataFrame(tweet, columns=['tweet'])
text_id = pd.DataFrame(text_id, columns=['text_id'])
date = pd.DataFrame(date, columns=['date'])

frst_column_annotations = []
scnd_column_annotations = []


for i in annotation:
    frst_column_annotations.append(list(i[0].values())[1])
    scnd_column_annotations.append(list(i[1].values())[1])

dataframe_a_words = pd.DataFrame(frst_column_annotations, columns=['word1'])
dataframe_b_words = pd.DataFrame(scnd_column_annotations, columns=['word2'])
words = pd.concat([dataframe_a_words, dataframe_b_words], axis=1)

words_col_a = words['word1'].str.split(',', 3, expand=True).rename(columns={0:'w1', 1:'w2', 2:'w3', 3:'w4'})
words_col_b = words['word2'].str.split(',', 3, expand=True).rename(columns={0:'w5', 1:'w6', 2:'w7', 3:'w8'})
words_df = pd.concat([words_col_a, words_col_b], axis=1)

# Rename columns
text_id = pd.DataFrame(text_id, columns=['text_id'])
tweet = pd.DataFrame(tweet, columns=['tweet'])
date = pd.DataFrame(date, columns=['date'])

df_list = [text_id, tweet, date, words_df]
df = pd.concat(df_list, axis=1)

df.drop(columns=['text_id', 'date'], inplace=True)
df = df[df['tweet'] != 'no tweet text available']
df.reset_index(drop=True, inplace=True)

df['tweet'] = df['tweet'].str.replace('#\S+', '', regex=True) \
                         .str.replace('&amp;','and') \
                         .str.strip()

is_duplicate = df.apply(pd.Series.duplicated, axis=1)
df_ = df.where(~is_duplicate, None)

df = df_[['tweet','w1']].rename(columns={'w1':'Annotations'})

Below is a preview of the null counts from the separated words. We can see that the majority of additional labels tends towards missing a lot of values. For this reason only the first word is used. 

In [8]:
pd.DataFrame(df_.isna().sum(), columns=['Null Counts:']).T

Unnamed: 0,tweet,w1,w2,w3,w4,w5,w6,w7,w8
Null Counts:,0,0,2107,2505,2602,922,2158,2520,2611


Preview of word '6' to show the proportion of nulls to words. 

In [9]:
df_['w5'].isna().sum() / len(df_)

0.35030395136778114

Split dataset into test and train sets. We use numpy to split the data then export to csv files. 

In [10]:
train, test = np.split(df.sample(frac=1), [int(.7*len(df))])
train.reset_index(inplace=True, drop=True)
test.reset_index(inplace=True, drop=True)
train.to_csv("train_twitter_sentiment.csv", index=False)
test.to_csv("test_twitter_sentiment.csv", index=False)

Pull in the data and place into the h2ai client to utilize their classes/tools. 

In [11]:
train_path = './train_twitter_sentiment.csv'
test_path = './test_twitter_sentiment.csv'

train = h2oai.upload_dataset_sync(train_path)
test = h2oai.upload_dataset_sync(test_path)

Preview the size of the dataset and the used columns. 

In [12]:
print('Train Dataset: ', len(train.columns), 'x', train.row_count)
print('Test Dataset: ', len(test.columns), 'x', test.row_count)

[c.name for c in train.columns]

Train Dataset:  2 x 1842
Test Dataset:  2 x 790


['tweet', 'Annotations']

Before actually running a set of algos on your training set, the h2oai client has a function to preview a rough estimate of some of the parameters, i.e. how changing the accuracy and the time change the total time taken to train the model.

In [13]:
exp_preview = h2oai.get_experiment_preview_sync(
    dataset_key=train.key
    , validset_key=''
    , target_col='Annotations'
    , classification=True
    , dropped_cols=[]
    , accuracy=10
    , time=10
    , interpretability=10
    , is_time_series=False
    , time_col=''
    , enable_gpus=True
    , reproducible=False
    , resumed_experiment_id=''
    , config_overrides="""
        enable_tensorflow='on'
        enable_tensorflow_charcnn='on'
        enable_tensorflow_textcnn='on'
        enable_tensorflow_textbigru='on'
    """
)

exp_preview

['ACCURACY [10/10]:',
 '- Training data size: *1,842 rows, 2 cols*',
 '- Feature evolution: *[LightGBM, TensorFlow]*, *3-fold CV**, 3 reps*',
 '- Final pipeline: *Ensemble (12 models), 3-fold CV*',
 '',
 'TIME [10/10]:',
 '- Feature evolution: *8 individuals*, up to *540 iterations*',
 '- Early stopping: After *50* iterations of no improvement',
 '',
 'INTERPRETABILITY [10/10]:',
 '- Feature pre-pruning strategy: Permutation Importance FS',
 '- Monotonicity constraints: enabled',
 '- Feature engineering search space: [CVTargetEncode, Frequent, TextBiGRU, TextCNN, TextCharCNN, Text]',
 '',
 '[LightGBM, TensorFlow] models to train:',
 '- Model and feature tuning: *360*',
 '- Feature evolution: *18072*',
 '- Final pipeline: *12*',
 '',
 'Estimated runtime: *hours*',
 'Auto-click Finish/Abort if not done in: *1 day*/*7 days*']

Next, run the previewed model above. 

In [None]:
model = h2oai.start_experiment_sync(
    dataset_key=train.key,
    testset_key=test.key,
    target_col='Annotations',
    scorer='Accuracy',
    is_classification=True,
    cols_to_drop=[],
    accuracy=10,
    time=10,
    interpretability=10,
    enable_gpus=True,
    config_overrides="""
                    enable_tensorflow='on'
                    enable_tensorflow_charcnn='on'
                    enable_tensorflow_textcnn='on'
                    enable_tensorflow_textbigru='on'
                    """
)

When complete, each model will have a unique key. This key may be used to run comparative analysis on different recipes.

In [None]:
print('Modeling completed for model ' + model.key)

Download the test predictions. 

In [None]:
test_preds = h2oai.download(model.test_predictions_path, '.')
print('Test set predictions available at', test_preds)

Preview the test predictions.

In [None]:
pd.read_csv('test_preds.csv')