$$ \text{Twitter Moral Corpus} $$
$$ \text{and} $$
$$ \text{H2O's Python Client with Driverless AI} $$

This notebook utilizes H2O's Driverless AI and related Python Client to analyze the Twitter Corpus Paper 'Moral Foundations Twitter Corpus: A collection of 35k tweets annotated for moralsentiment.' The goal is to analyze the Twitter Moral Corpus and test the efficacy and usability of H2O's automated machine learning through their H2O Ai Client.  

Moral Foundations Theory is a five factor taxonomy of human morality. Below are the five.

Moral Factors:
- Care\Harm
- Fairness\Cheating
- Loyalty\Betrayal
- Authority\Subversion
- Purity\Degradation

Import libraries

In [20]:
import os
import time
import wget #!pip install wget
import subprocess
import numpy as np
import pandas as pd
from zipfile import ZipFile 
from h2oai_client import Client
from sklearn import model_selection

# pd.set_option('display.max_rows', None), pd.set_option('display.max_columns', None)

Download twitter data with tweets.

In [3]:
# wget.download('https://drive.google.com/u/0/uc?id=1d1p95CspLTT1em4I42rDpWY6Hcikg296&export=download')

The twitter data doesn't come with tweets. In order to get the tweets you have to sign up for a twitter develop account. 

You will need to collect the following and fill in the values in the text_script before you run it with python text_script.py. The tweets will then be populated. <br>
Values to be filled: consumer_key='', consumer_secret='', access_token_key='', access_token_secret=''

In [4]:
# Twitter sentiment data
# wget.download('https://osf.io/cwu4m/download')
# os.rename('MFTC_V4.json', 'twitter_sentiment.json')

# Text script
# wget.download('https://osf.io/mzg5w/download')
# !python text_scripts

The davidson comes separate. Below will download that; however, this was left out because the word classications weren't available. 

In [5]:
# Wget for Davidson dataset mentioned below
# wget.download('https://raw.githubusercontent.com/t-davidson/hate-speech-and-offensive-language/ \
# master/data/labeled_data.csv')
# df_davidson = pd.read_csv('labeled_data.csv').drop(columns=['Unnamed: 0'])

Set credentials for H2O Client. The username will be the beginning of the email used for your GCP account. You can also type ‘id’ into the terminal and that will show the username, i.e. uid=1001(username). 

In [6]:
address = 'http://34.106.56.14:12345'
username = 'scidataprojects'
password = 'Cypher65!@'
h2oai = Client(address = address, username = username, password = password)

Import moral corpus', below is each given corpus. The Davidson corpus needed to be downloaded from elsewhere, and didn't have the actual classification words included so was ultimately excluded from the analysis.  

Domains are selected based on relevant moral problems in social sciences.
- ALM: All Lives Matter - political right
- Baltimore: Baltimore protests related to the death of Freddie Gray.
- BLM: Black Lives Matter - political left
- Election: The 2016 presidential election
- MeToo: Womens sexual harassment/assualt movement. 
- Sandy: Hurricane Sandy
- Davidson: Hate Speech

In [7]:
df = pd.read_json('tw.json')
dfs = []
ALM, Baltimore, BLM, Election, MeToo, Sandy = df['Tweets'][0], df['Tweets'][1], df['Tweets'][2], \
                                              df['Tweets'][4], df['Tweets'][5], df['Tweets'][6]

corpus = [ALM, Baltimore, BLM, Election, MeToo, Sandy]

The domains come embedded in a json file. For ease of use, the below code takes the json and places it in a pandas dataframe. All hash tags were removed from the tweets. Each tweet has the potential to have multiple word classifications based on the aforementioned moral theory factors.  

In [8]:
# Initial lists to be converted to a dataframe later
text_id = []
tweet = []
date = []
annotation = []

for c in corpus:
    for i in c:
        text_id.append(list(i.values())[0])
        tweet.append(list(i.values())[1])
        date.append(list(i.values())[2])
        annotation.append(list(i.values())[3])

tweet = pd.DataFrame(tweet, columns=['tweet'])
text_id = pd.DataFrame(text_id, columns=['text_id'])
date = pd.DataFrame(date, columns=['date'])

frst_column_annotations = []
scnd_column_annotations = []


for i in annotation:
    frst_column_annotations.append(list(i[0].values())[1])
    scnd_column_annotations.append(list(i[1].values())[1])

dataframe_a_words = pd.DataFrame(frst_column_annotations, columns=['word1'])
dataframe_b_words = pd.DataFrame(scnd_column_annotations, columns=['word2'])
words = pd.concat([dataframe_a_words, dataframe_b_words], axis=1)

words_col_a = words['word1'].str.split(',', 4, expand=True).rename(columns={0:'w1', 1:'w2', 2:'w3', 3:'w4', 4:'w5'})
words_col_b = words['word2'].str.split(',', 5, expand=True).rename(columns={0:'w6', 1:'w7', 2:'w8', 3:'w9', 4:'w10', 5:'w11'})
words_df = pd.concat([words_col_a, words_col_b], axis=1)

# Rename columns
text_id = pd.DataFrame(text_id, columns=['text_id'])
tweet = pd.DataFrame(tweet, columns=['tweet'])
date = pd.DataFrame(date, columns=['date'])

df_list = [text_id, tweet, date, words_df]
df = pd.concat(df_list, axis=1)
df_words = pd.concat(df_list, axis=1)

df.drop(columns=['text_id', 'date'], inplace=True)
df = df[df['tweet'] != 'no tweet text available']
df.reset_index(drop=True, inplace=True)

df['tweet'] = df['tweet'].str.replace('#\S+', '', regex=True) \
                         .str.replace('&amp;','and') \
                         .str.strip()

is_duplicate = df.apply(pd.Series.duplicated, axis=1)
df_ = df.where(~is_duplicate, None)

dataframe = df_[['tweet','w1']].rename(columns={'w1':'Annotations'})

Make sure that all columns have a max of one word.

In [9]:
# for x in df_.iloc[:,1:]:
#     print(df_[x].value_counts(), '\n')

In [10]:
dummies = pd.get_dummies(df.iloc[:,1:])
dummies_summed = dummies.iloc[:,1:].T.groupby([s.split('_')[1] for s in dummies.iloc[:,1:].T.index.values]).sum().T

In [11]:
y = df.iloc[:,0]
y.shape

(2632,)

In [12]:
# For each pair of moral values, if there is a value in each field greater than zero take the higher of the two.
def set_higher_moral_count(col1, col2):
    if col1 > 0:
        return 1
    elif col2 > 0:
        return 0
    # Check if both columns are > 0 and equal
    elif col1 > 0 & col2 > 0 & col1 == col2:
        return 3
    else:
        return 2

In [13]:
care_harm = dummies_summed.apply(lambda x: set_higher_moral_count(x.harm, x.care), axis=1)
fairness_cheating = dummies_summed.apply(lambda x: set_higher_moral_count(x.fairness, x.cheating), axis=1)
loyalty_betrayal = dummies_summed.apply(lambda x: set_higher_moral_count(x.loyalty, x.betrayal), axis=1)
authority_subversion = dummies_summed.apply(lambda x: set_higher_moral_count(x.authority, x.subversion), axis=1)
purity_degradation = dummies_summed.apply(lambda x: set_higher_moral_count(x.purity, x.degradation), axis=1)

In [14]:
y.shape, care_harm.shape

((2632,), (2632,))

In [15]:
y.head()

0    Wholeheartedly support these protests and acts...
1    This Sandra Bland situation man no disrespect ...
2    Commitment to peace, healing and loving neighb...
3            Injustice for one is an injustice for all
4    This is what compassion looks like!   https://...
Name: tweet, dtype: object

In [16]:
Care_Harm = pd.concat([y, care_harm], axis=1)[care_harm != 2].rename(columns={0:'y'}).reset_index(drop=True)
Fairness_Cheating = pd.concat([y, fairness_cheating], axis=1)[fairness_cheating != 2].rename(columns={0:'y'}).reset_index(drop=True)
Loyalty_Betrayal = pd.concat([y, loyalty_betrayal], axis=1)[loyalty_betrayal != 2].rename(columns={0:'y'}).reset_index(drop=True)
Authority_Subversion = pd.concat([y, authority_subversion], axis=1)[authority_subversion != 2].rename(columns={0:'y'}).reset_index(drop=True)
Purity_Degradation = pd.concat([y, purity_degradation], axis=1)[purity_degradation != 2].rename(columns={0:'y'}).reset_index(drop=True)

Check for value counts and imbalanced classes. 

In [17]:
frames = [Care_Harm, Fairness_Cheating, Loyalty_Betrayal, Authority_Subversion, Purity_Degradation]
names = ['Care_Harm', 'Fairness_Cheating', 'Loyalty_Betrayal', 'Authority_Subversion', 'Purity_Degradation']

print('1 = First listed value')
print('0 = Second listed value\n-----------')

for w,z in zip(names, frames):
    print(w)
    print(z.iloc[:, 1].value_counts(), '\n')

1 = First listed value
0 = Second listed value
-----------
Care_Harm
1    662
0    466
Name: y, dtype: int64 

Fairness_Cheating
1    659
0    532
Name: y, dtype: int64 

Loyalty_Betrayal
1    462
0    351
Name: y, dtype: int64 

Authority_Subversion
0    247
1    108
Name: y, dtype: int64 

Purity_Degradation
1    267
0    160
Name: y, dtype: int64 



In [21]:
def get_df_name(df):
    name =[x for x in globals() if globals()[x] is df][0]
    return name


def train_test_split(names, frames):
#     Split dataset into test and train sets. We use numpy to split the data then export to csv files. <br>
#     70/30 split
    for name,frame in zip(names, frames):
        train, test = np.split(frame.sample(frac=1), [int(.7*len(frame))])
        train.reset_index(inplace=True, drop=True)
        test.reset_index(inplace=True, drop=True)
        train.to_csv("train_twitter_sentiment{}.csv".format(name), index=False)
        test.to_csv("test_twitter_sentiment{}.csv".format(name), index=False)
        
        return train, test


def train_and_download_summary(name):
    start = time.time()
    
    train, test = train_test_split(names, frames)
    name = get_df_name(name)
    
    train_path = './train_twitter_sentiment{}.csv'.format(name)
    test_path = './test_twitter_sentiment{}.csv'.format(name)
    
    # Pull in the data and place into the h20ai client to utilize their classes/tools. 
    train = h2oai.upload_dataset_sync(train_path)
    test = h2oai.upload_dataset_sync(test_path)
 
    # Run the model
    model = h2oai.start_experiment_sync(
        dataset_key=train.key,
        testset_key=test.key,
        target_col='y',
        scorer='LOGLOSS',
        is_classification=True,
        cols_to_drop=[],
        accuracy=7,
        time=2,
        interpretability=8,
        enable_gpus=True,
        config_overrides="""
                        enable_tensorflow='on'
                        enable_tensorflow_charcnn='on'
                        enable_tensorflow_textcnn='on'
                        enable_tensorflow_textbigru='on'
                        """
    )

    # Download summary files and place in the 'summary' folder.   
    name = get_df_name(name)
    cwd = os.getcwd()
    
    path = os.path.isdir(cwd+'/'+'summary_'+name)
    if not path:
        os.mkdir('summary_'+name)
        
    summary_path = h2oai.download(src_path=model.summary_path, dest_dir='.')
    dir_path = './h2oai_experiment_summary_'+model.key


    with ZipFile(dir_path+'.zip', 'r') as zip: 
        zip.extractall('./summary{}/'.format(name))
        
    end = time.time()
    print(end -start)
    return model

In [None]:
# Care_Harm
# Fairness_Cheating
# Loyalty_Betrayal
# Authority_Subversion
# Purity_Degradation

train_and_download_summary(Care_Harm)

In [None]:
ensemble_base_learner_fold_scores = pd.read_json('summary/ensemble_base_learner_fold_scores.json')
timing = pd.read_json('summary/timing.json')
coefs = pd.read_json('summary/coefs.json')
confsn_mtrx = pd.read_json('summary/ensemble_confusion_matrix_test.json')
orig_features = pd.read_json('summary/ensemble_features_orig.json')
ensemble_features = pd.read_json('summary/ensemble_features.json')
ensemble_model_description = pd.read_json('summary/ensemble_model_description.json')
ensemble_score_data = pd.read_json('summary/ensemble_score_data.json').T
tuning_leaderboard = pd.read_json('summary/tuning_leaderboard.json')