### Updates
#### (3/26/19)
- Separated urls into links and linked pictures
- Changed racist word count to regular frequency
- Racist/neutral scores are aggregated
- Time tweet created feature removed
- Corrected a few data types in the user-feature dataframe
- Replaced regular variables with np.arrays when calculating metrics to shorten script
- Specify racism dictionary path in cell under "User-defined Variables"

#### (4/1/19)
- Plotted correlation coefficients matrix

#### (4/8/19)
Added two features: hate-speech and offensive language tweets
- https://github.com/t-davidson/hate-speech-and-offensive-language
    - Code and data in this repo was referenced and used
- The script and files provided to run their pickled classifier on new data in their classifier directory could not be used
    - There were multiple bugs in the script (2.7 to 3.6 incompatibility (despite using 2to3 conversion), pickled models provided were unusable, input feature generation function broken, etc.)
- Their "Automated Hate Speech Detection and the Problem of Offensive Language Python 3.6" notebook was used to:
    - Re-train their classifier with fixed input dimensions
    - Generate a new pickled file for the best model chosen (overall 75.79% accuracy, compared to their average of 82.33% accuracy; drop in accuracy was most likely due to the removal of TF-IDF and POS features used in pickled model)
    - Calculate percentage of tweets classified as hate-speech 
    - Calculate percentage of tweets classified as offensive language 
    
#### (4/10/19)
- Integrated Austin's code for classifying a tweet as religious
- Vectorized metric calculation functions to speed up program (~7.9x faster)
    - Original runtime: 587.108s
    - New runtime: 74.172s
- Cleaned up code

### User-defined Variables

In [None]:
# Specify filenames and directories here
# Specify the directory containing user .json + .gz files
TWEETS_DIRECTORY = "../results/users_tweets_metadata/"

# Specify the name of the CSV file containing personality scores for each user
TRAIN_LABEL_FILE = "../data/train_labels_rand.csv"

# Specify the name of the CSV file containing dictionary words related to racism
RACISM_DICT_FILE = "../data/racism_dictionary.csv"

# Specify the name of the CSV file containing dictionary words related to religiosity
RELIGIOUS_DICTIONARY = "../data/religious_corpus.csv"

### Program Begins Here

In [None]:
# Imports here
from sklearn.feature_selection import SelectPercentile, SelectKBest
from sklearn.feature_selection import f_regression
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from dictfeaturizer import DictFeaturizer
from nltk import TweetTokenizer
from textstat.textstat import textstat
from sklearn.externals import joblib
from tqdm import tqdm
import sys
import os
import operator
import gzip
import re
import json
import string
import numpy as np
import pandas as pd
import emoji
import csv
import time

#### Functions/Classes

In [None]:
# Class to load into np vectorizer with corresponding function
class ReligiousityDict:
    def __init__(self, directory):
        self.directory = directory
        self.religious_words = []
        
    def read_csv(self):
        religious_words = []
        with open(self.directory) as csvfile:
            readCSV = csv.reader(csvfile, delimiter=',')
            for row in csvfile:
                if row != 'WORSHIP,\n':
                    religious_words.append(row[:-2].lower())
                else:
                    religious_words.append(row[:-1].lower())
        self.religious_words = religious_words

In [None]:
# Function definitions for calculating metrics
# Get the day a tweet was created
def getDayTweeted(data):
    srch = re.search("(\w{3})", data["created_at"])
    day = str(srch[1])
    if day == "Mon":
        return 0
    elif day == "Tue":
        return 1
    elif day == "Wed":
        return 2
    elif day == "Thu":
        return 3
    elif day == "Fri":
        return 4
    elif day == "Sat":
        return 5
    elif day == "Sun":
        return 6
    
# Get the day most tweets were created on
def getDayTweetedMost(data):
    vgetDayTweeted = np.vectorize(getDayTweeted)
    daysTweeted = vgetDayTweeted(data)
    day, counts = np.unique(daysTweeted, return_counts=True)
    dayCount = dict(zip(day, counts))
    sortedDayCount = sorted(dayCount.items(), key=operator.itemgetter(1), reverse=True)
    return sortedDayCount[0][0]

# Compute flesch/coleman/automated/linsear/gunning readability scores
def getReadabilityScore(mode, tweets):
    vreadabilityFunc = None
    if mode == 'flesch':
        vreadabilityFunc = np.vectorize(textstat.flesch_kincaid_grade)
    elif mode == 'coleman':
        vreadabilityFunc = np.vectorize(textstat.coleman_liau_index)
    elif mode == 'automated':
        vreadabilityFunc = np.vectorize(textstat.automated_readability_index)
    elif mode == 'linsear':
        vreadabilityFunc = np.vectorize(textstat.linsear_write_formula)
    elif mode == 'gunning':
        vreadabilityFunc = np.vectorize(textstat.gunning_fog)
    else:
        print("Unknown mode.")
    
    avgScore = vreadabilityFunc(tweets).sum() / len(tweets)
    return avgScore

# Get the number of linked pictures in a tweet
def getNumPics(urls):
    numPics = 0
    for url in urls:
        m = re.search('https:\/\/twitter\.com\/i\/web\/status\/', url["expanded_url"])
        if m:
            numPics += 1
    return numPics
    
# Get the sentiment of a tweet
def getSentiment(tweet, sid):
    ss = sid.polarity_scores(tweet)
    compoundScore = ss["compound"]
    if compoundScore > 0:
        return 1
    elif compoundScore == 0:
        return 0
    else:
        return -1

# Create a dictionary that has counts of each sentiment (1: pos, 0: neu: -1: neg)
def getSentimentFreq(tweets, sid):
    vgetSentiment = np.vectorize(lambda t, sid: getSentiment(t, sid))
    sentiments = vgetSentiment(tweets, sid)
    unique, counts = np.unique(sentiments, return_counts=True)
    return dict(zip(unique, counts))

# Count total racist words in a tweet
def aggregateRacismResults(tweet, racismDict):
    racismDictResults = racismDict.transform(tweet.split())
    numRacistWords = racismDictResults['racist-stereotypes'] + \
                        racismDictResults['racist-skin_color'] + \
                        racismDictResults['racist-culture'] + \
                        racismDictResults['racist-country'] + \
                        racismDictResults['racist-animals'] + \
                        racismDictResults['racist-migration'] + \
                        racismDictResults['racist-nationality'] + \
                        racismDictResults['racist-religion'] + \
                        racismDictResults['racist-crime'] + \
                        racismDictResults['racist-race'] + \
                        racismDictResults['racist-diseases']
    return numRacistWords

'''
# Get the average number of racist words in a tweet
def getAvgNumRacistWords(tweets, racismDict):
    vfunc = np.vectorize(aggregateRacismResults)
    return vfunc(tweets, racismDict).sum() / len(tweets)
'''

# Classify a tweet as religious or not (Austin's)
def is_tweet_religious(tweet, religionDict):
    needed_words = 2
    common_words = set(religionDict.religious_words) & set(tweet.split())
    if len(common_words) >= needed_words:
        return 1
    else:
        return 0
    
# Get the percentage of religious tweets
def getPercentReligious(tweets, religionDict):
    vfunc = np.vectorize(is_tweet_religious)
    return vfunc(tweets, religionDict).sum() / len(tweets)

# Classify tweet as hate-speech(0), offensive language(1), or neither(2) and create a counts dictionary
def getHsolClass(tweets, hsolModel):
    transformedTweet = transform_inputs(tweets)
    predictedClasses = hsolModel.predict(transformedTweet)
    unique, counts = np.unique(predictedClasses, return_counts=True)
    return dict(zip(unique, counts))

# Get the average number of retweets/favorites/hashtags/links/pics/emojis
def getAvgNum(item, data, racismDict=None):
    vfunc = None
    if item == 'retweets':
        vfunc = np.vectorize(lambda d: d["retweet_count"])
    elif item == 'favorites':
        vfunc = np.vectorize(lambda d: d["favorite_count"])
    elif item == 'hashtags':
        vfunc = np.vectorize(lambda d: len(d["entities"]["hashtags"]))
    elif item == 'links':
        vfunc = np.vectorize(lambda d: len(d["entities"]["urls"]) - getNumPics(d["entities"]["urls"]))
    elif item == 'pics':
        vfunc = np.vectorize(lambda d: getNumPics(d["entities"]["urls"]))
    elif item == 'emojis':
        # "data" input must be tweets for this case
        vfunc = np.vectorize(lambda t: len([c for c in t if c in emoji.UNICODE_EMOJI]))
    elif item == 'racist':
        # "data" input must be tweets for this case
        vfunc = np.vectorize(aggregateRacismResults)
        return vfunc(tweets, racismDict).sum() / len(tweets)
    else:
        print("Unknown mode:", item)
        return 0.0
    return vfunc(data).sum() / len(data)

# Get the percentage of pos/neu/neg/religious/hatespeech/offensive tweets
def getPercent(item, tweets, sentimentFreq=None, hsolFreq=None, religionDict=None):
    if item == 'pos' and 1 in sentimentFreq:
        return sentimentFreq[1] / len(tweets)
    elif item == 'neu' and 0 in sentimentFreq:
        return sentimentFreq[0] / len(tweets)
    elif item == 'neg' and -1 in sentimentFreq:
        return sentimentFreq[-1] / len(tweets)
    elif item == 'religious':
        vfunc = np.vectorize(is_tweet_religious)
        return vfunc(tweets, religionDict).sum() / len(tweets)
    elif item == 'hatespeech' and 0 in hsolFreq:
        return hsolFreq[0] / len(tweets)
    elif item == 'offensive' and 1 in hsolFreq:
        return hsolFreq[1] / len(tweets)
    else:
        return 0.0

In [None]:
# Other function definitions
# Get tweets from one user
def getUserTweetsAndData(userTweetsFile):
    data = []
    tweets = []

    with gzip.open(userTweetsFile,'r') as f:        
        for tweetDeetsBinary in f:
            # Convert each line (binary) to string
            tweetDeetsStr = tweetDeetsBinary.decode('utf-8')

            # Generate json objects
            tweetDeetsJson = json.loads(tweetDeetsStr)
            data.append(tweetDeetsJson)            
            tweets.append(tweetDeetsJson["text"])
    
    return data, tweets

# Clean a tweet
def cleanTweet(text):
    cleanedText = " ".join([re.sub("[^a-zA-Z#']", '', x) for x in tweet_tokenizer.tokenize(text) if
                                 x and
                                 not x.startswith('http') and
                                 not x.startswith('@') and
                                 not x.startswith('#') and
                                 x.lower() != 'rt' and
                                 not (x.startswith('&') and x.endswith(';'))]).strip().lower()
    return cleanedText.lower()

# Clean all tweets
def cleanAllTweets(tweets):
    vcleanTweet = np.vectorize(cleanTweet)
    cleanedTweets = vcleanTweet(tweets)
    return cleanedTweets

# Print error message
def printErrorMsg(userID):
    print("---------------------------------------------------------------------")
    print("Unable to calculate metrics for user:", userID)
    print("---------------------------------------------------------------------")

In [None]:
# Function definitions from "Automated Hate Speech Detection and the Problem of Offensive Language" paper
def preprocess(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE
    2) lots of whitespace with one instance
    3) mentions with MENTIONHERE

    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, '', parsed_text)
    parsed_text = re.sub(mention_regex, '', parsed_text)
    return parsed_text

def count_twitter_objs(text_string):
    """
    Accepts a text string and replaces:
    1) urls with URLHERE
    2) lots of whitespace with one instance
    3) mentions with MENTIONHERE
    4) hashtags with HASHTAGHERE

    This allows us to get standardized counts of urls and mentions
    Without caring about specific people mentioned.
    
    Returns counts of urls, mentions, and hashtags.
    """
    space_pattern = '\s+'
    giant_url_regex = ('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|'
        '[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
    mention_regex = '@[\w\-]+'
    hashtag_regex = '#[\w\-]+'
    parsed_text = re.sub(space_pattern, ' ', text_string)
    parsed_text = re.sub(giant_url_regex, 'URLHERE', parsed_text)
    parsed_text = re.sub(mention_regex, 'MENTIONHERE', parsed_text)
    parsed_text = re.sub(hashtag_regex, 'HASHTAGHERE', parsed_text)
    return(parsed_text.count('URLHERE'),parsed_text.count('MENTIONHERE'),parsed_text.count('HASHTAGHERE'))

def other_features(tweet):
    """This function takes a string and returns a list of features.
    These include Sentiment scores, Text and Readability scores,
    as well as Twitter specific features"""
    sentiment = sid.polarity_scores(tweet)
    
    words = preprocess(tweet) #Get text only
    
    syllables = textstat.syllable_count(words)
    num_chars = sum(len(w) for w in words)
    num_chars_total = len(tweet)
    num_terms = len(tweet.split())
    num_words = len(words.split())
    avg_syl = round(float((syllables+0.001))/float(num_words+0.001),4)
    num_unique_terms = len(set(words.split()))
    
    ###Modified FK grade, where avg words per sentence is just num words/1
    FKRA = round(float(0.39 * float(num_words)/1.0) + float(11.8 * avg_syl) - 15.59,1)
    ##Modified FRE score, where sentence fixed to 1
    FRE = round(206.835 - 1.015*(float(num_words)/1.0) - (84.6*float(avg_syl)),2)
    
    twitter_objs = count_twitter_objs(tweet)
    retweet = 0
    if "rt" in words:
        retweet = 1
    features = [FKRA, FRE,syllables, avg_syl, num_chars, num_chars_total, num_terms, num_words,
                num_unique_terms, sentiment['neg'], sentiment['pos'], sentiment['neu'], sentiment['compound'],
                twitter_objs[2], twitter_objs[1],
                twitter_objs[0], retweet]
    
    return features

def get_feature_array(tweets):
    feats=[]
    for t in tweets:
        feats.append(other_features(t))
    return np.array(feats)

def transform_inputs(tweets):
    oth_array = get_feature_array(tweets)    
    return pd.DataFrame(oth_array)


#### Pre-processing

In [None]:
# Load training labels and dictionaries + init objs
df = pd.read_csv(TRAIN_LABEL_FILE, dtype={'user_id': 'str'})
tweet_tokenizer = TweetTokenizer()
racismDict = DictFeaturizer.load(RACISM_DICT_FILE)
racismDict.rel = False
sid = SentimentIntensityAnalyzer()
hsolModel = joblib.load(open('hsof_model_mini.pkl', 'rb'))
religionDict = ReligiousityDict(RELIGIOUS_DICTIONARY)
religionDict.read_csv() #reads in religion dict may need to change pos for clarity

In [None]:
# Specify feature columns in dataframe
featureNames = [
    "num_tweets",#ok
    "num_followers",#ok
    "day_with_most_tweets",#ok    
    "flesch_kincaid_grade",#ok
    "coleman_liau_index",#ok
    "automated_readability_index",#ok
    "linsear_write_formula",#ok
    "gunning_fog",#ok
    "avg_num_retweets",#ok
    "avg_num_favorites",#ok
    "avg_num_hashtags",#ok
    "avg_num_emojis",#ok
    "avg_num_links",#ok
    "avg_num_linked_pics",#ok
    "percent_pos_tweets",#ok
    "percent_neu_tweets",#ok
    "percent_neg_tweets",#ok
    "avg_num_racist_words",#ok
    "percent_religious_tweets", #Austin
    "percent_hatespeech_tweets",
    "percent_offensive_tweets"
]

In [None]:
# Init/reset all fields to 0.0
for featName in featureNames:
    df[featName] = np.zeros(df.shape[0])

In [None]:
# Correct the data types in the dataframe + set df index
df = df.astype({"num_tweets": int,
                "num_followers": int,
                "num_followers": int,
                "day_with_most_tweets": int
               })
df = df.set_index("user_id")
#print(df.head(3))

In [None]:
# Calculate metrics for each user
# Get the gzip files and load the queue
queue = [TWEETS_DIRECTORY + objname for objname in os.listdir(TWEETS_DIRECTORY) if re.search(r".+\.gz$", objname)]
progressBar = tqdm(total=len(queue))

while len(queue) > 0:
    # Get data and tweets
    userTweetsFile = queue.pop()
    dataAndTweets = getUserTweetsAndData(userTweetsFile)
    data = dataAndTweets[0]          # list of json data for each tweet
    tweets = dataAndTweets[1]
    totalNumTweets = len(data)
    
    # Skip if user has no tweets
    if totalNumTweets == 0:
        continue
    
    # Get the user ID
    userID = data[0]["user"]["id_str"]
    
    try:
        # Clean all tweets
        cleanedTweets = cleanAllTweets(tweets)

        # Get sentiment and hatespeech/offensive language frequencies
        sentimentFreq = getSentimentFreq(cleanedTweets, sid)
        hsolFreq = getHsolClass(cleanedTweets, hsolModel)

        df.loc[userID, "num_tweets"] = totalNumTweets
        df.loc[userID, "num_followers"] = data[0]["user"]["followers_count"]
        df.loc[userID, "day_with_most_tweets"] = getDayTweetedMost(data)
        df.loc[userID, "flesch_kincaid_grade"] = getReadabilityScore('flesch', cleanedTweets)
        df.loc[userID, "coleman_liau_index"] = getReadabilityScore('coleman', cleanedTweets)
        df.loc[userID, "automated_readability_index"] = getReadabilityScore('automated', cleanedTweets)
        df.loc[userID, "linsear_write_formula"] = getReadabilityScore('linsear', cleanedTweets)
        df.loc[userID, "gunning_fog"] = getReadabilityScore('gunning', cleanedTweets)
        df.loc[userID, "avg_num_retweets"] = getAvgNum('retweets', data)
        df.loc[userID, "avg_num_favorites"] = getAvgNum('favorites', data)
        df.loc[userID, "avg_num_hashtags"] = getAvgNum('hashtags', data)
        df.loc[userID, "avg_num_emojis"] = getAvgNum('emojis', tweets)
        df.loc[userID, "avg_num_links"] = getAvgNum('links', data)
        df.loc[userID, "avg_num_linked_pics"] = getAvgNum('pics', data)
        df.loc[userID, "avg_num_racist_words"] = getAvgNum('racist', cleanedTweets, racismDict=racismDict)
        df.loc[userID, "percent_pos_tweets"] = getPercent('pos', cleanedTweets, sentimentFreq=sentimentFreq)
        df.loc[userID, "percent_neu_tweets"] = getPercent('neu', cleanedTweets, sentimentFreq=sentimentFreq)
        df.loc[userID, "percent_neg_tweets"] = getPercent('neg', cleanedTweets, sentimentFreq=sentimentFreq) 
        df.loc[userID, "percent_religious_tweets"] = getPercent('religious', cleanedTweets, religionDict=religionDict)
        df.loc[userID, "percent_hatespeech_tweets"] = getPercent('hatespeech', cleanedTweets, hsolFreq=hsolFreq)
        df.loc[userID, "percent_offensive_tweets"] = getPercent('offensive', cleanedTweets, hsolFreq=hsolFreq)
        progressBar.update(1)

    except:
        printErrorMsg(userID)

In [None]:
# Clean up the dataframe
# Drop any NaNs or nulls
for index, row in df.iterrows():
    if row.isnull().any():
        df = df.drop(index)
        
# Normalize all values
#df = (df - df.mean()) / (df.max() - df.min())
df

In [None]:
# Output calculated metrics to a CSV file
df.to_csv("../results/user_metrics.csv")

In [None]:
# Convert dataframe to np array
npDF = dfNorm.values

In [None]:
# Plot the correlation matrix
import matplotlib.pyplot as plt
%matplotlib inline

for i in range(len(df.columns)):
    print(str(i) + ": " + df.columns[i])
    
plt.rcParams['figure.figsize'] = [10, 10]
plt.matshow(df.corr())
plt.xticks(range(len(df.columns)), [i for i in range(len(df.columns))])
plt.yticks(range(len(df.columns)), [i for i in range(len(df.columns))])
plt.colorbar()
plt.show()

In [None]:
import seaborn as sns
corr = df.corr()
plt.figure(figsize = (15,10))
sns.heatmap(corr, xticklabels=[i for i in range(len(df.columns))])
sns.heatmap(corr, yticklabels=[i for i in range(len(df.columns))])

In [None]:
# Split the data into training/testing sets
# sdo + rwa
trainSize = int(.75*npDF.shape[0])
Xtrain = npDF[:trainSize, 2:]
ytrain = npDF[:trainSize, :2]
Xtest = npDF[trainSize:, 2:]
ytest = npDF[trainSize:, :2]

print(Xtrain.shape)
print(ytrain.shape)
print(Xtest.shape)
print(ytest.shape)

#### Regression

In [None]:
# Run Linear Regression
regr = linear_model.LinearRegression()
regr.fit(Xtrain, ytrain)
print("Score:", regr.score(Xtest, ytest))
yPred = regr.predict(Xtest)
#print('Coefficients: \n', regr.coef_)
print("Mean squared error: %.2f" % mean_squared_error(ytest, yPred))
print('Variance score: %.2f' % r2_score(ytest, yPred))

### Don't run past here, still broken from multiple changes

In [None]:
# Feature extraction (old)
"""
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform(tweets)
print("tf shape:", tf.shape)

y = np.empty(tf.shape[0])

# Give each tweet the personality score of the user
if TRAIN_LABEL_FILE == "":
    # Randomly fill score if none provided
    #y.fill(np.random.uniform(size=1)[0])
    y = np.random.uniform(size=tf.shape[0])
    print(y[0])
else:
    #y.fill(float(sdo_score))
    y = np.random.uniform(size=tf.shape[0])
    print(y[0])
"""

In [None]:
# Feature selection (old)
"""
#tfNew = SelectKBest(f_regression, k=10).fit_transform(tf, y)
selPercent = SelectPercentile(f_regression, percentile=10)
tfNew = selPercent.fit_transform(tf, y)
print("tfNew shape:", tfNew.shape)
"""

# References

https://www.geeksforgeeks.org/readability-index-pythonnlp/
- Referenced code for calculating readability score
- There are 4 readability formulas we can use
    - flesch_reading_ease(text)
    - gunning_fog(text)
    - smog_index(text)
    - dale_chall_readability_score(text)

@article{tulkens2016automated,
  title={The automated detection of racist discourse in dutch social media},
  author={Tulkens, St{\'e}phan and Hilte, Lisa and Lodewyckx, Elise and Verhoeven, Ben and Daelemans, Walter},
  journal={Computational Linguistics in the Netherlands Journal},
  volume={6},
  number={1},
  pages={3--20},
  year={2016}
}
- Racism dictionary

https://stackoverflow.com/questions/43146528/how-to-extract-all-the-emojis-from-text
- Code used to extract emojis

@inproceedings{hateoffensive,
  title = {Automated Hate Speech Detection and the Problem of Offensive Language},
  author = {Davidson, Thomas and Warmsley, Dana and Macy, Michael and Weber, Ingmar}, 
  booktitle = {Proceedings of the 11th International AAAI Conference on Web and Social Media},
  series = {ICWSM '17},
  year = {2017},
  location = {Montreal, Canada},
  pages = {512-515}
  }
  - Hate-speech and offensive language classifier