# README
This simple script does the following:
- Reads tweets in from gzip files for each user given the directory containing user json + gzip files (the output files from Habeeb's twitter crawler)
- Cleans the tweets: takes out unicode characters (assumed to be emojis) and non-alphabetical tokens
- Vectorize tweets using bag-of-words representation (tf/tf-idf)
- Performs feature selection by removing all but the highest scoring percentage of features (percentage defined by user)
- Performs topic modeling with LDA - included in the script in case it might produce some useful insights
- Runs regression on the vectorized tweets and random real-numbered y-labels (this will be replaced the ground truth when it is available to us)
- Returns real-valued predictions, correlation coefficients for each feature, mean squared error and variance score

Notes:
- LDA should only use TF, not TF-IDF

To Do:
- Take in CSV IDs with their corresponding scores
- Compare SelectPercentile(...) and SelectKBest(...)
- Add k-fold cross-validation
- Add option to save/load ML model
    - https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
- Use and compare the following regression models:
    - Linear Regression (current)
    - Ridge Regression
    - Least Angle Regression
    - Bayesian Regression
    - Logistic Regression
    - SVM Regression
    - Nearest Neighbors Regression
    - Decision Trees Regression
    - Gradient Tree Boosting Regression
    - NN Regression (Use code from research with Dr. Jiang)
- Compare results of using TF-IDF (use TfidfVectorizer(...) instead of CountVectorizer(...)) for regression instead (currently just TF)
- With either TfidfVectorizer(...) and CountVectorizer(...), add params like specifying n-grams 
- Write all results to a .log or .csv file
- Loop over all gzips instead of using just one
- Look into adding tweet-based features
    - https://stackoverflow.com/questions/29653321/add-column-to-tfidf-matrix (reference the 2nd answer)
- In each tweet-json, we have:
    - DT created
    - friends_count
    - followers_count
    - favourites_count
    - retweet_count
    - etc. (open json from any gzip to see what other data was scraped)
- Refine token filtering: don't just throw out anything that contains non-alphabetic characters
- Keep emojis
    - https://stackoverflow.com/questions/43146528/how-to-extract-all-the-emojis-from-text
- Write another script to run this automatically with different options (which models, what parameters, etc.)
- Display statistics/results graphically
- Use the hate-speech-dictionary as a feature
- Use the time that tweet is created as a feature
- Use the syllable count as a feature
    - https://stackoverflow.com/questions/46759492/syllable-count-in-python
- Use emojis as a feature

### User-defined Variables

In [None]:
# Specify the directory containing user tweets .json + .gz files
TWEETS_DIRECTORY = "../users-new/"

# Specify the name of the CSV file containing personality scores for each user
TRAIN_LABEL_FILE = "../train_labels.csv"

### Program Begins Here

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_selection import SelectPercentile, SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
import sys
import os
import gzip
import re
import json
import string
import random
import numpy as np
import pandas as pd

In [None]:
# Function definitions

# Print the words in their respective topics
def printTopWords(model, featureNames, nTopWords):
    for topicIdx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topicIdx
        message += " ".join([featureNames[i] for i in topic.argsort()[:-nTopWords-1:-1]])
        print(message)
    
    print()
    
# Get tweets from one user
def getUserTweetsAndData(userTweetsFile):
    data = []
    tweets = []

    with gzip.open(userTweetsFile,'r') as f:        
        for tweetDeetsBinary in f:
            # Convert each line (binary) to string
            tweetDeetsStr = tweetDeetsBinary.decode('utf-8')

            # Generate json objects
            tweetDeetsJson = json.loads(tweetDeetsStr)
            data.append(tweetDeetsJson)

            # https://stackoverflow.com/questions/92438/stripping-non-printable-characters-from-a-string-in-python
            filteredTweetText = ''.join(filter(lambda x: x in string.printable, tweetDeetsJson["text"]))

            # Only keep alphabetical words
            filteredTweetText = ' '.join([word for word in filteredTweetText.split(" ") if word.isalpha()])
            tweets.append(filteredTweetText)
    
    return data, tweets

# Get a user's ID
def getUserID(TWEETS_DIRECTORY, userTweetsFile):
    pattern = "(" + TWEETS_DIRECTORY.replace("/", "\/") + ")(\d+)(\.json\.gz)"
    m = re.match(pattern, userTweetsFile)
    
    return int(m[2])

# Get a user's SDO and RWA scores
def getUserScores(TWEETS_DIRECTORY, userTweetsFile, df):
    userID = getUserID(TWEETS_DIRECTORY, userTweetsFile)
    sdo_score = float(df.loc[userID, "sdo"])
    rwa_score = float(df.loc[userID, "rwa"])
    
    return sdo_score, rwa_score

# Print a user's data
def printUserStats(TWEETS_DIRECTORY, userTweetsFile, df):
    # Get user personality scores
    personalityScores = getUserScores(TWEETS_DIRECTORY, userTweetsFile, df)
    sdo_score = personalityScores[0]
    rwa_score = personalityScores[1]
    
    # Print user data
    print("Current twitter user file:", userTweetsFile)
    print("Number of tweets:", int(df.loc[userID, "num_tweets"]))
    print("Number of nighttime tweets:", int(df.loc[userID, "num_night_tweets"]))
    print("Number of monday tweets:", int(df.loc[userID, "num_mon_tweets"]))
    print("Number of tuesday tweets:", int(df.loc[userID, "num_tue_tweets"]))
    print("Number of wednesday tweets:", int(df.loc[userID, "num_wed_tweets"]))
    print("Number of thursday tweets:", int(df.loc[userID, "num_thu_tweets"]))
    print("Number of friday tweets:", int(df.loc[userID, "num_fri_tweets"]))
    print("Number of saturday tweets:", int(df.loc[userID, "num_sat_tweets"]))
    print("Number of sunday tweets:", int(df.loc[userID, "num_sun_tweets"]))
    print("Number of favorites:", int(df.loc[userID, "num_favorited"]))
    print("Number of retweets:", int(df.loc[userID, "num_retweeted"]))
    print("Number of followers:", int(df.loc[userID, "num_followers"]))
    print("SDO Score:", sdo_score)
    print("RWA Score:", rwa_score)

In [None]:
# Get the gzip files
gzipFiles = [TWEETS_DIRECTORY + objname for objname in os.listdir(TWEETS_DIRECTORY) if re.search(r".+\.gz$", objname)]
for gzfilename in gzipFiles:
    print(gzfilename)

In [None]:
# Read file containing training labels into a dataframe
df = pd.read_csv(TRAIN_LABEL_FILE)
print(df.head(5))

In [None]:
# Modify dataframe to hold features for each user
featureNames = [
    "num_tweets",
    "avg_read_score",
    "num_night_tweets",
    "racist_score",
    "sexist_score",
    "relig_score",
    "num_emojis",
    "num_hashtags",
    "num_retweeted",
    "num_reshares",
    "num_mon_tweets",
    "num_tue_tweets",
    "num_wed_tweets",
    "num_thu_tweets",
    "num_fri_tweets",
    "num_sat_tweets",
    "num_sun_tweets",
    "num_pos_tweets",
    "num_neut_tweets",
    "avg_neg_score",
    "num_links",
    "num_followers",
    "num_favorited",
    "num_pics",
    "num_linked_pics"
]

for featName in featureNames:
    df[featName] = np.zeros(df.shape[0])
df = df.set_index("user_id")

print(df.head(5))

In [None]:
for userTweetsFile in gzipFiles:
    # Get data and tweets
    userID = float(getUserID(TWEETS_DIRECTORY, userTweetsFile))
    dataAndTweets = getUserTweetsAndData(userTweetsFile)
    data = dataAndTweets[0]
    tweets = dataAndTweets[1]
    
    for i in range(len(dataAndTweets[0])):
        # Get the average readability score
        
        # Get the time tweet created
        srch = re.search("(\d{2}):\d{2}:\d{2}", data[i]["created_at"])
        time = srch[0]
        hour = srch[1]
        if int(hour) > 18:
            df.loc[userID, "num_night_tweets"] += 1
        
        # Get the level of racism
        
        # Get the level of sexism
        
        # Get the level of religiousness
        
        # Get whether emojis were used/freq. of emojis
        
        # Get the number of hashtags
        
        # Get the number of reposts
        
        # Get the day of the week tweeted
        srch = re.search("(\w{3})", data[i]["created_at"])
        day = str(srch[1])
        if day == "Mon":
            df.loc[userID, "num_mon_tweets"] += 1
        elif day == "Tue":
            df.loc[userID, "num_tue_tweets"] += 1
        elif day == "Wed":
            df.loc[userID, "num_wed_tweets"] += 1
        elif day == "Thu":
            df.loc[userID, "num_thu_tweets"] += 1
        elif day == "Fri":
            df.loc[userID, "num_fri_tweets"] += 1
        elif day == "Sat":
            df.loc[userID, "num_sat_tweets"] += 1
        elif day == "Sun":
            df.loc[userID, "num_sun_tweets"] += 1

        # Get the sentiment of the tweet
        
        # Get whether a link was present
        
        # Get the number favorited
        df.loc[userID, "num_favorited"] += data[i]["favorite_count"]
        
        # Get the number retweeted
        df.loc[userID, "num_retweeted"] += data[i]["retweet_count"]
        
        # Get the number of pictures
        
        # Get the number of linked pictures
    
    # Get the number of tweets total
    df.loc[userID, "num_tweets"] = len(data)
    
    # Get the number of follows
    df.loc[userID, "num_followers"] = data[i]["user"]["followers_count"]
    
    # Print all a user's data
    printUserStats(TWEETS_DIRECTORY, userTweetsFile, df)

In [None]:
# Get complete rows of data from DF and convert to np_array
colNames = ["user_id", "sdo", "rwa"] + featureNames
newDF = pd.DataFrame(columns=colNames)
newDF = newDF.set_index("user_id")

for userTweetsFile in gzipFiles:
    userID = float(getUserID(TWEETS_DIRECTORY, userTweetsFile))
    newDF = newDF.append(df.loc[userID])
    
npDF = newDF.values
print(npDF)

In [None]:
# Split the data into training/testing sets
trainSize = int(.75*npDF.shape[0])
Xtrain = npDF[:trainSize, 2:]
ytrain = npDF[:trainSize, :2]
Xtest = npDF[trainSize:, 2:]
ytest = npDF[trainSize:, :2]

print(Xtrain.shape)
print(ytrain.shape)
print(Xtest.shape)
print(ytest.shape)

In [None]:
# Run Linear Regression
regr = linear_model.LinearRegression()
regr.fit(Xtrain, ytrain)
yPred = regr.predict(Xtest)
print(yPred)

In [None]:
print('Coefficients: \n', regr.coef_)
print("Mean squared error: %.2f" % mean_squared_error(ytest, yPred))
print('Variance score: %.2f' % r2_score(ytest, yPred))

In [None]:
"""
# Feature extraction
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform(tweets)
print("tf shape:", tf.shape)

y = np.empty(tf.shape[0])

# Give each tweet the personality score of the user
if TRAIN_LABEL_FILE == "":
    # Randomly fill score if none provided
    #y.fill(np.random.uniform(size=1)[0])
    y = np.random.uniform(size=tf.shape[0])
    print(y[0])
else:
    #y.fill(float(sdo_score))
    y = np.random.uniform(size=tf.shape[0])
    print(y[0])
"""

In [None]:
# Feature selection
#tfNew = SelectKBest(f_regression, k=10).fit_transform(tf, y)
selPercent = SelectPercentile(f_regression, percentile=10)
tfNew = selPercent.fit_transform(tf, y)
print("tfNew shape:", tfNew.shape)

In [None]:
# Run LDA with original tf
lda = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda.fit(tf)
print("Topics in LDA model:")
tfFeatureNames = vectorizer.get_feature_names()
printTopWords(lda, tfFeatureNames, 20)

In [None]:
# Run LDA with selected tf
lda2 = LatentDirichletAllocation(n_components=10, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)

lda2.fit(tfNew)
print("Topics in new LDA model:")
tfFeatureNames = vectorizer.get_feature_names()
retained = selPercent.get_support(True)
newFeatureNames = []

# Get the list of selected feature names 
for idx in retained:
    newFeatureNames.append(tfFeatureNames[idx])

printTopWords(lda2, newFeatureNames, 20)