# Building the Document Collections and Quantifying Topic Coverage over Time Series

Our time series was every day from 11/12/2020 to 12/11/2020.

For the purpose of our topic model, we found that treating every individual word as its own topic was most optimal. Therefore, we quantify the coverage of each given word in our vocabulary on each day in our time series and then select words which have a maximum coverage above our threshold on any given day in our time series. Afterward, the coverage of each topic over time can be compared to the underlying stock price by performing Granger tests. 

The basis of our algorithm is as follows:

    For each of TSLA, PLTR, and NFLX:

    1. Build the collection and then build vocabulary for entire collection
       1a. Make a set of stop words which we will not include in the vocabulary
    2. Build a vocabulary for each day which maps the word to its count on that day
        2a. Evaluate the coverage of each term in this vocabulary on that given day
    3. For each term in our full document collection vocabulary, make a dictionary mapping
    each word a list which is its coverage during each day in our time series.
        3a. Only evaluate words which have a coverage over our cutoff (0.001) at some point in the time series
    4. Filter for the 200 words which have the highest causality at some point during the time series
    5. For each of these words, perform Granger tests to evaluate if there is causal relationship where the 
    coverage of that given word "Granger causes"

In [1]:
"""
The code in this cell parses our input data which we had previously created in tweet_data_tsla.txt,
tweet_data_pltr.txt, and tweet_data_nflx.txt into document collection lists with tuples in the form
(tweet string, datetime.date())
"""
import re
import datetime

# Helper function for formatting the dates in the doc collection
def month_to_num(m):
    if m == "Jan":
        return 1
    elif m == "Feb":
        return 2
    elif m == "Mar":
        return 3
    elif m == "Apr":
        return 4
    elif m == "May":
        return 5
    elif m == "Jun":
        return 6
    elif m == "Jul":
        return 7
    elif m == "Aug":
        return 8
    elif m == "Sep":
        return 9
    elif m == "Oct":
        return 10
    elif m == "Nov":
        return 11
    elif m == "Dec":
        return 12
    else:
        return -1

def create_tweet_collection(infile):
    
    doc_collection = []
    
    with open(infile) as fp:
            
        for tweet in re.findall('--start--(.*?)--end--', fp.read(), re.S):
                
            split_tweet = tweet.split("\n")
                
            unformatted_string_date = split_tweet[1]
            # 0 - day of week, 1 - Month as 3-letter, 2 - day, 3 - time (UTC), 4 - timezone add-on (always +0000), 5 - year, 6 - :
            split_date = unformatted_string_date.split(" ")
            # a correctly formatted datetime timestamp of the tweet with only the date info
            date = datetime.date(int(split_date[5]), month_to_num(split_date[1]), int(split_date[2]))

            tweet_string = ""

            for line in range(2, len(split_tweet)):
                tweet_string += split_tweet[line]

            dc_tuple = (tweet_string, date)
            doc_collection.append(dc_tuple)
    
    return doc_collection       
                
tsla_doc_collection = create_tweet_collection("tweet_data_tsla.txt")
pltr_doc_collection = create_tweet_collection("tweet_data_pltr.txt")
nflx_doc_collection = create_tweet_collection("tweet_data_nflx.txt")

In [2]:
"""
The code in this cell was used to parse the stopwords.txt file from MP 2.4 and create a set of stop words which we 
will exclude from our corpus vocabulary to avoid words of low significance.
"""
stopwords_file = open("stopwords.txt", "r")
stopwords_lines = stopwords_file.readlines()

stopwords = set()

for line in stopwords_lines:
    stopwords.add(line.split("\n")[0])
    
stopwords_file.close()

In [3]:
"""
Helper function which is used to filter out tweets which are not in English
"""
def isEnglish(s):
    try:
        s.encode(encoding='utf-8').decode('ascii')
    except UnicodeDecodeError:
        return False
    else:
        return True

In [4]:
"""
The module in this cell is used to create our corpus which will perform all the calculations we need to quantify 
topic coverage over the time series. 

Parameters:
    - self.documents: array of tuples (date of tweet, array of words in tweet) which represent our document collection
      for this corpus
    - self.daily_documents: dictionary mapping date to an array of all words used in tweets on that day
    - self.vocabulary: array of all unique words from all documents in this collection excluding stop words
    - self.daily_vocabulary_dict: nested dictionary; date maps to a dictionary which has the word
      as a key and the frequency of that word as its value
    - self.daily_word_count: dictionary mapping date to total number of words on that day
    - self.daily_term_coverage: nested dict; date maps to a dictionary which has the words coverage on this day;
      key is the word and value is total coverage. Coverage = (Term Frequency) / (Count of total words on this day)
    - self.documents_path: the path to our list of tuples with tweets and their date
    - self.word_coverage_over_time = dictionary mapping a word to an array of tuples containing the date and 
      the word's coverage on that date
    - self.word_max_coverage: dictionary mapping word to its max coverage during time series so we can sort and
      filter for the top words which have the highest coverage at any point in our time series.
    - self.number_of_documents: the total number of documents in the given collection
    - self.vocabulary_size: The total number of unique words in this entire document collection
"""
import datetime
from statistics import mean

class Corpus(object):

    def __init__(self, documents_path):
        """
        Initialize parameters and set up document path to get tweets
        """
        self.documents = []
        self.daily_documents = dict()
        self.vocabulary = []
        self.daily_vocabulary_dict = dict()
        self.daily_word_count = dict()
        self.daily_term_coverage = dict()
        self.documents_path = documents_path
        self.word_coverage_over_time = dict()
        self.word_max_coverage = dict()
        self.number_of_documents = 0
        self.vocabulary_size = 0

    def build_corpus(self):
        """
        Read document, fill in self.documents, a list of list of word
        self.documents = [["the", "day", "is", "nice", "the", ...], [], []...]
        Update self.number_of_documents
        """
        # the doc collection comes in as an array of tuples with the tweet string first and the date next
        for tweet, date in self.documents_path:
            tweet_words = tweet.split(" ")
            # get all words into lists
            self.documents.append((date, tweet_words))
            
            if date in self.daily_documents:
                self.daily_documents[date] += tweet_words
            else:
                self.daily_documents[date] = tweet_words

        self.number_of_documents = len(self.documents)

    def build_vocabulary(self):
        """
        Construct a list of unique words in the whole corpus. Put it in self.vocabulary
        for example: ["rain", "the", ...]
        Update self.vocabulary_size
        """
        for date, tweet in self.documents:
            for word in tweet:
                # make sure word is unique
                if word not in self.vocabulary and word not in stopwords:
                    self.vocabulary.append(word)

        self.vocabulary_size = len(self.vocabulary)
        
    def build_daily_vocab(self):
        """
        Construct a nested dictionary in which each date maps to a dictionary where each word maps to
        its total count in tweets on that respective day. Then, compute the coverage of the given word--a 
        topic in this case--on that day and save that in self.daily_term_coverage.
        """
        for date in self.daily_documents:
            
            dwc = 0
            word_count_dict = dict()
            word_coverage_dict = dict()
            
            for word in self.daily_documents[date]:
                dwc += 1
                if word in word_count_dict:
                    word_count_dict[word] += 1
                else:
                    word_count_dict[word] = 1
        
            self.daily_vocabulary_dict[date] = word_count_dict
            
            self.daily_word_count[date] = dwc
            
            for w in word_count_dict:
                word_coverage_dict[w] = word_count_dict[w] / dwc
                
            self.daily_term_coverage[date] = word_coverage_dict
    
    def build_word_cov_over_time(self):
        """
        This function gives us the coverage of every word in the vocabulary over time. To avoid noise, we only 
        select topics which are heavily covered at some point during our time series.
        """
        for word in self.vocabulary:
            if isEnglish(word):
                # we will not be able to extract knowledge from tweets in other languages yet
                cov_over_time = []

                for date in self.daily_term_coverage:
                    word_to_cov_map = self.daily_term_coverage[date]
                    if word in word_to_cov_map:
                        cov_over_time.append(word_to_cov_map[word])
                    else:
                        cov_over_time.append(0)

                # if this word is more than 1/1000 at some point then we mark it as significant
                if max(cov_over_time) > 0.001:
                    self.word_coverage_over_time[word] = cov_over_time
                    self.word_max_coverage[word] = max(cov_over_time)
                

In [5]:
"""
The code in this cell initialized the 3 corpuses we are interested in which are tweets containing $TSLA, $PLTR, or $NFLX.
The corpus is initialized, then the appropriate functions are called on each corpus to calculate each given topic/word's 
coverage over time.
"""
tsla_documents_path = tsla_doc_collection
pltr_documents_path = pltr_doc_collection
nflx_documents_path = nflx_doc_collection

tsla_corpus = Corpus(tsla_documents_path)
tsla_corpus.build_corpus()
tsla_corpus.build_vocabulary()
tsla_corpus.build_daily_vocab()
tsla_corpus.build_word_cov_over_time()

pltr_corpus = Corpus(pltr_documents_path)
pltr_corpus.build_corpus()
pltr_corpus.build_vocabulary()
pltr_corpus.build_daily_vocab()
pltr_corpus.build_word_cov_over_time()

nflx_corpus = Corpus(nflx_documents_path)
nflx_corpus.build_corpus()
nflx_corpus.build_vocabulary()
nflx_corpus.build_daily_vocab()
nflx_corpus.build_word_cov_over_time()

In [6]:
"""
The code in this cell filters for the top 200 most highly covered words/topics in each document collection
"""
from operator import itemgetter

tsla_top200_words = dict(sorted(tsla_corpus.word_max_coverage.items(), key = itemgetter(1), reverse = True)[:200])
pltr_top200_words = dict(sorted(pltr_corpus.word_max_coverage.items(), key = itemgetter(1), reverse = True)[:200])
nflx_top200_words = dict(sorted(nflx_corpus.word_max_coverage.items(), key = itemgetter(1), reverse = True)[:200])

In [7]:
"""
The code in this cell converts our term coverage dictionaries into csvs so that they can be cleaned up, the time
series data--the underlying stock's price (TSLA, PLTR, or NFLX)--can be added into the csv, and then they can be
loaded into R to perform Granger Causality Tests which will tell us which topic's coverage is causaly linked to 
the respective stock's price.
"""
import pandas as pd

tsla_word_coverage_over_time_df = pd.DataFrame()
pltr_word_coverage_over_time_df = pd.DataFrame()
nflx_word_coverage_over_time_df = pd.DataFrame()

for word in tsla_top200_words:
    tsla_word_coverage_over_time_df[word] = tsla_corpus.word_coverage_over_time[word]
    
for word in pltr_top200_words:
    pltr_word_coverage_over_time_df[word] = pltr_corpus.word_coverage_over_time[word]

for word in nflx_top200_words:
    nflx_word_coverage_over_time_df[word] = nflx_corpus.word_coverage_over_time[word]

tsla_word_coverage_over_time_df.to_csv("tsla_word_coverage_over_time.csv")
pltr_word_coverage_over_time_df.to_csv("pltr_word_coverage_over_time.csv")
nflx_word_coverage_over_time_df.to_csv("nflx_word_coverage_over_time.csv")

For the next step, which was Granger testing, please see the granger_testing.html or granger_testing.rmd file.