# Comparing tf-idf Weight Use Across Embedding Models for Stance Classification

# Vision
The system is a grid-searcher of deep neural networks that takes raw twitter data, processes it, vectorizes it, and then learns the stance of the user based on it. So, in these tweets, a user would either be for, against, or neutral on some topic. The system will specifically process data from Australian mining companies.

This project specifically focuses on the vectorization step of the process. I compare character n-gram term frequency, word-based (the following are all word-based) term frequency-inverse document frequency, term frequency-inverse document frequency in combination with Word2vec, just Word2Vec, term frequency-inverse document frequency in combination with fastText, and just fastText. 

The term-frequency algorithm is like a "bag of words" approach, with each word weighted by its frequency, creating sparse 'tweet-vectors' that only have values for each word/character n-gram that is in that tweet. The approach using individual term vectors (think Word2vec) takes the inverse document frequency weight, along with individual term embeddings, to create another 'tweet-vector' which is the average of all words in a tweet, where the words are scaled by their importance according to the inverse document frequency weighting (Besbes, "Sentiment analysis on Twitter using word2vec and keras") The idea is that these networks are 'informed' by what each term 'means'. The networks using just Word2vec or fastText average all the vector embeddings of each word in a tweet into a ‘tweet-vector’ without scaling the words according to their importance.

The purpose of this project is to compare a system that only uses term-frequency to a system that uses term-frequency in combination with embeddings generated by tools like Word2vec and fastText. Below, I will be primarily evaluating how well they perform in this specific area of stance analysis. A larger goal is to create a system that can then take in new tweets and classify them as either for, against, or neutral regarding their stance on certain companies, as accurately as possible, and that is why I am taking on this sub-project.

# Background
The system uses [Keras](https://keras.io/) for densely connected networks, because of its simplicity. It also uses a grid search to find a good size for the network both in layer count and in individual layer size. To create all permutations for the grid search, the system uses itertools, a python library for combinatorics. For number and vector manipulation the system uses [NumPy](https://www.numpy.org/) and [Pandas](https://pandas.pydata.org/). To calculate performance scores the system uses [Sci-kit Learn](https://scikit-learn.org/stable/). 

For word embedding generation it uses [Word2vec](https://en.wikipedia.org/wiki/Word2vec) and [fastText](https://fasttext.cc/), specifically the [Gensim](https://radimrehurek.com/gensim/) implementations. Word2vec and fastText were chosen because they are common tools that can generate word embeddings, and both have many tutorials on how to use them. I chose the Gensim library for implementations because they were also commonly used in many tutorials and Gensim provided implementations for both systems. I chose fastText in addition to Word2vec because fastText is a sort of extension or even improvement on Word2vec. It specifically improves on Word2vec by using sub-word information. This means that under-the-hood the system generates the word embeddings using the combination of character n-grams for the given word. So, each a word like "apple" is actually represented as a series of character n-grams. So "apple" could be represented as <ap, pp, pl, le>. This is especially useful to find similarities between words that are slightly misspelled, which makes it perfect for tweets. For example, "no" and "noo" would be made up of similar representations, <no> and <no, oo>, and since they would also be in similar areas of a sentence, fastText would eventually learn that they are very similar.

The system also takes advantage of Sci-kit Learn’s term frequency-inverse document frequency vectorizer. By using a simple-to-use vectorizer, the system can easily fit a tf-idf system and then extract the weights for each individual word in the system. Those weights are later used to scale the word embedding vectors generated by the Word2vec and fastText systems.

This project will also be based on the work that Roy Adams and I did for our senior project.

My data comes from Twitter and was further manipulated by CSIRO and labeled by both CSIRO and others, included Professor Keith VanderLinden and Roy Adams of Calvin College. The tweets gathered were both tweeted from Australia and concerned certain Australian mining companies, such as Adani.

In order to run this system quickly, I have used [Singularity](https://www.sylabs.io/docs/), a container system; [SLURM](https://slurm.schedmd.com/documentation.html), a workload manager; and [Borg](https://borg.calvin.edu/), Calvin College’s supercomputer.

# Implementation
This system runs and trains on Borg, the Calvin supercomputer, specifically on GPU nodes, when available (right now it’s set to run on CPU nodes because the GPU nodes were in use, this can be changed by editing the SLURM script). It runs within a singularity container to make the code easily portable if desired.

The system is implemented in Python and uses Bash and SLURM scripts to assist in running it. It uses Keras for the densely connected neural networks. The Keras neural networks train on vectors that are generated from the data. By (un)commenting certain lines, you can test many different versions of the program. The main differences between the program variations are the different vectorization approaches, as detailed above. 

Once the input data is vectorized and transformed into the desired input format, it is passed into a grid searching algorithm which trains and tests many different networks across the search space of layer size and number of layers. This extends the work that Roy Adams and I did for our senior project by allowing for further manipulation of the input data.

The performance results of each neural network in the grid search are then stored in a CSV file with the date and time of when the system started running appended (if you use the “single_run” script, otherwise they are put in the “run_results” folder for the “25x run” script). Currently, I have the resulting CSV files in the project directory for the following vectorization methods:
*   term frequency-inverse document frequency, 
*   tf-idf in combination Word2vec. 
*   tf-idf in combination fastText (there are 25 run results available since it did well)
*   Word2vec
*   fastText



# Results
Before this project, Roy Adams and I had created a system that was able to achieve an accuracy of around 74%, using term frequency vectorization for three to five length character n-grams. It used the grid search that this system used, and it did well with smaller networks. It was run 25 times that so that I could better estimate the mean accuracy.

Within this project, it should be noted that each system was only run on one training set (except for the tf-idf in combination with fastText variation of the program, since it did so well on the first run), so the results that are recorded for each grid-search and each vectorization method may be an outlier performance result. 

By converting the system Roy and I created to use word-based n-grams and term frequency-inverse document frequency instead of just term frequency, the performance decreased. The best performing network, with a densely connected neural network architecture of "1000 10 25 100" (where each number is the number of nodes in the layer, with four hidden layers) had a resulting F1 score of 0.721. A network of architecture "10" followed closely behind with an F1 score of 0.720.

The use of Word2vec alone allowed the system to reach a peak F1 score performance of 0.75 with a network architecture of “500 10 50 25”. This outperforms the original term-frequency network. The networks that are deeper seemed to do a bit better in this situation, with four of the top five networks having four layers. This could be due to chance since there are so many possible permutations of four-layer networks. It could also be because the data isn’t as sparse as term frequency data and therefore requires more nodes and layers to separate the tweets of different stances. Based on the numbers and number of four layer networks (which you can explore more by looking at the result files), I believe a hypthoesis that four-layer networks are consistenly some of the top performers due to random chance would be difficult to reject.

The use of Word2vec in combination with tf-idf had a top F1 score of 0.77. The top four all had that F1 score (at least at that precision) and four of the top five were four layer networks, with the top performer being a two-layer network of “500 25”.

The use of fastText alone peaked with an F1 score of 0.784 with a network architecture of “100 1000 50 25”. Where adding tf-idf weighting to fastText bumped the peak score up to 0.802 with a network architecture of “25 50 1000 10”, on a single run on the same dataset that was used for all the rest of the tests. The top five networks for both runs all contained four layers.

The 25 runs of the fastText in combination with tf-idf ended up with a maximum mean value of 0.74 and a standard deviation of about 0.03. So, although at least one fastText architecture is to achieve a high score every run, there is no single architecture that can consitently perform with a score higher than 0.74. If someone could find the similarities between the architectures that do the best every run, then they could perhaps construct a network that uses the ideal features of a good architecture.

In conclusion, fastText did seem to outperform Word2vec, which was probably due to the sub-word information it was able to draw on, and using tf-idf weightings to make “important” words more prominent in the tweet vector seemed to positively influence F1 accuracy scores as well.

# Implications
By increasing the accuracy of this system, I have helped form a method to increase a computer’s ability to perform stance analysis of Twitter Data. Stance analysis, like sentiment analysis, has business implications. If companies can understand the stance of a subset of the population that engages with their company, they can better understand whether they have a social license to start a new operation or continue a current operation.

Another area in which it could be used would be the political arena. Politicians could use stance analysis to see where people stand regarding the politician themselves or certain policies. This would allow politicians to better understand where a subset of their constituents stands on certain matters.

However, stance analysis can also be misused. If someone could mass-analyze text and identify what stance certain people took, then they could perhaps use it as a tool to control. For example, in a nation like China which is increasing restrictions on speech every day, a system like this could be used to weed out dissenters in a more efficient manner.

# Bibliography

Ahmed Besbes. "Sentiment analysis on Twitter using word2vec and keras". https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html. Accessed on 5/12/2019.

Dipanjan (DJ) Sarkar. "A hands-on intuitive approach to Deep Learning Methods for Text Data — Word2Vec, GloVe and FastText". https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa. Accessed on 5/12/2019.

Roy Adams and Brent Ritzema. "Social License to Operate and Machine Learning". https://github.com/Calvin-CS/slo-classifiers/tree/feature/keras-nn/stance/keras-nn. 