## Relationships between words
### n-grams and correlations
Read: http://tidytextmining.com/ngrams.html

Exploring relationships and connections between words.
1. Tokenizing by n-gram, here by bigrams for a start
2. Counting and filtering n-grams: Most common bigrams, then remove the ones where at least one is a stop word.
3. Analyzing bigrams: one bigram per row: look at the tf-idf (the one with the highest), can be visualized for each document(/book in example).
4. Using them to provide context in sentiment analysis: The approach is to count certain word segments like "happy" and "not happy". We examine how often sentiment-associated words are preceded by "not" or other negating words. We use the `AFINN lexicon for sentiment analysis`. Is there something similar in german for sentiment analysis? -> http://www.ulliwaltinger.de/sentiment/ (gives negative, positive and neutral words in tsv-format, not sure if the words have a "sentiment score")
5. Calculate sentiment score per comment
6. Visualize a network of bigrams: node1 = word1 -> node2 = word2 of bigram, weight given by the number of occurence of the bigram. For each word we only show the words that follows it the most as connected directed node. Only show bigrams that occured at least x times.  
An useful and flexibel way to visualize relational data. - This is a visualization of a Markov chain.
7. Put the whole thing built so far into a function for usage on other texts
8. We may be interested in words that tend to co-occur within particular documents or particular chapters, even if they don't occur next to each other. Turn text into a wide matrix first for that.  
Counting and correlating among sections: Package like `widyr` in R for that? (`pairwise_count`)
9. Most common co-occuring words not that meaningful since they're also the most common individual words. We may instead want to examine correlation among words, which indicates how often they appear together relative to how often they appear separately (phi coefficient ~ pearson correlation for binary data).  
Pick some interesting words and look at their correlations. Then plot a graph for highest correlations like above.
    
Pairs of consecutive words might capture structure that isn't present when one is just counting single words, and may provide context that makes tokens more understandable (for example "pulteney street" instead of "pulteney" only). However, the per-bigram counts are also sparser since they are rarer.

In [10]:
# Initialize libraries and data
%matplotlib inline
import re
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
df_art = pd.read_csv('articles_2017_08.csv')
df_com = pd.read_csv('comments_2017_08.csv').sample(5000) # crop because battery life, skews data
# Make float better readable
pd.options.display.float_format = '{:.3f}'.format

# https://de.wikipedia.org/wiki/Liste_der_h%C3%A4ufigsten_W%C3%B6rter_der_deutschen_Sprache
stop_words = "die, der, und, in, zu, den, das, nicht, von, sie, ist, des, sich, mit, dem, dass, er, es, ein, ich, auf, so, eine, auch, als, an, nach, wie, im, für"
stop_words += "man, aber, aus, durch, wenn, nur, war, noch, werden, bei, hat, wir, was, wird, sein, einen, welche, sind, oder, zur, um, haben, einer, mir, über, ihm, diese, einem, ihr, uns"
#stop_words += "da, zum, kann, doch, vor, dieser, mich, ihn, du, hatte, seine, mehr, am, denn, nun, unter, sehr, selbst, schon, hier"
#stop_words += "bis, habe, ihre, dann, ihnen, seiner, alle, wieder, meine, Zeit, gegen, vom, ganz, einzelnen, wo, muss, ohne, eines, können, sei"
stop_words = stop_words.lower()
stop_words = stop_words.split(', ')


## Topic modeling
Read: http://tidytextmining.com/topicmodeling.html

Use LDA: Each document a mixture of topics and each topic a mixture of words. Built in package?

It's possible to declare what a "document" is. Also look at how each document is classified (per topic). 

In [11]:
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups

n_samples = 2000
n_features = 1000
n_components = 10
n_top_words = 20

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=n_features,
                                   stop_words=stop_words)
tfidf = tfidf_vectorizer.fit_transform(df_com['con'])


