# Natural Language Processing of NeIC Slack logs 

## First fetch the source

First clone with  
`git clone https://github.com/coderefinery/ahm18-nlp-slack.git`  
or  
`git clone git@github.com:coderefinery/ahm18-nlp-slack.git`  
(or fork first and then clone!)

and then  
```cd ahm18-nlp-slack
jupyter-notebook```

You will also need the Slack logs! These can be viewed at [https://wiki.neic.no/chat/](https://wiki.neic.no/chat/), but webscraping it is difficult.

Instead, go to this [Google Drive link](https://drive.google.com/open?id=1BVioZ9t15c1Ek7Xq067stN0ED28eXOsJ) and save the zipfile to the current directory (thanks to Joel)



## Some background on Jupyter Notebooks

### History of Jupyter
  - In 1991, Guido van Rossum publishes Python, which starts to gain in popularity
  - In 2001, Fernando Pérez started programming a fancier shell for Python called IPython
  - In 2014, Fernando Pérez announced a spin-off project from IPython called Project Jupyter. IPython will continue to exist as a Python shell and a kernel for Jupyter, while the notebook and other language-agnostic parts of IPython will move under the Jupyter name
 

### Why "Jupyter"?
 - Julia + Python + R	      
 - Jupyter is actually language agnostic and Jupyter kernels exist for dozens of programming languages
 - Galileo's publication in a pamphlet in 1610 in Sidereus Nuncius about observations of Jupiter's moons is formulated as a notebook, with illustrations, text, calculations, titles, datapoints, images, reasoning... One of the first notebooks!  
<img src="http://media.gettyimages.com/photos/pages-from-sidereus-nuncius-magna-by-galileo-galilei-a-book-of-and-picture-id90732970" width="500">

  

### Use cases
- Experimenting with new ideas, testing new libraries/databases 
- Interactive code and visualization development
- Sharing and explaining code to colleagues
- Learning from other notebooks
- Interactive data analysis
- Many cloud platforms offer access to Jupyter Notebooks 
- Keeping track of interactive sessions, like a digital lab notebook
- Supplementary information with published articles
- Teaching (programming, experimental/theoretical science)
- Presentations

### Cells

- **Code cells** contain code to be interpreted by the *kernel* (Python, R, Julia, Octave/Matlab...)
- **Markdown cells** contain formatted text written in Markdown 
![Components](img/notebook_components.png)

### Markdown cells

This cell contains simple [markdown](https://daringfireball.net/projects/markdown/syntax), a simple language for writing text that can be automatically converted to other formats, e.g. HTML, LaTeX or any of a number of others.

**Bold**, *italics*, **_combined_**, ~~strikethrough~~, `inline code`.

* bullet points

or

1. numbered
3. lists

**Equations:**   
inline $e^{i\pi} + 1 = 0$
or on new line  
$$e^{i\pi} + 1 = 0$$

Images ![CodeRefinery Logo](https://pbs.twimg.com/profile_images/875283559052980224/tQLhMsZC_400x400.jpg)

Links:  
[One of many markdown cheat-sheets](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet#emphasis)


### Code cells

In [None]:
# a code cell can run statements of code.
# when you run this cell, the output is sent 
# from the web page to a back-end process, run 
# and the results are displayed to you
print("hello world")

The order of execution is important

In [None]:
x=1

In [None]:
x

In [None]:
x+=1

### Useful keyboard shortcuts 

Some shortcuts only work in Command or Edit mode.

* `Enter` key to enter Edit mode (`Escape` to enter Command mode)
* `Ctrl`-`Enter`: run the cell
* `Shift`-`Enter`: run the cell and select the cell below
* `Alt`-`Enter`: run the cell and insert a new cell below
* `Ctrl`-`s`: save the notebook 
* `Tab` key for code completion or indentation (Edit mode)
* `m` and `y` to toggle between Markdown and Code cells (Command mode)
* `d-d` to delete a cell (Command mode)
* `z` to undo deleting (Command mode)
* `a/b` to insert cells above/below current cell (Command mode)
* `x/c/v` to cut/copy/paste cells (Command mode)
* `Up/Down` or `k/j` to select previous/next cells (Command mode)
* `h` for help menu for keyboard shortcuts (Command mode)
* Append `?` for help on commands/methods, `??` to show source (Edit mode) 

> **Exercise**: Spend a few minutes playing around in the notebook. Add cells, toggle between Markdown and Code, execute some code, write some markdown. Try the keyboard shortcuts

### Links and further reading
 - http://nbviewer.jupyter.org/
 - https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks
 - http://mybinder.org/
 - https://jupyterhub.readthedocs.io/en/latest/
 - http://ipython-books.github.io/minibook/
 - http://ipython-books.github.io/cookbook/
 - https://www.oreilly.com/ideas/the-state-of-jupyter

## Analyzing Slack logs

Let's get down to business.  
Hopefully everyone has the following packages installed:
- `numpy`, `pandas`, `matplotlib`, `jupyter`, `nltk`, `textmining`, `lda`, `emoji`, `json`, `re`, `datetime`

### Import packages

In [None]:
import os
import sys
print(sys.version)
#sys.setdefaultencoding('utf8')
from __future__ import division

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
import re
import json
import emoji
import lda
import textmining
import datetime
import nltk 

### Extracting the logs

Below we have a code cell, but it's not Python! Have a look at the first line. `%%bash` is a *cell magic* which tells Jupyter to interpret the contents of the cell as bash commands

In [None]:
%%bash
mkdir slack_logs
cd slack_logs
mv ../NeIC_Slack_export_Dec10_2017.zip .
unzip NeIC_Slack_export_Dec10_2017.zip
cd ..

### Inspect directory structure and file format

In the code cell below we have a *line magic* `%sx` (with one `%` sign) which runs a shell command (using `commands.getoutput()`) and captures the output.  

If you want to see what magic commands are available, type `%lsmagic` in a code cell. Magics depend on what kernel is used, and new magics can also be installed (and created!).

In [None]:
dirs = %sx ls -d slack_logs/*/
for n,i in enumerate(dirs):
    print(n,i)

> **Exercise:** Add a code cell(s) below. Use either a bash cell magic or sx line magics to list the contents of the `_aa` directory under `slack_logs`, `cat` the contents of the json file, and then remove the `_aa` directory 

#### Let's first explore one channel

In [None]:
# note how a scrollable window appears in the output area of this cell
dir = dirs[13] #coderefinery
os.listdir(dir)

Look at the structure of the json files

In [None]:
dates = os.listdir(dir)
d = dates[0] # pick the first date

# read in contents of json file
with open(dir+d,"r") as f:
    raw_json = json.loads(f.read())

# dump json
dump = json.dumps(raw_json,indent=4)
print(dump)

Aha, `subtype` only present if it's not a regular message. We also see that each message has a unix epoch timestamp.

### Extracting messages

We now extract all regular messages in one channel

In [None]:
dates = os.listdir(dir) # this is still the coderefinery channel
messages = []

for d in dates: 
    with open(dir+d,"r") as f:
        raw_json = json.loads(f.read())

    for j in raw_json:
        if not "subtype" in j.keys(): # exclude non-message messages
            messages.append(j["text"])


Add all words in all messages to one list

In [None]:
words = []
for m in messages:
    [words.append(w.lower()) for w in m.split()]
    
words[-20:]

#### Let's do this for all the Slack channels

In [None]:
# list with all channel names
all_channels = [d.replace("slack_logs/","").replace("/","") for d in dirs]

# dictionary with channel names as keys
words_in_channels = dict.fromkeys(all_channels)

In [None]:
# function to join messages into one long array
def join_messages(messages):
    words = []
    for m in messages:
        [words.append(w.lower()) for w in m.split()]
    return words


We now join all words in all channels. Is this time-consuming? Let's time it with the `%%timeit` cell magic!

In [None]:
%%timeit -n 1 -r 1
# join messages in all channels into elements of words_in_channels dict
for channel in all_channels:
    dates = os.listdir("slack_logs/"+channel)
    messages = []
    for d in dates: 
        with open("slack_logs/"+channel+"/"+d,"r") as f:
            raw_json = json.loads(f.read())

        for j in raw_json:
            if not "subtype" in j.keys(): # exclude non-message messages
                messages.append(j["text"])
    words_in_channels[channel] = join_messages(messages)
    print("channel {} has {} words".format(channel,len(words_in_channels[channel])))

Remove empty channels

In [None]:
words_in_channels = { k:v for k,v in words_in_channels.items() if len(v)!=0 }

Plot number of words in channels using Seaborn barplot

In [None]:
plt.rcParams["figure.figsize"] = [12,9]
x = words_in_channels.keys()
y = [len(words_in_channels[i]) for i in words_in_channels.keys()]
ax = sns.barplot(x=y, y=x);

ax.set_xlim([0,200000]);

From now on, let's focus on the largest channels (and include `ahm-planning` for good measure)

In [None]:
channels = ["tryggve","general","xt","web","random","arc-debugging","ndgf","coderefinery", "ahm-planning"]

### Simple natural language processing

Natural language toolkit tests

In [None]:
# tab completion can be used to see available methods of a module
#nltk.

#### Frequency distribution of words

In [None]:
# we can view docstrings for functions in python modules: 
nltk.FreqDist?

In [None]:
# and the source code itself can be viewed by using two question marks!
nltk.FreqDist??

Find frequency distribution of words, but excluding stopwords!

In [None]:
from nltk.corpus import stopwords

most_common_words = dict.fromkeys(channels,0)
dists = dict.fromkeys(channels,0)
stop = stopwords.words('english')
for channel in channels:
   words = words_in_channels[channel]
   words = [token for token in words if token not in stop]
   dist = nltk.FreqDist(words)
   dists[channel] = dist
   most_common_words[channel] = dist.most_common(20)

#### Pandas dataframes

We create a dataframe to work with:

In [None]:
# we can feed a dictionary to the `DataFrame` method
df_words = pd.DataFrame(data=most_common_words)
df_words.head(30)

In [None]:
# we can get a list of unique common words with a set()
common_words = set()
for index, row in df_words.iterrows():
    for r in row:
        common_words.add(r[0])
for i in common_words:
    print(i)

### Lexical diversity (type-token ratio)

Let's look at lexical diversity, i.e. the ratio of number of distinct words and total number of words

In [None]:
def lexical_diversity(text):
    return len(set(text)) / len(text)

In [None]:
for channel in channels: # loop over the largest channels
    words = words_in_channels[channel]
    lex_div = lexical_diversity(words)
    print("Lexical diversity in %s is %f"%(channel,lex_div))

Linguistic richness is clearly greatest in `ahm-planning`, closely followed by `random` and `general`!

### Collocations, contexts and similar words

In [None]:
nltk.Text.collocations?

**Collocations (sequences of words that co-occur more often than expected by chance)**

In [None]:
for channel in channels:
    words = words_in_channels[channel]
    all_words = " ".join(words)
    tokens = nltk.word_tokenize(all_words)
    text = nltk.Text(tokens)
    print(channel)
    print("------------")
    text.collocations()
    print("------------------------------------------------")
    print("")

tyttebär hangout??

**Similar contexts**

In [None]:
nltk.Text.similar?

In [None]:
for channel in channels:
    words = words_in_channels[channel]
    all_words = " ".join(words)
    tokens = nltk.word_tokenize(all_words)
    text = nltk.Text(tokens)
    print(channel)
    print("------------")
    text.similar("good")
    print("------------------------------------------------")
    print("")


**Searching for words**

> **Exercise:** Another method of `nltk.Text` is `concordance`, which finds all matches in a body of text for a given word or phrase.  
1. Have a look at the docstring for `nltk.Text.concordance` to see how it's used.
2. Copy-paste either the `collocation` or `similar` cell from above using keyboard shortcuts `C` and `V`.
3. Edit the cell to use the `concordance` method, with a word or phrase of your choice 
4. Share if you find something interesting!

## "Sentiment analysis": emojis!

[Sentiment analysis (a.k.a. opinion mining or emotion AI)](https://en.wikipedia.org/wiki/Sentiment_analysis) refers to the use of NLP, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.   
Used on reviews, survey responses, online and social media, healthcare materials...

We will abuse this concept and analyze sentiments in NeIC Slack channels through emoji usage.  
Emojis in Slack logs are expressed like `:slightly_smiling_face:` 

Let's find all the emojis first by joining all words from all channels and regex-ing it, and then find the most frequent ones

In [None]:
# join words from all channels into one list, and then join into one long "sentence"
words = [j for i in words_in_channels.values() for j in i] 
all_words = " ".join(words)

emojis = re.findall(r":[a-zA-Z_]+:",all_words) # this filters out strings like :43:

dist = nltk.FreqDist(emojis)
dist.most_common(30)


Let's investigate a few key emojis

In [None]:
common_emojis = [u":disappointed:",u":wink:",u":slightly_smiling_face:",u":simple_smile:",
              u":thumbsup:",u":clap:",u":stuck_out_tongue:",u":smile:",u":grinning:"]

for i in common_emojis:
    print(emoji.emojize('NeIC is %s'%i, use_aliases=True)) 
    print(i)
    print("------------------")

hmm, `emoji` package doesn't understand `:simple_smile:`

Let's count how often emojis are used in the different channels

In [None]:
channel_emojis = dict.fromkeys(channels,0)
 
for channel in channels:
    words = words_in_channels[channel]
    count = [words.count(i) for i in common_emojis]
    channel_emojis[channel] = count

channel_emojis

Put the dict into a DataFrame for easy manipulation

In [None]:
df_emojis = pd.DataFrame(data=channel_emojis)
df_emojis.head(10)

Let's use real emojis as indices in the dataframe

In [None]:
emojis=[]
for i in common_emojis:
    x = emoji.emojize(i, use_aliases=True) 
    emojis.append(x)
df_emojis["emojis"] = emojis
df_emojis.set_index('emojis',inplace=True, drop=True)
df_emojis.head(10)

Doesn't look good with the unsupported emoji. Let's sum simple_smile into slightly_smiling_face (we in the Nordics don't have such fine-grained positive emotions anyways)

In [None]:
row_keep = df_emojis.index[2]
row_delete = df_emojis.index[3]

In [None]:
# add row_delete to row_keep, and delete row_delete
df_emojis.loc[row_keep] += df_emojis.loc[row_delete]
df_emojis.drop([row_delete], inplace=True)

df_emojis


Normalize to total number of selected emojis in each channel to get final result

In [None]:
df_tmp = 100*df_emojis/df_emojis.sum()
df_tmp.round(1)

### Conclusions

- NeIC people express quite positive emotions overall 
- There's not a lot of clapping and thumbs-up-giving, except for `XT` people who do both, and `Tryggve` people enjoy clapping 
- On the other hand, `XT`-ers don't smile as much as other channels, but they do laugh a bit
- People in `web` stick out their tongue more than average
- `arc-debugging` and `ahm-planning` folks are rather limited in their emotional repertoire  
- The most ambiguous communication takes place on `random` and `general`, as evidenced by the high proportion of winking
- `NDGF`-ers are the most disappointed channel. Anything we can do to help guys? 😉 

### Topic modeling with latent Dirichlet allocation (LDA) 

Topic models are statistical models for discovering topics that occur in a set of documents. For a visual representation of what goes on in topic modelling, [see this animation](https://upload.wikimedia.org/wikipedia/commons/7/70/Topic_model_scheme.webm).  

[LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) is probably the most commonly used topic model today and was introduced by [David Blei, Andrew Ng and Michael I. Jordan](http://www.jmlr.org/papers/v3/blei03a.html).

We need to do some preprocessing. First join all words for each channel

In [None]:
joined_words_in_channels = dict.fromkeys(channels)
for i in channels:
    words = words_in_channels[i]
    joined_words_in_channels[i] = " ".join(words)


`textmining` is a useful package to create the [term-document matrix](https://en.wikipedia.org/wiki/Document-term_matrix)

In [None]:
tdm = textmining.TermDocumentMatrix()
# add documents to the term-document matrix (each channel is a document)
for channel in channels:
    tdm.add_doc(joined_words_in_channels[channel])

# write term document matrix to csv file
tdm.write_csv('matrix.csv', cutoff=1)


Need a numpy array as input to the LDA fit

In [None]:
vocab = list(tdm.rows(cutoff=1))[0] #needed later
titles = channels
X = np.array(list(tdm.rows(cutoff=1))[1:])

X.shape

We now define the model, and run the LDA fit

In [None]:
n_topics = 20
model = lda.LDA(n_topics=n_topics, n_iter=1500, random_state=1)
model.fit(X)  # model.fit_transform(X) is also available


From the fit model we can look at the topic-words. Let's look at the top 20 words for each topic by probability

In [None]:
topic_word = model.topic_word_
topic_word.shape

In [None]:
n_top_words = 20
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('*Topic {}\n- {}'.format(i, ' '.join(topic_words)))

The other information we get from the model is document-topic probabilities

In [None]:
doc_topic = model.doc_topic_
doc_topic.shape

In [None]:
for n in range(len(channels)):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}\n{}...".format(n,topic_most_pr,titles[n][:50]))

Ok, that wasn't interesting. The top topics are just help words

Let's plot the topics instead, and exclude topics 0 and 19

In [None]:
f, ax= plt.subplots(len(channels), 1, figsize=(8, 12))
for k in range(len(channels)):
    ax[k].stem(doc_topic[k,:], linefmt='r-',
               markerfmt='ro', basefmt='w-')
    ax[k].set_xlim(0.5, 18.5)
    ax[k].set_xticks(range(1,19))
    ax[k].set_ylim(0, .3)
    ax[k].set_ylabel("Prob")
    ax[k].set_title("{}".format(channels[k]))

ax[-1].set_xlabel("Topic")

plt.tight_layout()
plt.show()

For reference, let's also make a HTML table

In [None]:
table=[[] for i in range(len(topic_word))]
for i, topic_dist in enumerate(topic_word):
    table[i].append(str(i))
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    table[i].append(' '.join(topic_words))
from IPython.display import HTML, display
display(HTML('<table><tr>{}</tr></table>'.format('</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in table))))

### Conclusions

- Clearly, different topics are being discussed in each channel

> **Exercise:** Redo the above LDA analysis *removing stop words first*

### Extra: time analysis

#### When are messages are posted?

In [None]:
unixtime_in_channels = dict.fromkeys(channels)

Join timestamps in all channels

In [None]:
for channel in channels:
    dates = os.listdir("slack_logs/"+channel)
    unixtime = []
    for d in dates: 
        with open("slack_logs/"+channel+"/"+d,"r") as f:
            raw_json = json.loads(f.read())

        for j in raw_json:
            if not "subtype" in j.keys(): # exclude non-message messages
                unixtime.append(j["ts"])
    unixtime_in_channels[channel] = unixtime

We now convert to readable timestamps (and use a list this time, for no particular reason)

In [None]:
all_timestamps = []
for channel in channels:
    unixtime = unixtime_in_channels[channel]
    timestamps = [datetime.datetime.fromtimestamp(int(float(i))).strftime('%Y-%m-%d %H:%M:%S')
                  for i in unixtime]
    all_timestamps.append(timestamps)
    print("first message in {} was posted on {}".format(channel,timestamps[0]))

Let's convert to a pandas DataFrame, since they're nice to work with

In [None]:
df_datetimes = pd.DataFrame(data=all_timestamps)
df_datetimes = df_datetimes.transpose()
# set column labels to channel names
df_datetimes.columns = channels
df_datetimes.head()

Finally, use the groupby method to group dataframe into hours, and then count to get histogram

In [None]:
df_hours = pd.DataFrame()
for channel in channels:
    # need to convert to datetime64
    try:
        df_datetimes[channel] = df_datetimes[channel].astype("datetime64")
    except:
        pass
    # groupby and count()
    df_hours[channel] = df_datetimes[channel].groupby(df_datetimes[channel].dt.hour).count()

# normalize    
df_hours = df_hours / df_hours.sum()
# set name of index column
df_hours.rename_axis("Hour",inplace=True)
df_hours.head()

In [None]:
df_hours.plot(kind="bar",subplots=True,figsize=(10,18),ylim=(0,0.25));
plt.tight_layout()

> **Exercise:** Do the same analysis but splitting into weekdays!   
> (hint: format specifier %A gives the weekday)