# Discovering Channel Keywords

There is a lot of content in Slack. As mentioned in Part 1, there are more than 10,000 messages across 62 channels. People talk about a myriad of topics using different forms of expressions. 

We need an effective way to summarize the content in each channel. One way to do this is by extracting keywords that are pertinent to each channel. This will clear up a lot of the clutter - imagine getting about 10 phrases that summarize the content of each channel. One way to do this is by considering each channel as a 'document', and comparing naive counts of words in each of these documents. This is problematic however, for a number of different reasons: 

* There is no context in naive counts of words
* Each word is considered equally as important to a document, even though some words may be more pertinent
* Longer documents will contain more words and appear more pertinent
* It captures no interdependence between words
* There are many words that don't provide little meaning, but add to the feature space, creating the [curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

We will address some of these issues using preprocessing and Term Frequency Inverse Document Frequency ([Tf-Idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)). Tf-Idf gives weight to words in a document if they are frequently occuring in a document, and lessens weight if they are common across other documents. It thus outputs words that are usually relevant to a specific document that is not found much elsewhere.

### Table of Contents ###
> Part 1: [Analyzing Trends by Time](Part 1 - Analyzing Trends by Time.html)

> **Part 2: [Discovering Channel Keywords]**

> Part 3: [Channel Clustering](Part 3 - Channel Clustering.html)

> Part 4: [Sentiment Analysis](Part 4 - Sentiment Analysis.html)
    

In [71]:
import pandas as pd
data = pd.read_csv("data/concatenated.csv")

In [99]:
data.sample(3)

Unnamed: 0,channel,datetime,id,text,reactions,reaction_counts,reaction_total_count,reaction_type_count
8258,sascoding,2016-08-11 19:17:40.000268,U1KR9BP0E,<@U1KQYG6SJ>: I will support your continued le...,[u'joy'],[2],2.0,1.0
8905,social,2016-07-06 18:47:34.000013,U1L6WB44A,Dying to see The Secret Life of Pets this Frid...,[u'+1'],[5],5.0,1.0
2432,general,2016-08-28 18:34:53.000034,U1KQYG6SJ,"interesting that just last year, they had addr...",,,,


We will pass through the *text* portion of the data, and compute Tf-Idf scores of two-word phrases in each channel. We will do some preprocessing to first remove *filler words* called stop words, such as 'and', 'to', etc. Then, we will find the 200 most important phrases.

In [107]:
from sklearn.feature_extraction.text import TfidfVectorizer

def create_vectorizer():
    return TfidfVectorizer(input = 'content', max_features = 200, stop_words = 'english',
                         encoding = 'utf-8', ngram_range = (2,2))

Below is code for displaying the top 10 key phrases output by tf-idf.

In [141]:
import numpy as np

# number of top phrases to display
top_n = 10

def display_scores(vectorizer, tfidf_result, top_n):
    scores = zip(vectorizer.get_feature_names(),
                np.asarray(tfidf_result.sum(axis=0)).ravel())

    sorted_scores = sorted(scores, key = lambda x: x[1], reverse = True)

    for item in sorted_scores[:top_n]:
        print "{0:20} Score: {1}".format(item[0], item[1])

vectorizer = create_vectorizer()

def display_channel_top(channel, top_n = top_n):
    text = data[data['channel'] == channel]['text']
    tfidf_result = vectorizer.fit_transform(text.values.astype('U'))
    print "Channel: ", channel + '\n'
    
    display_scores(vectorizer, tfidf_result, top_n)
    print ""

Now, 62 outputs would be too gaudy. In an era of information overload, you don't need to be thrown more information. I will only display results for a select channels.

In [143]:
display_channel_top('social')
display_channel_top('recreation')

Channel:  social

beer garden          Score: 22.7900132132
let know             Score: 19.5876367634
http www             Score: 12.7183706158
going tonight        Score: 11.635329113
wants join           Score: 11.5523027368
https www            Score: 8.14295801667
just got             Score: 7.52015004717
tomorrow night       Score: 6.59385732977
rum runners          Score: 6.33199893985
sounds good          Score: 6.30284299504

Channel:  recreation

miller fields        Score: 8.5083861035
sunday miller        Score: 4.78438518916
soccer sunday        Score: 4.10924901114
fields 5pm           Score: 3.77685479516
let know             Score: 3.32321840368
armory fields        Score: 3.16872797977
play sunday          Score: 2.6602341385
fields closed        Score: 2.57576408461
looks like           Score: 2.55836785092
make tomorrow        Score: 2.49293243197



'Beer Garden' is a favorite of our MSA class. More importantly, it is a place mentioned frequently in the *social* channel, but not so much elsewhere. Hence, it surfaces as a top keyword for that channel. We can also surmise that people ask 'who is going tonight and wants to join?' followed by ..'I'll let you know' ..'sounds good.' You can almost hear a typical conversation taking place in this channel.

Comparatively, in the recreation channel, the most popular topic is talking about playing soccer sunday at 5pm at Miller fields. Another location is Armory fields, but with a much lower scores, we see Miller fields is primarily where people play. Armory field is only played in when Miller fields is closed. An interesting question I can tackle is how I can distinguish between usage of Miller for Miller fields and Miller Lite. Now I got ideas poppin'..

In [144]:
display_channel_top('linearalgebra')
display_channel_top('logisticregression')

Channel:  linearalgebra

factor analysis      Score: 1.91259225706
principal components Score: 1.87123950889
correlation covariance Score: 1.70710678119
dr race              Score: 1.56956155258
worksheet quiz       Score: 1.46632668397
grades don           Score: 1.39226311039
linear algebra       Score: 1.34927756873
total variance       Score: 1.31380841384
does know            Score: 1.26094372813
linearly independent Score: 1.22445651981

Channel:  logisticregression

logistic regression  Score: 6.1850438952
odds ratio           Score: 3.60502587628
need know            Score: 3.31621266659
ratio test           Score: 3.0764583422
odds ratios          Score: 2.97734414536
training data        Score: 2.69870342597
dr simmons           Score: 2.69276540582
likelihood ratio     Score: 2.6202610932
common odds          Score: 2.51817439189
quasi complete       Score: 2.21223542475



Shall we do some academics? 'Factor Analysis' and 'Principal Components' are top keywords in *linearalgebra*, whereas there are many words related to odds and ratio in *logisticregression*. You can even see who teaches these courses - Dr. Race in LinAlg, Dr. Simmons in Logistic. 

In [147]:
display_channel_top('foodies')
display_channel_top('_practicum_tech_leads')

Channel:  foodies

foodie outing        Score: 2.64320932962
want dinner          Score: 2.25457460504
restaurant week      Score: 2.07366157262
wants join           Score: 2.06198535522
let know             Score: 2.04065156877
thinking going       Score: 2.0
want try             Score: 1.96772555601
really good          Score: 1.78485000765
outing tonight       Score: 1.6940231452
dinner tonight       Score: 1.68187118087

Channel:  _practicum_tech_leads

u2r2nanuc thanks     Score: 4.0
practicum server     Score: 3.7890372446
enterprise miner     Score: 2.51180867934
does know            Score: 2.33347405574
tech leads           Score: 2.170148842
great idea           Score: 2.0
think supposed       Score: 2.0
using git            Score: 2.0
want use             Score: 1.93286781486
backup drive         Score: 1.81054101571



Foodies is pretty self-explanatory. *u2r2nanuc* is a user-name that is masked for privacy reasons, but is the person who is in charge of the servers, managing transfer portal security. This is a channel where practicum technical leads ask questions and share knowledge. It seems SAS enterprise miner has been talked about a lot, as well as using git, and how to backup drives.

In [149]:
display_channel_top('python')
display_channel_top('datamining')

Channel:  python

pip install          Score: 5.48826627611
https www            Score: 4.11172314471
nltk download        Score: 3.98470560605
command line         Score: 3.96094126756
dr healey            Score: 3.95008169731
weather underground  Score: 3.80634314124
lunch learn          Score: 3.33454701845
http www             Score: 3.26123361693
looks like           Score: 3.15393624544
let know             Score: 2.99492069591

Channel:  datamining

data mining          Score: 3.48222499158
10 000               Score: 3.34900079537
misclassification rate Score: 2.13994929403
makes sense          Score: 2.06049634364
far apart            Score: 2.0
line flattens        Score: 2.0
person person        Score: 2.0
thanks u1ks1bfed     Score: 2.0
thanks u1l6yqusu     Score: 2.0
measure impurity     Score: 1.96904987881



Yessssss I love python and data mining. 

To summarize, we have found phrases that characterize the conversations in each channel. In the [next section](Part 3 - Channel Clustering.ipynb), we will find groupings of these channels. 

### Check out further content ###
> Part 1: [Analyzing Trends by Time](Part 1 - Analyzing Trends by Time.html)

> **Part 2: [Discovering Channel Keywords]**

> Part 3: [Channel Clustering](Part 3 - Channel Clustering.html)

> Part 4: [Sentiment Analysis](Part 4 - Sentiment Analysis.html)
    