<div class="alert alert-block alert-success"><h3>IFN619 - Data Analytics for Information Professionals</h4></div>

## Module 2A Workshop :: Relevance and Basic Text Analytics

1. What is relevant?
    * Information Retrieval
    * Filter Bubbles
2. Which text is most relevant?
    * Lexical frequency
    * TF/IDF
    * BM25
    * Topic modelling
3. Reflection
    * Are these questions equivilent?
    * Asking the right questions

In [None]:
from IPython.display import IFrame 
import pandas as pd
from collections import Counter
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.utils import tokenize
from gensim.summarization.bm25 import get_bm25_weights
from gensim.corpora.textcorpus import remove_stopwords
from gensim.summarization import keywords
from gensim.models.ldamodel import LdaModel

In [None]:
#!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim

### [1] What is relevant?

In [None]:
IFrame(src="https://embed.ted.com/talks/lang/en/eli_pariser_beware_online_filter_bubbles", width="854", height="480")

**DISCUSSION:**
* How might filter-bubbles have contributed to the recent tragedy in Christchurch?
* Can you think of ways that you personally might be trapped in a filter-bubble?
* As an information professional, what is your responsibility?

[Read More]()

### [2] Which text is most relevant?

Common text algorithms for finding relevant text for a given query.

#### Before we start...

We need to get the text into a format where it is structured enough to use (similar process to last workshop).


In [None]:
df = pd.read_csv(???,names=['topic','title','description'])

In [None]:
???

In [None]:
testString = df.iloc[10]['description']
testString

In [None]:
terms = list(tokenize(testString))
terms

In [None]:
def termFilter(str):
    return  len(str)>1 and str!='lt' and str !='gt'

In [None]:
list(filter(???,terms))

In [None]:
list(filter(???,tokenize(df.iloc[9]['description'])))

In [None]:
def getTerms(text):
    return list(filter(???,tokenize(text)))

In [None]:
df['terms'] = df['description'].map(???)

In [None]:
df['termCount'] = df['terms'].map(???)

In [None]:
df[['topic','termCount']].groupby(???).mean()

In [None]:
subset = df[df['termCount']>40]
len(subset)

In [None]:
len(df)

In [None]:
terms = subset['terms']
terms

In [None]:
allterms = [term for list_ in terms for term in list_]
len(???)

In [None]:
vocab = Counter(???)
sortedVocab = sorted(vocab, key=vocab.get, reverse=True)
for key in sortedVocab:
    print(vocab[key],'\t',key)

#### Term Frequency, Inverse Document Frequency (TF/IDF)

[Read More]()

In [None]:


vocab = Dictionary(df[???].tolist())  # fit dictionary
corpus = [vocab.doc2bow(terms) for terms in df['terms'].tolist()]  # convert corpus to BoW format

model = TfidfModel(corpus)  # fit model
vector = model[corpus[0]]  # apply model to the first corpus document

In [None]:
vector

In [None]:
[(vocab[el[0]],el[1]) for el in vector if el[1]>0.3]

In [None]:
def get_tfidf(idx):
    term_values = [(vocab[el[0]],el[1]) for el in model[corpus[idx]] if el[1]>0]
    srt =  sorted(term_values, key=lambda x: x[1],reverse=True)
    return list(map(lambda x: x[0],srt[:5]))


In [None]:
df['tfidf'] = df.index.map(???)

In [None]:
df

In [None]:
df = df[['topic','description','terms','tfidf']]
df

In [None]:
pd.set_option('display.max_colwidth', -1)
print(df.iloc[0])

#### BM25 (Best Match)

[Read More]()

In [None]:

doc_terms = list(map(remove_stopwords,df['terms'].tolist()))
doc_terms

In [None]:
bm25_model = get_bm25_weights(???, n_jobs=-1)

In [None]:
def get_bm25(idx):
    term_values = list(zip(doc_terms[idx],bm25_model[idx]))
    top_values = filter(lambda t: t[1]>0,term_values)
    srt =  sorted(top_values, key=lambda x: x[1],reverse=True)
    return list(map(lambda x: x[0],srt[:5]))

In [None]:
get_bm25(???)

In [None]:
df['bm25'] = df.index.map(???)

In [None]:
df

In [None]:
def get_keywords(text):
    return keywords(text).split('\n')

In [None]:
df['keywords'] = df['description'].map(???)

In [None]:
df

#### Topic Modelling

Latent Dirichlet Allocation (LDA) - [Read More]()

In [None]:

lda_model = LdaModel(corpus=???, id2word=???, num_topics=???, random_state=100, update_every=1,
                     chunksize=100, passes=???, alpha='auto', per_word_topics=True)

In [None]:
lda_model.print_topics()

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
#vis = pyLDAvis.gensim.prepare(lda_model, corpus, vocab)
#vis

### [3] Reflection

* Are these questions equivilent?
* Asking the right questions

#### Pulse Survey

Please complete the pulse survey. This is a new unit. We value your feedback, and need both positive and negative feedback to help us with improving the unit.