# Text Analysis

#### The idea of this exercise to perform simple text analysis, a popular concept used in many cutting-edge applications. Also, known as Text Mining - the idea is to retrieve high-quality information from the text. Some of the text mining tasks are: text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization etc

#### Based on a custom query, we will try to find the similar documents from our pool of documents

In [None]:
from pyspark import SparkContext
sc = SparkContext()

In [None]:
# Load the text file in zipped format, yes that's possible!
t = sc.textFile('test.ft.txt.bz2')

In [None]:
t.count()

In [None]:
# Take a look how the data looks like
t.take(10)

##### Stopwords: The list of most frequenty used words in a specific language. Stopwords do not offer any useful information about a chunk of text, so we generally remove them from the text before progressing further

In [None]:
# Execute this cell to download the list of English stopwords
import urllib.request as urllib
urllib.urlretrieve ("https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt", "stopwords.txt")

In [None]:
stopwords = sc.textFile("stopwords.txt").collect()

#### Split the total dataset into two parts, if needed

In [None]:
train,test = t.randomSplit(weights=[0.9, 0.1], seed=1)

In [None]:
# Check the number of partitions
train.getNumPartitions()

In [None]:
# Increase the number of partitions
train = train.repartition(10)

In [None]:
train.getNumPartitions() # Check again

In [None]:
train.persist() # Store the RDD in memory for quicker operations

In [None]:
# Split the text into 'tokens' (individual words) by whitespace
traw = train.map(lambda x: x.split(' '))

In [None]:
# Discard the first token(word) and take rest
tdata = traw.map(lambda x: x[1:])

In [None]:
tdata.take(10)

In [None]:
# Create a function that would make the tokens(words) lowercase and then check if it's a stopword or not.
# If stopword, then discard it
# Input: x -> list of words/tokens
# Outout: list of words/tokens without stopwords
def remove_sw(x):
    # Write your code here

In [None]:
t_semi_clean = tdata.map(remove_sw)
t_semi_clean.take(10)

In [None]:
# Create a function which tries to eliminate all the special characters in tokens(words)
# Also, only take words which have length more than 2!
# Hint: Use regex, the module in python is re
# Input: x -> list of words/tokens
# Outout: list of words/tokens with length more than 2 and without any special characters
import re
def replace_special_chars(x):
    # Write code here

In [None]:
t_clean = t_semi_clean.map(replace_special_chars)
t_clean.take(10)

## Term Frequency (TF): The number of times a specific word occurs in a record

#### TF of term 't' in a document 'd' = Number of times term 't' occurs in a document or record 'd'

In [None]:
# Write a function which takes the rdd item (record) and 
# then tries to count the occurances of a specific word in the whole record
# Input: record -> list of words/tokens
# Output: list of (word, frequency of occurance)
def tf(record):
    counts = {}
    # Write your code here
    return list(counts.items()) 

In [None]:
tokens_with_tfs = t_clean.map(tf)
tokens_with_tfs.take(10)

## Inverse Document Frequency (IDF): How important is a specific word in the whole corpus

#### Calculation of IDF is not as straightforward as TF. 
#### IDF score of term 't' = log(total number of documents / number of documents containing 't')

In [None]:
#Take out the unique words per record from 't_clean' 
# Hint: Use python 'set' function

unique_words_per_record = t_clean.map(lambda x: #YOUR CODE HERE )

In [None]:
# Write a helper function to attach '1' to every word
# Input: record -> list of words
# Output: list of tuples where each tuple is (word, 1)
def attach_1_to_words(record):
    # Your code here

In [None]:
# You need to attach '1' to each and every word across all records of RDD 'unique_words_per_record'
# And Return as a single list. 
# Which transformation should we use?
unique_words_per_record_with_1 = # unique_words_per_record. YOUR CODE HERE

In [None]:
# We need to add up the '1's together for same words 
# which is basically counting the number of documents where a specific word occurs!
# Which transformation?
tokens_with_docs_count = # unique_words_per_record_with_1. YOUR CODE HERE
tokens_with_docs_count.take(2)

In [None]:
# Now, count the total number of documents
docs = t_clean.count()

In [None]:
# You have the counts for the words in the whole document set, now try to calculate IDF
# Hint: use python module "math" and then math.log for logarithm
# Return: RDD of (token, idf_score)
import math
tokens_with_idfs = tokens_with_docs_count.map(lambda x: (x[0], math.log(docs/x[1])))

In [None]:
# Sort the result on the basis of idf scores and take just 10. Which 'action' do we use?
tokens_with_idfs.takeOrdered(10, lambda s: s[1])

In [None]:
# Calculate the idfs for each of the tokens (words) as a python dict (because we need to use it over and over again)
tokens_with_idfs_dict = tokens_with_idfs.collectAsMap()

### TFIDF score of a term in a specific document = TF of the term in a specific doc x IDF of the term 

In [None]:
# Write the function tfidf which would take the rdd which has the token counts per document
# and then muliply with the IDF score of that term
# Input: record -> list of (word, term frequency)
# Output: list of (word, tfidf score)
def tfidf(record):
    res = []
    #Your code here
    return res

In [None]:
tfidf_docs = tokens_with_tfs.map(tfidf)
tfidf_docs.take(5)

### Calculate cosine similarity :  measure of similarity of two documents i.e. the document vectors and the query vector. The document vectors are the vector representation of our documents which we have already calculated and the query vector will be calcultated based on a custom query

#### https://en.wikipedia.org/wiki/Cosine_similarity

In [None]:
# The cosine similarity function
# Input: doc_record: data rdd record, query: query rdd record
# Output: tuple of (doc_record, cosine similarity score)
def cosine_similarity(doc_record, query):
    dot_prod = 0.0
    norm_record = []
    norm_query = []
    for query_term in query:
        norm_query.append(query[query_term])
    for word_tfidf in doc_record:
        word = word_tfidf[0]
        tfidf = word_tfidf[1]
        norm_record.append(tfidf**2)
        
        if word in query:
            dot_prod += query[word] * tfidf
        res = dot_prod / math.sqrt(sum(norm_record)) / math.sqrt(sum(norm_query))
        return (doc_record, res)

In [None]:
def tuples_to_dict(record):
    output = {}
    for word_tfidf in record:
        word = word_tfidf[0]
        tfidf = word_tfidf[1]
        output[word] = tfidf
    return output

In [None]:
def querybuilder(querystr=""):
    query_rdd_raw = sc.parallelize([tuple(querystr.split(' '))])
    query_sw = query_rdd_raw.map(remove_sw)
    query_rs = query_sw.map(replace_special_chars)
    query_rdd_tf = query_rs.map(tf)
    query_rdd_tfidf = query_rdd_tf.map(tfidf)
    query_dict = query_rdd_tfidf.map(tuples_to_dict).collect()[0]
    return query_dict

In [None]:
test.take(1)

#### Now we will build the 'query' which would be used to find similar documents

In [None]:
# query = querybuilder("") # You can build the query by passing a string OR
query = querybuilder(test.take(1)[0])  # Build the query from the test RDD using any of the documents
query

In [None]:
r = tfidf_docs.map(lambda x: cosine_similarity(x, query)) # Calculate the cosine similarity

In [None]:
r.takeOrdered(10, key=lambda s: -s[1])

#### The rest of the section is optional and could be used if needed

In [None]:
r.filter(lambda x: x is None).count()

In [None]:
r = r.filter(lambda x: x is not None)

In [None]:
# Attach the document id and then sort
r.zipWithIndex().takeOrdered(5, key=lambda s: -s[0][1])

In [None]:
def get_original_record_ids(result_rdd, number):
    ids = []
    r_rdd = result_rdd.zipWithIndex()
    r_rdd_sorted = r_rdd.takeOrdered(number, key=lambda s: -s[0][1])
    i = 0
    for rec in r_rdd_sorted:
        ids.append((rec[1], i))
        i = i+1
    return ids

def filter_records_on_ids(training_record, oids):
    position = training_record[1]
    for oid in oids:
        if position == oid[0]:
            return True
    return False

def map_final_records(training_record, oids):
    position = training_record[1]
    for oid in oids:
        if position == oid[0]:
            return (training_record, oid[1])
    return None

In [None]:
oids = get_original_record_ids(r, 10)
oids

In [None]:
#Get the full content of the matched documents
train.zipWithIndex().filter(lambda x: filter_records_on_ids(x, oids)).map(lambda x: map_final_records(x, oids)).takeOrdered(10, lambda s: s[1])