# Text Analysis


#### The idea of this exercise to perform simple text analysis, a popular concept used in many cutting-edge applications. Also, known as Text Mining - the idea is to retrieve high-quality information from the text. Some of the text mining tasks are: text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization etc

In [None]:
from pyspark import SparkContext
sc = SparkContext()

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Text Classification") \
    .getOrCreate()

In [None]:
# Load the text file in zipped format, yes that's possible!
t = sc.textFile('data/test.ft.txt.bz2')

In [None]:
t.take(2)

#### Stopwords: The list of most frequenty used words in a specific language. Stopwords do not offer any useful information about a chunk of text, so we generally remove them from the text before progressing further¶

In [None]:
# Execute this cell to download the list of English stopwords
import urllib.request as urllib
urllib.urlretrieve ("https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt", "stopwords.txt")

In [None]:
stopwords = sc.textFile("stopwords.txt").collect()

In [None]:
# Check the number of partitions
t.getNumPartitions()

In [None]:
# Increase the number of partitions
t = t.repartition(10)

In [None]:
t.getNumPartitions() # Check again

In [None]:
# Split the text into 'tokens' (individual words) by whitespace
traw = ....

In [None]:
# Discard the first token(word) and take rest
tdata = traw.map(lambda x: x[1:])

In [None]:
# Create a function which tries to eliminate all the special characters in tokens(words)
# Also, only take words which have length more than 2!
# Hint: Use regex, the module in python is re
# Input: x -> list of words/tokens
# Outout: list of words/tokens with length more than 2 and without any special characters
import re
def replace_special_chars(x):
    out = []
    # YOUR CODE HERE
    return out

In [None]:
t_semi_clean = tdata.map(replace_special_chars)
t_semi_clean.take(10)

In [None]:
# Create a function that would make the tokens(words) lowercase and then check if it's a stopword or not.
# If stopword, then discard it
# Input: x -> list of words/tokens
# Outout: list of words/tokens without stopwords
def remove_sw(x):
    out = []
    # YOUR CODE HERE
    return out

In [None]:
t_clean = t_semi_clean.map(remove_sw)
t_clean.take(10)

In [None]:
tfinal = t_clean.zipWithIndex()  # add a unique id for the records

In [None]:
inputdf = spark.createDataFrame(tfinal, ["words", "id"])  # DATAFRAMES! WHAAAAAT IS THAT?

In [None]:
inputdf.show()

In [None]:
inputdf.cache()  # store in memory for quicker operations, Spark's USP!

### Term Frequency (TF): The number of times a specific word occurs in a record

To convert text into word counts, we use the CountVectorizer
https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer

PySpark docs:
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizer

In [None]:
from pyspark.ml.feature import CountVectorizer
# Parameters:
# inputCol: "words"
# outputCol: "rawFeatures"
# vocabSize: 1000
# minDF: 2
cv = .....

In [None]:
# 'Fit' the model to the input data frame (inputdf) with 'cv'
cvmodel = .....

cvmodel here is a 'CountVectorizerModel', all the functions listed here: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.CountVectorizerModel

In [None]:
vocab = cvmodel.vocabulary  # Get the vocabulary list out

In [None]:
# Broadcast the variable to all the workers (Only suitable for read-only data which would be required as a lookup table)
vocab_broadcast = sc.broadcast(vocab)

In [None]:
# 'Transform' the inputdf with the cvmodel
featurizedData = ......

In [None]:
featurizedData.show()

### Inverse Document Frequency (IDF): How important is a specific word in the whole corpus

https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.IDF

In [None]:
from pyspark.ml.feature import IDF
# Parameters:
# inputCol: "rawFeatures"
# outputCol: "features"
idf = .......

In [None]:
# fit the model with 'featurizedData'
idfModel = .......

idfModel here is an 'IDFModel', read its functions here: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.IDFModel

In [None]:
# Transform 'featurizedData' with idfModel
tfidfData = ...... 
# We get TFIDF data above

In [None]:
tfidfData.select('rawFeatures').take(1)

In [None]:
tfidfData.select('features').take(1)

### Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. LDA can be thought of as a clustering algorithm where Topics correspond to cluster centers, and documents correspond to examples (rows) in a dataset.

https://spark.apache.org/docs/latest/mllib-clustering.html#latent-dirichlet-allocation-lda

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA

In [None]:
from pyspark.ml.clustering import LDA
# Group the words into 10 different topics
# Parameters:
# k: 10
# featuresCol: "features"
lda = ......

In [None]:
# fit the model with 'tfidfData'
ldamodel = .......

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDAModel

In [None]:
# describe the topics from the model
ldatopics = ........

In [None]:
ldatopics.show()

In [None]:
def indexToWord(x):
    vocab_local = vocab_broadcast.value  # Read the variable that we broadcasted earlier
    res = []
    for i in x[1]:
        res.append(vocab_local[i])
    return (x[0], res)

In [None]:
topicsRDD = ldatopics.rdd.map(indexToWord)

In [None]:
# convert topicsRDD to dataframe passing the names of the columns ['topic', 'word'] and show the contents
