# **Textual Data Classifiction** 
A set of preprocessing steps must be applied on the textual attribute before generating a classification model. Since Spark ML algorithms work only on “Tables” and double values, the textual part of the input data must be translated in a set of attributes to represent the data as a table. Many words are useless (e.g., conjunctions). Stopwords are usually removed. 

Traditionally a weight, based on the **TF-IDF** measure, is used to assign a difference importance to the words based on their frequency in the collection.

In [None]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import HashingTF
from pyspark.ml.feature import IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

# input and output folders
trainingData = "ex_dataText/trainingData.csv"
unlabeledData = "ex_dataText/unlabeledData.csv"
outputPath = "predictionsLRPipelineText/"

In [None]:
# *************************
# Training step
# *************************

# Create a DataFrame from trainingData.csv
# Training data in raw format
trainingData = spark.read.load(trainingData,\
                                format="csv",\
                                header=True,\
                                inferSchema=True)

### **Here the different part**

In [None]:
# Configure an ML pipeline, which consists of five stages:
# tokenizer -> split sentences in set of words
# remover -> remove stopwords
# hashingTF -> map set of words to a fixed-length feature vectors (each
# word becomes a feature and the value of the feature is the frequency of
# the word in the sentence)
# idf -> compute the idf component of the TF-IDF measure
# lr -> logistic regression classification algorithm

# The Tokenizer splits each sentence in a set of words.
# It analyzes the content of column "text" and adds the
# new column "words" in the returned DataFrame
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")

In [None]:
# Remove stopwords.
# The StopWordsRemover component returns a new DataFrame with
# a new column called "filteredWords". "filteredWords" is generated
# by removing the stopwords from the content of column "words"

remover = StopWordsRemover()\
.setInputCol("words")\
.setOutputCol("filteredWords")

In [None]:
# I want now compute the TF-IDF (Term Frquency - Inverse Doc Frequency) component. 
# Unfortunately it doesn't exists a single method that automatically compute it
# We need to pass through 2 Steps and 1 approximation

# First Step: hashing TF
# It maps each input word to a number and compute the 
# occurrencies of that number inside each sentence
hashingTF = HashingTF()\
.setNumFeatures(1000)\ # number of features of the vector (lucky if is > #words)
.setInputCol("filteredWords")\
.setOutputCol("rawFeatures")

In [None]:
# Apply the IDF transformation/computation.
# System will compute the occurrences of each word in 
# the entire dataset
idf = IDF()\
.setInputCol("rawFeatures")\
.setOutputCol("features") 

In [None]:
# Create a classification model based on the logistic regression algorithm
# We can set the values of the parameters of the
# Logistic Regression algorithm using the setter methods.
lr = LogisticRegression()\
.setMaxIter(10)\
.setRegParam(0.01)

In [None]:
pipeline = Pipeline().setStages([tokenizer, remover, hashingTF, idf, lr])

# Execute the pipeline on the training data to build the
# classification model
classificationModel = pipeline.fit(trainingData)

# Now, the classification model can be used to predict the class label
# of new unlabeled data

In [None]:
# *************************
# Prediction step
# *************************

# Read unlabeled data
# Create a DataFrame from unlabeledData.csv
# Unlabeled data in raw format
unlabeledData = spark.read.load(unlabeledData,\
    format="csv", header=True, inferSchema=True)

In [None]:
# Make predictions on unlabeled documents by using the
# Transformer.transform() method.
# The transform will only use the 'features' columns
predictionsDF = classificationModel.transform(unlabeledData)

# Select only the original features (i.e., the value of the original text attribute) and
# the predicted class for each record
predictions = predictionsDF.select("text", "prediction"
                                   
# Save the result in an HDFS output folder
predictions.write.csv(outputPath, header="true")

# **Sparse Labeled Data: The LIBSVM format**
Frequently the training data are sparse. E.g., textual data are sparse. MLlib supports reading training examples stored in the LIBSVM format: it is a commonly used textual format that is used
to represent sparse documents/data points.

The LIBSVM format is a textual format in which each line represents an input record/data point by using a sparse feature vector.
Each line has the format:
- **label index1:value1 index2:value2 ...**

Consider the following two records/data points characterized by 4 predictive features and a class label:
- Features = [5.8, 1.7, 0 , 0 ] -- Label = 1
- Features = [4.1, 0 , 2.5, 1.2] -- Label = 0
Their LIBSVM format-based representation is the following:
- 1 1:5.8 2:1.7
- 0 1:4.1 3:2.5 4:1.2

LIBSVM files can be loaded into DataFrames by combining the following methods:
- read, format("libsvm"), and load(inputpath)

The returned DataFrame has two columns:
- label: double
    - The double value associated with the label
- features: vector
    - A sparse vector associated with the predictive features

In [None]:
...
spark.read.format("libsvm").load("sample_libsvm_data.txt")
..