Algorithms used in the project

We have used 3 algorithms.

CountVectorizer
Random Forest
RegexTokenizer

CountVectorizer

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

Examples

Assume that we have the following DataFrame with columns id and texts:

each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces a CountVectorizerModel with vocabulary (a, b, c). Then the output column “vector” after transformation contains:

Each vector represents the token counts of the document over the vocabulary.

For more details click here.

Random Forest

Random Forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because it’s simplicity and the fact that it can be used for both classification and regression tasks. In this post, you are going to learn, how the random forest algorithm works and several other important things about it.

Random Forest is a supervised learning algorithm. Like you can already see from it’s name, it creates a forest and makes it somehow random. The „forest“ it builds, is an ensemble of Decision Trees, most of the time trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

To say it in simple words: Random forest builds multiple decision trees and merges them together to get a more accurate and stable prediction.

One big advantage of random forest is, that it can be used for both classification and regression problems, which form the majority of current machine learning systems. I will talk about random forest in classification, since classification is sometimes considered the building block of machine learning.

For more details click here.

RegexTokenizer

A regex based tokenizer that extracts tokens either by using the provided regex pattern to split the text (default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty.

ft_regex_tokenizer(x, input_col = NULL, output_col = NULL, gaps = TRUE, min_token_length = 1, pattern = "\\s+", to_lower_case = TRUE, uid = random_string("regex_tokenizer_"), ...)

For more details click here and here.

References

https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer

https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd

https://spark.rstudio.com/reference/ft_regex_tokenizer/

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/ml/feature/RegexTokenizer.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly