MADLIB-1351 : Added stopping criteria on perplexity to LDA #432

hpandeycodeit · 2019-08-27T18:28:24Z

LDA:
Added stopping criteria on perplexity to LDA.

Currently, in LDA there are no stopping criteria. It runs for all the provided iterations.
This PR calculated the Perplexity on each iteration and when the difference between the last two Perplexity values is less than the perplexity_tol, it stops the iteration.

These are the two new parameters added to the function:

evaluate_every      Integer,
perplexity_tol      Double Precision

And there is a change to the Model output table as well. It will have these two extra columns

perplexity  DOUBLE PRECISION[]
perplexity_iters INTEGER[]

Where
perplexity is an Array of perplexity values as per the 'evaluate_every' parameter.
perplexity_iters is an Array indicating the iterations for which perplexity is calculated

asf-ci · 2019-08-27T19:15:54Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1032/

kaknikhil · 2019-08-28T21:28:34Z

@hpandeycodeit
I haven't reviewed the code yet but looks like there aren't any tests for this PR. Can you add tests for all possible scenarios related to the changes made in this PR ?
Make sure to test for all possible test cases for evaluate_every and perplexity

hpandeycodeit · 2019-08-28T21:43:13Z

@hpandeycodeit
I haven't reviewed the code yet but looks like there aren't any tests for this PR. Can you add tests for all possible scenarios related to the changes made in this PR ?
Make sure to test for all possible test cases for evaluate_every and perplexity

@kaknikhil I will add the tests cases soon to the PR.

asf-ci · 2019-08-31T02:35:36Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1037/

asf-ci · 2019-09-03T21:03:53Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1041/

kaknikhil · 2019-08-28T21:31:09Z

src/ports/postgres/modules/lda/lda.py_in

+        # the Model and Output Table
+        if self.evaluate_every > 0:
+            self.perplexity.append(
+                get_perplexity('madlib',


The schema should not be hard coded to 'madlib' in all the places that call get_perplexity. Use the schema_madlib variable instead.

kaknikhil · 2019-08-28T21:37:30Z

src/ports/postgres/modules/lda/lda.py_in

+        prep_string = ""
+        prep_itr_str = ""
+        if len(self.perplexity) > 0:
+            prep_string = ", " + py_list_to_sql_string(self.perplexity)


Use .format instead of +

kaknikhil · 2019-09-03T18:10:30Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    END;
+$$ LANGUAGE plpgsql;
+
+select assert(validate_perplexity() = TRUE, 'Perplexity calculation is wrong');


missing new line

kaknikhil · 2019-09-03T18:12:07Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    'lda_training',
+    'lda_model',
+    'lda_output_data',
+    20, 5, 2, 10, 0.01, 2, .2);


maybe add the column name as a comment after each of these numbers to make it more readable and also add a new line after each argument

kaknikhil · 2019-09-03T18:18:16Z

src/ports/postgres/modules/lda/lda.py_in

+            # JIRA: MADLIB-1351
+            # If the Perplexity_diff is less than the perplexity_tol,
+            # Stop the iteration
+            if self.perplexity_diff < self.perplexity_tol:


We should also add a test case for this condition. Either unit test or dev check

kaknikhil · 2019-09-03T18:28:28Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    'lda_output_data',
+    20, 5, 3, 10, 0.01, 1, .1);
+
+SELECT assert(cardinality(perplexity) = 3, 'Perplexity calculation is wrong') from lda_model; 


I don't think the cardinality function available in gpdb 4.3. If not then we should replace it by something like array_upper.

kaknikhil · 2019-09-03T19:37:15Z

src/ports/postgres/modules/lda/test/lda.sql_in

+---------- TEST CASES FOR PERPLEXITY ----------
+
+drop table if exists lda_model, lda_output_data;
+SELECT lda_train(


We should add few more test cases. In all these case we need to assert that we calculated the perplexity at the right iteration.

no_of_iterations % evaluate_every != 0.

both no_of_iters and evaluate_every = 1

no_of_iterations % evaluate_every == 0 and no_of_iterations != evaluate_every

Set evaluate_every to 0 and -1

When perplexity_tol is reached before finishing all the iterations

Added tests for 2 and 4. There are few outstanding tests like 1,3 and 5 for which I need some more clarity. I will discuss with you on that.

kaknikhil · 2019-09-03T20:47:28Z

src/ports/postgres/modules/lda/lda.py_in

+        # JIRA: MADLIB-1351
+        # Calculate Perplexity for evaluate_every Iteration
+        # Skil the calculation at the first iteration as the model generated
+        # at first iteration is a random model


I think we should be more verbose in this comment. Something like (but definitely not limited to)

For each iteration 1. Model table is updated (for the first iteration, it is the random model. For iteration >1 , the model that is updated is learnt in the previous iteration) 1. __lda_count_topic_agg is called 1. then lda_gibbs_sample is called which learns and updates the model(the updated model is not passed to python. The learnt model is updated in the next iteration) Because of this workflow we can safely ignore the first perplexity value.

kaknikhil · 2019-09-03T20:52:35Z

src/ports/postgres/modules/lda/lda.py_in

+        # Calculate Perplexity for evaluate_every Iteration
+        # Skil the calculation at the first iteration as the model generated
+        # at first iteration is a random model
+        if it > self.evaluate_every and self.evaluate_every > 0 and (


we already assert that evaluate_every >=0 (line 514) , we don't need to repeat this check.

Unless I am missing something, the whole if check can be simplified by skipping the perplexity calculation when it == 0 instead of using it and it-1.

We could move this code logic (lines 206 - 216) to it's own function and unit test all the logic.

This logic is appending values in perplexity_iters ( it - 1) : perplexity_iters[0] = it-1;
and Moved the code to a seperate function

kaknikhil · 2019-09-03T21:08:51Z

src/ports/postgres/modules/lda/lda.py_in

@@ -445,6 +511,12 @@ def lda_train(schema_madlib, train_table, model_table, output_data_table, voc_si
            'invalid argument: positive real expected for alpha')
    _assert(beta is not None and beta > 0,
            'invalid argument: positive real expected for beta')
+    _assert(evaluate_every is not None and evaluate_every >= 0,


The user docs for evaluate_every mention Set it to 0 or negative number to not evaluate perplexity in training at all but this check will throw an exception for evaluate_every < 0

I have removed this check as we are not calculating the perplexity for 0 or -1.

kaknikhil · 2019-09-03T21:13:21Z

Few more general comments

The commit title should have the module name and not the jira no i.e. LDA : Added stopping criteria on perplexity.
The commit is missing details and the JIRA no. We should add a verbose commit message (including the motivation for excluding the first iteration for calculating perplexity).
URL for the jira in the PR message is incorrect. It is pointing to the apache madlib pull request url instead of apache madlib jira

asf-ci · 2019-09-06T19:30:17Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1053/

kaknikhil · 2019-09-16T17:28:40Z

src/ports/postgres/modules/lda/lda.py_in

+        prep_string = ""
+        prep_itr_str = ""
+        if len(self.perplexity) > 1:
+            prep_string = ", {0}".format(py_list_to_sql_string(self.perplexity))


can we give these 2 variables better names ? What does prep mean (perplexity ??) ??

changed the names here.

kaknikhil · 2019-09-16T17:34:36Z

src/ports/postgres/modules/lda/lda.py_in

+        if it > self.evaluate_every and self.evaluate_every > 0 and (
+                it - 1) % self.evaluate_every == 0:
+            self.gen_output_data_table(work_table_in)
+            perplexity = 0.0


this line is not needed

kaknikhil · 2019-09-16T17:35:34Z

src/ports/postgres/modules/lda/lda.py_in

+            perplexity = get_perplexity(self.schema_madlib,
+                                        self.model_table,
+                                        self.output_data_table)
+            self.perplexity_diff = abs(self.perplexity[


refactor self.perplexity[len(self.perplexity) - 1] as self.perplexity[-1]

kaknikhil · 2019-09-16T17:39:29Z

src/ports/postgres/modules/lda/test/lda.sql_in

@@ -288,3 +288,126 @@ CREATE OR REPLACE FUNCTION validate_lda_output() RETURNS integer AS $$
 $$ LANGUAGE plpgsql;

 select validate_lda_output();
+
+
+---------- TEST CASES FOR PERPLEXITY ----------


consider adding a description at the beginning of each test case

One liner headings are already present for every test case. Let me know if you think putting more details is a good idea.

kaknikhil · 2019-09-16T17:40:11Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    'lda_training',
+    'lda_model',
+    'lda_output_data',
+    20, 5, 2, 10, 0.01, 2, .2);


same comment as before

maybe add the column name as a comment after each of these numbers to make it more readable and also add a new line after each argument

kaknikhil · 2019-09-16T17:46:19Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    'lda_output_data',
+    20, 5, 2, 10, 0.01, 2, .2);
+
+SELECT assert(perplexity_iters = '{2}', 'Number of Perplexity iterations are wrong') from lda_model; 


We can also assert the len of the perplexity values.

Since we cannot deterministically assert the perplexity value itself, we should at least assert that all the perplexity values > 0

Added the test cases for above as discussed.

kaknikhil · 2019-09-16T17:47:05Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    .1                       -- perplexity_tol
+    );
+
+SELECT assert(array_upper(perplexity,1) = 3, 'Perplexity calculation is wrong') from lda_model; 


we should assert the value of perplexity_iters here and also that all perplexity values are > 0

Added test for this as well.

kaknikhil · 2019-09-16T17:48:07Z

src/ports/postgres/modules/lda/test/lda.sql_in

+    .1                       -- perplexity_tol
+    );
+
+select assert(perplexity = '{}', 'Perplexity calculation is wrong') from lda_model; 


If evaluate_every=1, why do we expect the perplexity array to be empty ?

Fixed this one.

asf-ci · 2019-09-26T21:19:08Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1086/

kaknikhil · 2019-09-30T18:33:30Z

@hpandeycodeit the jenkins build is failing for the latest commit. Can you take a look ?

kaknikhil · 2019-09-30T18:36:37Z

Can you also add a test for perplexity_tol ?

hpandeycodeit · 2019-10-01T17:28:04Z

Can you also add a test for perplexity_tol ?

fixed these.

asf-ci · 2019-10-01T17:53:34Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1089/

asf-ci · 2019-10-01T18:30:12Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1090/

asf-ci · 2019-10-01T19:10:11Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1091/

asf-ci · 2019-10-09T00:06:19Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1098/

kaknikhil · 2019-10-09T23:37:33Z

src/ports/postgres/modules/lda/test/lda.sql_in

Why are we checking for <10 if the tol is 100 ?

I think can add another assert to all the dev-check tests assert that all the perplexity values are unique. What do you think ?

do you mean if the length of the calculated perplexity values matches the distinct perplexity values?
fixed other issues.

No I mean adding an assert to check that all the perplexity values are different

Added test case for distinct perplexity values as discussed.

kaknikhil · 2019-10-09T23:47:29Z

src/ports/postgres/modules/lda/lda.sql_in

+    <dt>evaluate_every</dt>
+    <dd>int, optional (default=0). How often to evaluate perplexity. Set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.</dd>
+    <dt>perplexity_tol</dt>
+    <dd>float, optional (default=1e-1). Perplexity tolerance to stop iterating. Only used when evaluate_every is greater than 0.</dd>


maybe @fmcquillan99 can add a more verbose explanation here.

kaknikhil · 2019-10-09T23:47:51Z

src/ports/postgres/modules/lda/test/lda.sql_in

@@ -438,7 +444,9 @@ select assert(array_upper(perplexity_iters,1) <= 5, 'Perplexity iterations are d
 select assert(perplexity[1] > 0 , 'Perplexity value should be greate than 0') from lda_model ;


-- Test to check if the perplexity_toll is greater than the difference between two perplexity iterations --
+-- Test: If the difference between the last two iterations is less than the perplexity_tol, the iterations training will stop --


Instead of saying last two iterations we can just say If the perplexity difference between any two iterations is less than the perplexity_tol, we will stop training.

asf-ci · 2019-10-15T16:37:32Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1106/

asf-ci · 2019-10-15T22:39:46Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1109/

fmcquillan99 · 2019-10-25T00:39:39Z

(1)
Please add num_iterations to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:

model_table
...
model	BIGINT[]. The encoded model ...etc...
num_iterations	INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity	DOUBLE PRECISION[] Array of ...etc....
...

(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         103,                     -- vocabulary size
                         5,                       -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every 2 iterations
                         0.0                      -- Set tolerance to 0 so runs full number of iterations
                       );

produces

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than .1 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 519, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         103,                     -- vocabulary size\n                         5,                       -- number of topics\n                         10,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2,                       -- Evaluate perplexity every 2 iterations\n                         0.0                      -- Set tolerance to 0 so runs full number of iterations\n                       );"]

fmcquillan99 · 2019-10-28T16:22:09Z

(3)
Last iteration value for perplexity does not match final perplexity value:

DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);

INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area.'),
(4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.'),
(5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.'),
(6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.'),
(7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.'),
(8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant.'),
(9, 'In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.'),
(10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.'),
(11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally designed to perform a simple task in an indirect and overly complicated way. Usually, these machines consist of a series of simple unrelated devices; the action of each triggers the initiation of the next, eventually resulting in achieving a stated goal.'),
(12, 'In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.'),
(13, 'k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.'),
(14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.');


ALTER TABLE documents ADD COLUMN words TEXT[];

UPDATE documents SET words = 
    regexp_split_to_array(lower(
    regexp_replace(contents, E'[,.;\']','', 'g')
    ), E'[\\s+]');


DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents',    -- input table
                             'docid',        -- document id column
                             'words',        -- vector of words in document
                             'documents_tf', -- output documents table with term frequency
                             TRUE);          -- TRUE to created vocabulary table

Train

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         100,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         1,                       -- Evaluate perplexity every n iterations
                         0.1                      -- Stopping perplexity tolerance
                       );

SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp;

-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
perplexity       | {195.764020671,194.317808815,193.208428811,188.2838923,188.384646897,189.849099875,189.939592275}
perplexity_iters | {1,2,3,4,5,6,7}

Predict on input data

DROP TABLE IF EXISTS outdata_predict_perp;

SELECT madlib.lda_predict( 'documents_tf',          -- Document to predict
                           'lda_model_perp',             -- LDA model from training
                           'outdata_predict_perp'                
                         );

SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                  'outdata_predict_perp'
                                );

-[ RECORD 1 ]------+-----------------
lda_get_perplexity | 192.569799335159

I would expect this to be 189.939592275 which is the last value in the array for perplexity at iteration 7.

fmcquillan99 · 2019-10-28T16:22:27Z

(4)
Unnecessary verbose output

DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents',    -- input table
                             'docid',        -- document id column
                             'words',        -- vector of words in document
                             'documents_tf', -- output documents table with term frequency
                             TRUE);          -- TRUE to created vocabulary table

produces

NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.
CONTEXT:  SQL statement "
                 CREATE TABLE documents_tf_vocabulary AS
                 SELECT (row_number() OVER (order by word))::INTEGER - 1 as wordid,
                        word::TEXT
                 FROM (
                    SELECT distinct(words) as word
                    FROM (
                          SELECT unnest(words::TEXT[]) as words
                          FROM documents
                    ) q1
                ) q2
                "
PL/Python function "term_frequency"
NOTICE:  One or more columns in the following table(s) do not have statistics: documents
HINT:  For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
CONTEXT:  SQL statement "
                 CREATE TABLE documents_tf_vocabulary AS
                 SELECT (row_number() OVER (order by word))::INTEGER - 1 as wordid,
                        word::TEXT
                 FROM (
                    SELECT distinct(words) as word
                    FROM (
                          SELECT unnest(words::TEXT[]) as words
                          FROM documents
                    ) q1
                ) q2
                "
PL/Python function "term_frequency"
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'docid' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CONTEXT:  SQL statement "
        CREATE TABLE documents_tf(
            docid INTEGER,
            wordid INTEGER,
            count INTEGER
        )
        "
PL/Python function "term_frequency"
NOTICE:  One or more columns in the following table(s) do not have statistics: documents
HINT:  For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
CONTEXT:  SQL statement "
        INSERT INTO documents_tf
            SELECT docid, w.wordid as wordid, word_count as count
            FROM (
                SELECT docid, word::TEXT, count(*) as word_count
                FROM
                (
                    SELECT docid, unnest(words::TEXT[]) as word
                    FROM documents
                    WHERE
                        docid IS NOT NULL
                ) q1
                GROUP BY docid, word
            ) q2
            
            , documents_tf_vocabulary as w
            WHERE
                q2.word = w.word
            
        "
PL/Python function "term_frequency"
                                      term_frequency                                      
------------------------------------------------------------------------------------------
 Term frequency output in table documents_tf, vocabulary in table documents_tf_vocabulary
(1 row)

Time: 206.233 ms

hpandeycodeit · 2019-10-29T17:51:25Z

DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents', -- input table
'docid', -- document id column
'words', -- vector of words in document
'documents_tf', -- output documents table with term frequency
TRUE);

@fmcquillan99 I don't see a verbose output when I am running the above query. Are you running it in GPDB or postgres?

postgres=# DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
DROP TABLE
postgres=# 
postgres=# SELECT madlib.term_frequency('documents',    -- input table
postgres(#                              'docid',        -- document id column
postgres(#                              'words',        -- vector of words in document
postgres(#                              'documents_tf', -- output documents table with term frequency
postgres(#                              TRUE);          
                                      term_frequency                                      
------------------------------------------------------------------------------------------
 Term frequency output in table documents_tf, vocabulary in table documents_tf_vocabulary
(1 row)

postgres=#

fmcquillan99 · 2019-10-29T18:14:59Z

@hpandeycodeit I was running on GP5 from psql

hpandeycodeit · 2019-10-29T22:20:00Z

@hpandeycodeit I was running on GP5 from psql

So this is not in LDA code. This is the part of GPDB 5. If a table does not have stats, it prints out the messages about the no stats. Once the stats are updated(run analyze on these tables), and run the above sql again, these messages disappear.

hpandeycodeit · 2019-11-01T17:25:31Z

@fmcquillan99,

In lda_predict although the model table remains the same, it randomly initializes the output table. That is why we are seeing the difference in the perplexity values from what is calculated in lda_train vs get_perplexity()

However, if the same output table(generated by lda_train) is passed to the get_perplexity() function, the perplexity values match. For eg:

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         385,                     -- vocabulary size
                         5,                        -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         1,                       -- Evaluate perplexity every n iterations
                         .2                      -- Stopping perplexity tolerance
                       );

Generates the following perplexity values with the last perplexity value 179.380131412:

postgres=# select perplexity from lda_model_perp ;
                                                                  perplexity                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 {196.940707618,193.245742228,191.155602156,185.314159394,182.901929923,187.283749958,186.944341124,185.508311039,185.72038473,179.380131412}
(1 row)

Now running the get_perplexity() on the above-generated output table lda_output_data_perp produces the following perplexity:

postgres=# SELECT madlib.lda_get_perplexity( 'lda_model_perp',
postgres(#                                   'lda_output_data_perp'
postgres(#                                 );
 lda_get_perplexity 
--------------------
   179.380131412469

which matches the last perplexity value calculated by lda_train

Thanks!

hpandeycodeit · 2019-11-01T17:25:55Z

(1)
Please add num_iterations to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:

model_table
...
model	BIGINT[]. The encoded model ...etc...
num_iterations	INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity	DOUBLE PRECISION[] Array of ...etc....
...

(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         103,                     -- vocabulary size
                         5,                       -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every 2 iterations
                         0.0                      -- Set tolerance to 0 so runs full number of iterations
                       );

produces

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than .1 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 519, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         103,                     -- vocabulary size\n                         5,                       -- number of topics\n                         10,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2,                       -- Evaluate perplexity every 2 iterations\n                         0.0                      -- Set tolerance to 0 so runs full number of iterations\n                       );"]

This is fixed.

asf-ci · 2019-11-01T17:41:58Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1123/

fmcquillan99 · 2019-11-01T18:20:56Z

(5)
iteration number does not match when early termination

-[ RECORD 1 ]----+----------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 5
perplexity       | {199.746367293,193.662852162,190.782567914,189.245695537}
perplexity_iters | {1,2,3,4}

Time: 38.941 ms

I think num_iterations should be 4 ?

fmcquillan99 · 2019-11-01T19:02:16Z

(6)
NULLs not being handled properly

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         NULL,                    -- Evaluate perplexity every n iterations
                         NULL                     -- Stopping perplexity tolerance
                       );

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than 0 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 525, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         NULL,                       -- Evaluate perplexity every n iterations\n                         NULL                      -- Stopping perplexity tolerance\n                       );"]

Please implement as per

evaluate_every (optional)
INTEGER, default: 0. How often to evaluate perplexity. Set it to 0 or a negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence during the training process, but it will also increase total training time. For example, evaluating perplexity in every iteration might increase training time up to two-fold.
perplexity_tol (optional)
DOUBLE PRECISION, default: 0.1. Perplexity tolerance to stop iteration. Only used when the parameter 'evaluate_every' is greater than 0.

hpandeycodeit · 2019-11-04T21:29:55Z

(6)
NULLs not being handled properly

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         NULL,                    -- Evaluate perplexity every n iterations
                         NULL                     -- Stopping perplexity tolerance
                       );

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than 0 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 525, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         NULL,                       -- Evaluate perplexity every n iterations\n                         NULL                      -- Stopping perplexity tolerance\n                       );"]

Please implement as per

evaluate_every (optional)
INTEGER, default: 0. How often to evaluate perplexity. Set it to 0 or a negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence during the training process, but it will also increase total training time. For example, evaluating perplexity in every iteration might increase training time up to two-fold.
perplexity_tol (optional)
DOUBLE PRECISION, default: 0.1. Perplexity tolerance to stop iteration. Only used when the parameter 'evaluate_every' is greater than 0.

Fixed this and num_iterations.

asf-ci · 2019-11-04T22:07:50Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1128/

fmcquillan99 · 2019-11-04T23:56:19Z

Re-test after latest commits

(1)
Please add num_iterations to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:

model_table
...
model	BIGINT[]. The encoded model ...etc...
num_iterations	INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity	DOUBLE PRECISION[] Array of ...etc....
...

Now looks like:

-[ RECORD 1 ]----+--------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 9
perplexity       | {196.148467882,192.142777576,193.872066117}
perplexity_iters | {3,6,9}

OK

(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         103,                     -- vocabulary size
                         5,                       -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every 2 iterations
                         0.0                      -- Set tolerance to 0 so runs full number of iterations
                       );

produces

-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 20
perplexity       | {191.992070922,188.198782019,187.433873268,184.973287318,184.491077644,176.27420008,180.63646659,180.456641184,179.574266867,179.152413582}
perplexity_iters | {2,4,6,8,10,12,14,16,18,20}

OK

(3)
Last iteration value for perplexity doe not match final perplexity value:

DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);

INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area.'),
(4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.'),
(5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.'),
(6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.'),
(7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.'),
(8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant.'),
(9, 'In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.'),
(10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.'),
(11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally designed to perform a simple task in an indirect and overly complicated way. Usually, these machines consist of a series of simple unrelated devices; the action of each triggers the initiation of the next, eventually resulting in achieving a stated goal.'),
(12, 'In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.'),
(13, 'k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.'),
(14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.');


ALTER TABLE documents ADD COLUMN words TEXT[];

UPDATE documents SET words =
    regexp_split_to_array(lower(
    regexp_replace(contents, E'[,.;\']','', 'g')
    ), E'[\\s+]');


DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents',    -- input table
                             'docid',        -- document id column
                             'words',        -- vector of words in document
                             'documents_tf', -- output documents table with term frequency
                             TRUE);          -- TRUE to created vocabulary table

Train

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         100,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         1,                       -- Evaluate perplexity every n iterations
                         0.1                      -- Stopping perplexity tolerance
                       );

SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp;

-[ RECORD 1 ]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 16
perplexity       | {195.582090721,192.071728778,191.048336558,194.186905186,195.150503634,191.566207005,191.199131632,185.533220287,189.910983656,184.981903783,185.753724338,183.043524383,189.125703696,191.460991339,189.193774612,189.182916247}
perplexity_iters | {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}

Perplexity on input data

SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                  'lda_output_data_perp'
                                );

 lda_get_perplexity 
--------------------
   189.182916246556
(1 row)

which matches the last value in the array for the training function.

OK

(6) still has an issue

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2                       -- Evaluate perplexity every n iterations

Done.
(psycopg2.ProgrammingError) function madlib.lda_train(unknown, unknown, unknown, integer, integer, integer, integer, numeric, integer) does not exist
LINE 1: SELECT madlib.lda_train( 'documents_tf',          -- documen...
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2                       -- Evaluate perplexity every n iterations\n                       );"]

This should be the same results as:

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every n iterations
                         NULL
                       );

which actually does work if you put NULL for the last param.

asf-ci · 2019-11-05T17:42:57Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1129/

asf-ci · 2019-11-05T18:21:49Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1130/

hpandeycodeit · 2019-11-05T18:36:52Z

@fmcquillan99 Fixed the issue with the Null handling on the last param.

kaknikhil · 2019-11-05T18:44:23Z

@hpandeycodeit
We should test cases for the following scenarios (not sure if we already have tests for some of these) :

If evaluate_every <=0, assert that there are no perplexity values.
If tolerance == 0, assert that we don't stop early.
All permutations of the interface with evaluate_every and tolerance being passed as NULL and/or not passed at all to make sure we default the values as expected.

fmcquillan99 · 2019-11-05T18:58:34Z

I checked (6) after the last commit and it works now.

So LGTM on functionality.

asf-ci · 2019-11-05T19:04:59Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1131/

src/ports/postgres/modules/lda/lda.py_in

asf-ci · 2019-11-05T22:03:44Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1133/

hpandeycodeit · 2019-11-07T18:05:54Z

@hpandeycodeit
We should test cases for the following scenarios (not sure if we already have tests for some of these) :

If evaluate_every <=0, assert that there are no perplexity values.

If tolerance == 0, assert that we don't stop early.

All permutations of the interface with evaluate_every and tolerance being passed as NULL and/or not passed at all to make sure we default the values as expected.

Added the test cases for 2 and 3. There was already a test case covering scenario 1.

asf-ci · 2019-11-07T18:42:12Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1141/

asf-ci · 2019-11-07T19:27:16Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1142/

fmcquillan99 · 2019-11-07T22:59:57Z

LGTM

kaknikhil · 2019-11-12T22:54:55Z

src/ports/postgres/modules/lda/test/lda.sql_in

@@ -474,3 +474,89 @@ select assert(array_upper(perplexity_iters,1)  = 2, 'Perplexity iterations are d
 select assert(perplexity[1] > 0 , 'Perplexity value should be greate than 0') from lda_model ;
 select assert(array_upper(ARRAY(Select distinct unnest(perplexity)),1)= array_upper(perplexity,1) , 'Perplexity values should be unique') from lda_model ;

+
+-- Test for evaluate_every = 1  and 0 : In this case the iterations should not stop early --


@hpandeycodeit
I can't find the test for evaluate_every = 0. Am i missing something ?

when evaluate_every = NULL (it takes the default evaluate_every=0) and in that case, we don't calculate perplexity. We have a test case for covering evaluate_every = NULL.

Prior to this commit, in LDA there are no stopping criteria. It runs for all the provided iterations. This commit calculates the perplexity on each iteration and when the difference between the last two perplexity values is less than the perplexity_tol, it stops the iteration. These are the two new parameters added to the function: ``` evaluate_every INTEGER, perplexity_tol DOUBLE PRECISION ``` Also, there is a change to the model output table. The following new columns are added: 1. perplexity(DOUBLE PRECISION[]): is an array of perplexity values as per the 'evaluate_every' parameter. 2. perplexity_iters(INTEGER[]): is an Array indicating the iterations for which perplexity is calculated

asf-ci · 2019-11-18T19:48:17Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1154/

kaknikhil suggested changes Sep 3, 2019

View reviewed changes

kaknikhil reviewed Sep 16, 2019

View reviewed changes

hpandeycodeit force-pushed the LDA branch from 8bc26ec to a0d4704 Compare October 1, 2019 18:09

hpandeycodeit force-pushed the LDA branch from a0d4704 to d93c2dd Compare October 1, 2019 18:32

kaknikhil reviewed Oct 9, 2019

View reviewed changes

kaknikhil approved these changes Oct 15, 2019

View reviewed changes

hpandeycodeit force-pushed the LDA branch from f1c5fd3 to 09162eb Compare November 5, 2019 17:44

kaknikhil suggested changes Nov 5, 2019

View reviewed changes

src/ports/postgres/modules/lda/lda.py_in Outdated Show resolved Hide resolved

hpandeycodeit force-pushed the LDA branch from 2350eff to fc8af4e Compare November 7, 2019 18:50

kaknikhil reviewed Nov 12, 2019

View reviewed changes

khannaekta force-pushed the LDA branch from fc8af4e to e1e114d Compare November 18, 2019 19:02

khannaekta merged commit 5a1717e into apache:master Nov 18, 2019

MADLIB-1351 : Added stopping criteria on perplexity to LDA #432

MADLIB-1351 : Added stopping criteria on perplexity to LDA #432

Conversation

hpandeycodeit commented Aug 27, 2019 • edited Loading

asf-ci commented Aug 27, 2019

kaknikhil commented Aug 28, 2019 • edited Loading

hpandeycodeit commented Aug 28, 2019

asf-ci commented Aug 31, 2019

asf-ci commented Sep 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaknikhil commented Sep 3, 2019

asf-ci commented Sep 6, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaknikhil Sep 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaknikhil Sep 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaknikhil Sep 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asf-ci commented Sep 26, 2019

kaknikhil commented Sep 30, 2019 • edited Loading

kaknikhil commented Sep 30, 2019

hpandeycodeit commented Oct 1, 2019

asf-ci commented Oct 1, 2019

asf-ci commented Oct 1, 2019

asf-ci commented Oct 1, 2019

asf-ci commented Oct 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asf-ci commented Oct 15, 2019

asf-ci commented Oct 15, 2019

fmcquillan99 commented Oct 25, 2019

fmcquillan99 commented Oct 28, 2019 • edited Loading

fmcquillan99 commented Oct 28, 2019

hpandeycodeit commented Oct 29, 2019

fmcquillan99 commented Oct 29, 2019

hpandeycodeit commented Oct 29, 2019

hpandeycodeit commented Nov 1, 2019

hpandeycodeit commented Nov 1, 2019

asf-ci commented Nov 1, 2019

fmcquillan99 commented Nov 1, 2019

fmcquillan99 commented Nov 1, 2019 • edited Loading

hpandeycodeit commented Nov 4, 2019

asf-ci commented Nov 4, 2019

fmcquillan99 commented Nov 4, 2019 • edited Loading

asf-ci commented Nov 5, 2019

hpandeycodeit commented Aug 27, 2019 •

edited

Loading

kaknikhil commented Aug 28, 2019 •

edited

Loading

kaknikhil Sep 16, 2019 •

edited

Loading

kaknikhil Sep 16, 2019 •

edited

Loading

kaknikhil Sep 16, 2019 •

edited

Loading

kaknikhil commented Sep 30, 2019 •

edited

Loading

fmcquillan99 commented Oct 28, 2019 •

edited

Loading

fmcquillan99 commented Nov 1, 2019 •

edited

Loading

fmcquillan99 commented Nov 4, 2019 •

edited

Loading