Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MADLIB-1351 : Added stopping criteria on perplexity to LDA #432

Merged
merged 1 commit into from
Nov 18, 2019

Conversation

hpandeycodeit
Copy link
Member

@hpandeycodeit hpandeycodeit commented Aug 27, 2019

LDA:
Added stopping criteria on perplexity to LDA.

MADLIB-1351

Currently, in LDA there are no stopping criteria. It runs for all the provided iterations.
This PR calculated the Perplexity on each iteration and when the difference between the last two Perplexity values is less than the perplexity_tol, it stops the iteration.

These are the two new parameters added to the function:

evaluate_every      Integer,
perplexity_tol      Double Precision

And there is a change to the Model output table as well. It will have these two extra columns

perplexity  DOUBLE PRECISION[]
perplexity_iters INTEGER[]

Where
perplexity is an Array of perplexity values as per the 'evaluate_every' parameter.
perplexity_iters is an Array indicating the iterations for which perplexity is calculated

@asf-ci
Copy link

asf-ci commented Aug 27, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1032/

@kaknikhil
Copy link
Contributor

kaknikhil commented Aug 28, 2019

@hpandeycodeit
I haven't reviewed the code yet but looks like there aren't any tests for this PR. Can you add tests for all possible scenarios related to the changes made in this PR ?
Make sure to test for all possible test cases for evaluate_every and perplexity

@hpandeycodeit
Copy link
Member Author

@hpandeycodeit
I haven't reviewed the code yet but looks like there aren't any tests for this PR. Can you add tests for all possible scenarios related to the changes made in this PR ?
Make sure to test for all possible test cases for evaluate_every and perplexity

@kaknikhil I will add the tests cases soon to the PR.

@asf-ci
Copy link

asf-ci commented Aug 31, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1037/

@asf-ci
Copy link

asf-ci commented Sep 3, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1041/

# the Model and Output Table
if self.evaluate_every > 0:
self.perplexity.append(
get_perplexity('madlib',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema should not be hard coded to 'madlib' in all the places that call get_perplexity. Use the schema_madlib variable instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

prep_string = ""
prep_itr_str = ""
if len(self.perplexity) > 0:
prep_string = ", " + py_list_to_sql_string(self.perplexity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use .format instead of +

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

END;
$$ LANGUAGE plpgsql;

select assert(validate_perplexity() = TRUE, 'Perplexity calculation is wrong');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing new line

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

'lda_training',
'lda_model',
'lda_output_data',
20, 5, 2, 10, 0.01, 2, .2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add the column name as a comment after each of these numbers to make it more readable and also add a new line after each argument

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# JIRA: MADLIB-1351
# If the Perplexity_diff is less than the perplexity_tol,
# Stop the iteration
if self.perplexity_diff < self.perplexity_tol:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a test case for this condition. Either unit test or dev check

'lda_output_data',
20, 5, 3, 10, 0.01, 1, .1);

SELECT assert(cardinality(perplexity) = 3, 'Perplexity calculation is wrong') from lda_model;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the cardinality function available in gpdb 4.3. If not then we should replace it by something like array_upper.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

---------- TEST CASES FOR PERPLEXITY ----------

drop table if exists lda_model, lda_output_data;
SELECT lda_train(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add few more test cases. In all these case we need to assert that we calculated the perplexity at the right iteration.

  1. no_of_iterations % evaluate_every != 0.
  2. both no_of_iters and evaluate_every = 1
  3. no_of_iterations % evaluate_every == 0 and no_of_iterations != evaluate_every
  4. Set evaluate_every to 0 and -1
  5. When perplexity_tol is reached before finishing all the iterations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for 2 and 4. There are few outstanding tests like 1,3 and 5 for which I need some more clarity. I will discuss with you on that.

# JIRA: MADLIB-1351
# Calculate Perplexity for evaluate_every Iteration
# Skil the calculation at the first iteration as the model generated
# at first iteration is a random model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be more verbose in this comment. Something like (but definitely not limited to)

For each iteration 
1. Model table is updated (for the first iteration, it is the random model. For iteration >1 , the model that is updated is learnt in the previous iteration)
1. __lda_count_topic_agg is called
1. then lda_gibbs_sample is called which learns and updates the model(the updated model is not passed to python. The learnt model is updated in the next iteration)

Because of this workflow we can safely ignore the first perplexity value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# Calculate Perplexity for evaluate_every Iteration
# Skil the calculation at the first iteration as the model generated
# at first iteration is a random model
if it > self.evaluate_every and self.evaluate_every > 0 and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we already assert that evaluate_every >=0 (line 514) , we don't need to repeat this check.
  2. Unless I am missing something, the whole if check can be simplified by skipping the perplexity calculation when it == 0 instead of using it and it-1.
  3. We could move this code logic (lines 206 - 216) to it's own function and unit test all the logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is appending values in perplexity_iters ( it - 1) : perplexity_iters[0] = it-1;
and Moved the code to a seperate function

@@ -445,6 +511,12 @@ def lda_train(schema_madlib, train_table, model_table, output_data_table, voc_si
'invalid argument: positive real expected for alpha')
_assert(beta is not None and beta > 0,
'invalid argument: positive real expected for beta')
_assert(evaluate_every is not None and evaluate_every >= 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user docs for evaluate_every mention Set it to 0 or negative number to not evaluate perplexity in training at all but this check will throw an exception for evaluate_every < 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this check as we are not calculating the perplexity for 0 or -1.

@kaknikhil
Copy link
Contributor

Few more general comments

  1. The commit title should have the module name and not the jira no i.e. LDA : Added stopping criteria on perplexity.
  2. The commit is missing details and the JIRA no. We should add a verbose commit message (including the motivation for excluding the first iteration for calculating perplexity).
  3. URL for the jira in the PR message is incorrect. It is pointing to the apache madlib pull request url instead of apache madlib jira

@asf-ci
Copy link

asf-ci commented Sep 6, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1053/

prep_string = ""
prep_itr_str = ""
if len(self.perplexity) > 1:
prep_string = ", {0}".format(py_list_to_sql_string(self.perplexity))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give these 2 variables better names ? What does prep mean (perplexity ??) ??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the names here.

if it > self.evaluate_every and self.evaluate_every > 0 and (
it - 1) % self.evaluate_every == 0:
self.gen_output_data_table(work_table_in)
perplexity = 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

perplexity = get_perplexity(self.schema_madlib,
self.model_table,
self.output_data_table)
self.perplexity_diff = abs(self.perplexity[
Copy link
Contributor

@kaknikhil kaknikhil Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor self.perplexity[len(self.perplexity) - 1] as self.perplexity[-1]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -288,3 +288,126 @@ CREATE OR REPLACE FUNCTION validate_lda_output() RETURNS integer AS $$
$$ LANGUAGE plpgsql;

select validate_lda_output();


---------- TEST CASES FOR PERPLEXITY ----------
Copy link
Contributor

@kaknikhil kaknikhil Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider adding a description at the beginning of each test case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One liner headings are already present for every test case. Let me know if you think putting more details is a good idea.

'lda_training',
'lda_model',
'lda_output_data',
20, 5, 2, 10, 0.01, 2, .2);
Copy link
Contributor

@kaknikhil kaknikhil Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as before

maybe add the column name as a comment after each of these numbers to make it more readable and also add a new line after each argument

'lda_output_data',
20, 5, 2, 10, 0.01, 2, .2);

SELECT assert(perplexity_iters = '{2}', 'Number of Perplexity iterations are wrong') from lda_model;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We can also assert the len of the perplexity values.
  2. Since we cannot deterministically assert the perplexity value itself, we should at least assert that all the perplexity values > 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the test cases for above as discussed.

.1 -- perplexity_tol
);

SELECT assert(array_upper(perplexity,1) = 3, 'Perplexity calculation is wrong') from lda_model;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should assert the value of perplexity_iters here and also that all perplexity values are > 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test for this as well.

.1 -- perplexity_tol
);

select assert(perplexity = '{}', 'Perplexity calculation is wrong') from lda_model;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If evaluate_every=1, why do we expect the perplexity array to be empty ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this one.

@asf-ci
Copy link

asf-ci commented Sep 26, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1086/

@kaknikhil
Copy link
Contributor

kaknikhil commented Sep 30, 2019

@hpandeycodeit the jenkins build is failing for the latest commit. Can you take a look ?

@kaknikhil
Copy link
Contributor

Can you also add a test for perplexity_tol ?

@hpandeycodeit
Copy link
Member Author

Can you also add a test for perplexity_tol ?

fixed these.

@asf-ci
Copy link

asf-ci commented Oct 1, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1089/

@asf-ci
Copy link

asf-ci commented Oct 1, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1090/

@asf-ci
Copy link

asf-ci commented Oct 1, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1091/

@asf-ci
Copy link

asf-ci commented Oct 9, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1098/

select assert(abs(perplexity[2] - perplexity[1]) <10, 'Perplexity tol is less than the perplexity difference') from lda_model ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we checking for <10 if the tol is 100 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can add another assert to all the dev-check tests assert that all the perplexity values are unique. What do you think ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean if the length of the calculated perplexity values matches the distinct perplexity values?
fixed other issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I mean adding an assert to check that all the perplexity values are different

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test case for distinct perplexity values as discussed.

<dt>evaluate_every</dt>
<dd>int, optional (default=0). How often to evaluate perplexity. Set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold.</dd>
<dt>perplexity_tol</dt>
<dd>float, optional (default=1e-1). Perplexity tolerance to stop iterating. Only used when evaluate_every is greater than 0.</dd>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @fmcquillan99 can add a more verbose explanation here.

@@ -438,7 +444,9 @@ select assert(array_upper(perplexity_iters,1) <= 5, 'Perplexity iterations are d
select assert(perplexity[1] > 0 , 'Perplexity value should be greate than 0') from lda_model ;


-- Test to check if the perplexity_toll is greater than the difference between two perplexity iterations --
-- Test: If the difference between the last two iterations is less than the perplexity_tol, the iterations training will stop --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying last two iterations we can just say If the perplexity difference between any two iterations is less than the perplexity_tol, we will stop training.

@asf-ci
Copy link

asf-ci commented Oct 15, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1106/

@asf-ci
Copy link

asf-ci commented Oct 15, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1109/

@fmcquillan99
Copy link

(1)
Please add num_iterations to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:

model_table
...
model	BIGINT[]. The encoded model ...etc...
num_iterations	INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity	DOUBLE PRECISION[] Array of ...etc....
...

(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         103,                     -- vocabulary size
                         5,                       -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every 2 iterations
                         0.0                      -- Set tolerance to 0 so runs full number of iterations
                       );

produces

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than .1 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 519, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         103,                     -- vocabulary size\n                         5,                       -- number of topics\n                         10,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2,                       -- Evaluate perplexity every 2 iterations\n                         0.0                      -- Set tolerance to 0 so runs full number of iterations\n                       );"]

@fmcquillan99
Copy link

fmcquillan99 commented Oct 28, 2019

(3)
Last iteration value for perplexity does not match final perplexity value:

DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);

INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area.'),
(4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.'),
(5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.'),
(6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.'),
(7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.'),
(8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant.'),
(9, 'In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.'),
(10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.'),
(11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally designed to perform a simple task in an indirect and overly complicated way. Usually, these machines consist of a series of simple unrelated devices; the action of each triggers the initiation of the next, eventually resulting in achieving a stated goal.'),
(12, 'In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.'),
(13, 'k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.'),
(14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.');


ALTER TABLE documents ADD COLUMN words TEXT[];

UPDATE documents SET words = 
    regexp_split_to_array(lower(
    regexp_replace(contents, E'[,.;\']','', 'g')
    ), E'[\\s+]');


DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents',    -- input table
                             'docid',        -- document id column
                             'words',        -- vector of words in document
                             'documents_tf', -- output documents table with term frequency
                             TRUE);          -- TRUE to created vocabulary table

Train

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         100,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         1,                       -- Evaluate perplexity every n iterations
                         0.1                      -- Stopping perplexity tolerance
                       );

SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp;

-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
perplexity       | {195.764020671,194.317808815,193.208428811,188.2838923,188.384646897,189.849099875,189.939592275}
perplexity_iters | {1,2,3,4,5,6,7}

Predict on input data

DROP TABLE IF EXISTS outdata_predict_perp;

SELECT madlib.lda_predict( 'documents_tf',          -- Document to predict
                           'lda_model_perp',             -- LDA model from training
                           'outdata_predict_perp'                
                         );

SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                  'outdata_predict_perp'
                                );

-[ RECORD 1 ]------+-----------------
lda_get_perplexity | 192.569799335159

I would expect this to be 189.939592275 which is the last value in the array for perplexity at iteration 7.

@fmcquillan99
Copy link

(4)
Unnecessary verbose output

DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents',    -- input table
                             'docid',        -- document id column
                             'words',        -- vector of words in document
                             'documents_tf', -- output documents table with term frequency
                             TRUE);          -- TRUE to created vocabulary table

produces

NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.
CONTEXT:  SQL statement "
                 CREATE TABLE documents_tf_vocabulary AS
                 SELECT (row_number() OVER (order by word))::INTEGER - 1 as wordid,
                        word::TEXT
                 FROM (
                    SELECT distinct(words) as word
                    FROM (
                          SELECT unnest(words::TEXT[]) as words
                          FROM documents
                    ) q1
                ) q2
                "
PL/Python function "term_frequency"
NOTICE:  One or more columns in the following table(s) do not have statistics: documents
HINT:  For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
CONTEXT:  SQL statement "
                 CREATE TABLE documents_tf_vocabulary AS
                 SELECT (row_number() OVER (order by word))::INTEGER - 1 as wordid,
                        word::TEXT
                 FROM (
                    SELECT distinct(words) as word
                    FROM (
                          SELECT unnest(words::TEXT[]) as words
                          FROM documents
                    ) q1
                ) q2
                "
PL/Python function "term_frequency"
NOTICE:  Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'docid' as the Greenplum Database data distribution key for this table.
HINT:  The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CONTEXT:  SQL statement "
        CREATE TABLE documents_tf(
            docid INTEGER,
            wordid INTEGER,
            count INTEGER
        )
        "
PL/Python function "term_frequency"
NOTICE:  One or more columns in the following table(s) do not have statistics: documents
HINT:  For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
CONTEXT:  SQL statement "
        INSERT INTO documents_tf
            SELECT docid, w.wordid as wordid, word_count as count
            FROM (
                SELECT docid, word::TEXT, count(*) as word_count
                FROM
                (
                    SELECT docid, unnest(words::TEXT[]) as word
                    FROM documents
                    WHERE
                        docid IS NOT NULL
                ) q1
                GROUP BY docid, word
            ) q2
            
            , documents_tf_vocabulary as w
            WHERE
                q2.word = w.word
            
        "
PL/Python function "term_frequency"
                                      term_frequency                                      
------------------------------------------------------------------------------------------
 Term frequency output in table documents_tf, vocabulary in table documents_tf_vocabulary
(1 row)

Time: 206.233 ms

@hpandeycodeit
Copy link
Member Author

DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents', -- input table
'docid', -- document id column
'words', -- vector of words in document
'documents_tf', -- output documents table with term frequency
TRUE);

@fmcquillan99 I don't see a verbose output when I am running the above query. Are you running it in GPDB or postgres?

postgres=# DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;
DROP TABLE
postgres=# 
postgres=# SELECT madlib.term_frequency('documents',    -- input table
postgres(#                              'docid',        -- document id column
postgres(#                              'words',        -- vector of words in document
postgres(#                              'documents_tf', -- output documents table with term frequency
postgres(#                              TRUE);          
                                      term_frequency                                      
------------------------------------------------------------------------------------------
 Term frequency output in table documents_tf, vocabulary in table documents_tf_vocabulary
(1 row)

postgres=# 

@fmcquillan99
Copy link

@hpandeycodeit I was running on GP5 from psql

@hpandeycodeit
Copy link
Member Author

@hpandeycodeit I was running on GP5 from psql

So this is not in LDA code. This is the part of GPDB 5. If a table does not have stats, it prints out the messages about the no stats. Once the stats are updated(run analyze on these tables), and run the above sql again, these messages disappear.

@hpandeycodeit
Copy link
Member Author

@fmcquillan99,

In lda_predict although the model table remains the same, it randomly initializes the output table. That is why we are seeing the difference in the perplexity values from what is calculated in lda_train vs get_perplexity()

However, if the same output table(generated by lda_train) is passed to the get_perplexity() function, the perplexity values match. For eg:

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         385,                     -- vocabulary size
                         5,                        -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         1,                       -- Evaluate perplexity every n iterations
                         .2                      -- Stopping perplexity tolerance
                       );

Generates the following perplexity values with the last perplexity value 179.380131412:

postgres=# select perplexity from lda_model_perp ;
                                                                  perplexity                                                                  
----------------------------------------------------------------------------------------------------------------------------------------------
 {196.940707618,193.245742228,191.155602156,185.314159394,182.901929923,187.283749958,186.944341124,185.508311039,185.72038473,179.380131412}
(1 row)

Now running the get_perplexity() on the above-generated output table lda_output_data_perp produces the following perplexity:

postgres=# SELECT madlib.lda_get_perplexity( 'lda_model_perp',
postgres(#                                   'lda_output_data_perp'
postgres(#                                 );
 lda_get_perplexity 
--------------------
   179.380131412469

which matches the last perplexity value calculated by lda_train

Thanks!

@hpandeycodeit
Copy link
Member Author

(1)
Please add num_iterations to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:

model_table
...
model	BIGINT[]. The encoded model ...etc...
num_iterations	INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity	DOUBLE PRECISION[] Array of ...etc....
...

(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         103,                     -- vocabulary size
                         5,                       -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every 2 iterations
                         0.0                      -- Set tolerance to 0 so runs full number of iterations
                       );

produces

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than .1 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 519, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         103,                     -- vocabulary size\n                         5,                       -- number of topics\n                         10,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2,                       -- Evaluate perplexity every 2 iterations\n                         0.0                      -- Set tolerance to 0 so runs full number of iterations\n                       );"]

This is fixed.

@asf-ci
Copy link

asf-ci commented Nov 1, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1123/

@fmcquillan99
Copy link

(5)
iteration number does not match when early termination

-[ RECORD 1 ]----+----------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 5
perplexity       | {199.746367293,193.662852162,190.782567914,189.245695537}
perplexity_iters | {1,2,3,4}

Time: 38.941 ms

I think num_iterations should be 4 ?

@fmcquillan99
Copy link

fmcquillan99 commented Nov 1, 2019

(6)
NULLs not being handled properly

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         NULL,                    -- Evaluate perplexity every n iterations
                         NULL                     -- Stopping perplexity tolerance
                       );

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than 0 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 525, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         NULL,                       -- Evaluate perplexity every n iterations\n                         NULL                      -- Stopping perplexity tolerance\n                       );"]

Please implement as per

evaluate_every (optional)
INTEGER, default: 0. How often to evaluate perplexity. Set it to 0 or a negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence during the training process, but it will also increase total training time. For example, evaluating perplexity in every iteration might increase training time up to two-fold.
perplexity_tol (optional)
DOUBLE PRECISION, default: 0.1. Perplexity tolerance to stop iteration. Only used when the parameter 'evaluate_every' is greater than 0.

@hpandeycodeit
Copy link
Member Author

(6)
NULLs not being handled properly

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         NULL,                    -- Evaluate perplexity every n iterations
                         NULL                     -- Stopping perplexity tolerance
                       );

InternalError: (psycopg2.InternalError) plpy.Error: invalid argument: perplexity_tol should not be less than 0 (plpython.c:5038)
CONTEXT:  Traceback (most recent call last):
  PL/Python function "lda_train", line 22, in <module>
    voc_size, topic_num, iter_num, alpha, beta,evaluate_every , perplexity_tol)
  PL/Python function "lda_train", line 525, in lda_train
  PL/Python function "lda_train", line 96, in _assert
PL/Python function "lda_train"
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         NULL,                       -- Evaluate perplexity every n iterations\n                         NULL                      -- Stopping perplexity tolerance\n                       );"]

Please implement as per

evaluate_every (optional)
INTEGER, default: 0. How often to evaluate perplexity. Set it to 0 or a negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence during the training process, but it will also increase total training time. For example, evaluating perplexity in every iteration might increase training time up to two-fold.
perplexity_tol (optional)
DOUBLE PRECISION, default: 0.1. Perplexity tolerance to stop iteration. Only used when the parameter 'evaluate_every' is greater than 0.

Fixed this and num_iterations.

@asf-ci
Copy link

asf-ci commented Nov 4, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1128/

@fmcquillan99
Copy link

fmcquillan99 commented Nov 4, 2019


Re-test after latest commits

(1)
Please add num_iterations to the output table. This is needed now because
we have a perplexity tolerance, so training may not run the maximum number of iterations
specified. The model table should look like:

model_table
...
model	BIGINT[]. The encoded model ...etc...
num_iterations	INTEGER. Number of iterations that training ran for,
which may be less than the maximum value specified in the parameter 'iter_num' if
the perplexity tolerance was reached.
perplexity	DOUBLE PRECISION[] Array of ...etc....
...

Now looks like:

-[ RECORD 1 ]----+--------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 9
perplexity       | {196.148467882,192.142777576,193.872066117}
perplexity_iters | {3,6,9}

OK

(2)
The parameter 'perplexity_tol' can be any value >= 0.0 Currently it errors out below a
value of 0.1 which is not correct. I may want to set it to 0.0 so that training runs
for the full number of iterations. So please change it to error out if 'perplexity_tol'<0.

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         103,                     -- vocabulary size
                         5,                       -- number of topics
                         10,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every 2 iterations
                         0.0                      -- Set tolerance to 0 so runs full number of iterations
                       );

produces

-[ RECORD 1 ]----+--------------------------------------------------------------------------------------------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 20
perplexity       | {191.992070922,188.198782019,187.433873268,184.973287318,184.491077644,176.27420008,180.63646659,180.456641184,179.574266867,179.152413582}
perplexity_iters | {2,4,6,8,10,12,14,16,18,20}

OK

(3)
Last iteration value for perplexity doe not match final perplexity value:

DROP TABLE IF EXISTS documents;
CREATE TABLE documents(docid INT4, contents TEXT);

INSERT INTO documents VALUES
(0, 'Statistical topic models are a class of Bayesian latent variable models, originally developed for analyzing the semantic content of large document corpora.'),
(1, 'By the late 1960s, the balance between pitching and hitting had swung in favor of the pitchers. In 1968 Carl Yastrzemski won the American League batting title with an average of just .301, the lowest in history.'),
(2, 'Machine learning is closely related to and often overlaps with computational statistics; a discipline that also specializes in prediction-making. It has strong ties to mathematical optimization, which deliver methods, theory and application domains to the field.'),
(3, 'California''s diverse geography ranges from the Sierra Nevada in the east to the Pacific Coast in the west, from the Redwood Douglas fir forests of the northwest, to the Mojave Desert areas in the southeast. The center of the state is dominated by the Central Valley, a major agricultural area.'),
(4, 'One of the many applications of Bayes'' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes'' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.'),
(5, 'When data are unlabelled, supervised learning is not possible, and an unsupervised learning approach is required, which attempts to find natural clustering of the data to groups, and then map new data to these formed groups. The support-vector clustering algorithm, created by Hava Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering algorithms in industrial applications.'),
(6, 'Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.'),
(7, 'A multilayer perceptron is a class of feedforward artificial neural network. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training.'),
(8, 'In mathematics, an ellipse is a plane curve surrounding two focal points, such that for all points on the curve, the sum of the two distances to the focal points is a constant.'),
(9, 'In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs.'),
(10, 'In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects. A graph in this context is made up of vertices (also called nodes or points) which are connected by edges (also called links or lines). A distinction is made between undirected graphs, where edges link two vertices symmetrically, and directed graphs, where edges link two vertices asymmetrically; see Graph (discrete mathematics) for more detailed definitions and for other variations in the types of graph that are commonly considered. Graphs are one of the prime objects of study in discrete mathematics.'),
(11, 'A Rube Goldberg machine, named after cartoonist Rube Goldberg, is a machine intentionally designed to perform a simple task in an indirect and overly complicated way. Usually, these machines consist of a series of simple unrelated devices; the action of each triggers the initiation of the next, eventually resulting in achieving a stated goal.'),
(12, 'In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc... Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.'),
(13, 'k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.'),
(14, 'In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.');


ALTER TABLE documents ADD COLUMN words TEXT[];

UPDATE documents SET words =
    regexp_split_to_array(lower(
    regexp_replace(contents, E'[,.;\']','', 'g')
    ), E'[\\s+]');


DROP TABLE IF EXISTS documents_tf, documents_tf_vocabulary;

SELECT madlib.term_frequency('documents',    -- input table
                             'docid',        -- document id column
                             'words',        -- vector of words in document
                             'documents_tf', -- output documents table with term frequency
                             TRUE);          -- TRUE to created vocabulary table

Train

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         100,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         1,                       -- Evaluate perplexity every n iterations
                         0.1                      -- Stopping perplexity tolerance
                       );

SELECT voc_size, topic_num, alpha, beta, perplexity, perplexity_iters from lda_model_perp;

-[ RECORD 1 ]----+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
voc_size         | 384
topic_num        | 5
alpha            | 5
beta             | 0.01
num_iterations   | 16
perplexity       | {195.582090721,192.071728778,191.048336558,194.186905186,195.150503634,191.566207005,191.199131632,185.533220287,189.910983656,184.981903783,185.753724338,183.043524383,189.125703696,191.460991339,189.193774612,189.182916247}
perplexity_iters | {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}

Perplexity on input data

SELECT madlib.lda_get_perplexity( 'lda_model_perp',
                                  'lda_output_data_perp'
                                );

 lda_get_perplexity 
--------------------
   189.182916246556
(1 row)

which matches the last value in the array for the training function.

OK

(6) still has an issue

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2                       -- Evaluate perplexity every n iterations

Done.
(psycopg2.ProgrammingError) function madlib.lda_train(unknown, unknown, unknown, integer, integer, integer, integer, numeric, integer) does not exist
LINE 1: SELECT madlib.lda_train( 'documents_tf',          -- documen...
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
 [SQL: "SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency\n                         'lda_model_perp',        -- model table created by LDA training (not human readable)\n                         'lda_output_data_perp',  -- readable output data table \n                         384,                     -- vocabulary size\n                         5,                        -- number of topics\n                         20,                      -- number of iterations\n                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)\n                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)\n                         2                       -- Evaluate perplexity every n iterations\n                       );"]

This should be the same results as:

DROP TABLE IF EXISTS lda_model_perp, lda_output_data_perp;

SELECT madlib.lda_train( 'documents_tf',          -- documents table in the form of term frequency
                         'lda_model_perp',        -- model table created by LDA training (not human readable)
                         'lda_output_data_perp',  -- readable output data table 
                         384,                     -- vocabulary size
                         5,                        -- number of topics
                         20,                      -- number of iterations
                         5,                       -- Dirichlet prior for the per-doc topic multinomial (alpha)
                         0.01,                    -- Dirichlet prior for the per-topic word multinomial (beta)
                         2,                       -- Evaluate perplexity every n iterations
                         NULL
                       );

which actually does work if you put NULL for the last param.

@asf-ci
Copy link

asf-ci commented Nov 5, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1129/

@asf-ci
Copy link

asf-ci commented Nov 5, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1130/

@hpandeycodeit
Copy link
Member Author

@fmcquillan99 Fixed the issue with the Null handling on the last param.

@kaknikhil
Copy link
Contributor

@hpandeycodeit
We should test cases for the following scenarios (not sure if we already have tests for some of these) :

  1. If evaluate_every <=0, assert that there are no perplexity values.
  2. If tolerance == 0, assert that we don't stop early.
  3. All permutations of the interface with evaluate_every and tolerance being passed as NULL and/or not passed at all to make sure we default the values as expected.

@fmcquillan99
Copy link

I checked (6) after the last commit and it works now.

So LGTM on functionality.

@asf-ci
Copy link

asf-ci commented Nov 5, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1131/

@asf-ci
Copy link

asf-ci commented Nov 5, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1133/

@hpandeycodeit
Copy link
Member Author

@hpandeycodeit
We should test cases for the following scenarios (not sure if we already have tests for some of these) :

  1. If evaluate_every <=0, assert that there are no perplexity values.
  2. If tolerance == 0, assert that we don't stop early.
  3. All permutations of the interface with evaluate_every and tolerance being passed as NULL and/or not passed at all to make sure we default the values as expected.

Added the test cases for 2 and 3. There was already a test case covering scenario 1.

@asf-ci
Copy link

asf-ci commented Nov 7, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1141/

@asf-ci
Copy link

asf-ci commented Nov 7, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1142/

@fmcquillan99
Copy link

LGTM

@@ -474,3 +474,89 @@ select assert(array_upper(perplexity_iters,1) = 2, 'Perplexity iterations are d
select assert(perplexity[1] > 0 , 'Perplexity value should be greate than 0') from lda_model ;
select assert(array_upper(ARRAY(Select distinct unnest(perplexity)),1)= array_upper(perplexity,1) , 'Perplexity values should be unique') from lda_model ;


-- Test for evaluate_every = 1 and 0 : In this case the iterations should not stop early --
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hpandeycodeit
I can't find the test for evaluate_every = 0. Am i missing something ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when evaluate_every = NULL (it takes the default evaluate_every=0) and in that case, we don't calculate perplexity. We have a test case for covering evaluate_every = NULL.

Prior to this commit, in LDA there are no stopping criteria. It runs for
all the provided iterations. This commit calculates the perplexity on
each iteration and when the difference between the last two perplexity
values is less than the perplexity_tol, it stops the iteration.

These are the two new parameters added to the function:

```
evaluate_every  INTEGER,
perplexity_tol  DOUBLE PRECISION
```

Also, there is a change to the model output table. The following new
columns are added:

1. perplexity(DOUBLE PRECISION[]): is an array of perplexity values as
per the 'evaluate_every' parameter.
2. perplexity_iters(INTEGER[]): is an Array indicating the iterations
for which perplexity is calculated
@asf-ci
Copy link

asf-ci commented Nov 18, 2019

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/1154/

@khannaekta khannaekta merged commit 5a1717e into apache:master Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants