You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Here is a list of records which have a large fulltext contents. Due to limitations in SOLR, we currently throw away anything beyond 32k bytes. Under these circumstances, it would be nice to be more sensible when generating fulltext so that we throw away things which are not interesting (e.g. numeric tables) and keep the text that we want.
To clarify: the SOLR ingestion problem had to do with a limitation in the word length (blobs of characters longer than 32k caused a failure), not text length.
Commit ee36740 modifies the PDF extraction code so that newlines are kept, greatly lowering the possibility that huge strings of nonsense will be generated.
Here is a list of records which have a large fulltext contents. Due to limitations in SOLR, we currently throw away anything beyond 32k bytes. Under these circumstances, it would be nice to be more sensible when generating fulltext so that we throw away things which are not interesting (e.g. numeric tables) and keep the text that we want.
The text was updated successfully, but these errors were encountered: