Globo Dataset missing Assets (Labels) #9

Curlykonda · 2020-01-14T15:05:50Z

Hi,

With the provided Globo dataset, we cannot train the ACR because of the article contents could not be provided. Progressing to the NAR training, it seems that some assets are missing.

In nar_trainer_gcom.py the following method tries to derserialise the labels, metadata and article embeddings from a pickle file. However, this pickle file is not provided, or, more precisely, the "acr_label_encoders" are missing.

tf.logging.info('Loading ACR module assets')
        acr_label_encoders, articles_metadata_df, content_article_embeddings_matrix = \
                load_acr_module_resources(FLAGS.acr_module_resources_path)

def load_acr_module_resources(acr_module_resources_path):
    (acr_label_encoders, articles_metadata_df, content_article_embeddings) = \
              deserialize(acr_module_resources_path)

    tf.logging.info("Read ACR label encoders for: {}".format(acr_label_encoders.keys()))
    tf.logging.info("Read ACR articles metadata: {}".format(len(articles_metadata_df)))
    tf.logging.info("Read ACR article content embeddings: {}".format(content_article_embeddings.shape))

    return acr_label_encoders, articles_metadata_df, content_article_embeddings

Similarly, in nar_utils.py, this method cannot be executed because the folder ''/pickles/" does not contain 'nar_label_encoders'

def load_nar_module_preprocessing_resources(nar_module_preprocessing_resources_path):
    #{'nar_label_encoders', 'nar_standard_scalers'}
    nar_resources = \
              deserialize(nar_module_preprocessing_resources_path)

    nar_label_encoders = nar_resources['nar_label_encoders']
    tf.logging.info("Read NAR label encoders for: {}".format(nar_label_encoders.keys()))

    return nar_label_encoders

How can we get these labels? Or am I overlooking something?

Thanks

The text was updated successfully, but these errors were encountered:

Heng-xiu · 2020-01-30T18:02:57Z

Hi,
It does missing some assets during the process. I notice that 'acr_label_encoders.pickle' comes from acr.preprocessing.acr_preprocess_gcom.py. If you check the code in the acr_preprocess_gcom.py, you can see the required parameters in the program.

However, we can not reimplement the 'acr_label_encoders.pickle' without the full text in the Glob dataset.

As you can see that there is a column called full_text in 'documents_g1.csv'.

Follow the instruction in Pre-processing data for the ACR module and download the files from Kaggle. full_text column doesn't exist in any files described in the Kaggle, such as clicks.zip, articles_metadata.csv, and articles_embeddings.pickle.

So, how can we get the documents_g1.csv?

Thanks

CatcherGG · 2020-05-16T18:58:37Z

Any answer to this?

gabrielspmoreira · 2020-07-05T19:51:27Z

Hi. Sorry for the delayed response. The nar_trainer_gcom.py in fact had an issue, requiring some assets that were not available for the G1 dataset (as the raw textual content of the articles could not be released for that dataset). Now, the information required from the ACR module for the NAR training are only the following command line parameters (--acr_module_articles_metadata_csv_path and --acr_module_articles_content_embeddings_pickle_path), which correspond to the files available in the public dataset URL.

CatcherGG mentioned this issue May 16, 2020

Flag Mismatch for NAR Trainer #10

Closed

gabrielspmoreira closed this as completed in 764f38b Jul 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Globo Dataset missing Assets (Labels) #9

Globo Dataset missing Assets (Labels) #9

Curlykonda commented Jan 14, 2020 •

edited

Heng-xiu commented Jan 30, 2020

CatcherGG commented May 16, 2020

gabrielspmoreira commented Jul 5, 2020

Globo Dataset missing Assets (Labels) #9

Globo Dataset missing Assets (Labels) #9

Comments

Curlykonda commented Jan 14, 2020 • edited

Heng-xiu commented Jan 30, 2020

CatcherGG commented May 16, 2020

gabrielspmoreira commented Jul 5, 2020

Curlykonda commented Jan 14, 2020 •

edited