Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globo Dataset missing Assets (Labels) #9

Closed
Curlykonda opened this issue Jan 14, 2020 · 3 comments
Closed

Globo Dataset missing Assets (Labels) #9

Curlykonda opened this issue Jan 14, 2020 · 3 comments

Comments

@Curlykonda
Copy link

Curlykonda commented Jan 14, 2020

Hi,

With the provided Globo dataset, we cannot train the ACR because of the article contents could not be provided. Progressing to the NAR training, it seems that some assets are missing.

In nar_trainer_gcom.py the following method tries to derserialise the labels, metadata and article embeddings from a pickle file. However, this pickle file is not provided, or, more precisely, the "acr_label_encoders" are missing.

tf.logging.info('Loading ACR module assets')
        acr_label_encoders, articles_metadata_df, content_article_embeddings_matrix = \
                load_acr_module_resources(FLAGS.acr_module_resources_path)

def load_acr_module_resources(acr_module_resources_path):
    (acr_label_encoders, articles_metadata_df, content_article_embeddings) = \
              deserialize(acr_module_resources_path)

    tf.logging.info("Read ACR label encoders for: {}".format(acr_label_encoders.keys()))
    tf.logging.info("Read ACR articles metadata: {}".format(len(articles_metadata_df)))
    tf.logging.info("Read ACR article content embeddings: {}".format(content_article_embeddings.shape))

    return acr_label_encoders, articles_metadata_df, content_article_embeddings

Similarly, in nar_utils.py, this method cannot be executed because the folder ''/pickles/" does not contain 'nar_label_encoders'

def load_nar_module_preprocessing_resources(nar_module_preprocessing_resources_path):
    #{'nar_label_encoders', 'nar_standard_scalers'}
    nar_resources = \
              deserialize(nar_module_preprocessing_resources_path)

    nar_label_encoders = nar_resources['nar_label_encoders']
    tf.logging.info("Read NAR label encoders for: {}".format(nar_label_encoders.keys()))

    return nar_label_encoders    

How can we get these labels? Or am I overlooking something?

Thanks

@Heng-xiu
Copy link

Hi,
It does missing some assets during the process. I notice that 'acr_label_encoders.pickle' comes from acr.preprocessing.acr_preprocess_gcom.py. If you check the code in the acr_preprocess_gcom.py, you can see the required parameters in the program.

However, we can not reimplement the 'acr_label_encoders.pickle' without the full text in the Glob dataset.
Screenshot from 2020-01-30 12-54-31
As you can see that there is a column called full_text in 'documents_g1.csv'.

Follow the instruction in Pre-processing data for the ACR module and download the files from Kaggle. full_text column doesn't exist in any files described in the Kaggle, such as clicks.zip, articles_metadata.csv, and articles_embeddings.pickle.

So, how can we get the documents_g1.csv?

Thanks

@CatcherGG
Copy link

Any answer to this?

@gabrielspmoreira
Copy link
Owner

Hi. Sorry for the delayed response. The nar_trainer_gcom.py in fact had an issue, requiring some assets that were not available for the G1 dataset (as the raw textual content of the articles could not be released for that dataset). Now, the information required from the ACR module for the NAR training are only the following command line parameters (--acr_module_articles_metadata_csv_path and --acr_module_articles_content_embeddings_pickle_path), which correspond to the files available in the public dataset URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants