<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by Amy Kirchhoff for [Constellate](https://constellate.orng/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email amy.kirchhoff@ithaka.org.<br />

# Create a CSV from a Dataset with Abstracts



___

## Import your dataset

We'll use the `constellate` client to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "18ca8082-f0af-eb43-0d51-11ce559b3ec1"

Next, import the `constellate` client, passing the `dataset_id` as an argument using the `get_dataset` method.

In [None]:
# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
#dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
dataset_file = constellate.download(dataset_id, 'jsonl')

Import some of the packages we will need.

In [None]:
import csv
import pandas as pd

Set up the dataframe

In [None]:
df = pd.DataFrame(columns = ['id', 'docType', 'docSubType', 'provider', 'title', 'subTitle', 'collection', 'creator', 'publicationYear', 
                             'isPartOf', 'doi', 'pageStart', 'pageEnd', 'pageCount', 'wordCount', 'pagination', 'language', 
                             'publisher', 'placeOfPublication', 'identifier', 'abstract', 'url', 'tdmCategory', 
                             'sourceCategory', 'sequence', 'issueNumber', 'volumeNumber', 'outputFormat', 'datePublished' ])

Read each document using the tdm_client.dataset_reader and insert the pertinent fields into our dataframe.

In [None]:
i=0;
for document in constellate.dataset_reader(dataset_file):
    #print(type(document))
    i=i+1;
    
    df=pd.concat([df, pd.DataFrame.from_records([{'id' : document['id'], 
                  'docType':document.get('docType', ""), 
                  'docSubType':document.get('docSubType', ""),
                  'provider':document.get('provider', ""),
                  'title':document.get('title', ""),
                  'subTitle':document.get('subTitle', ""),
                  'collection':document.get('collection', ""),
                  'creator':document.get('creator', ""),
                  'publicationYear':document.get('publicationYear', ""),
                  'isPartOf':document.get('isPartOf', ""),
                  'doi':document.get('doi', ""),
                  'pageStart':document.get('pageStart', ""),
                  'pageEnd':document.get('pageEnd', ""),
                  'pageCount':document.get('pageCount', ""),
                  'wordCount':document.get('wordCount', ""),
                  'pagination':document.get('pagination', ""),
                  'language':document.get('language', ""),
                  'publisher':document.get('publisher', ""),
                  'placeOfPublication':document.get('placeOfPublication', ""),
                  'identifier':document.get('identifier', ""),
                  'abstract':document.get('abstract', ""),
                  'url':document.get('url', ""),
                  'wordCount':document.get('wordCount', ""),
                  'tdmCategory':document.get('tdmCategory', ""),
                  'sourceCategory':document.get('sourceCategory', ""),
                  'sequence':document.get('sequence', ""),   
                  'issueNumber':document.get('issueNumber', ""),
                  'volumeNumber':document.get('volumeNumber', ""),
                  'outputFormat':document.get('outputFormat', ""),
                  'datePublished':document.get('datePublished', "")
                 }])
                  
                ]
                )

    
    # We could make the above better by dealing with the lists of identifiers and authors in a more thoughtful way
    
    if i%100 == 0:
        print("At document #: " + str(i))
    
df
        
    

In [None]:
df.to_csv(dataset_id + "-md-with-abstracts.csv", encoding='utf-8')