<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by Amy Kirchhoff for [Constellate](https://constellate.orng/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email amy.kirchhoff@ithaka.org.<br />

# Create a CSV from a Dataset with Abstracts



___

## Import your dataset

We'll use the `constellate` client to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [1]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "18ca8082-f0af-eb43-0d51-11ce559b3ec1"

Next, import the `constellate` client, passing the `dataset_id` as an argument using the `get_dataset` method.

In [2]:
# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
#dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
dataset_file = constellate.download(dataset_id, 'jsonl')

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/
ti:"innovation" OR ti:"innovations" OR ti:"invention" OR ti:"inventions" OR ti:"technological change" OR ti:"technical change" OR ti:"technological discontinuity" OR ti:"technical discontinuity" published in The American Economic Review, The Quarterly Journal of Economics, Strategic Management Journal from 1850 - 2021. 434 documents.
INFO:root:File /home/jovyan/data/18ca8082-f0af-eb43-0d51-11ce559b3ec1-jsonl.jsonl.gz exists. Not re-downloading.


Import some of the packages we will need.

In [3]:
import csv
import pandas as pd

INFO:numexpr.utils:NumExpr defaulting to 8 threads.


Set up the dataframe

In [4]:
df = pd.DataFrame(columns = ['id', 'docType', 'docSubType', 'provider', 'title', 'subTitle', 'collection', 'creator', 'publicationYear', 
                             'isPartOf', 'doi', 'pageStart', 'pageEnd', 'pageCount', 'wordCount', 'pagination', 'language', 
                             'publisher', 'placeOfPublication', 'identifier', 'abstract', 'url', 'tdmCategory', 
                             'sourceCategory', 'sequence', 'issueNumber', 'volumeNumber', 'outputFormat', 'datePublished' ])

Read each document using the tdm_client.dataset_reader and insert the pertinent fields into our dataframe.

In [5]:
i=0;
for document in constellate.dataset_reader(dataset_file):
    #print(type(document))
    i=i+1;
    
    df=pd.concat([df, pd.DataFrame.from_records([{'id' : document['id'], 
                  'docType':document.get('docType', ""), 
                  'docSubType':document.get('docSubType', ""),
                  'provider':document.get('provider', ""),
                  'title':document.get('title', ""),
                  'subTitle':document.get('subTitle', ""),
                  'collection':document.get('collection', ""),
                  'creator':document.get('creator', ""),
                  'publicationYear':document.get('publicationYear', ""),
                  'isPartOf':document.get('isPartOf', ""),
                  'doi':document.get('doi', ""),
                  'pageStart':document.get('pageStart', ""),
                  'pageEnd':document.get('pageEnd', ""),
                  'pageCount':document.get('pageCount', ""),
                  'wordCount':document.get('wordCount', ""),
                  'pagination':document.get('pagination', ""),
                  'language':document.get('language', ""),
                  'publisher':document.get('publisher', ""),
                  'placeOfPublication':document.get('placeOfPublication', ""),
                  'identifier':document.get('identifier', ""),
                  'abstract':document.get('abstract', ""),
                  'url':document.get('url', ""),
                  'wordCount':document.get('wordCount', ""),
                  'tdmCategory':document.get('tdmCategory', ""),
                  'sourceCategory':document.get('sourceCategory', ""),
                  'sequence':document.get('sequence', ""),   
                  'issueNumber':document.get('issueNumber', ""),
                  'volumeNumber':document.get('volumeNumber', ""),
                  'outputFormat':document.get('outputFormat', ""),
                  'datePublished':document.get('datePublished', "")
                 }])
                  
                ]
                )

    
    # We could make the above better by dealing with the lists of identifiers and authors in a more thoughtful way
    
    if i%100 == 0:
        print("At document #: " + str(i))
    
df
        
    

At document #: 100
At document #: 200
At document #: 300
At document #: 400


Unnamed: 0,id,docType,docSubType,provider,title,subTitle,collection,creator,publicationYear,isPartOf,...,identifier,abstract,url,tdmCategory,sourceCategory,sequence,issueNumber,volumeNumber,outputFormat,datePublished
0,http://www.jstor.org/stable/1814296,article,research-article,jstor,"Technological Change, Obsolescence and Aggrega...",,,[Robert Eisner],1956,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",,http://www.jstor.org/stable/1814296,[Mathematics - Mathematical objects],"[Business & Economics, Business, Economics]",,4,46,"[unigram, bigram, trigram]",1956-09-01
0,http://www.jstor.org/stable/1803327,article,research-article,jstor,Uncertain Innovation and the Persistence of Mo...,,,"[Richard J. Gilbert, David M. G. Newbery]",1984,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",,http://www.jstor.org/stable/1803327,[Law - Civil law],"[Business & Economics, Business, Economics]",,1,74,"[unigram, bigram, trigram]",1984-03-01
0,http://www.jstor.org/stable/23469764,article,research-article,jstor,Financial Innovation and Portfolio Risks,,,[Alp Simsek],2013,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",,http://www.jstor.org/stable/23469764,"[Economics - Economic disciplines, Law - Civil...","[Business & Economics, Business, Economics]",,3,103,"[unigram, bigram, trigram]",2013-05-01
0,http://www.jstor.org/stable/3094054,article,research-article,jstor,Evolutionary Diffusion: Internal and External ...,,,"[Anuradha Nagarajan, Will Mitchell]",1998,Strategic Management Journal,...,"[{'name': 'issn', 'value': '01432095'}, {'name...",This study links theories concerning methods t...,http://www.jstor.org/stable/3094054,"[Business - Business administration, Applied s...","[Management & Organizational Behavior, Busines...",,11,19,"[unigram, bigram, trigram]",1998-11-01
0,ark://27927/pf1d05m4zc,article,,portico,Spawned with a silver spoon? Entrepreneurial p...,,,[Aaron K. Chatterji],2009,Strategic Management Journal,...,"[{'name': 'doi', 'value': '10.1002/smj.729'}, ...",,http://doi.org/10.1002/smj.729,[Information science - Informetrics],,4,2,30,"[unigram, bigram, trigram]",2009-02-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,http://www.jstor.org/stable/1816273,article,research-article,jstor,"Market Structure, Business Conduct, and Innova...",,,[Jesse W. Markham],1965,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",,http://www.jstor.org/stable/1816273,[Economics - Microeconomics],"[Business & Economics, Business, Economics]",,1/2,55,"[unigram, bigram, trigram]",1965-03-01
0,http://www.jstor.org/stable/43861131,article,research-article,jstor,"Technological Innovations, Downside Risk, and ...",,,"[Kyle Emerick, Alain de Janvry, Elisabeth Sado...",2016,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",We use a randomized experiment in India to sho...,http://www.jstor.org/stable/43861131,[Biological sciences - Ecology],"[Business & Economics, Business, Economics]",,6,106,"[unigram, bigram, trigram]",2016-06-01
0,http://www.jstor.org/stable/2486396,article,research-article,jstor,The Dynamics of Continuous Innovation in Scale...,,,[Yasunori Baba],1989,Strategic Management Journal,...,"[{'name': 'issn', 'value': '01432095'}, {'name...",This paper attempts to explain why some indust...,http://www.jstor.org/stable/2486396,"[Business - Business administration, Philosoph...","[Management & Organizational Behavior, Busines...",,1,10,"[unigram, bigram, trigram]",1989-01-01
0,http://www.jstor.org/stable/26527997,article,research-article,jstor,Team-Specific Capital and Innovation,,,"[Xavier Jaravel, Neviana Petkova, Alex Bell]",2018,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",We establish the importance of team-specific c...,http://www.jstor.org/stable/26527997,[Mathematics - Applied mathematics],"[Business & Economics, Business, Economics]",,4-5,108,"[unigram, bigram, trigram]",2018-04-01


In [6]:
df.to_csv(dataset_id + "-md-with-abstracts.csv", encoding='utf-8')