<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by Amy Kirchhoff for [Constellate](https://constellate.orng/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email amy.kirchhoff@ithaka.org.<br />

# Create a CSV from a Dataset with Abstracts



___

## Import your dataset

We'll use the `constellate` client to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [1]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_id = "18ca8082-f0af-eb43-0d51-11ce559b3ec1"

Next, import the `constellate` client, passing the `dataset_id` as an argument using the `get_dataset` method.

In [2]:
# Importing your dataset with a dataset ID
import constellate
# Pull in the sampled dataset (1500 documents) that matches `dataset_id`
# in the form of a gzipped JSON lines file.
# The .get_dataset() method downloads the gzipped JSONL file
# to the /data folder and returns a string for the file name and location
#dataset_file = constellate.get_dataset(dataset_id)

# To download the full dataset (up to a limit of 25,000 documents),
# request it first in the builder environment. See the Constellate Client
# documentation at: https://constellate.org/docs/constellate-client
# Then use the `constellate.download` method show below.
dataset_file = constellate.download(dataset_id, 'jsonl')

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/
"innovation" OR ti:"innovations" OR ti:"invention" OR ti:"inventions" OR ti:"technological change" OR ti:"technical change" OR ti:"technological discontinuity" OR ti:"technical discontinuity" published in The American Economic Review, The Quarterly Journal of Economics, Strategic Management Journal from 1850 - 2021. 7832 documents.
INFO:root:File /root/data/8ff6274d-8080-a83a-c1dc-00f9d05ecf22-jsonl.jsonl.gz exists. Not re-downloading.


Import some of the packages we will need.

In [3]:
import csv
import pandas as pd

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


Set up the dataframe

In [4]:
df = pd.DataFrame(columns = ['id', 'docType', 'docSubType', 'provider', 'title', 'subTitle', 'collection', 'creator', 'publicationYear', 
                             'isPartOf', 'doi', 'pageStart', 'pageEnd', 'pageCount', 'wordCount', 'pagination', 'language', 
                             'publisher', 'placeOfPublication', 'identifier', 'abstract', 'url', 'tdmCategory', 
                             'sourceCategory', 'sequence', 'issueNumber', 'volumeNumber', 'outputFormat', 'datePublished' ])

Read each document using the tdm_client.dataset_reader and insert the pertinent fields into our dataframe.

In [5]:
i=0;
for document in constellate.dataset_reader(dataset_file):
    #print(type(document))
    i=i+1;
    df=df.append({'id' : document['id'], 
                  'docType':document.get('docType', ""), 
                  'docSubType':document.get('docSubType', ""),
                  'provider':document.get('provider', ""),
                  'title':document.get('title', ""),
                  'subTitle':document.get('subTitle', ""),
                  'collection':document.get('collection', ""),
                  'creator':document.get('creator', ""),
                  'publicationYear':document.get('publicationYear', ""),
                  'isPartOf':document.get('isPartOf', ""),
                  'doi':document.get('doi', ""),
                  'pageStart':document.get('pageStart', ""),
                  'pageEnd':document.get('pageEnd', ""),
                  'pageCount':document.get('pageCount', ""),
                  'wordCount':document.get('wordCount', ""),
                  'pagination':document.get('pagination', ""),
                  'language':document.get('language', ""),
                  'publisher':document.get('publisher', ""),
                  'placeOfPublication':document.get('placeOfPublication', ""),
                  'identifier':document.get('identifier', ""),
                  'abstract':document.get('abstract', ""),
                  'url':document.get('url', ""),
                  'wordCount':document.get('wordCount', ""),
                  'tdmCategory':document.get('tdmCategory', ""),
                  'sourceCategory':document.get('sourceCategory', ""),
                  'sequence':document.get('sequence', ""),   
                  'issueNumber':document.get('issueNumber', ""),
                  'volumeNumber':document.get('volumeNumber', ""),
                  'outputFormat':document.get('outputFormat', ""),
                  'datePublished':document.get('datePublished', "")
                 },       
                 
                 ignore_index=True)
    
    # We could make the above better by dealing with the lists of identifiers and authors in a more thoughtful way
    
    if i%100 == 0:
        print("At document #: " + str(i))
    
df
        
    

At document #: 100
At document #: 200
At document #: 300
At document #: 400
At document #: 500
At document #: 600
At document #: 700
At document #: 800
At document #: 900
At document #: 1000
At document #: 1100
At document #: 1200
At document #: 1300
At document #: 1400
At document #: 1500
At document #: 1600
At document #: 1700
At document #: 1800
At document #: 1900
At document #: 2000
At document #: 2100
At document #: 2200
At document #: 2300
At document #: 2400
At document #: 2500
At document #: 2600
At document #: 2700
At document #: 2800
At document #: 2900
At document #: 3000
At document #: 3100
At document #: 3200
At document #: 3300
At document #: 3400
At document #: 3500
At document #: 3600
At document #: 3700
At document #: 3800
At document #: 3900
At document #: 4000
At document #: 4100
At document #: 4200
At document #: 4300
At document #: 4400
At document #: 4500
At document #: 4600
At document #: 4700
At document #: 4800
At document #: 4900
At document #: 5000
At docume

Unnamed: 0,id,docType,docSubType,provider,title,subTitle,collection,creator,publicationYear,isPartOf,...,identifier,abstract,url,tdmCategory,sourceCategory,sequence,issueNumber,volumeNumber,outputFormat,datePublished
0,ark://27927/pc0gj2v8m,article,research-article,portico,Diversification in context: a cross‐national a...,,,"[Michael Mayer, Richard Whittington]",2003,Strategic Management Journal,...,"[{'name': 'doi', 'value': '10.1002/smj.334'}, ...",,http://doi.org/10.1002/smj.334,"[Applied sciences - Research methods, Mathemat...",,6,8,24,"[unigram, bigram, trigram]",2003-08-01
1,http://www.jstor.org/stable/1828073,article,research-article,jstor,Price Behavior in U.S. Manufacturing: An Empir...,,,[Leonard Sahling],1977,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",,http://www.jstor.org/stable/1828073,[Mathematics - Applied mathematics],"[Business & Economics, Business, Economics]",,5,67,"[unigram, bigram, trigram]",1977-12-01
2,ark://27927/pc0cckxpk,article,research-article,portico,Applying epistemic logic and evidential theory...,,,[Carl Brønn],1998,Strategic Management Journal,...,"[{'name': 'doi', 'value': '10.1002/(SICI)1097-...",,http://doi.org/10.1002/(SICI)1097-0266(199801)...,[Mathematics - Mathematical logic],,5,1,19,"[unigram, bigram, trigram]",1998-01-01
3,ark://27927/pc0mkhhg5,article,research-article,portico,"Strategic positioning, human capital, and perf...",,,"[Bruce C. Skaggs, Mark Youndt]",2004,Strategic Management Journal,...,"[{'name': 'doi', 'value': '10.1002/smj.365'}, ...",,http://doi.org/10.1002/smj.365,"[Philosophy - Applied philosophy, Applied scie...",,6,1,25,"[unigram, bigram, trigram]",2004-01-01
4,http://www.jstor.org/stable/2118434,article,research-article,jstor,Permanent and Transitory Components of GNP and...,,,[John H. Cochrane],1994,The Quarterly Journal of Economics,...,"[{'name': 'issn', 'value': '00335533'}, {'name...",This paper uses two-variable autoregressions t...,http://www.jstor.org/stable/2118434,[Mathematics - Mathematical objects],"[Business & Economics, Business, Economics]",,1,109,"[unigram, bigram, trigram]",1994-02-01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7827,http://www.jstor.org/stable/1814611,article,book-review,jstor,Review Article,,,[John Sheahan],1965,The American Economic Review,...,"[{'name': 'issn', 'value': '00028282'}, {'name...",,http://www.jstor.org/stable/1814611,,"[Business & Economics, Business, Economics]",,3,55,"[unigram, bigram, trigram]",1965-06-01
7828,http://www.jstor.org/stable/1881742,article,research-article,jstor,Co-operative Production in France and England,,,[Edward Cummings],1890,The Quarterly Journal of Economics,...,"[{'name': 'issn', 'value': '00335533'}, {'name...",,http://www.jstor.org/stable/1881742,[Philosophy - Applied philosophy],"[Business & Economics, Business, Economics]",,4,4,"[unigram, bigram, trigram, fullText]",1890-07-01
7829,http://www.jstor.org/stable/3094044,article,research-article,jstor,Technology Development Mode: A Transaction Cos...,,,"[Thomas S. Robertson, Hubert Gatignon]",1998,Strategic Management Journal,...,"[{'name': 'issn', 'value': '01432095'}, {'name...",Technology alliances have emerged in the past ...,http://www.jstor.org/stable/3094044,[Economics - Microeconomics],"[Management & Organizational Behavior, Busines...",,6,19,"[unigram, bigram, trigram]",1998-06-01
7830,http://www.jstor.org/stable/20142145,article,research-article,jstor,The Influence of Mergers on Firms' Product-Mix...,,,"[Ranjani A. Krishnan, Satish Joshi, Hema Krish...",2004,Strategic Management Journal,...,"[{'name': 'issn', 'value': '01432095'}, {'name...",This study draws on the institutional and reso...,http://www.jstor.org/stable/20142145,"[Health sciences - Health and wellness, Busine...","[Management & Organizational Behavior, Busines...",,6,25,"[unigram, bigram, trigram]",2004-06-01


In [None]:
df.to_csv(dataset_id + "-md-with-abstracts.csv", encoding='utf-8')