<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by Amy Kirchhoff for [Constellate](https://constellate.orng/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email amy.kirchhoff@ithaka.org.<br />

# Pull FT out and drop it into a directory of files



___

## Import your dataset

We'll use the `constellate` client to automatically retrieve the dataset in the JSON file format. 

Enter a [dataset ID](https://docs.constellate.org/key-terms/#dataset-ID) in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

In [None]:
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
#dataset_id = "18ca8082-f0af-eb43-0d51-11ce559b3ec1"

Next, import the `constellate` client, passing the `dataset_id` as an argument using the `get_dataset` method.

In [1]:
# Importing your dataset with a dataset ID
import constellate
import os
from pathlib import Path
import csv
import pandas as pd

# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)

# Check to see if a dataset file exists
# If not, download a dataset using the Constellate Client
# The default dataset is Shakespeare Quarterly, 1950-present
dataset_file = Path.cwd() / 'data' / 'my_data.jsonl.gz' # Make sure this filepath matches your dataset filename

Constellate: use and download of datasets is covered by the Terms & Conditions of Use: https://constellate.org/terms-and-conditions/
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


Import some of the packages we will need.

Set up the dataframe

In [2]:
df = pd.DataFrame(columns = ['id', 'docType', 'docSubType', 'provider', 'title', 'subTitle', 'collection', 'creator', 'publicationYear', 
                             'isPartOf', 'doi', 'pageStart', 'pageEnd', 'pageCount', 'wordCount', 'pagination', 'language', 
                             'publisher', 'placeOfPublication', 'identifier', 'abstract', 'url', 'tdmCategory', 
                             'sourceCategory', 'sequence', 'issueNumber', 'volumeNumber', 'outputFormat', 'datePublished', 'fullText' ])

ft_folder = Path('./ft/')
ft_folder.mkdir(exist_ok=True)
#os.mkdir("fulltext")


Read each document using the tdm_client.dataset_reader and insert the pertinent fields into our dataframe.

In [7]:
i=0;
for document in constellate.dataset_reader(dataset_file):
    #print(type(document))
    i=i+1;
    
    ft_name = document.get('id', "")
    ft_name = ft_name.replace('/', '_') + ".txt"
    
    print(ft_name)
    
    ft=document.get('fullText', "")
                    
    with open('./ft/' + ft_name, 'w') as f:
        for str in ft:
            f.write(str)

        df=pd.concat([df, pd.DataFrame.from_records([{'id' : document['id'], 
                  'docType':document.get('docType', ""), 
                  'docSubType':document.get('docSubType', ""),
                  'provider':document.get('provider', ""),
                  'title':document.get('title', ""),
                  'subTitle':document.get('subTitle', ""),
                  'collection':document.get('collection', ""),
                  'creator':document.get('creator', ""),
                  'publicationYear':document.get('publicationYear', ""),
                  'isPartOf':document.get('isPartOf', ""),
                  'doi':document.get('doi', ""),
                  'pageStart':document.get('pageStart', ""),
                  'pageEnd':document.get('pageEnd', ""),
                  'pageCount':document.get('pageCount', ""),
                  'wordCount':document.get('wordCount', ""),
                  'pagination':document.get('pagination', ""),
                  'language':document.get('language', ""),
                  'publisher':document.get('publisher', ""),
                  'placeOfPublication':document.get('placeOfPublication', ""),
                  'identifier':document.get('identifier', ""),
                  'abstract':document.get('abstract', ""),
                  'url':document.get('url', ""),
                  'wordCount':document.get('wordCount', ""),
                  'tdmCategory':document.get('tdmCategory', ""),
                  'sourceCategory':document.get('sourceCategory', ""),
                  'sequence':document.get('sequence', ""),   
                  'issueNumber':document.get('issueNumber', ""),
                  'volumeNumber':document.get('volumeNumber', ""),
                  'outputFormat':document.get('outputFormat', ""),
                  'datePublished':document.get('datePublished', "")
                 }])
                  
                ]
                )
        
    

http:__www.jstor.org_stable_976884.txt
http:__www.jstor.org_stable_3517931.txt
http:__www.jstor.org_stable_10.13173_centasiaj.58.1-2.0228.txt
http:__www.jstor.org_stable_10.5406_jsporthistory.43.2.250.txt
http:__www.jstor.org_stable_24452479.txt
http:__www.jstor.org_stable_43611606.txt
http:__www.jstor.org_stable_45136330.txt
http:__www.jstor.org_stable_10.13173_centasiaj.58.1-2.0227.txt
http:__www.jstor.org_stable_48703935.txt
http:__www.jstor.org_stable_43611621.txt
http:__www.jstor.org_stable_10.5406_jsporthistory.43.2.251.txt
http:__www.jstor.org_stable_24452564.txt


In [8]:
df.to_csv("./data/my_data.md-with-abstracts.csv", encoding='utf-8')