## Gathering and transforming data for use in [Torch](https://github.com/jcjohnson/torch-rnn)
The below scripts are used to pull data from the NASA Astrophysics Data System API and convert that data into a text file. The screenshots at the end show Torch being used in the Terminal to create .h5 and .json files for use in training an LSTM RNN.

In [6]:
# modules for ADS API and data transformations
import os
import ads as ads 
import pandas as pd
import numpy as np

### Configure API and execute API call for 1,000 most recent "black hole" papers
* I will also conduct calls for ten thousand and one hundred thousand records using these same methods - 1,000 set is for example    
* Subsequently I will contruct three different datasets of one thousand, ten thousand, and one hundred thousand records and do this with three different topics for comparison.   
* This demo will only be for "black hole" papers.   

In [7]:
os.environ["ADS_DEV_KEY"] = "kNUoTurJ5TXV9hsw9KQN1k8wH4U0D7Oy0CJoOvyw"
ads.config.token = 'ADS_DEV_KEY' 

### Search for papers and retrieve associated bibcodes, titles, and abstracts
Initially I will just work to generate titles with my ANN. I will do this using the titles from this API call as a training dataset. I will be able to compare generated titles to the training titles to see how close I can get to the training titles. For my project I will experiment with using abstracts instead of titles. Titles are just much faster to work with because they are small so the processing runs much faster.

In [8]:
papers1 = list(ads.SearchQuery(q= "black hole", fl=['bibcode', 'title', 'abstract'], sort='pubdate', max_pages=20 ))

In [10]:
# find titles 
t = []
for i in papers1:
    title1 = i.title
    t.append(title1)
title = t

Convert the resulting titles into a dataframe so that they can be easily converted into a text file for use in [Torch](https://github.com/jcjohnson/torch-rnn). Torch requires a .txt file in order to run their processing Python module to create the necessary .json and .h5 files that the RNN library requires.

In [11]:
# create an initial df (only 1 column) and clean it up
df = pd.DataFrame({'Title' : title
  })
df['Title'] = df['Title'].str.get(0)

# write to .txt
df.to_csv("blackhole_1000.txt", sep=' ', header=None, index=None, encoding='utf-8')

In [15]:
# get average number of characters in a title for use in torch
sum(df['Title'].str.len())/1000

84

Create .json and .h5 files using Torch's [preprocessing scripts](https://github.com/jcjohnson/torch-rnn/blob/master/scripts/preprocess.py):
![Data Conversion](Torch_File_Conversion.png)
  
Resulting new files needed for training the model:
![New Files](New_Files.png)
   
We can now do a test run using our tiny dataset to train Torch's LSTM implementation. The training takes 50 "Epochs" to run:   
![Training](Training.png)