# A long short-term memory recurrent neural network for generating astrophysics-specific language and assessment using tf-idf, cosine similarity, and presence of non-ASCII characters to determine model effectiveness
### Daina Bouquin
## Summary of Findings

The below sections outline the results of an analysis initially geared toward testing the following hyptheses. 

**Some subdomains of astronomy will be more acurately generated by the LSTM ANN than others.** 

A. The ability to generate a similar title to those available to train the model in the original training sets will be different between fields that are more emergent compared with those that are more established. For instance, I hypothesize that the LSTM ANN will be more effective in generating titles based on the "black hole" data than on the "exoplanet" data as the exoplanet field is much less established and therefore likely less regular from a linguistics perspective than text discussing black holes.

   
* Black Holes - earliest publication available in ADS is 1799 
* Astrobiology - earliest publication available in ADS is 1870 
* Exoplanets - earliest publication available in ADS is 1943 

This hypothesis has been shown to be true in regards to an LSTM trained real astronomy titles from the Astrophysics Data System. The real titles were used to train the neural network and each trained model output 10,000 RNN-generated titles based on the training data; training datasets consisted of corpus sizes of 1,000 to 20,000 titles. The highest degree of accuracy achieved by the LSTM RNN for each subdomain of astronomy is listed below. You can see that the most established field (black holes) has the lowest cosine similarity measure achieved across all tests, while the most emergent field (exoplanets) was not able to achieve a cosine similarity measure better than 45.81 degrees.

| Astronomy Domain        | Minimum Cosine Similarity Angle        | 
| ------------- |:-------------:| 
| Black Holes     | 30.6 | 
| Astrobiology      |  32.4   | 
| Exoplanets | 45.81     | 

Moreover, as training sizes increased, the accuracy as measured through cosine similarity improved for both the black hole and exoplanets tests (violin plots below); the improvements were subtle and this is likely due to not having a much higher capacity to process larger datasets. Running torch (the analysis module I implemented in Python) without access to a GPU and GPU acceleration through CUDA necessarily limited my ability to train models using larger datasets. 

<img src="blackhole_summary.png",width=600>
<img src="exoplanet_summary.png",width=600>

Interestingly, for the cosine similarity analysis performed on the RNN-generated titles that were trained on the most interdisciplinary field (astrobiology), there was no obvious improvement across training set sizes. There seemed to be no obvious pattern emerging from these initial results. I suspect that this is due to the significantly higher unique word count for astrobiology papers (see point B below).
<img src="astrobio_summary.png",width=600>
 
B. The relationship will be correlated with the diversity of unique words used within the training corpus of text representing the subdomain. This is to say that a subdomain with a larger variety of unique words (combinations of letters) will be more challenging to for the LSTM ANN to generate than a training corpus with fewer unique words. I hypothesize that more data will be needed to achieve a high degree of similarity between generated text and training text in some fields than others and that this will similarly correspond to unique word count. 

| Astronomy Domain  | Training Size  | Unique Word Count  |
| ------------- |:-------------:| -----:|
| <span style="color:green">Black Hole</span>     | <span style="color:green">1K</span> | <span style="color:green">4134</span> |
| <span style="color:green">Black Hole</span>      | <span style="color:green">5K</span>      |  <span style="color:green">15203</span> |
| <span style="color:green">Black Hole</span> | <span style="color:green">10K</span>      |  <span style="color:green">26095</span>  |  
| <span style="color:green">Black Hole</span>     | <span style="color:green">20K</span> | <span style="color:green">45761</span> |  
| <span style="color:blue">Astrobiology</span>      | <span style="color:blue">1K</span>      |  <span style="color:blue">5217</span>  |   
| <span style="color:blue">Astrobiology</span> | <span style="color:blue">5K</span>      |  <span style="color:blue">20324</span>  |   
| <span style="color:blue">Astrobiology</span>      | <span style="color:blue">10K</span> | <span style="color:blue">34104</span> |   
| <span style="color:blue">Astrobiology</span>      | <span style="color:blue">20K</span>      |  <span style="color:blue">56703</span> |   
| <span style="color:orange">Exoplanets</span> | <span style="color:orange">1K</span>   |   <span style="color:orange">4162</span>  |   
| <span style="color:orange">Exoplanets</span>      | <span style="color:orange">5K</span> | <span style="color:orange">14143</span> |   
| <span style="color:orange">Exoplanets</span>      | <span style="color:orange">10K</span>     |  <span style="color:orange">23175</span>  |   
| <span style="color:orange">Exoplanets</span> | <span style="color:orange">20K</span>      |  <span style="color:orange">38301</span> |  

The above hypothesis based on unique wordcount only holds true for the astrobiology titles tested in this analysis as there are not substantical differences between Black Hole titles and Exoplanet titles in regards to unique word count.

C. I also performed an analysis using the presence of non-ASCII characters in the training and model-generated titles.
I compared the proportion of non-ASCII characters per title containing non-ASCII characters to that of titles in the training corpus itself.

| Astronomy Domain  | Training Size  | Non-ASCII Training  | Non-ASCII Test  |
| ------------- |:-------------:| -----:| -----:|
| <span style="color:green">Black Hole</span>     | <span style="color:green">1K</span> | <span style="color:green">3</span> | <span style="color:green">25</span> |
| <span style="color:green">Black Hole</span>      | <span style="color:green">5K</span>      |  <span style="color:green">3</span> | <span style="color:green">26</span> |
| <span style="color:green">Black Hole</span> | <span style="color:green">10K</span>      |  <span style="color:green">3</span>  |  <span style="color:green">6</span> |
| <span style="color:green">Black Hole</span>     | <span style="color:green">20K</span> | <span style="color:green">3</span> |  <span style="color:green">6</span> |
| <span style="color:blue">Astrobiology</span>      | <span style="color:blue">1K</span>      |  <span style="color:blue">3</span>  |   <span style="color:blue">8</span>  |   
| <span style="color:blue">Astrobiology</span> | <span style="color:blue">5K</span>      |  <span style="color:blue">2</span>  |   <span style="color:blue">9</span>  |   
| <span style="color:blue">Astrobiology</span>      | <span style="color:blue">10K</span> | <span style="color:blue">3</span> |   <span style="color:blue">15</span>  |   
| <span style="color:blue">Astrobiology</span>      | <span style="color:blue">20K</span>      |  <span style="color:blue">3</span> |   <span style="color:blue">15</span>  |   
| <span style="color:orange">Exoplanets</span> | <span style="color:orange">1K</span>   |   <span style="color:orange">3</span>  |   <span style="color:orange">7</span>  |   
| <span style="color:orange">Exoplanets</span>      | <span style="color:orange">5K</span> | <span style="color:orange">3</span> |   <span style="color:orange">7</span>  |   
| <span style="color:orange">Exoplanets</span>      | <span style="color:orange">10K</span>     |  <span style="color:orange">2</span>  |   <span style="color:orange">4</span>  |   
| <span style="color:orange">Exoplanets</span> | <span style="color:orange">20K</span>      |  <span style="color:orange">2</span> |  <span style="color:orange">4</span>  |   

Again you can see from the above results that astrobiology is the only subdomain that deviates from the trend wherein increasing the training set size makes the prevelance of non-ASCII characters more realistic.

** Further discussion of these results will be incorporated into the final writeup of these analyses.**