<a href="https://colab.research.google.com/github/bellaroseee/447-Group-Project/blob/checkpoint-2/src/test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

taken from [link](https://keras.io/examples/generative/lstm_character_level_text_generation/)
  
data from [link](https://www.kaggle.com/namanj27/astronomers-telegram-dataset?select=Processed_Atels.csv)

# Setup

In [3]:
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np
import random
import io
import pandas as pd

# Prepare the Data

In [69]:
path_to_file = keras.utils.get_file(
    "Processed_Atels", 
    "https://raw.githubusercontent.com/bellaroseee/447-Group-Project/checkpoint-2/src/Processed_Atels.csv")
data = pd.read_csv(path_to_file)
data = data["Text processed"]

In [71]:
data[0]

'We report spectroscopic observations of AT2018lab, discovered by the DLT40 survey on UT 2018 Dec 29.13 in the luminous infrared galaxy (LIRG) IC 2163. The observations were performed using the FLOYDS spectrograph on Faulkes-South on UT 2018 Dec 29.46. The spectrum reveals a mostly featureless blue continuum. We also note the presence of H-alpha emission in our spectrum with FWHM of 400 km/s. This feature could be associated with the host galaxy IC 2163, but is blueshifted from the nominal recessional velocity of the host by 600 km/s.'

In [72]:
len(data)

1203

In [73]:
test_data = data[:10]
dev_data = data[10:20]
train_data = data[20:]
print(f"test: {len(test_data)}\ndev: {len(dev_data)}\ntrain: {len(train_data)}")

test: 10
dev: 10
train: 1183


In [76]:
# text_processed = data["Text processed"]
text = ""
for row in text_processed:
    text += row

print("Corpus length:", len(text))
chars = sorted(list(set(text)))
print("Total chars:", len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
print(len(list(char_indices)))
# char_indices maps character to index (index is decided here)
# indices_char maps index to character (this is the opposite of char_indices)

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i : i + maxlen]) # add 40 chars from i to sentences
    next_chars.append(text[i + maxlen]) # add the next char to next_chars
print("Number of sequences:", len(sentences))

# test = 1
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    # if (test == 1) :
      # print(i, sentence)
      for t, char in enumerate(sentence):
          # print(t, char)
          x[i, t, char_indices[char]] = 1
      y[i, char_indices[next_chars[i]]] = 1
      # print(char_indices['A'])
      # print(x[0][0])
      # print(char_indices['W'])
      # print(y[0])
      # test = 2

Corpus length: 1940271
Total chars: 108
108
Number of sequences: 646744


* `text` is a list of characters from data. `text[0]` = `w`, `text[:5]` = `We re`
* `sentences` is a list of sentences of length `maxlen` from data `text`, incremented by `step` 
    * `sentences[0]` : We report spectroscopic observations of | `next_char[0]` : A
    * `sentences[1]` : report spectroscopic observations of AT2 | `next_char[1]` : 0
    * `sentences[2]` : ort spectroscopic observations of AT2018 | `next_char[2]` : 1
* `x.shape` (646744, 40, 108) -> (num of sequences, length of sequence, number of characters) 
* `y.shape` (646744, 108) -> (num of sequences, number of characters)


Full Explanation of For Loop
  
i, sentence: `0 We report spectroscopic observations of`

t, char: `0 W`

`char_indices['W']` = 56

`x[0][0]` : 

```
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False  True False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False]
```

next_char is 'A', `char_indices['A']` = 34.

`y[0]` :

```
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False  True False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False]
```


  


# Build the model: a single LSTM layer

In [51]:
model = keras.Sequential( # stack layers into tf.keras.Model.
    [ # this is the first layer
     
        keras.Input(shape=(maxlen, len(chars))), # instatntiate Keras tensor of shape (40, 180)
        layers.LSTM(128), # 128 is the dimensionality of output space
        layers.Dense(len(chars), activation="softmax"), # densely connected NN layer with output of dimension 40 & softmax activation function.
    ], 
    # [ # this is the first layer
     
    #     layers.LSTM(128), # 128 is the dimensionality of output space
    #     layers.Dense(len(chars), activation="softmax"), # densely connected NN layer with output of dimension 40 & softmax activation function.
    # ]
)
optimizer = keras.optimizers.RMSprop(learning_rate=0.01)
model.compile(loss="categorical_crossentropy", optimizer=optimizer) # configure the losses and optimizer 

In [48]:
model.summary()

Model: "[<tensorflow.python.keras.layers.recurrent_v2.LSTM object at 0x7f9a36db82b0>, <tensorflow.python.keras.layers.core.Dense object at 0x7f9a365cc470>]"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_8 (LSTM)                (None, 128)               121344    
_________________________________________________________________
dense_7 (Dense)              (None, 108)               13932     
Total params: 135,276
Trainable params: 135,276
Non-trainable params: 0
_________________________________________________________________


# Prepare the text sampling function

In [56]:
# returns the index of the most likely value -> 'maximum' preds value 
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype("float64")
    # print("after changing to np array")
    # print(preds)
    preds = np.log(preds) / temperature # why do this?
    # print("after log and / temp")
    # print(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds) # why do this? normalize?
    probas = np.random.multinomial(1, preds, 1)
    # print("Probas:")
    # print(probas)
    # print("returned values: ", np.argmax(probas))
    return np.argmax(probas)

# Train the model

In [60]:
epochs = 1
batch_size = 128

for epoch in range(epochs):
    model.fit(x, y, batch_size=batch_size, epochs=1) # train the model
    print()
    print("Generating text after epoch: %d" % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    print(f"text len {len(text)} start_index {start_index}, maxlen {maxlen}")
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)

        # generated = ""
        test_generated = []
        topthree = []
        sentence = text[start_index : start_index + maxlen]
        print('...Generating with seed: "' + sentence + '"')

        for a in range(3):
          generated = ""
          for i in range(20):
            # if i == 0:
              x_pred = np.zeros((1, maxlen, len(chars))) # this is 1 row of same dimesion with x
              for t, char in enumerate(sentence): 
                  x_pred[0, t, char_indices[char]] = 1.0 # map True value on x_pred based on 'sentence'
              preds = model.predict(x_pred, verbose=0)[0]
              # print(preds.shape)
              # print(preds)
              next_index = sample(preds, diversity) # calls the sample(preds, temperature) fn above
              next_char = indices_char[next_index]
              if (i == 0): topthree.append(next_char)
              sentence = sentence[1:] + next_char
              generated += next_char
          test_generated.append(generated)

        print("...Generated: ", test_generated[0])
        print(f"{test_generated[1]}\n{test_generated[2]}")
        print("top Three:", topthree)
        print()


Generating text after epoch: 0
text len 1940271 start_index 1767214, maxlen 40
...Diversity: 0.2
...Generating with seed: " al. 1995 AJ, 110, 880), and a redshift "
...Generated:  of the observations 
of the program the p
rogram is and the so
top Three: ['o', 'o', 'r']

...Diversity: 0.5
...Generating with seed: " al. 1995 AJ, 110, 880), and a redshift "
...Generated:  from the propomence 
of the first about t
he first of the two 
top Three: ['f', 'o', 'h']

...Diversity: 1.0
...Generating with seed: " al. 1995 AJ, 110, 880), and a redshift "


  import sys


...Generated:  is rereooum enfi8q.:
 We Ales anc program
 mapiling eeal usrem
top Three: ['i', ' ', ' ']

...Diversity: 1.2
...Generating with seed: " al. 1995 AJ, 110, 880), and a redshift "
...Generated:  limit unchouscom we 
solutin. relea used 
ampimetionREOO=Hahla
top Three: ['l', 's', 'a']



In [77]:
input = "That’s one small ste"
for t, char in enumerate(input):
  print(f"{t}, {char}")

0, T
1, h
2, a
3, t
4, ’
5, s
6,  
7, o
8, n
9, e
10,  
11, s
12, m
13, a
14, l
15, l
16,  
17, s
18, t
19, e
