#  Iteration 1 - Prompt database - Step 2 - Neural network

aitextgen library by [Max Woolf](https://minimaxir.com)


 `aitextgen`: [GitHub repository](https://github.com/minimaxir/aitextgen) and [documentation](https://docs.aitextgen.io/).


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

[K     |████████████████████████████████| 572 kB 5.4 MB/s 
[K     |████████████████████████████████| 4.0 MB 37.8 MB/s 
[K     |████████████████████████████████| 87 kB 5.1 MB/s 
[K     |████████████████████████████████| 582 kB 43.3 MB/s 
[K     |████████████████████████████████| 408 kB 46.8 MB/s 
[K     |████████████████████████████████| 136 kB 40.6 MB/s 
[K     |████████████████████████████████| 596 kB 46.9 MB/s 
[K     |████████████████████████████████| 1.1 MB 41.4 MB/s 
[K     |████████████████████████████████| 6.6 MB 36.8 MB/s 
[K     |████████████████████████████████| 895 kB 48.0 MB/s 
[K     |████████████████████████████████| 77 kB 6.1 MB/s 
[K     |████████████████████████████████| 94 kB 3.1 MB/s 
[K     |████████████████████████████████| 271 kB 44.4 MB/s 
[K     |████████████████████████████████| 144 kB 37.3 MB/s 
[?25h  Building wheel for aitextgen (setup.py) ... [?25l[?25hdone
  Building wheel for fire (setup.py) ... [?25l[?25hdone


04/25/2022 11:49:04 — INFO — numexpr.utils — NumExpr defaulting to 2 threads.


## GPU

Verifying which GPU is active

In [None]:
!nvidia-smi

Mon Apr 25 11:49:08 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   69C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Loading GPT-2

Because we're retraining a model on new text, GPT-2 model needs to be loaded into the GPU. 

In [None]:
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

04/25/2022 11:49:08 — INFO — aitextgen — Downloading the 124M GPT-2 TensorFlow weights/config from Google's servers


Fetching checkpoint:   0%|          | 0.00/77.0 [00:00<?, ?it/s]

Fetching hparams.json:   0%|          | 0.00/90.0 [00:00<?, ?it/s]

Fetching model.ckpt.data-00000-of-00001:   0%|          | 0.00/498M [00:00<?, ?it/s]

Fetching model.ckpt.index:   0%|          | 0.00/5.21k [00:00<?, ?it/s]

Fetching model.ckpt.meta:   0%|          | 0.00/471k [00:00<?, ?it/s]

04/25/2022 11:49:20 — INFO — aitextgen — Converting the 124M GPT-2 TensorFlow weights to PyTorch.
Converting TensorFlow checkpoint from /content/aitextgen/124M
Loading TF weight model/h0/attn/c_attn/b with shape [2304]
Loading TF weight model/h0/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight model/h0/attn/c_proj/b with shape [768]
Loading TF weight model/h0/attn/c_proj/w with shape [1, 768, 768]
Loading TF weight model/h0/ln_1/b with shape [768]
Loading TF weight model/h0/ln_1/g with shape [768]
Loading TF weight model/h0/ln_2/b with shape [768]
Loading TF weight model/h0/ln_2/g with shape [768]
Loading TF weight model/h0/mlp/c_fc/b with shape [3072]
Loading TF weight model/h0/mlp/c_fc/w with shape [1, 768, 3072]
Loading TF weight model/h0/mlp/c_proj/b with shape [768]
Loading TF weight model/h0/mlp/c_proj/w with shape [1, 3072, 768]
Loading TF weight model/h1/attn/c_attn/b with shape [2304]
Loading TF weight model/h1/attn/c_attn/w with shape [1, 768, 2304]
Loading TF weight

Save PyTorch model to aitextgen/pytorch_model.bin


04/25/2022 11:49:27 — INFO — aitextgen — Loading 124M GPT-2 model from /aitextgen.


Save configuration file to aitextgen/config.json


04/25/2022 11:49:29 — INFO — aitextgen — GPT2 loaded with 124M parameters.
04/25/2022 11:49:29 — INFO — aitextgen — Using the default GPT-2 Tokenizer.


## Mounting Google Drive

In [None]:
mount_gdrive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Uploading the Spotify song database

In [None]:
file_name = "/content/drive/MyDrive/songtitlesapr24.csv"

## Finetuning GPT-2

Important parameters for `train()`:

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. aitextgen will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. Only works on a T4 or V100 GPU.

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
ai.train(file_name,
         line_by_line=True,
         from_cache=False,
         num_steps=3000,
         generate_every=500,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-3,
         fp16=False,
         batch_size=1, 
         )

04/25/2022 12:34:25 — INFO — aitextgen — Loading text from /content/drive/MyDrive/songtitlesapr24.csv with generation length of 1024.


  0%|          | 0/6326 [00:00<?, ?it/s]

04/25/2022 12:34:25 — INFO — aitextgen.TokenDataset — Encoding 6,326 rows from /content/drive/MyDrive/songtitlesapr24.csv.
  f"Setting `Trainer(checkpoint_callback={checkpoint_callback})` is deprecated in v1.5 and will "
  f"Setting `Trainer(progress_bar_refresh_rate={progress_bar_refresh_rate})` is deprecated in v1.5 and"
  "Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed"
04/25/2022 12:34:26 — INFO — pytorch_lightning.utilities.rank_zero — GPU available: True, used: True
04/25/2022 12:34:26 — INFO — pytorch_lightning.utilities.rank_zero — TPU available: False, using: 0 TPU cores
04/25/2022 12:34:26 — INFO — pytorch_lightning.utilities.rank_zero — IPU available: False, using: 0 IPUs
04/25/2022 12:34:26 — INFO — pytorch_lightning.utilities.rank_zero — HPU available: False, using: 0 HPUs
  f"The `Callback.{hook}` hook was deprecated in v1.6 and"
04/25/2022 12:34:26 — INFO — pytorch_lightning.accelerators.gpu — LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


  0%|          | 0/3000 [00:00<?, ?it/s]

  "`trainer.progress_bar_dict` is deprecated in v1.5 and will be removed in v1.7."


[1m500 steps reached: generating sample texts.[0m
 a Fool  Bok Nero
[1m1,000 steps reached: saving model to /trained_model[0m
[1m1,000 steps reached: generating sample texts.[0m
 My Woman
[1m1,500 steps reached: generating sample texts.[0m

[1m2,000 steps reached: saving model to /trained_model[0m
[1m2,000 steps reached: generating sample texts.[0m

[1m2,500 steps reached: generating sample texts.[0m

[1m3,000 steps reached: saving model to /trained_model[0m
[1m3,000 steps reached: generating sample texts.[0m
This Love


04/25/2022 13:13:10 — INFO — aitextgen — Saving trained model pytorch_model.bin to /trained_model


## Generate Text From The Trained Model

In [None]:
ai.generate()

Boomin In Your Jeep



Optional-but-helpful parameters for `ai.generate()`:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
ai.generate(n=40,
            batch_size=50,
            temperature=1.8,
            min_length=1,
            max_length=10,
            top_p=0.7)

Lonely Cities
Born On Your Own
A Boy Named Sue    
Cant Get Enough  rpm 
Boom  Single 
Wish You Would
Born To The USA
Sick of This
Hush On It
Hush  for You
I Have a Dream  The Complete Speech of
Lover I Miss You
All Out of Love
Dont Know Why
CANT STOP THE FEELING! Original
Grown Man Sport
Mundian to Bach Ke
Born TO Run
Let Me Be Your Teddy Bear
Shes in Love With Him
Hush  Just Because You Feel Good
Ghetto Boy Blues
The Hard Way
This Time
Shots
Aint All Seym high  
I Took a Pill In Ibiza Youth
Singing the Blues
Bridges Burn
All About U ft Nate Dogg Snoop
Tik Tok  BlocBoy JB
Bitch Don’t Kill My V
Remaster
Dont Know Why
Fountain To Move
Bonus Track
Dont Know Why
TAlone Calvin Harris  Feat Ste
Sting Aint Nothin 
Born In The USA


# Bulk generating & saving into txt

In [None]:
num_files = 1

for _ in range(num_files):
  ai.generate_to_file(n=10000,
                      batch_size=500,
                      temperature=2.8,
                      min_length=1,
                      max_length=10,
                      top_p=0.7)

04/25/2022 13:26:23 — INFO — aitextgen — Generating 10,000 texts to ATG_20220425_132623_54219819.txt


  0%|          | 0/10000 [00:00<?, ?it/s]




# LICENSE

MIT License

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.