# rap wiz

by: [pelgo14](https://github.com/pelgo14)  
reference: "Train a GPT-2 Text-Generating Model w/ GPU For Free" by [Max Woolf](http://minimaxir.com)

read about this project in [**my blog post**](https://pelgo14.github.io/artificial-rapper)

*Last updated: September 9th, 2020*

features:

* scrape lyrics from your favourite rappers
* train gpt 2 on this dataset
* generate (rhyming) rap lyrics 

To get started (in colab):

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Make sure you're running the notebook with a GPU runtime.
3. Run the cells below, or use runtime/run-all

To get started (local .ipynb)

0. install [miniconda](https://docs.conda.io/en/latest/miniconda.html)

1. create the conda environment with packages: jupyter, notebook, ... 
  - `conda env create -f environment.yml`

2. start jupyter notebook
  - `jupyter notebook`

3. Run the cells below, or use runtime/run-all



In [1]:
# @markdown am i running in colab?
running_in_colab = True # @param {type:"boolean"}

# @markdown the artists to scrape lyrics from
# use "thrust" for debug, he has only 4 songs, else: "mf-doom", "aesop-rock", "aesop-rocky", etc...
artists = ["aesop-rock", "mf-doom", "asap-rocky", "kodak black"] #@param 

# @markdown your genius api key from https://genius.com/api-clients
key = "OSk3SqQp7_L1lqVhOBoCSNHz1Fy0J7UJQC2RLM4bhTVe7Ev5KeSzJUuZCq8HwrVJ" #@param {type:"string"}

# @markdown amount of steps to train
steps = 250 # @param {type:"number"}

# @markdown how many lines to generate
line_count = 4 # @param {type:"number"}


# @markdown the generated lines will appear here or at the bottom of the notebook
lines = [] # @param 


**rapwiz42** is a neural network (gpt-2) and lyric corpus based lyric generation system.

## setup

In [2]:
if running_in_colab:
  !nvidia-smi
  %tensorflow_version 1.x

Fri Sep 11 16:09:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P8     8W /  75W |      0MiB /  7611MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [3]:
if running_in_colab:
  !pip -q install pronouncing syllables progressbar2 gpt-2-simple enlighten lyricsgenius

[K     |████████████████████████████████| 51kB 5.0MB/s 
[K     |████████████████████████████████| 942kB 10.8MB/s 
[K     |████████████████████████████████| 81kB 7.8MB/s 
[K     |████████████████████████████████| 92kB 9.4MB/s 
[?25h  Building wheel for pronouncing (setup.py) ... [?25l[?25hdone
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone


In [4]:
import pronouncing
import syllables
import progressbar
import gpt_2_simple as gpt2

import logging

import urllib.request
import tarfile
import os

import random

import enlighten
import yaml
import lyricsgenius as genius
import fileinput
import json
import re
from os import listdir
from os.path import isfile, join

import concurrent.futures

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## classes

### scraper class

In [5]:
class Scraper:
    def __init__(self, genius_key, bar):
        logging.basicConfig(level=logging.INFO)
        self.genius_key = genius_key
        self.bar = bar

    def scrape(self, artist):
        # scrape
        api = genius.Genius(self.genius_key)
        artist_api = api.search_artist(artist)
        try:
            artist_api.save_lyrics(overwrite=True)
        except:
            logging.warn(f"Oops! Cant download: {artist}")
        self.bar.update()
        return True

    def extract(self):
        """
        extract
        # "songs": [
        # {
        #  "lyrics": "verse..."
        #  } ,
        # ...
        # ]
        :return:
        """
        source_path = '.'
        source_files = [f for f in listdir(source_path) if isfile(join(source_path, f))]
        for file in source_files:
            if file.endswith('.json'):
                with open(file) as json_file:
                    data = json.load(json_file)
                    for p in data['songs']:
                        if p['lyrics']:
                            with open("lyricdb.txt", "ab") as myfile:
                                myfile.write(p['lyrics'].encode('ASCII', errors="ignore"))

    def clean_folder(self):
        dir_name = "."
        test = os.listdir(dir_name)
        for item in test:
            if item.endswith(".json"):
                os.remove(os.path.join(dir_name, item))

    def clean_lyricdb(self):
        # clean
        # [intro: eminem] <- needs to be removed
        # ...
        for line in fileinput.input(r"lyricdb.txt", inplace=True):
            if not re.search(r"([^\w\d\s',<?>])", line):
                print(line, end="")

    def scrape_list(self, artists):
        # https://stackoverflow.com/questions/6893968/how-to-get-the-return-value-from-a-thread-in-python
        # multi-threaded scraping with fault tolerance
        with concurrent.futures.ThreadPoolExecutor() as executor:
            tasks = {}
            for artist in artists:
                task = executor.submit(self.scrape, artist)
                tasks[artist] = task
            while len(tasks.keys()) != 0:
                for artist, task in tasks.copy().items():
                    try:
                      res = task.result()
                    except:
                      res = False
                    tasks.pop(artist)
                    if not res:
                        tasks[artist] = executor.submit(self.scrape, artist)


### generator class

In [6]:
class Generator:
    def __init__(self, sess, log_level="INFO"):
        if "INFO" in log_level:
            logging.basicConfig(level=logging.INFO)
        elif "DEBUG" in log_level:
            logging.basicConfig(level=logging.DEBUG)
        self.sess = sess
        # gpt2.load_gpt2(self.sess, run_name="run1")

    def generate(self, line_count=16):
        """
        generates line_count amount of lines. every second line will rhyme with the previous
        (the last word of every second line will rhyme with the last word of the previous line)
        maybe_todo: add rhyme scheme support (not just ababab...)
        :param line_count:
        :return:
        """
        logging.info("generating lines...")
        lines = [self.gen_next_non_rhyme_line()]
        for i in progressbar.progressbar(range(1, line_count)):
            last_line = lines[i - 1]
            logging.debug(f"gen line num: {i}, last_line: {last_line}")
            logging.debug(f"lines: {lines}")
            if i % 2 == 1:
                line = self.gen_next_rhyme_line(last_line)
            else:
                line = self.gen_next_non_rhyme_line()
            lines.append(line)
        logging.info("generating lines... done!")
        for line in lines:
            print(line)
        return lines

    def gen_next_non_rhyme_line(self):
        """
        returns a line that suffices the is_good metric
        :return:
        """
        good_lines = []
        tries_per_iteration = 100
        i = 0
        while len(good_lines) == 0:
            logging.debug(f"trying to generate line. try:{i}")
            i += 1
            temp_lines = gpt2.generate(
                self.sess,
                return_as_list=True,
                length=12,
                temperature=0.8,
                nsamples=tries_per_iteration,
                batch_size=tries_per_iteration
            )
            temp_lines = [x.strip() for x in temp_lines]
            for line in temp_lines:
                if self.is_good(line):
                    good_lines.append(line)
        return random.choice(good_lines)

    def gen_next_rhyme_line(self, last_line):
        """
        returns a line that suffices the is_good metric and rhymes with last_line
        :param last_line:
        :return:
        """
        good_lines = []
        tries_per_iteration = 100
        i = 0
        while len(good_lines) == 0:
            logging.debug(f"try:{i} to generate rhyming line on line: {last_line}.")
            i += 1
            temp_lines = gpt2.generate(
                self.sess,
                return_as_list=True,
                length=12,
                temperature=0.8,
                nsamples=tries_per_iteration,
                batch_size=tries_per_iteration
            )
            temp_lines = [x.strip() for x in temp_lines]
            last_word = last_line.split()[-1]
            for line in temp_lines:
                this_last_word = line.split()[-1]
                if self.is_good(line) and self.rhymes(last_word, this_last_word):
                    good_lines.append(line)
            if i >= 30:
              return "Failed, next try"
        return random.choice(good_lines)


    def is_good(self, line):
        """"
        # defines a minimal metric for a "good" line
        # if line is empty or doesn't exist -> not good
        # if line is too short or too long -> not good
        # if there are (almost) no rhyme for the last word -> not good
        # if line is a multi-line line -> not good
        # if the line passed all checks it is good
        :param line:
        :return:
        """
        # if line is empty or doesn't exist -> not good
        if not line or len(line) == 0:
            return False

        # if line is too short or too long -> not good
        syl_count = syllables.estimate(line)
        if not 7 < syl_count < 18:
            return False

        # if there are (almost) no rhyme for the last word -> not good
        last_word = line.split()[-1]
        rhyme_list = pronouncing.rhymes(last_word)
        if not len(rhyme_list) > 20:
            return False

        # if line is a multi-line line -> not good
        if len(line.split("\n")) != 1:
            return False

        # if the line passed all checks it is good
        return True

    def rhymes(self, last_word, this_last_word):
        """ 
        :param last_word:
        :param this_last_word:
        :return: true if last_word rhymes with this_last_word
        """
        return this_last_word in pronouncing.rhymes(last_word)


## main 

### scrape

first, given a list of names of artists, their lyrics are scraped from genius.com.


then a text corpus is created from the scraped lyrics, and the corpus is cleaned of any bad characters.


In [None]:
print("scraping...")
manager = enlighten.get_manager()
scrape_bar = manager.counter(total=len(artists), desc='scrape', unit='artists')
scraper = Scraper(key, scrape_bar)
scraper.scrape_list(artists)
scraper.extract()
scraper.clean_folder()
scraper.clean_lyricdb()

scraping...
Searching for songs by aesop-rock...

Searching for songs by mf-doom...

Searching for songs by asap-rocky...

Searching for songs by kodak black...

Changing artist name to 'Aesop Rock'
Changing artist name to 'A$AP Rocky'
Changing artist name to 'MF DOOM'
Changing artist name to 'Kodak Black'
Song 1: "None Shall Pass"
Song 1: "Fuckin’ Problems"
Song 2: "Daylight"
Song 1: "Doomsday"
Song 3: "Zero Dark Thirty"
Song 1: "Tunnel Vision"
Song 4: "Rings"
Song 2: "1Train"
Song 2: "Beef Rapp"
Song 5: "Coffee"
Song 2: "No Flockin"
Song 3: "Goldie"
Song 3: "Roll in Peace"
Song 6: "Gopher Guts"
Song 4: "ZEZE"
Song 7: "Kirby"
Song 4: "Wild for the Night"
Song 5: "Praise the Lord (Da Shine)"
Song 5: "SKRT"
Song 6: "Peso"
Song 3: "That’s That"
Song 7: "Everyday"
Song 4: "Rapp Snitch Knishes"
Song 8: "Long Live A$AP"
Song 5: "Deep Fried Frenz"
Song 9: "Fashion Killa"
Song 8: "Dorks"
Song 6: "Hoe Cakes"
Song 6: "There He Go"
Song 7: "One Beer"
Song 9: "Mystery Fish"
Song 7: "Skrilla"
Song

### train 

then a gpt-2 model gets fine-tuned on the text corpus.

In [None]:
# from google.colab import files
# files.upload()

In [None]:
print("training...")
file_name = "lyricdb.txt"
gpt2.download_gpt2(model_name="345M")
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              dataset=file_name,
              model_name='345M',
              steps=steps,
              restore_from='fresh',
              run_name='run1',
              print_every=25,
              sample_every=250,
              save_every=500,
              overwrite=True
              )


### generate

then, with this gpt-2 model fine-tuned to rap flavour, rhyming lines in an “aabbaabb…” scheme are generated.

In [None]:
gen = Generator(sess, log_level="INFO")
lines = gen.generate(line_count)

## pretty print

In [None]:
!pip install art

Collecting art
[?25l  Downloading https://files.pythonhosted.org/packages/22/dd/f7be08119239db50f80285444323e58c81d6443158ac5f8b401b3fedb4d0/art-4.7-py2.py3-none-any.whl (547kB)
[K     |████████████████████████████████| 552kB 4.4MB/s 
[?25hInstalling collected packages: art
Successfully installed art-4.7


In [None]:
from art import *

In [None]:
for line in lines:
  print(line)

After a long and arduous week of nursing, I'm
Failed, next try
Growing up in downtown Atlanta, I never dreamed of being an
If you do a great job at the job, you can


In [None]:
for line in lines:
  print(text2art(line, font="small"))

   _     __  _                      _                                   _                 _                                          _           __                        _                  ___  _        
  /_\   / _|| |_  ___  _ _   __ _  | | ___  _ _   __ _   __ _  _ _   __| |  __ _  _ _  __| | _  _  ___  _  _  ___ __ __ __ ___  ___ | |__  ___  / _|  _ _   _  _  _ _  ___(_) _ _   __ _     |_ _|( ) _ __  
 / _ \ |  _||  _|/ -_)| '_| / _` | | |/ _ \| ' \ / _` | / _` || ' \ / _` | / _` || '_|/ _` || || |/ _ \| || |(_-< \ V  V // -_)/ -_)| / / / _ \|  _| | ' \ | || || '_|(_-<| || ' \ / _` | _   | | |/ | '  \ 
/_/ \_\|_|   \__|\___||_|   \__,_| |_|\___/|_||_|\__, | \__,_||_||_|\__,_| \__,_||_|  \__,_| \_,_|\___/ \_,_|/__/  \_/\_/ \___|\___||_\_\ \___/|_|   |_||_| \_,_||_|  /__/|_||_||_|\__, |( ) |___|   |_|_|_|
                                                 |___/                                                                                                                          

## outro

### author: [pelgo14](https://github.com/pelgo14)
read my [blog post](https://pelgo14.github.io/artificial-rapper) for more information


### reference: [Max Woolf](http://minimaxir.com) says
For more about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple). You can also read my [blog post](https://minimaxir.com/2019/09/howto-gpt2/) for more information how to use this notebook!

### original paper 
https://openai.com/blog/better-language-models/