# Parse and prepare a dataset of abc music notations

Download the [Nottingham Dataset](https://github.com/jukedeck/nottingham-dataset) or this [dataset of abc music notation from Henrik Norbeck ](http://norbeck.nu/abc/download.asp) select the 'one big zip file (549 kilobytes).' at the end of the page. 

If we use the Henrik Norbeck DS the first thing we are going to do is parse all the files and concatenate the text in one 'big' text file.

We will train our model using Char-RNN for TF, you can clone it from [https://github.com/sherjilozair/char-rnn-tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow)

## ABC Notations

We will need some software to work with `abc` and `mid` files, you can install by using on Ubuntu:

```
$ sudo apt-get install abcmidi timidity
```

On Mac:


```
$ brew install abcmidi timidity
```

For mac user you can also install [easy abc](https://www.nilsliberg.se/ksp/easyabc/) to read the files

Here’s a simple example:

```
X: 1
T:"Hello world in abc notation"
M:4/4
K:C
"Am" C, D, E, F,|"F" G, A, B, C|"C"D E F G|"G" A B e c
```

To test the installation we can listen to this by saving the above snippet into a `hello.abc` file and running (Mac and Ubuntu):

```
$ abc2midi hello.abc -o hello.mid && timidity hello.mid
```

In [56]:
import os

# input_folder_fp = '/home/gu-ma/Downloads/hn201809'
input_folder_fp = '/home/gu-ma/Downloads/nottingham-dataset/ABC_cleaned'
abc_raw_txt = ''
abc_all_txt = ''

# Parse all files in the input folders
for root, subdirs, files in os.walk(input_folder_fp):
    print(root)
    for filename in files:
        file_path = os.path.join(root, filename)
        print('\t- %s ' % filename)
        if filename.lower().endswith('.abc'):
            with open(file_path, 'r') as f:
                abc_raw_txt += f.read()

print('\nabc_raw_txt:\n--\n' + abc_raw_txt[:1000])

/home/gu-ma/Downloads/nottingham-dataset/ABC_cleaned
	- reelsh-l.abc 
	- reelsm-q.abc 
	- reelsd-g.abc 
	- playford.abc 
	- waltzes.abc 
	- reelsa-c.abc 
	- slip.abc 
	- reelsr-t.abc 
	- reelsu-z.abc 
	- jigs.abc 
	- morris.abc 
	- xmas.abc 
	- ashover.abc 
	- hpps.abc 

abc_raw_txt:
--

X: 1
T:Hallowe'en
% Nottingham Music Database
S:Chris Dewhurst 1983, via PR
M:4/4
L:1/4
K:G
c/2B/2|"Am"A/2^G/2A/2B/2 c/2B/2c/2d/2|"Am"e/2d/2e/2f/2 "D7"g/2e/2d/2B/2|\
"G"GB/2G/2 d/2G/2B/2d/2|g/2e/2d/2B/2 G/2A/2B/2G/2|
"Am"A/2^G/2A/2B/2 c/2B/2c/2d/2|"Am"e/2d/2e/2f/2 "D7"g/2e/2d/2B/2|\
"G"G/2A/2B/2G/2 "E7"e/2d/2c/2B/2|"Am"cA A::
c/2B/2|"Am"Aa/2A/2 g/2A/2f/2A/2|"Am"e/2d/2e/2f/2 "D7"g/2e/2d/2B/2|\
"G"GB/2G/2 d/2G/2B/2d/2|"G"g/2e/2d/2B/2 G/2A/2B/2G/2|
"Am"Aa/2A/2 g/2A/2f/2A/2|"Am"e/2d/2e/2f/2 "D7"g/2e/2d/2B/2|\
"G"G/2A/2B/2G/2 "E7"e/2d/2c/2B/2|"Am"cA A:|


X: 2
T:Hannah Onestep
% Nottingham Music Database
S:Pauline Wilson, via PR
M:4/4
L:1/4
K:G
"G"D4|"D7"B3F|"G"AG FG-|G3F|AG FA-|AG E2|"Am"F4-|"D7"F4|"Am"E4|

Then we remove the 'unecessary' parts, clean up the text

In [57]:
import re

# Helper function to extract (and delete) chunks of text from abc_raw_text
def extract_text(regex, txt, delete):
    output = ''
    # extract the text
    for result in re.findall(regex, txt, re.S):
        output += result + "\n"
    # delete from the original file
    if delete:        
        global abc_raw_txt
        abc_raw_txt = (re.sub(regex, '', abc_raw_txt, flags=re.S))
    # remove empty lines
    abc_raw_txt = ''.join([s for s in abc_raw_txt.strip().splitlines(True) if s.strip()])
    return output

# Helper function to delete selected lines from a text
def delete_lines(regex, txt):
    txt = (re.sub(regex, '', txt, flags=re.S))
    txt = ''.join([s for s in txt.strip().splitlines(True) if s.strip()])
    return txt

# Extract intro text
useless_txt = extract_text(r'(This file.*?- Questions?.[^\n]*)', abc_raw_txt, True)

# Save the file without the intro text
abc_all_txt = abc_raw_txt

# Delete 'comments'
abc_raw_txt = delete_lines(r'".[^\n]*', abc_raw_txt)
# Delete Lyrics
abc_raw_txt = delete_lines(r'%.[^\n]*', abc_raw_txt)
# Delete some more comments
abc_raw_txt = delete_lines(r'W:.[^\n]*', abc_raw_txt)

# Extract headers
abc_headers_txt = extract_text(r'(X:.*?K:.[^\n]*)', abc_raw_txt, True)

print('\nabc_raw_txt:\n--\n' + abc_raw_txt[:1000])
print('\nabc_headers_txt:\n--\n' + abc_headers_txt[:1000])


abc_raw_txt:
--
c/2B/2|
c/2B/2|
DG A|
Bd2^c|
ag e_e|
ed c2|
GA EG|
P:A
ed |
P:B
ef/2g/2 |
AG |
P:A
G|
P:B
e/2f/2|
B/2c/2|
 [2
|
d/2e/2|
d/2d/2|
||
P:A
A/2G/2|
P:B
a|
P:C
A/2G/2|
c|
c|
:|
c|
c|
P:A
D|
P:B
G/2A/2|
(A/4G/4)|F/2A/2 A/4B/4A/4F/4|G/2B/2 B/4c/4B/4G/4|F/2A/2 e/4f/4e/4d/4|\
c/2A/2 A/4B/4A/4G/4|
F/2A/2 A/4B/4A/4F/4|G/2B/2 B/2(3A/4B/4c/4|d/4e/4d/4c/4 A/4B/4A/4G/4|F/2D/2 D/2\
:|
(d/4e/4)|f/2d/2 f/2d/2|f/4d/4f/4g/4 a/2a/2|e/4c/4A/4c/4 e/4c/4A/4c/4|\
e/2f/2 g/2a/4g/4|
f/2d/2 f/2d/2|f/4d/4f/4g/4 a/2a/2|d/4e/4d/4c/4 A/4B/4A/4G/4|F/2D/2 D/2:|
FG |
||
P:A
A/2G/2|
[1
[2
P:B
F/2G/2|
D|
B|
e|
c/2d/2|
|
B/2c/2|
K:C
G/2G/2|
:|
c/2d/2|
P:A
e|
P:B
(3e/2f/2g/2|
D/2D/2|
Dc BA|
d/2^c/2d/2e/2 dD/2D/2|
|:D/2D/2|
A|
f/2g/2|
A|
|:
a/2g/2|
E/2F/2|
:|
c/2d/2|
EG B2|
P:A
A/2G/2|
K:G
P:B
e/2f/2|
P:A
A/2G/2|
K:G
P:B
e/2f/2|
[1
K:C
P:C
K:C
P:D
|:
P:A
B/2c/2|
P:B
B/2c/2|
P:A
G/2A/2|
P:B
B/2A/2|
P:A
E|:
P:V
|
G/2A/2|: |||:
e/2f/2|
::
A|
A,|
|:d/2e/2|
M:4/4
L:1/4
B3/2B/2 Bc|
c/4d/4|
z/2|
P:A
f/2e/2|
K:D
P:B


In [58]:
print(len(abc_raw_txt))
print(len(abc_raw_txt_headers))

22260
363155


Once we have what we need we can save the file to disk

In [59]:
output_raw_fp = os.path.join(input_folder_fp, 'abc_raw.txt')
output_all_fp =  os.path.join(input_folder_fp, 'abc_all.txt')
output_header_fp =  os.path.join(input_folder_fp, 'abc_headers.txt')

with open(output_raw_fp, 'w') as f:
    f.write(abc_raw_txt)
    
with open(output_all_fp, 'w') as f:
    f.write(abc_all_txt)
    
with open(output_header_fp, 'w') as f:
    f.write(abc_headers_txt)

Now that we have our input text file ready we can run it through our RNN, we will use char-rnn for tensorflow, you can download it and install it from [here](https://github.com/sherjilozair/char-rnn-tensorflow) 

In [60]:
import shutil
import subprocess

charrnn_folder_fp = '/home/gu-ma/Documents/Projects/201809-HSLU-COMPPX/References/char-rnn-tensorflow'

# We try with the full text first
shutil.move(output_all_fp, os.path.join(charrnn_folder_fp, 'data', 'abc', 'input.txt'))

'/home/gu-ma/Documents/Projects/201809-HSLU-COMPPX/References/char-rnn-tensorflow/data/abc/input.txt'

Go to the directory and run the training