## Libraries

- **os** . Classic library for handling with file's paths.
- **json**. Handle json files
- **music21** . 👀 Great library for dealing with kern,MIDI, musicXML files. Can be used as a converter but the great thing is that extract files into an OOP objects (awesome) which all related info from song as _notes_ , _duration_, _clef_ ... are treated as object (_which its own parameters_).**Check documentation <a href="http://web.mit.edu/music21/doc/usersGuide/usersGuide_01_installing.html"> here </a>**
- **keras & numpy** For one-hot encoding the training sequences

In [11]:
import os
import json
import music21 as m21 #music converter, usefull to handle with kern,MIDI, musicXML files
import tensorflow.keras as keras
import numpy as np

In [12]:
# VARIABLES


# Path Variables
DEUSTCHL_PATH = "deutschl/"
CHINA_PATH = "china/"
SAVE_DIR = "dataset"
SINGLE_FILE_DATASET = "file_dataset"

# Get all subdirectories and save paths on a list.
german_dataset_paths = []
china_dataset_paths = []

for path, subdir, files in os.walk(DEUSTCHL_PATH):
    for dir in subdir:
        path_to_dir = os.path.join(DEUSTCHL_PATH, dir)
        german_dataset_paths.append(path_to_dir)

for path, subdir, files in os.walk(CHINA_PATH):
    for dir in subdir:
        path_to_dir = os.path.join(CHINA_PATH, dir)
        china_dataset_paths.append(path_to_dir)
        
# Duration variable
ACCEPTABLE_DURATIONS = [
    0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4
]

# Sequence length (Batch size that will feed  our NN)
SEQUENCE_LENGTH = 64


## Preprocess Function

The idea is to preprocess whole dataset from .krn format to our time-series representation. <br>

<h4> Steps TODO </h4>

1. **Load folk songs**. We're going to use music21 library for that. _load_songs_in_kern()_ function

2. **Filter out songs that have non-acceptable durations**. This step is important since we want the network to easy pick-up the structures we're going to feed it. We're going to use songs that follow the basic hierarchical structure (multiples of 16th/8th notes or so).

3. **Transponse songs to C major / A minor**. In order to have the same pitch for whole datasets, this also makes things easier for the network.

4. **Encode songs with music time series representation**. We need to one-hot coding yet, but may this format be handy to store and manipulate later.

5. **Save songs to text file**


**Whole process is at preprocess() functions**

In [13]:
def load_songs_in_kern(dataset_path, songs=None):
    '''Load datasets using music21
       Return a list of music21 objects which each object contain a song
    '''
    if songs == None:
        songs = []
    
    for path, subdir, files in os.walk(dataset_path):
        for file in files:
            if file[-3:] == "krn": #Check for format, don't want to upload other files(checksums)
                song = m21.converter.parse(os.path.join(path,file))
                songs.append(song)
    
    return songs


def has_acceptable_durations(song, acceptable_durations):
    
    for note in song.flat.notesAndRests:
        if note.duration.quarterLength not in acceptable_durations:
            return False
    return True


def transpose(song):
    '''Transpose  song's pitch to Cmaj/Amin '''
    
    #get key from the song
    parts = song.getElementsByClass(m21.stream.Part)
    measures_part0 =  parts[0].getElementsByClass(m21.stream.Measure)
    key = measures_part0[0][4] # First measure,return list, 4 index is key.
    
    #estimate key using music21
    if not isinstance(key, m21.key.Key):
        key = song.analyze("key")
    
    # get interval for transposition. E.g., Bmaj --> Cmaj
    if key.mode == "major":
        interval = m21.interval.Interval(key.tonic, m21.pitch.Pitch("C"))
    elif key.mode == "minor":
        interval = m21.interval.Interval(key.tonic, m21.pitch.Pitch("A"))
    
    # transpose song by calculated interval
    transposed_song = song.transpose(interval)
    
    return transposed_song


def encode_song(song, time_step=0.25):
    '''Encode song to time-series format'''
    # p = 60, d = 1.0 --> [60, "_", "_", "_"]
    
    encoded_song = []
    
    for event in song.flat.notesAndRests:
        
        # handle notes
        if isinstance(event, m21.note.Note):
            symbol = event.pitch.midi
        # handle rests
        elif isinstance(event, m21.note.Rest):
            symbol = "r"
        
        # convert the note/rest into time series notation
        steps = int(event.duration.quarterLength / time_step)
        for step in range(steps):
            if step == 0: # First step --> Symbol
                encoded_song.append(symbol)
            else: #Rest of steps are "_"
                encoded_song.append("_")
        
    # cast encoded song to a str
    encoded_song = " ".join(map(str, encoded_song))
    
    return encoded_song


def preprocess(dataset_path, save_dir):
    
    print("Loading songs...")
    if type(dataset_path) == type([]):
        for path in dataset_path:
            try:
                songs = load_songs_in_kern(path, songs=songs)
            except NameError:
                songs = load_songs_in_kern(path)
            finally:
                print(f"Loaded {path} directory.")
    else:
        songs = load_songs_in_kern(dataseth_path)
        
    print(f"Dataset loaded up succesfully. \n Total songs: {len(songs)}")
    
    print(f"Processing songs...")
    for i, song in enumerate(songs):
        if not has_acceptable_durations(song, ACCEPTABLE_DURATIONS):
            continue
        
        #transpose to Cmaj/Amin
        try:
            song = transpose(song)
        except IndexError:
            continue
        #encoded song to time series format
        encoded_song = encode_song(song)
        
        #save songs to text file
        save_path = os.path.join(save_dir, str(i))
        with open(save_path, "w") as file:
            file.write(encoded_song)
    
    print(f"Dataset Created!")
        
        
        


In [66]:

if __name__ == "__main__":
#     songs = load_songs_in_kern('china/natmin')
#     print(len(songs))
#     song = songs[0]
#     print(f"Has acceptable duration? {has_acceptable_durations(song,ACCEPTABLE_DURATIONS)}")
#     transpose(song)
    preprocess(german_dataset_paths, os.path.join(SAVE_DIR, 'deutschl'))
    
    
    

Loading songs...
Loaded deutschl/allerkbd directory.
Loaded deutschl/altdeu1 directory.
Loaded deutschl/altdeu2 directory.
Loaded deutschl/ballad directory.
Loaded deutschl/boehme directory.
Loaded deutschl/dva directory.
Loaded deutschl/erk directory.
Loaded deutschl/fink directory.
Loaded deutschl/kinder directory.
Loaded deutschl/test directory.
Loaded deutschl/variant directory.
Loaded deutschl/zuccal directory.
Dataset loaded up succesfully. 
 Total songs: 5365


## Join whole dataset in a single file

In [14]:
def load_song(file_path):
    with open(file_path, "r") as file:
        song = file.read()
    return song



def create_single_file_dataset(dataset_path, file_dataset_path, sequence_length=SEQUENCE_LENGTH):
    new_song_delimiter = "/ " * sequence_length # Create delimiter with the size of a batch in order to let the network learn that this is the end of a song
    songs = ""
    
    
    # load encoded songs and add delimiters between 
    for path, _ , files in os.walk(dataset_path):
        for file in files:
            file_path = os.path.join(path, file)
            song = load_song(file_path)
            songs = songs + song + " " + new_song_delimiter
    
    songs = songs[:-1] #Erase last blank
    
    # save string that contains all dataset
    with open(file_dataset_path, "w") as file:
        file.write(songs)
    
    return songs #To later on mapping info from songs
    
    
    

In [98]:
if __name__ == "__main__":
    german_songs = create_single_file_dataset(os.path.join(SAVE_DIR,'deutschl'), SINGLE_FILE_DATASET + "_Deutschl")

## Map Songs Info to create a vocabulary of Notes

In [15]:
def create_mapping(songs, mapping_path):
    mappings = {}
    
    # identify the vocabulary
    songs = songs.split()
    vocabulary = list(set(songs))
    
    # create mappings (dictionary)
    for i, symbol in enumerate(vocabulary):
        mappings[symbol] = i
    
    
    # save vocabulary to a json file
    with open(mapping_path, "w") as file:
        json.dump(mappings, file)
        

In [99]:
if __name__ == "__main__":
    create_mapping(german_songs, "deutschl_mapping.json")

## Convert values to ints and Generate Training Sequences

In [16]:
def convert_songs_to_int(songs, mappings_path):
    '''Use of our mapped values for changing the MIDI notation into int values'''
    
    int_songs =[]
    
    #load mappings
    with open(mappings_path, "r") as file:
        mappings = json.load(file)
    
    # cast songs string to a list
    songs = songs.split()
    
    #map songs to int
    for symbol in songs:
        int_songs.append(mappings[symbol])
    
    return int_songs


def generating_training_sequences(dataset, sequence_length=SEQUENCE_LENGTH):
    
    if dataset.lower() == 'china':
        single_file_path = SINGLE_FILE_DATASET + '_China'
        map_path = 'china_mapping.json'
    elif dataset.lower() == 'deutschl':
        single_file_path = SINGLE_FILE_DATASET + '_Deutschl'
        map_path = 'deutschl_mapping.json'
    else:
        print('Wrong Dataset, please check and try again')
        return
    
    # load songs and map them to int
    songs = load_song(single_file_path)
    int_songs = convert_songs_to_int(songs, map_path)
    
    # generate the training sequences
    inputs = []
    targets = []
    
    num_sequences = len(int_songs) - sequence_length
    for  i in range(num_sequences):
        inputs.append(int_songs[i:i+sequence_length]) # Each sequence is shifted by one step respect to previous one
        targets.append(int_songs[i+sequence_length]) # Point at the first symbol of next sequence
        
        
    # one-hot encode the sequences
    vocabulary_size = len(set(int_songs)) # Get total num of symbol types in datasets
    inputs = keras.utils.to_categorical(inputs, num_classes=vocabulary_size)
    targets = np.array(targets)
    
    return inputs, targets
    

In [18]:
if __name__ == '__main__':
    inputs, targets =  generating_training_sequences('china')
    

In [23]:
inputs.shape, targets.shape

((423588, 64, 46), (423588,))

## Input data: inputs & targets.

<h3 style='text-align: center'> Inputs </h3>

Our X data inputs, each step the network will be fed with a sequence of size _SEQUENCE_LENGTH_ which represent certain chunk of melody.  

- **Shape** --> (total_sequences, sequence_length, one_hot_variables)

<h3 style='text-align: center'> Targets </h3>

This is our _"Y data"_ as LSTM cells wants to know which is the next value (_output value_ or _target value_) for a given X input. This represent the next symbol or note that follows the melody given to the network as a time-series sequence.

- **Shape** --> (total_sequences, 1)