# Loading the OHHLA corpus
This notebook shows a typical example of data loading and preprocessing necessary for NLP. In this case we are loading a corpus downloaded from the Hip-Hop Lyrics webpage [www.ohhla.com](www.ohhla.com). Our primary goal is to provide a dataset loading function for the [language modelling](todo) chapter in this book.

We provide the corpus in the `data` directory. As this notebook lives in a sub-directory itself, we access it via `../data`. Before preprocessing all files and provide *generic* loaders it is useful to inspect the format of the files based on a specific example file, and work on the loading process in this context. Here we look at `/data/ohhla/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt`.  

In [12]:
with open('../data/ohhla/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt', 'r') as f:
    # we use read().splitlines() instead of readlines() to skip newline characters
    lines = f.read().splitlines()
    
lines

['Artist: J-Live',
 'Album:  SPTA (Said Person of That Ability)',
 'Song:   The Authentic',
 'Typed by: OHHLA Webmaster DJ Flash',
 '',
 '[J-Live]',
 "Well if isn't the outbreak monkey for that latest epidemic of The Vapors",
 'Bringing you the greatest live caper to date',
 'To get involved, to marry to the music',
 'Givin no quarter for those that would abuse it',
 '',
 'Nah, not even a fuck and a dream, a Coke and smile',
 'A smoke and a pancake, not a pat on the back and a handshake',
 'For bein whack, shit is beyond hate',
 "I'm evacuatin the state, you makin me glands ache",
 '',
 'Me and my bags packed, the headphones are playin the soundtrack',
 'My ears to your songs is like teeth to plaque, or',
 'big speakers to feedback, quarterbacks to sacks',
 "The bottom line nah we don't need that",
 '',
 'Yo your shit is crack! In so much as',
 'I never fuck with it and never will - never use, never deal',
 "Never chopped, never cooked, and I really don't feel the appeal",
 'How they g

We first would like to get rid of the header and the newline char 

In [15]:
lyrics = lines[5:]
lyrics

['[J-Live]',
 "Well if isn't the outbreak monkey for that latest epidemic of The Vapors",
 'Bringing you the greatest live caper to date',
 'To get involved, to marry to the music',
 'Givin no quarter for those that would abuse it',
 '',
 'Nah, not even a fuck and a dream, a Coke and smile',
 'A smoke and a pancake, not a pat on the back and a handshake',
 'For bein whack, shit is beyond hate',
 "I'm evacuatin the state, you makin me glands ache",
 '',
 'Me and my bags packed, the headphones are playin the soundtrack',
 'My ears to your songs is like teeth to plaque, or',
 'big speakers to feedback, quarterbacks to sacks',
 "The bottom line nah we don't need that",
 '',
 'Yo your shit is crack! In so much as',
 'I never fuck with it and never will - never use, never deal',
 "Never chopped, never cooked, and I really don't feel the appeal",
 'How they get a record deal? For real',
 '',
 '[Chorus]',
 "It's the return of the supergood, the authentic",
 "You might drama but we wouldn't rec

Finally, we would like to convert the list of lines with newline characters to a single string, as this will be easier to process for our language models. We will also mark lyrical "bars" (lines) using a `BAR` tag to still capture the rhythmical structure in the song.

In [17]:
string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'
string

"[BAR][J-Live][/BAR][BAR]Well if isn't the outbreak monkey for that latest epidemic of The Vapors[/BAR][BAR]Bringing you the greatest live caper to date[/BAR][BAR]To get involved, to marry to the music[/BAR][BAR]Givin no quarter for those that would abuse it[/BAR][BAR][/BAR][BAR]Nah, not even a fuck and a dream, a Coke and smile[/BAR][BAR]A smoke and a pancake, not a pat on the back and a handshake[/BAR][BAR]For bein whack, shit is beyond hate[/BAR][BAR]I'm evacuatin the state, you makin me glands ache[/BAR][BAR][/BAR][BAR]Me and my bags packed, the headphones are playin the soundtrack[/BAR][BAR]My ears to your songs is like teeth to plaque, or[/BAR][BAR]big speakers to feedback, quarterbacks to sacks[/BAR][BAR]The bottom line nah we don't need that[/BAR][BAR][/BAR][BAR]Yo your shit is crack! In so much as[/BAR][BAR]I never fuck with it and never will - never use, never deal[/BAR][BAR]Never chopped, never cooked, and I really don't feel the appeal[/BAR][BAR]How they get a record deal? 

We are now ready to provide a loading function.

In [21]:
def load_song(file_name):
    with open(file_name, 'r') as f:
        # we use read().splitlines() instead of readlines() to skip newline characters
        lines = f.read().splitlines()   
        lyrics = lines[5:]
        string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'
        return string
    
load_song('../data/ohhla/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt')

"[BAR][J-Live][/BAR][BAR]Well if isn't the outbreak monkey for that latest epidemic of The Vapors[/BAR][BAR]Bringing you the greatest live caper to date[/BAR][BAR]To get involved, to marry to the music[/BAR][BAR]Givin no quarter for those that would abuse it[/BAR][BAR][/BAR][BAR]Nah, not even a fuck and a dream, a Coke and smile[/BAR][BAR]A smoke and a pancake, not a pat on the back and a handshake[/BAR][BAR]For bein whack, shit is beyond hate[/BAR][BAR]I'm evacuatin the state, you makin me glands ache[/BAR][BAR][/BAR][BAR]Me and my bags packed, the headphones are playin the soundtrack[/BAR][BAR]My ears to your songs is like teeth to plaque, or[/BAR][BAR]big speakers to feedback, quarterbacks to sacks[/BAR][BAR]The bottom line nah we don't need that[/BAR][BAR][/BAR][BAR]Yo your shit is crack! In so much as[/BAR][BAR]I never fuck with it and never will - never use, never deal[/BAR][BAR]Never chopped, never cooked, and I really don't feel the appeal[/BAR][BAR]How they get a record deal? 

Now we want to load several files from an album directory. 

In [37]:
from os import listdir
from os.path import isfile, join

def load_album(path):
    # we filter out directories, and files that don't look like song files in OHHLA.
    onlyfiles = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]
    lyrics = [load_song(f) for f in onlyfiles]
    return lyrics

songs = load_album('../data/ohhla/www.ohhla.com/anonymous/j_live/SPTA/')
[len(s) for s in songs]

[2555, 2779, 3283]

We will also make it easy to load several albums. Then, for a few artists we provide short cuts to the album directories we care about. 

In [38]:
def load_albums(album_paths):
    return [song 
            for path in album_paths 
            for song in load_album(path)]

top_dir = '../data/ohhla/www.ohhla.com/anonymous/'
j_live = [
    top_dir + '/j_live/allabove/',
    top_dir + '/j_live/bestpart/'
]
len(load_albums(j_live))


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3520: invalid continuation byte