# Data Scraping

A complementary notebook for [MusicRNN](https://github.com/eyal-sasson/music-rnn) demonstrating the data acquisition.

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/eyal-sasson/music-rnn/blob/main/Data_Scraping.ipynb)
[![View Source on GitHub](https://badgen.net/badge/icon/View%20Source%20on%20GitHub?icon=github&label)](https://github.com/eyal-sasson/music-rnn/blob/main/Data_Scraping.ipynb)

## Dependencies
First we need to install some dependencies. We are going to use the `requests` library for getting the website content, and then processing it with `BeautifulSoup`.

In [None]:
import random
from tqdm.notebook import tqdm, trange
import requests
from bs4 import BeautifulSoup
import re
from time import sleep

## The acquisition process
In this example, we will donwload songs from the Lesession dataset, from the [ABC Notation website](https://abcnotation.com).

### Getting the tune search pages
First, let's go over the tune search pages, each with 50 songs. We will take up to 1300 songs, and convert every page to a `BeautifulSoup` object.

In [None]:
url = "https://abcnotation.com/searchTunes"
site = "www.lesession.co.uk"
pages = []
for offset in trange(0, 1300, 50):
  params = {'q': f'site:{site}', 'f': 't', 's': offset}
  r = requests.get(url, params)
  while (r.status_code != 200):
    r = requests.get(url, params)
    sleep(5)
  soup = BeautifulSoup(r.content)
  pages.append(soup)

  0%|          | 0/26 [00:00<?, ?it/s]

### Getting all song pages
Now, we run over our pages and find the song links in them - we look for `href`s starting with `/tunePage` (indicating they are tunes). Then we save the links themselves, without the dataset prefix.

In [None]:
songs = []
for soup in pages:
  hrefs = [x['href'] for x in soup('a')]
  page_songs = [href.split('/', 2)[-1] for href in hrefs if href.startswith('/tunePage')]
  songs.extend(page_songs)
print(len(songs))
songs[:10]

1272


['music/lgsdmweb/0094',
 'music/lgsdmweb/0095',
 'music/lgsdmweb/0010',
 'music/woodenflute/0003',
 'music/LEMcCulloughTunes201404/0084',
 'music/posts/0001',
 'music/posts/0002',
 'music/lgsdmweb/0096',
 'music/LEMcCulloughTunes201404/0000',
 'music/LEMcCulloughTunes201404/0115']

For every `song` in our `songs` list, we can find its page at `/tunePage?a=[DATASET]/[SONG]`.

Let's now convert our songs to a set and then back to a list to remove any duplicates.

In [None]:
songs = [*set(songs)]
len(songs)

1272

As we can see, there were no duplicates in the dataset.

### Defining preprocessing
Some of the songs have junk lines in them, e.g. the composer, a URL, and even lyrics. Each field's meaning can be found in the [ABC Notation Standard](https://abcnotation.com/wiki/abc:standard:v2.1#information_field_definition).
We will remove the unnecessary fields (the title is not removed because we will later save each song as its title).

In [None]:
def preprocess_song(song):
  lines_processed = []
  to_remove = False
  headers_to_remove = ('A:', 'B:', 'C:', 'D:', 'F:', 'H:', 'I:', 'N:', 'O:', 'R:', 'S:', 'W:', 'w:', 'Z:')
  for line in song.split('\n'):
    # if len(line) > 1 and line[1] == ':': # uncomment this line to remove all lines after a junk header
    to_remove = False
    if line.startswith(headers_to_remove):
      to_remove = True
    if not to_remove:
      lines_processed.append(line)
  return '\n'.join(lines_processed)

### Downloading every song
Now we will get every song and save it as an abc file. We are going to make many requests so we will have to take some steps in order to not get blocked by the website.

We will send headers with every request to hide ourselves and lower the chances of getting blocked.

In [None]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0) Gecko/20100101 Firefox/20.0'}

sess = requests.Session()
adapter = requests.adapters.HTTPAdapter(max_retries = 20)
sess.mount('http://', adapter)

Now let's actually download the songs!

There are a few steps done, *for every tune*:

1. Get the tune webpage, send the headers with the request.
1. If the request failed (i.e. we got blocked), wait a few seconds and try again until working.
1. Preprocess the output to get the cleaned song.
1. Extract the title using a regex.
1. Save the song to `data/` as its title.
1. Wait 2-4 seconds (to avoid getting blocked). 

In [None]:
!mkdir data
for song in tqdm(songs):
  url = f"https://abcnotation.com/getResource/downloads/text_/_.abc?a={site}/{song}"
  r = sess.get(url, headers=headers)
  while (r.status_code != 200):
    sleep(3)
    r = sess.get(url, headers=headers)
  output = preprocess_song(r.text)
  title = re.search('\nT:(.*)\n', output).group(1).replace('/', ' ')
  with open(f'data/{title}.abc', 'w') as f:
    f.write(output.strip())
  sleep(random.randint(2,4))

mkdir: cannot create directory ‘data’: File exists


  0%|          | 0/1272 [00:00<?, ?it/s]

### Compress the folder to a zip file
Let's save the directory to a zip file so we can later use it in the model itself.

In [None]:
!zip -qr lesession.zip data/