## <center> ENSF 519.01 Programming Fundamentals for Data Engineers </center>
<h2> <center> Assignment 7: Web data mining (100 marks)</center></h2>
<h3> <center> Due: by  Nov 9th, 04.59 pm. To be submitted on D2L. </center></h3>
<!--h3> <center> <font color='red'>IMPORTANT NOTE: each task must be implemented exactly as asked, even if there are other easier or better solutions.</font> </center></h3-->




Edit this file and write your solutions to the problems in sections specified with `# Your solution goes here`. Test your code and when you were done, download this notebook as an `.ipynb` file and submit it to D2L. To get this file, in Jupyter notebook you can go to File -> Download as -> Notebook(.ipynb)

## Part 1: Billboard top charts

Complete the functions below to get the best selling albums of a certain week from the [billboard 200](https://www.billboard.com/charts/billboard-200) page. Create a pandas dataframe to store the information. Your dataframe will have (at least) three columns:

* title (the album name)
* artist
* rank

Feel free to extract more information (like the number of weeks that it has been on the chart, a url to the album cover art, ...) if available and you're interested.


In [4]:
from bs4 import BeautifulSoup
import requests
import numpy as np
import pandas as pd

from typing import Iterable

def get_billboard_top_albums_dataframe(date: str='2001-06-02', count: int=5) -> pd.DataFrame:
    response = requests.get('https://www.billboard.com/charts/billboard-200/'+date)
    htmlCode = response.text
    parser = BeautifulSoup(htmlCode,'html.parser')
    bestHits = parser.find_all('div', class_ = 'chart-list-item')
    firstHit = parser.find('div',class_ = 'chart-number-one__details')
    #date = parser.find('button', class_ = 'chart-detail-header__date-selector-button')

    bestHitsDF=pd.DataFrame(columns = ["Rank","Title", "Artist"])

    firstHitTitle = firstHit.find('div',class_ = 'chart-number-one__title').get_text()
    firstHitArtist = firstHit.find('div',class_ = 'chart-number-one__artist').get_text().replace('\n','')
    bestHitsDF = bestHitsDF.append({'Rank': '1','Title':firstHitTitle,'Artist':firstHitArtist},ignore_index=True)
    
    for hits in bestHits:
        songRank = hits.find('div',class_='chart-list-item__rank').get_text().replace('\n','')
        songName = hits.find('div',class_='chart-list-item__title').get_text().replace('\n','')
        songArtist = hits.find('div',class_='chart-list-item__artist').get_text().replace('\n','')
        bestHitsDF = bestHitsDF.append({'Rank':songRank,'Title':songName,'Artist':songArtist},ignore_index = True)
    return bestHitsDF.iloc[0:count]

top_5_albums = get_billboard_top_albums_dataframe(count=5, date='2018-06-02')
top_5_albums

Unnamed: 0,Rank,Title,Artist
0,1,Love Yourself: Tear,BTS
1,2,beerbongs & bentleys,Post Malone
2,3,Harder Than Ever,Lil Baby
3,4,And Justice For None,Five Finger Death Punch
4,5,Invasion Of Privacy,Cardi B


## Music brainz API
Here's a code that connects to the music brainz music database API and searches for the information about the given album. 

In [5]:
import itertools
import unicodedata
import re

_remove_accents = lambda input_str: ''.join((c for c in unicodedata.normalize('NFKD', input_str) if not unicodedata.combining(c)))
_clean_string = lambda s: set(re.sub(r'[^\w\s]', '', _remove_accents(s)).lower().split())
_jaccard = lambda set1, set2: float(len(set1 & set2)) / float(len(set1 | set2))


def search(entity_type:str, query: str):
    return requests.get(
        'http://musicbrainz.org/ws/2/{entity}/'.format(entity=entity_type),
        params={
            'fmt': 'json',
            'query': query
        }
    ).json()


def get_release_url(artist: str, title: str):
    type_ = 'release'
    search_results = search(type_, '%s AND artist:%s' % (title, artist))
    artist = _clean_string(artist)
    title = _clean_string(title)

    #print("title = " + str(title) +' artist=' + str(artist))
    for item in search_results.get(type_+'s', []):
        names = list()
        for artists in item['artist-credit']:
            if 'artist' in artists:
                names.append(_clean_string(artists['artist']['name']))
                for alias in artists['artist'].get('aliases', {}):
                    names.append(_clean_string(alias.get('name', '')))
        #print('  title=' + str(_clean_string(item['title'])) + ' names=' + ', '.join(itertools.chain(*names)))

        if _jaccard(_clean_string(item['title']), title) > 0.5 and \
            (any(_jaccard(artist, name) > 0.3 for name in names) or len(names) == 0):
            return 'http://musicbrainz.org/ws/2/{type}/{id}/{inc}'.format(id=item['id'], type=type_, \
            inc='?inc=artist-credits+labels+discids+recordings&fmt=json')
    
    return None


Using the above function find the track count and disk count of each album (if available) and store it in the pandas data frame. Read the implementation of the `search` function above and see how `fmt` parameter is passed to the web service in order to make it return the results in `json` format.

Another thing that's worth mentioning is that if you open up the url provided by `get_release_url` function above it doesn't include the information you need. There is an `inc` parameter just like the `fmt` parameter that determines what extra data should be included in the response. Consult the [API docs](https://musicbrainz.org/doc/Development/JSON_Web_Service) to figure out the value for `inc` parameter that suggests it to include the track cound and disk count of the album (release).

There are a ton of other information freely available on the music brainz database. You can browse the [API docs](https://musicbrainz.org/doc/Development/JSON_Web_Service) and use it to extract more information. Feel free to do that.

In [6]:
# Your solution goes here
import json
from collections import Counter

albumList = zip(top_5_albums['Title'],top_5_albums['Artist'])
DFTracks = pd.DataFrame(columns = ['Album','Artist','Track Count','Disc Count'])
for i in albumList:
    print(i)
    releaseURL = get_release_url(str(i[1]),str(i[0]))
    if releaseURL == None:
        DFTracks = DFTracks.append({'Album':str(i[0]),'Artist':str(i[1]),'Track Count':np.nan,'Disc Count':np.nan},ignore_index=True)
    else:
        response = requests.get(releaseURL)
        jsonResponse = json.loads(response.text)

        album = jsonResponse['title']
        artist = jsonResponse['artist-credit'][0]['name']
        discCount = len(jsonResponse['media'])
        trackCount = 0
        for i in range (0,discCount):
            trackCount += jsonResponse['media'][i]['track-count']
        DFTracks = DFTracks.append({'Album':album,'Artist':artist,'Track Count':trackCount,'Disc Count':discCount},ignore_index=True)
DFTracks



('Love Yourself: Tear', 'BTS')
('beerbongs & bentleys', 'Post Malone')
('Harder Than Ever', 'Lil Baby')
('And Justice For None', 'Five Finger Death Punch')
('Invasion Of Privacy', 'Cardi B')


Unnamed: 0,Album,Artist,Track Count,Disc Count
0,Love Yourself: Tear,BTS,,
1,beerbongs & bentleys,Post Malone,18.0,1.0
2,Harder Than Ever,Lil Baby,17.0,1.0
3,And Justice for None,Five Finger Death Punch,16.0,1.0
4,Invasion of Privacy,Cardi B,13.0,1.0
