# Wikipedia Scrapper

Before starting to get the Artists information from Spotify WEB API, it was necessary to define an original list of arists to work on.
To do so we wanted to use a a source that provided a set of artists from different genres and we didn't want to use the most popular artists strictly. 
Hence, billboard or other popularity based sources where discarded and Wikipedia was chosen instead. 

Wikipedia [List of musicians]( https://en.wikipedia.org/wiki/Lists_of_musicians) provided a set of artists classified by genres. 
There is a total of 53 genres inisde this page and each link to a new Genre page where all artists are listed. 
These high amount of different genres allowed us to obtain a diverse set of artists, which will allow us to obtain an answer to our question "How artists collaborate?", 
that will not be restricted to only a certain kind of artists or our specific preferences.

The objective of this notebook is to use to the Wikipedia API, and obtain a list of artists with the necessary format 
to later obtain the information from the Spotify API:

```
artists = [[ {"name": "Justin Bieber", "id": ""},
    {"name": "Kanye West", "id": ""},
    {"name": "Ariana Grande", "id": ""}
]]
```
To achieve it the steps followed where:

1. Scrap the List of Musicians page and obtain a set of different genres  
2. Define Scenarios based on the differnet format patterns of the genres pages
4. Scrap the genre pages based on scenarios to obtain the list of artists

The different steps taken are detailed bellow:


## 1. Scrapping the List of Musicians

In [None]:
import urllib.request
import json
import pandas as pd
import re

In [None]:
# Create function to retrive APIs
def getapi(name):
    baseurl = "https://en.wikipedia.org/w/api.php?"
    action = "action=query"
    title = "titles="+name
    content = "prop=revisions&rvprop=content"
    dataformat ="format=json"
    query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)
    return(query)

In [None]:
# Create function to get the contents from query
def getcontents(query):
    # Read in JSON
    wikiresponse = urllib.request.urlopen(query)
    wikijson = json.loads(wikiresponse.read())

    # Get content
    key1 = 'query'
    key2 = 'pages'
    for key in wikijson[key1][key2]: #To get the pages number key
        key3 = f'{key}'
    
    key4 = None
    for key in wikijson[key1][key2][key3]: # To check if revision is a key:
        if key == 'revisions':
            key4 = 'revisions'
    if key4 == None: # If is not,print an error message and take "simple" text
        print("Error with key4", wikijson[key1][key2][key3][title])
    # print(key4)
    
    content_main = wikijson[key1][key2][key3][key4][0]["*"]
    return(content_main)

In [None]:
# Get API link for Lists of Musicians
query = getapi("Lists_of_musicians")

# Get content
content_main = getcontents(query)

# Defining where the content needed is located
start_main =content_main.find("==Genre==")
end_main = content_main.find("==Instrument==")

print(query)

https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&titles=Lists_of_musicians&format=json


In [None]:
# Create function to retrive all links from file (the genres)
def getlinks(file):
    l=[]
    match1 = re.findall(r"\[\[([^\]\[:]+)\]\]", file)
    match2 =  re.findall(r"\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]", file)
    for m1 in match1:
        if "|" not in m1:
            l.append(m1)
        else:
            pass
    for m2 in match2:
        l.append(m2[0]) 
    return l

In [None]:
# Create function to retrive all names from the links (will be the artists names)
def getnames(file):
    l=[]
    match1 = re.findall(r"\[\[([^\]\[:]+)\]\]", file)
    match2 =  re.findall(r"\[\[([^\]\[:]+)\|([^\]\[:]+)\]\]", file)
    for m1 in match1:
        if "|" not in m1:
            l.append(m1)
        else:
            pass
    for m2 in match2:
        l.append(m2[1]) 
    return l

In [None]:
# Evaluating the links in List of Musicians
GENRES = getlinks(content_main[start_main:end_main])
print(len(GENRES[1:])) # There are too many GENRES


# Selecting a sub-set of Genres:
with open("../../data/wiki/SubListofGenres") as f:
  content_file = f.read()

content_file = content_file.replace('\n', "").replace("'","")
SET_GENRES = content_file.split(",")
print(len(SET_GENRES))

211
53


After scrapping the List of Musicians we obtained a list of 211 different genres, 
 this list contained from more general to more specific genres. After running the whole exercises we edned-up with a list of
 36K artists, which it was far to many. Working with that many artists would cause problems when obtaining the data from the Spotify Web Api,
 and furthemore would add additional complexity when analysing the network. Therfore it was decided to reduce the number of genres to scrap from 211 to 53. 
 The subset of genres was defined, by selecting the most general genres from the list, for example genres such as: 
 *"List of Christian dance, electronic, and techno artists"*, or *"List of bhangra artists"* were removed and instead *"List of hard-rock musicians"* or *"List of hip hop groups"* were kept.


## 2. Identifying Scenarios for the Genres pages

 After getting the set of genres the next step was to obtain the list of artists from each genre page. 
 However, since each page was formatted differently differnet scenarios were defined.

In [None]:
# Create function to choose the scenario of each genre page
def getscenario(CONDITIONS):
    conditions = CONDITIONS.copy()
    for cond in CONDITIONS:
        if cond == -1:
            conditions.remove(cond)
        else:
            pass
    if len(conditions)>0:
        return (min(conditions))
    else:
        return (-1)

# Create function to get the artists from each genre page 
def getartist(G_LIST):
    ALL_ARTISTS = []
    G_MISSING = []
    G_ERRORS = []

    for genre in G_LIST:
        print(genre)
        try:
            # Cleaning the Genre name
            clean_genre = genre.replace(" ", "_").replace("&", "%26")
            
            #Extracting the content from the Genre Page
            query = getapi(clean_genre)
            content_genre = getcontents(query)

            #Defining different pages format options as scenarios:
            #Scenarios for the start of the artists list
            by_alphanum = content_genre.find("==0-9==")
            by_alphanum_ = content_genre.find("== 0-9 ==")
            by_alphanum_sub = content_genre.find("===0-9===")
            by_alpha = content_genre.find("==A==")
            by_alpha_ = content_genre.find("== A ==")
            by_artist = content_genre.find("==Artists==")
            by_artist_ = content_genre.find("== Artists ==")
            by_individuals = content_genre.find("==Individuals==")
            by_individuals_ = content_genre.find("== Individuals ==")
            by_list = content_genre.find("== List ==")

            #Scenarios for the end of the artists list            
            to_seealso = content_genre.find("==See also==")
            to_seealso_ = content_genre.find("== See also ==")
            to_seealso1 = content_genre.find("See also==")
            to_ref = content_genre.find("==References==")
            to_ref_ = content_genre.find("== References ==")
            to_notes = content_genre.find("==Notes==")
            to_notes_ = content_genre.find("== Notes ==")

            CONDITIONS_BY = [by_alphanum, by_alphanum_, by_alpha, by_alpha_, by_artist, by_artist_,
             by_individuals, by_individuals_, by_list]
            CONDITIONS_TO = [to_seealso, to_seealso_, to_seealso1, to_ref, to_ref_, to_notes, to_notes_]
            
           
            #Identifying which is the scenario to evaluate
            by_scenario = getscenario(CONDITIONS_BY)
            to_scenario = getscenario(CONDITIONS_TO)

            # Extracting links only on the artists list position (based on the scenarios)
            if by_scenario > 0 and to_scenario > 0: 
                G_ARTISTS = getnames(content_genre[by_scenario:to_scenario])
                ALL_ARTISTS.append(G_ARTISTS)
            
            elif by_scenario < 0 and to_scenario > 0:
                G_ARTISTS = getnames(content_genre[0:to_scenario])
                ALL_ARTISTS.append(G_ARTISTS)
            
            elif by_scenario > 0 and to_scenario < 0:
                G_ARTISTS = getnames(content_genre[by_scenario:])
                ALL_ARTISTS.append(G_ARTISTS)
            
            # Extracting all links only for the cases there is not a scenario defined
            else:
                G_ARTISTS = getnames(content_genre)
                ALL_ARTISTS.append(G_ARTISTS)
                
                G_MISSING.append(genre)
              
        except:
            print("ERROR IN:", genre)
            G_ERRORS.append(genre)

    return(ALL_ARTISTS, G_MISSING, G_ERRORS)


## 2. Scrapping the Genres pages based on the scenarios

Once the scenarios where defined, it was possible to scrap the genre pages to obatin the list of artists:

In [None]:
ALL_ARTISTS,ALL_MISSED,ALL_ERRORS = getartist(GENRES[1:]);
print("ARTISTS/GENRES:",len(ALL_ARTISTS), "ALL LINKS:", len(ALL_MISSED), "ERROR:",len(ALL_ERRORS))

List of 1970s Christian pop artists
List of acid rock artists
List of adult alternative artists
List of alternative country musicians
List of alternative hip hop artists
List of alternative metal artists
List of alternative rock artists
List of ambient music artists
List of anarcho-punk bands
List of Arabic pop musicians
List of baroque pop artists
List of bebop musicians
List of bhangra artists
List of big band musicians
List of black metal bands
List of blue-eyed soul artists
List of bluegrass musicians
List of blues musicians
List of blues rock musicians
List of boogie woogie musicians
List of British blues musicians
List of British music hall musicians
List of Britpop musicians
List of C-pop artists
List of calypso musicians
List of Carnatic instrumentalists
List of Celtic musicians
List of chamber jazz musicians
List of Chicago blues musicians
List of Christian country artists
List of Christian dance, electronic, and techno artists
List of Christian hardcore bands
List of Christia

In [None]:
SET_ARTISTS,SET_MISSED,SET_ERRORS = getartist(SET_GENRES)
print("ARTISTS/GENRES:",len(SET_ARTISTS), "WITH ALL LINKS:", len(SET_MISSED), "ERROR:",len(SET_ERRORS))

List of 1970s Christian pop artists
List of adult alternative artists
List of alternative country musicians
List of alternative hip hop artists
List of alternative rock artists
List of blues musicians
List of Christian country artists
List of country music performers
List of country rock musicians
List of dance-pop artists
List of dance-punk artists
List of dubstep musicians
List of electro house artists
List of emo artists
List of experimental musicians
List of folk musicians
List of funk musicians
List of funk metal and funk rock bands
List of gangsta rap artists
List of garage rock bands
List of gospel musicians
List of hard rock musicians (A–M)
ERROR IN: List of hard rock musicians (A–M)
List of hard rock musicians (N–Z)
ERROR IN: List of hard rock musicians (N–Z)
List of heavy metal bands
List of hip hop groups
List of hip hop musicians
List of house music artists
List of indie pop artists
List of indie rock musicians
List of jazz musicians
Category:K-pop music groups
List of Lati

In [None]:
T_ARTISTS,T_MISSED,T_ERRORS = getartist(["List_of_R%26B_musicians", 'List of hip hop groups',
'List of hip hop musicians',
'List of indie pop artists',
'List of indie rock musicians'])
len([item for sublist in T_ARTISTS for item in sublist])

List_of_R%26B_musicians
List of hip hop groups
List of hip hop musicians
List of indie pop artists
List of indie rock musicians


5034

In [None]:
POP = ["List_of_dance-pop_artists","List of Latin pop artists","List_of_synth-pop_artists"] 
INDIE = ['List of indie pop artists', 'List of indie rock musicians']
HIPHOP = ["List_of_R%26B_musicians", 'List of hip hop groups',  'List of hip hop musicians']
ROCK =  ['List of alternative rock artists', 'List of progressive rock artists', 'List of rock music performers', 'List of psychedelic rock artists', 'List of garage rock bands']

merged = [ROCK,POP,HIPHOP]
flat_merge = [item for sublist in merged for item in sublist]

T_ARTISTS,T_MISSED,T_ERRORS = getartist(flat_merge)
print(len([item for sublist in T_ARTISTS for item in sublist]))
RPH_OUTPUT, d = getoutput(T_ARTISTS)
print(len(RPH_OUTPUT))


List of alternative rock artists
List of progressive rock artists
List of rock music performers
List of psychedelic rock artists
List of garage rock bands
List_of_dance-pop_artists
List of Latin pop artists
List_of_synth-pop_artists
List_of_R%26B_musicians
List of hip hop groups
List of hip hop musicians
7906
6988


In [None]:
# TESTS FOR ERRORS: 

genre = "List of hard rock musicians (A–M)"

clean_genre = genre.replace(" ", "_").replace("&", "%26")
query = getapi(clean_genre)
print(query)
content_genre = getcontents(query)

#Defining the different pages format options as scenarios:
by_alphanum = content_genre.find("==0-9==")
by_alphanum_ = content_genre.find("== 0-9 ==")
by_alpha = content_genre.find("==A==")
by_alpha_ = content_genre.find("== A ==")
by_artist = content_genre.find("==Artists==")
by_artist_ = content_genre.find("== Artists ==")
by_individuals = content_genre.find("==Individuals==")
by_individuals_ = content_genre.find("== Individuals ==")
by_list = content_genre.find("== List ==")
                        
to_seealso = content_genre.find("==See also==")
to_seealso_ = content_genre.find("== See also ==")
to_ref = content_genre.find("==References==")
to_ref_ = content_genre.find("== References ==")
to_notes = content_genre.find("==Notes==")
to_notes_ = content_genre.find("== Notes ==")

CONDITIONS_BY = [by_alphanum, by_alphanum_, by_alpha, by_alpha_, by_artist, by_artist_, 
by_individuals, by_individuals_, by_list]
CONDITIONS_TO = [to_seealso, to_seealso_, to_ref, to_ref_, to_notes, to_notes_]


print(CONDITIONS_BY)
print(CONDITIONS_TO)

#Defining the content to evaluate by scenario:
by_scenario = getscenario(CONDITIONS_BY)
to_scenario = getscenario(CONDITIONS_TO)

print(by_scenario)
print(to_scenario)

if by_scenario > 0 and to_scenario > 0: 
    G_ARTISTS = getnames(content_genre[by_scenario:to_scenario])

else:
    print("missing")

After cleanning the errors of extraction, it was possible to transform the artists list in to the desired output:

In [None]:
# Create a function to get the ARTISTS in the desired Output
def getoutput(ARTISTS_LIST):
    OUT_ARTISTS = []
    flat_ARTISTS = [item for sublist in ARTISTS_LIST for item in sublist]
    
    # Remove duplicates
    single_ARTISTS = sorted(list(dict.fromkeys(flat_ARTISTS)))
    num_duplicates = len(flat_ARTISTS) - len(single_ARTISTS)

    # Generate Output 
    for name in single_ARTISTS:
        dic_art = {"name": name , "id": ""}
        OUT_ARTISTS.append(dic_art)
    
    return(OUT_ARTISTS, num_duplicates)

In [None]:
ALL_output, ALL_duplicates = getoutput(ALL_ARTISTS)
print(len(ALL_output))
print(ALL_duplicates)

In [None]:
SET_output, SET_duplicates = getoutput(SET_ARTISTS)
print(len(SET_output))
print(SET_duplicates)

With the final results, it was possible to save the list of Artists in json format, to later use it for Spotify API.

In [None]:
with open('../../data/wiki/WikiArtists_RPH.json', 'w') as outfile:
    json.dump(RPH_OUTPUT, outfile)

In [None]:
with open('../../data/wiki/WikiArtists.json', 'w') as outfile:
    json.dump(SET_output, outfile)

### Additonal code for tests and checks

In [None]:
#TESTS API

name="List of political hip hop artists"
clean_name= name.replace(" ", "_").replace("&", "%26")
print(clean_name)
print(getapi(clean_name))
getcontents(getapi(clean_name));

In [None]:
def getartist_MISSED_TESTS(G_LIST):
    ALL_ARTISTS = []
    G_MISSING = []
    G_ERRORS = []

    for genre in G_LIST:
        print(genre)
        try:
            clean_genre = genre.replace(" ", "_").replace("&", "%26")
            
            query = getapi(clean_genre)
            content_genre = getcontents(query)

            #Defining the different pages format options as scenarios:
            by_alphanum = content_genre.find("==0-9==")
            by_alphanum_ = content_genre.find("== 0-9 ==")
            by_alphanum_sub = content_genre.find("===0-9===")
            by_alpha = content_genre.find("==A==")
            by_alpha_ = content_genre.find("== A ==")
            by_artist = content_genre.find("==Artists==")
            by_artist_ = content_genre.find("== Artists ==")
            by_individuals = content_genre.find("==Individuals==")
            by_individuals_ = content_genre.find("== Individuals ==")
            by_list = content_genre.find("== List ==")
                        
            to_seealso = content_genre.find("==See also==")
            to_seealso_ = content_genre.find("== See also ==")
            to_ref = content_genre.find("==References==")
            to_ref_ = content_genre.find("== References ==")
            to_notes = content_genre.find("==Notes==")
            to_notes_ = content_genre.find("== Notes ==")

            CONDITIONS_BY = [by_alphanum, by_alphanum_, by_alpha, by_alpha_, by_artist, by_artist_,
             by_individuals, by_individuals_, by_list]
            CONDITIONS_TO = [to_seealso, to_seealso_, to_ref, to_ref_, to_notes, to_notes_]
            
            print(CONDITIONS_BY)
            print(CONDITIONS_TO)

            #Defining the content to evaluate by scenario:
            by_scenario = getscenario(CONDITIONS_BY)
            to_scenario = getscenario(CONDITIONS_TO)
           
            print(by_scenario)
            print(to_scenario)

            if by_scenario > 0 and to_scenario > 0: 
                G_ARTISTS = getnames(content_genre[by_scenario:to_scenario])
                ALL_ARTISTS.append(G_ARTISTS)

            elif by_scenario < 0 and to_scenario > 0:
                G_ARTISTS = getnames(content_genre[0:to_scenario])
                ALL_ARTISTS.append(G_ARTISTS)
                
            
            elif by_scenario > 0 and to_scenario < 0:
                G_ARTISTS = getnames(content_genre[by_scenario:])
                ALL_ARTISTS.append(G_ARTISTS)

            else:
                G_MISSING.append(genre)
                    
        except:
            print("ERROR IN:", genre)
            G_ERRORS.append(genre)

    return(ALL_ARTISTS, G_MISSING, G_ERRORS)

In [None]:
a,b,c = getartist_MISSED_TESTS(SET_MISSED)