# **BTS and Sampling**
This project explores the network-style relationship between BTS and other artists based on the samples used in their music. The objective of this project is to understand the relationship between artist and influence, creating a complex relationship across genres and geographies.

In this project, I will be using a few specific terms to refer to songs and samples. They are as follows:
- **sample** - A sample is a recorded sound taken from its original context and applied to a new context.
- **sampled** - The action of using a sample. *They sampled <Sample Source> in <Song Containing Sample>.
- **song** - For the context of this project, *all songs contain samples, but not all samples are songs*.

In practice, sentences should take the following form:

- BTS *sampled* **Here We Go (Live at the Funhouse) by Run-DMC** in the *song* **호르몬 전쟁 War of Hormone
호르몬 전쟁 War of Hormone**.

In [1]:
# Created: May 17, 2023
# Author: Brendan Keane (GitHub @brendanwilliam)
# Purpose: Create a list of CSS selectors to pull sample data

# Imports
from bs4 import BeautifulSoup
import requests
import time
import random
import pandas as pd
import re

#==================== Global constants ====================#
EXPORT_PATH = "../src/data/raw/"
ARTIST = "BTS"
ARTIST_PAGE = "https://whosampled.com/{}/samples/"
ARTIST_SAMPLES = ARTIST_PAGE.format(ARTIST) + '?sp={}'
ROOT_URL = "https://www.whosampled.com"
HEADERS = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

### These functions gather all samples used by a specified artist
`nav_all_pages` works through all pages from a given URL and performs the specified function on each page. This function is used throughout this project to doeal with pages with more than 10 results.

`get_sample_pages` scrapes all the sample URLs on a page.

In [12]:
# Works through all pages from a given WhoSampled artist/track page
def nav_all_pages(url, func, dt):
  cur_page = 1
  while True:
    # This is to help people see where information is coming from in this project
    cur_url = url.format(cur_page)
    print('Current page:\t' + cur_url)

    # Handles error of a page having less than 10 results by modifying the URL
    response = requests.get(cur_url, headers=HEADERS)
    if response.status_code != 200:
      cur_url = cur_url[:-13]
      response = requests.get(cur_url, headers=HEADERS)

    soup = BeautifulSoup(response.content, 'html.parser')
    dt = func(soup, dt, cur_url)

    # Finding the next page element
    next_page = soup.find('span', class_='next')

    if not next_page:
      break
    else:
      cur_page += 1
      time.sleep(random.randint(1, 2))
  return dt

# Returns a list of sample pages from a given WhoSampled artist page
def get_sample_pages(soup, page_list, cur_page):
  for sample in soup.find_all('a', class_='connectionName playIcon'):
    page_list.append(sample.get('href'))
  return page_list

def get_sample_source(url):
  cur_url = ROOT_URL + url
  response = requests.get(cur_url, headers=HEADERS)
  soup = BeautifulSoup(response.content, 'html.parser')
  sample_element = soup.find('div', id='sampleWrap_source')
  sample_source = sample_element.find('a').get('href')
  return sample_source

def get_all_sample_sources(urls):
  sample_sources = []
  for url in urls:
    sample_sources.append(get_sample_source(url))

  return list(set(sample_sources))


# Creates a list of all samples an artist has used
pages = nav_all_pages(ARTIST_SAMPLES, get_sample_pages, [])

# Returns a Set of URLs from a list of samples
sources = get_all_sample_sources(pages)

print("Number of sample URL paths collected: ", str(len(sources)))

Current page:	https://whosampled.com/BTS/samples/?sp=1
Current page:	https://whosampled.com/BTS/samples/?sp=2
Number of sample URL paths collected:  23


### **Creating a dataset of songs and samples**
Now that we have the URLs for every sample used by an artist, we can create a dataset of all songs containing the sample. This dataset is the basis of our network, as the connection between song and sample is that of two nodes connected by an edge.

In [18]:
# URL for all songs containing a specific sample
# Format:
  # Field 1: Sample URL
  # Field 2: Page number
SAMPLE_URL = 'https://www.whosampled.com{}sampled/?cp={}'

# Gathers information about every song containing a sample.
# Takes in the sample URL and a DataFrame and returns a DataFrame with
# all songs containing the specified sample.
def get_all_sample_uses(url, df):
  cur_url = SAMPLE_URL.format(url, '{}')
  df = nav_all_pages(cur_url, get_sample_uses, df)
  return df

# This function is passed into `nav_all_pages` so that it's executed on every
# page visited.
# Takes in the BeautifulSoup data from the page, a DataFrame, and the URL of the
# song containing the sample.
# Returns the DataFrame with the `title`, `artist`, `year`, and `sample` (URL)
# added as a new row to the DataFrame. On a full page, this will add 10 entries
# to the DataFrame.
def get_sample_uses(soup, df, cur_page):

  substring = "Was sampled in"
  found_section = None
  sections = soup.find_all('section')

  # Makes sure the section is for samples
  for section in sections:
    header = section.find('span')

    if substring in header.text:
      found_section = section
      break

  if found_section is  None:
    print("Error: Unable to find instances of this song being sampled.")
    exit()

  # Loops through all entries on a page
  for sample in found_section.find_all('div', class_='listEntry sampleEntry'):

    # Scrapes the desired information
    title = sample.find('a', class_='trackName playIcon').text.strip()
    artist = sample.find('span', class_='trackArtist').text.strip()
    style = sample.find('span', class_='topItem').text.strip()
    sample_path = re.sub(r'sampled/\?cp=\d+', '', cur_page)

    # Handling if a genre does not exist
    try:
      elem = sample.find('span', class_='bottomItem').text.strip()
      if elem:
        genre = elem
      else:
        genre = ''
    except AttributeError:
      genre = ''

    # Adds information to DataFrame
    new_entry = pd.DataFrame({
      'title': [title],
      'artist': [' '.join(artist.split()[1:-1])],
      'year': [artist.split()[-1][1:-1]],
      'genre': [genre],
      'style': [style],
      'sample': [sample_path]
    })
    df = pd.concat([df, new_entry], ignore_index=True)

  return df


# Takes in a list of sample URLs and saves a `.csv` of all songs that contain the sample.
def create_shared_sample_df(sources):

  # Create the DataFrame
  df = pd.DataFrame(columns=['title', 'artist', 'year', 'sample'])

  # Loop through all URLs
  for source in sources:

    # Reset the df variable to include all newly added songs
    df = get_all_sample_uses(source, df)
    time.sleep(random.randint(1, 3))

    # Save current state
    df.to_csv(EXPORT_PATH + 'songs_w_samples_output.csv')

  # Final export
  df.to_csv(EXPORT_PATH + 'songs_w_samples_dataset.csv')

create_shared_sample_df(sources)

Current page:	https://www.whosampled.com/BTS/Intro%3A-2-Cool-4-Skool/sampled/?cp=1
Current page:	https://www.whosampled.com/BTS/Best-of-Me-(Japanese-Version)/sampled/?cp=1
Current page:	https://www.whosampled.com/BTS/Outro%3A-Luv-in-Skool/sampled/?cp=1
Current page:	https://www.whosampled.com/Urban-Zakapa/%EC%BB%A4%ED%94%BC%EB%A5%BC-%EB%A7%88%EC%8B%9C%EA%B3%A0/sampled/?cp=1
Current page:	https://www.whosampled.com/Keb%27-Mo%27/Am-I-Wrong/sampled/?cp=1
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=1
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=2
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=3
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=4
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=5
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=6
Current page:	https://www.whosampled.com/Mountain/Long-Red/sampled/?cp=7
Current page:	https://www.whosamp

In [48]:
# Takes in a list of URLs and returns song information

def get_song_details(url, df):
  cur_url = ROOT_URL + url
  res = requests.get(cur_url, headers=HEADERS)
  soup = BeautifulSoup(res.content, 'html.parser')

  title = soup.find('meta', itemprop='name')["content"].strip()
  name = soup.find('span', itemprop='byArtist').find('meta', itemprop='name')["content"].strip()
  year = soup.find('meta', itemprop='datePublished')["content"].strip()
  genre = soup.find('span', itemprop='genre').text.strip()

  new_entry = pd.DataFrame({
    'title': [title],
    'artist': [name],
    'year': [year],
    'sample': [cur_url],
    'genre': [genre],
    'style': 'Source'

  })
  print(new_entry)
  df = pd.concat([df, new_entry], ignore_index=True)

  return df

def create_song_df(url_list):

  df = pd.DataFrame(columns=['title', 'artist', 'year', 'genre', 'style', 'sample_path'])
  for url in url_list:
    df = get_song_details(url, df)
    time.sleep(random.randint(1, 2))

  return df

sample_df = create_song_df(sources)
sample_df.to_csv(EXPORT_PATH + 'samples_dataset.csv')

                   title artist  year  \
0  Intro: 2 Cool 4 Skool    BTS  2013   

                                              sample       genre   style  
0  https://whosampled.com/BTS/Intro%3A-2-Cool-4-S...  Rock / Pop  Source  
                           title artist  year  \
0  Best of Me (Japanese Version)    BTS  2018   

                                              sample       genre   style  
0  https://whosampled.com/BTS/Best-of-Me-(Japanes...  Rock / Pop  Source  
                 title artist  year  \
0  Outro: Luv in Skool    BTS  2013   

                                              sample       genre   style  
0  https://whosampled.com/BTS/Outro%3A-Luv-in-Skool/  Rock / Pop  Source  
     title        artist  year  \
0  커피를 마시고  Urban Zakapa  2009   

                                              sample       genre   style  
0  https://whosampled.com/Urban-Zakapa/%EC%BB%A4%...  Rock / Pop  Source  
        title    artist  year  \
0  Am I Wrong  Keb' Mo'  1994   

   

### Creating a Network DataFrame
Now that we have data for songs and samples, we will now create two DataFrames. First, we will create a DataFrame with all songs within this project. This is so that we can reference every song by an ID.

Second, we will create a network based on source, target, and value attributes. The result will be a network from our ID values which we will later populate with metadata.

In [54]:
sample_df = pd.read_csv('data/raw/samples_dataset.csv')
song_w_samples_df = pd.read_csv('data/raw/songs_w_samples_dataset.csv')

song_df = pd.concat([sample_df, song_w_samples_df], ignore_index=True)
song_df = song_df.drop('Unnamed: 0', axis=1)
song_df.to_csv('data/processed/song_data.csv')

In [29]:
df = pd.read_csv('data/processed/nodes_unique.csv')
df.sample(5)

Unnamed: 0,id,title,artist,year,genre,style,sample
207,230,Brooklyn Blew Up the Bridge,MC Mitchski,1987,Hip-Hop / Rap / R&B,Vocals / Lyrics,https://www.whosampled.com/Mountain/Long-Red/
406,429,That's How I'm Livin' (On the Rox Remix),Ice-T,1993,Hip-Hop / Rap / R&B,Drums,https://www.whosampled.com/Mountain/Long-Red/
1777,1800,How Could You,J Foe,1990,Hip-Hop / Rap / R&B,Vocals / Lyrics,https://www.whosampled.com/Run-DMC/Here-We-Go-...
76,99,Wussup Wit the Luv,Digital Underground feat. 2Pac,1993,Hip-Hop / Rap / R&B,Vocals / Lyrics,https://www.whosampled.com/Mountain/Long-Red/
1289,1312,Back Stage Pacin',Brother Ali,2003,Hip-Hop / Rap / R&B,Multiple Elements,https://www.whosampled.com/Run-DMC/Here-We-Go-...


In [30]:
df.columns

Index(['id', 'title', 'artist', 'year', 'genre', 'style', 'sample'], dtype='object')

In [35]:
rdf = df
rdf = rdf.drop('id', axis=1)
rdf.to_csv('data/processed/nodes_unique.csv')

In [52]:
song_df.sample(5)

Unnamed: 0.1,Unnamed: 0,title,artist,year,genre,style,sample
617,594,"Monkey See, Monkey Do",C.E.B.,2016,Hip-Hop / Rap / R&B,Drums,https://www.whosampled.com/Mountain/Long-Red/
371,348,Bus Dat Ass,King Tee feat. Tha Alkaholiks,1992,Hip-Hop / Rap / R&B,Vocals / Lyrics,https://www.whosampled.com/Mountain/Long-Red/
664,641,Come Il Sole,Ensi,2012,Hip-Hop / Rap / R&B,Vocals / Lyrics,https://www.whosampled.com/Mountain/Long-Red/
976,953,Abyssal Dependence,Zuntata,2011,Electronic / Dance,Vocals / Lyrics,https://www.whosampled.com/YouTube/Crazy-Germa...
638,615,No Risk No Reward,LMNO,2013,Hip-Hop / Rap / R&B,Vocals / Lyrics,https://www.whosampled.com/Mountain/Long-Red/


In [36]:
import pandas as pd

net = pd.DataFrame(columns=['source', 'target', 'value'])
df = pd.read_csv('data/processed/nodes_unique.csv')


def get_sample_df(df, val):
  rdf = df[df['sample'] == val]
  return rdf


def add_sample_to_network(root, df):
  root_index = root[root['style'] == 'Source']['id'].item()

  for index, row in root.iterrows():
    cur_id = row['id']

    if root_index != cur_id:
      new_edge = pd.DataFrame({
        'source': root_index,
        'target': row['id'],
        'value': [1/len(root)]
      })

      df = pd.concat([df, new_edge], ignore_index=True)

  return df

def make_sample_network(df):
  sample_list = list(set(df['sample']))
  net = pd.DataFrame(columns=['source', 'target', 'value'])

  for sample in sample_list:
    sample_df = get_sample_df(df, sample)
    net = add_sample_to_network(sample_df, net)

  return net

net = make_sample_network(df)
net.to_csv('data/processed/edges.csv')

ValueError: can only convert an array of size 1 to a Python scalar

### **Creating the visualization from the node and edge tables**
With all nodes and edges defined, we can now create the network diagram. To do this, we will load the `nodes.csv` DataFrame and define the connections with the `edges.csv` DataFrame. The result should be an interactive network diagram consisting of all songs that share a sample with BTS.

In [3]:
# Importing packages and necessary data

from pyvis.network import Network
import networkx as nx
import pandas as pd
import numpy as np

nodes = pd.read_csv('data/processed/nodes.csv')
nodes.unique()
edges = pd.read_csv('data/processed/edges.csv')

In [None]:
print(len(nodes))

In [17]:
net = Network(
  notebook=True,
)
nodes['label'] = nodes.apply(lambda row: '{} ({}) - {}'.format(row['title'], row['year'], row['artist']), axis=1)
nodes['colors'] = np.where(nodes['style'] == 'Source', 'CornflowerBlue', np.where(df['artist'] == 'BTS', 'DarkOrange', 'DimGray'))

titles = nodes['label'].tolist()
labels = nodes['artist'].tolist()
colors = nodes['colors'].tolist()

net.add_nodes(nodes['id'].tolist(),
  title=titles,
  label=labels,
  color=colors)

for index, row in edges.iterrows():
  net.add_edge(row['source'], row['target'], weight=row['value']*len(edges))

net.show_buttons(filter_=['physics'])
net.show('../index.html')

../index.html
