# Best Selling Music Artists

In [1]:
import pandas as pd
import re
wikiurl = 'https://en.wikipedia.org/wiki/List_of_best-selling_music_artists'
tables = pd.read_html(wikiurl)

### Some notes:

the pattern `[\d+]` is being replaced with an empty string. This means that any pattern that consists of one or more digits will be removed from the element.

The r before the string indicates that it is a raw string. This means that the backslash `(\)` characters are not interpreted as escape sequences. The `\[\d+\]` pattern matches any sequence of one or more digits, surrounded by square brackets. The empty string ('') is used to replace any occurrences of this pattern.

The `regex=True` parameter tells the `.replace()` method that the pattern is a regular expression. Regular expressions are a powerful tool for matching patterns in text.

---
The `.values[0]` method extracts the first element from the values array of the Pandas Series. In this case, the values array is simply the element itself, but this syntax is still used to extract the first element.

The reason why we need to extract the first element is because the .replace() method modifies the original element in place. This means that after calling the .replace() method, the element will no longer contain the original values. By extracting the first element first, we can save the original values before they are modified.

### Xtra :

To remove all occurrences of the pattern, you can modify the code to iterate over the matches instead of only replacing the first match:

`df = tables[0].map(
    lambda x: re.sub(r'\[\d+\]', '', x, flags=re.DOTALL)
)`

This code uses the re.sub() function, which is a more powerful version of .replace(). The flags=re.DOTALL parameter tells re.sub() to match all occurrences of the pattern, not just the first occurrence.

---

In [2]:
df = tables[0].map(
    lambda x:  # Define a lambda function to process each element in the DataFrame
        pd.Series(x)  # Convert the element to a Pandas Series
        .astype(str)  # Convert the Series to a string
        .replace(r'\[\d+\]', '', regex=True)  # Remove all occurrences of patterns `[\d+]` using regular expressions
        .values[0]  # Extract the first element of the Series (the element itself)
)
df

Unnamed: 0,Artist,Country,Period active,Release-year of first charted record,Genre,Total certified units (from available markets)[b],Claimed sales
0,The Beatles,United Kingdom,1960–1970,1962,Rock / pop,293.4 million US: 217.250 million JPN: 4.950 m...,600 million 500 million
1,Elvis Presley,United States,1953–1977,1956,Rock and roll / pop / country,"234.1 million US: 199.650 million JPN: 300,000...",500 million
2,Michael Jackson,United States,1964–2009,1971,Pop / rock / dance / soul / R&B,285.6 million US: 177.3 million JPN: 4.650 mil...,400 million
3,Elton John,United Kingdom,1962–present,1970,Pop / rock,212.5 million US: 139.050 million JPN: 1.1 mil...,300 million 250 million
4,Queen,United Kingdom,1971–present,1973,Rock,186.2 million US: 97.7 million JPN: 3.8 millio...,300 million 250 million
5,Madonna,United States,1979–present,1983,Pop / dance / electronica,184.9 million US: 87.675 million JPN: 6.450 mi...,300 million 250 million
6,Led Zeppelin,United Kingdom,1968–1980,1969,Hard rock / blues rock / folk rock,"142.9 million US: 115.1 million JPN: 400,000 [...",300 million 200 million
7,Rihanna,Barbados,2003–present,2005,R&B / pop / dance / hip-hop,365.8 million US: 261.550 million JPN: 1.4 mil...,250 million 230 million
8,Pink Floyd,United Kingdom,"1965–1996, 2005, 2012–2014",1967,Progressive rock / psychedelic rock,"124.2 million US: 78 million JPN: 100,000[b] G...",250 million 200 million


In [3]:
dfs = []

# Iterate over tables with the index using enumerate
for index, table in enumerate(tables):
    #lambda function 
    # Remove all characters inside square brackets using a non-greedy quantifier
    cleaned_table = table.map(lambda x: re.sub(r'\[[^\]]+\]', '', x, flags=re.DOTALL))

    # Append the cleaned table to the list
    dfs.append(cleaned_table)

    # Break the loop if the index reaches 5 (after 5 iterations)
    if index == 5:
        break

# Concatenate all DataFrames in the 'dfs' list along axis=0 with a new index
best_selling_artists_df = pd.concat(dfs, axis=0, ignore_index=True)

best_selling_artists_df

Unnamed: 0,Artist,Country,Period active,Release-year of first charted record,Genre,Total certified units (from available markets)[b],Claimed sales
0,The Beatles,United Kingdom,1960–1970,1962,Rock / pop,293.4 million US: 217.250 million JPN: 4.950 m...,600 million 500 million
1,Elvis Presley,United States,1953–1977,1956,Rock and roll / pop / country,"234.1 million US: 199.650 million JPN: 300,000...",500 million
2,Michael Jackson,United States,1964–2009,1971,Pop / rock / dance / soul / R&B,285.6 million US: 177.3 million JPN: 4.650 mil...,400 million
3,Elton John,United Kingdom,1962–present,1970,Pop / rock,212.5 million US: 139.050 million JPN: 1.1 mil...,300 million 250 million
4,Queen,United Kingdom,1971–present,1973,Rock,186.2 million US: 97.7 million JPN: 3.8 millio...,300 million 250 million
...,...,...,...,...,...,...,...
116,Bob Marley,Jamaica,1962–1981,1975,Reggae,"46.5 million US: 21.850 million JPN: 200,000 G...",75 million
117,The Police,United Kingdom,1977–1986 2007–2008,1978,Pop / rock,"42.2 million US: 23.650 million JPN: 100,000 G...",75 million
118,Barry Manilow,United States,1973–present,1973,Pop / soft rock,36.7 million US: 33.3 million UK: 3.225 millio...,75 million
119,Kiss,United States,1972–present,1974,Hard rock / heavy metal,"28.7 million US: 26 million JPN: 100,000 UK: 8...",75 million


In [4]:
for index, row in best_selling_artists_df.iterrows():
    row['Total certified units (from available markets)[b]'] = row['Total certified units (from available markets)[b]'].split('US')[0]


In [5]:
best_selling_artists_df.rename(columns={'Total certified units (from available markets)[b]':'Total certified units (from available markets)'}, inplace=True)

In [6]:
best_selling_artists_df

Unnamed: 0,Artist,Country,Period active,Release-year of first charted record,Genre,Total certified units (from available markets),Claimed sales
0,The Beatles,United Kingdom,1960–1970,1962,Rock / pop,293.4 million,600 million 500 million
1,Elvis Presley,United States,1953–1977,1956,Rock and roll / pop / country,234.1 million,500 million
2,Michael Jackson,United States,1964–2009,1971,Pop / rock / dance / soul / R&B,285.6 million,400 million
3,Elton John,United Kingdom,1962–present,1970,Pop / rock,212.5 million,300 million 250 million
4,Queen,United Kingdom,1971–present,1973,Rock,186.2 million,300 million 250 million
...,...,...,...,...,...,...,...
116,Bob Marley,Jamaica,1962–1981,1975,Reggae,46.5 million,75 million
117,The Police,United Kingdom,1977–1986 2007–2008,1978,Pop / rock,42.2 million,75 million
118,Barry Manilow,United States,1973–present,1973,Pop / soft rock,36.7 million,75 million
119,Kiss,United States,1972–present,1974,Hard rock / heavy metal,28.7 million,75 million
