In this exercise, your objective is to use BeautifulSoup in order to obtain a dataset of all Country Music Hall of Fame inductees. This information is available at https://countrymusichalloffame.org/hall-of-fame/members/, but you will take the contents of this website and convert it into a pandas DataFrame.

The website splits the members across multiple pages, but start by just working on the first page. Later on in the exercise, you'll take what you've done for the first page and apply it across all pages.

In [1]:
import requests
from bs4 import BeautifulSoup  
import pandas as pd
import re

### 1. Start by using either the inspector or by viewing the page source. Can you identify a tag that might be helpful for finding the names of all inductees? Make use of this to create a list containing just the names of each inductee.

In [21]:
website_url = 'https://countrymusichalloffame.org/hall-of-fame/members/page/2'
response = requests.get(website_url)

In [23]:
soup = BeautifulSoup(response.content, 'lxml')

In [24]:
artist_names = soup.find_all('h3')

In [25]:
artist_names

[<h3>Bobby Braddock</h3>,
 <h3>Harold Bradley</h3>,
 <h3>Jerry Bradley</h3>,
 <h3>Owen Bradley</h3>,
 <h3>Rod Brasfield</h3>,
 <h3>Garth Brooks</h3>,
 <h3>Brooks &amp; Dunn</h3>,
 <h3>Jim Ed Brown</h3>,
 <h3>Jim Ed Brown and the Browns</h3>]

In [33]:
artist_list = []
for artist in artist_names:
    artist_list.append(artist.text)

In [43]:
artist_list = [artist.text for artist in artist_names]

In [44]:
artist_list

['Bobby Braddock',
 'Harold Bradley',
 'Jerry Bradley',
 'Owen Bradley',
 'Rod Brasfield',
 'Garth Brooks',
 'Brooks & Dunn',
 'Jim Ed Brown',
 'Jim Ed Brown and the Browns']

### 2. Next, try and find a tag that could be used to find the year that each member was inducted. Extract these into a list. When you do this, be sure to only include the year and not the full text. For example, for Roy Acuff, the list entry should be "1962" and not "Inducted 1962". Double-check that the resulting list has the correct number of elements and is in the same order as your inductees list.

In [45]:
year_inducted_soup = soup.find_all('p')

In [46]:
year_inducted_soup = soup.find_all('div', attrs = {'class': "vertical-card_content--copy"})

In [47]:
len(year_inducted_soup)

9

In [49]:
year_inducted_soup[0].text

'\n\n                    Inducted 2011                  \n'

In [50]:
years_inducted = [int(re.findall("\d+", year_str.text)[0]) for year_str in year_inducted_soup 
                      if re.match("\s+Inducted\s\d+\s+", year_str.text)]

In [51]:
years_inducted

[2011, 2006, 2019, 1974, 1987, 2012, 2019, 2015, 2015]

In [77]:
years =[]
# iteration = 0
for x in year_inducted_soup:
#     print()
    years.append(int(x.text.strip().replace('Inducted ',"")))
years

[2011, 2006, 2019, 1974, 1987, 2012, 2019, 2015, 2015]

In [84]:
new_years =[]
for i in range(len(years)):
    new_years.append(years[i] - 2000)
    
new_years

[11, 6, 19, -26, -13, 12, 19, 15, 15]

### 3. Take the two lists you created on parts 1 and 2 and convert it into a pandas DataFrame.

In [55]:
hof_df = pd.DataFrame(
    {
        "artist": artist_list, 
        "year_inducted": years_inducted
    }
)

In [61]:
list(zip(artist_list, years_inducted))

[('Bobby Braddock', 2011),
 ('Harold Bradley', 2006),
 ('Jerry Bradley', 2019),
 ('Owen Bradley', 1974),
 ('Rod Brasfield', 1987),
 ('Garth Brooks', 2012),
 ('Brooks & Dunn', 2019),
 ('Jim Ed Brown', 2015),
 ('Jim Ed Brown and the Browns', 2015)]

In [59]:
results_together = pd.DataFrame(list(zip(artist_list, years_inducted)), columns = ['Artist', 'Year'])
results_together

Unnamed: 0,Artist,Year
0,Bobby Braddock,2011
1,Harold Bradley,2006
2,Jerry Bradley,2019
3,Owen Bradley,1974
4,Rod Brasfield,1987
5,Garth Brooks,2012
6,Brooks & Dunn,2019
7,Jim Ed Brown,2015
8,Jim Ed Brown and the Browns,2015


In [57]:
hof_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   artist         9 non-null      object
 1   year_inducted  9 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 272.0+ bytes


### 4. Now, you need to take what you created for the first page and apply it across the rest of the pages so that you can scrape all inductees. Notice that when you click the next page button at the bottom of the page that the url changes to "https://countrymusichalloffame.org/hall-of-fame/members/page/2". Check that the code that you wrote for the first page still works for page 2. Once you have verified that your code will still work, write a for loop that will cycle through all 16 pages and build a DataFrame containing all inductees and the year of their induction.

In [89]:
test_list = [1, 2, 3, 4]
test_list.extend([5, 6, 7, 8])
test_list

[1, 2, 3, 4, 5, 6, 7, 8]

In [95]:
all_artists = []
all_years_inducted = []
for page in range(1, 17):
#     print(f"Page Number: {page}")
    if page == 1:
        hof_url = "https://countrymusichalloffame.org/hall-of-fame/members/"
    else:
        hof_url = f"https://countrymusichalloffame.org/hall-of-fame/members/page/{page}"
    response = requests.get(hof_url)
    soup = BeautifulSoup(response.content, 'lxml')
    
    artists_page = soup.find_all('h3')
    artist_list_page = [artist.text for artist in artists_page]
    
    years_inducted_page = soup.find_all('p')
    years_inducted_list_page = [int(re.findall("\d+", year_str.text)[0]) for year_str in years_inducted_page 
                      if re.match("\s+Inducted\s\d+\s+", year_str.text)]
    
#     print(f"Artists: {artist_list_page}")
#     print(f"Years: {years_inducted_list_page}")
#     print("\n")
    
    all_artists.extend(artist_list_page)
    all_years_inducted.extend(years_inducted_list_page)


In [96]:
hof_df_all = pd.DataFrame(
    {
        "artist": all_artists, 
        "year_inducted": all_years_inducted
    }
)
hof_df_all.head()

Unnamed: 0,artist,year_inducted
0,Roy Acuff,1962
1,Alabama,2005
2,Bill Anderson,2001
3,Eddy Arnold,1966
4,Chet Atkins,1973


In [92]:
hof_df_all.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   artist         140 non-null    object
 1   year_inducted  140 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 2.3+ KB


### 5. Create a visual using the data that you scraped. Prepare a short (<5 minute) presentation.

### 6. Bonus Question: If you navigate to Roy Acuff's page, you will see that his date of birth and date of death are listed towards the top of the page, along with his birthplace. Write some code that will extract these three values. Once you get it working for Roy Acuff, figure out how you can automatically extract these values across the whole dataset of artists. In doing this, you'll need to figure out a way to automatically determine the correct urls for each artist. Note also that not every artist will have these three values, so write your code in a way that it can handle cases where these values are missing. Alabama is one such example.

In [124]:
test_string = "Birth: September 15, 1903 - Death: November 23, 1992 Birthplace: Maynardville, Tennessee"

In [133]:
re.findall("Birth: (\w+)", test_string)

['September']

In [145]:
re.sub("[a-z]", "", test_string)

'B: S 15, 1903 - D: N 23, 1992 B: M, T'

In [17]:
roy_url = 'https://countrymusichalloffame.org/artist/roy-acuff/'
roy_response = requests.get(roy_url)
roy_soup = BeautifulSoup(roy_response.content, 'lxml')

In [18]:
info_style = 'margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;'
roy_info = roy_soup.find_all('p', attrs = {'style': info_style})

In [147]:
info_style = 'margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;'
births = []
deaths = []
birthplaces = []
for artist_name in hof_df_all["artist"].to_list():
    artist_page_name = '-'.join(re.findall("\w+", artist_name)).lower()
#     print(artist_name.lower().replace(" ", "-").replace(".", "").replace('“', "").replace('”', ""))
    print(artist_page_name)

    artist_url = f'https://countrymusichalloffame.org/artist/{artist_page_name}/'
    artist_response = requests.get(artist_url)
    soup = BeautifulSoup(artist_response.content, 'lxml')
    print(artist_url)
    
    artist_info = soup.find_all('p', attrs = {'style': info_style})
    if artist_info:
        birthdate = re.findall("Birth: (\w+\s\d+, \d+)\s", artist_info[0].text)
        deathdate = re.findall("Death: (\w+\s\d+, \d+)\s", artist_info[0].text)
#         birthplace = re.findall("Birthplace: ([\w|\s|\(|\)]+,\s.+)\s", artist_info[0].text)
        birthplace = re.findall("Birthplace: (.+)\s$", artist_info[0].text)
    else:
        birthdate = [""]
        deathdate = [""]
        birthplace = [""]
    if len(birthdate) == 0:
        birthdate = [""]
    if len(deathdate) == 0:
        deathdate = [""]
    if len(birthplace) == 0:
        birthplace = [""]
    print(artist_info)
    
#     try: 
#         birthdate = re.findall("Birth: (\w+\s\d+, \d+)\s-", artist_info[0].text)
#         if len(birthdate) == 0 :
#             birthdate = [""]
#     except:
#         birthdate = [""]
    print(f"Birth Date: {birthdate}")
#     try: 
#         deathdate = re.findall("Death: (\w+\s\d+, \d+)\s", artist_info[0].text)
#         if len(deathdate) == 0 :
#             deathdate = [""]
#     except:
#         deathdate = [""]
    print(f"Death Date: {deathdate}")
#     try:
#         birthplace = re.findall("Birthplace: ([\w|\s]+,\s.+)\s", artist_info[0].text)
#         if len(birthplace) == 0 :
#             birthplace = [""]
#     except:
#         birthplace = [""]
    print(f"Birthplace: {birthplace}")
    
    
    print("==========================")
    births.extend(birthdate)
    deaths.extend(deathdate)
    birthplaces.extend(birthplace)

roy-acuff
https://countrymusichalloffame.org/artist/roy-acuff/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: September 15, 1903 - Death: November 23, 1992 <br/> Birthplace: Maynardville, Tennessee </p>]
Birth Date: ['September 15, 1903']
Death Date: ['November 23, 1992']
Birthplace: ['Maynardville, Tennessee']
alabama
https://countrymusichalloffame.org/artist/alabama/
[]
Birth Date: ['']
Death Date: ['']
Birthplace: ['']
bill-anderson
https://countrymusichalloffame.org/artist/bill-anderson/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: November 1, 1937 <br/> Birthplace: Columbia, South Carolina </p>]
Birth Date: ['November 1, 1937']
Death Date: ['']
Birthplace: ['Columbia, South Carolina']
eddy-arnold
https://countrymusichalloffame.org/artist/eddy-arnold/
[<p style="margin-top:

https://countrymusichalloffame.org/artist/patsy-cline/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: September 8, 1932 - Death: March 5, 1963 <br/> Birthplace: Winchester, Virginia </p>]
Birth Date: ['September 8, 1932']
Death Date: ['March 5, 1963']
Birthplace: ['Winchester, Virginia']
hank-cochran
https://countrymusichalloffame.org/artist/hank-cochran/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: August 2, 1935 - Death: July 15, 2010 <br/> Birthplace: Isola, Mississippi </p>]
Birth Date: ['August 2, 1935']
Death Date: ['July 15, 2010']
Birthplace: ['Isola, Mississippi']
paul-cohen
https://countrymusichalloffame.org/artist/paul-cohen/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;

https://countrymusichalloffame.org/artist/vince-gill/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: April 12, 1957 <br/> Birthplace: Norman, Oklahoma </p>]
Birth Date: ['April 12, 1957']
Death Date: ['']
Birthplace: ['Norman, Oklahoma']
johnny-gimble
https://countrymusichalloffame.org/artist/johnny-gimble/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: May 30, 1926 - Death: May 9, 2015 <br/> Birthplace: Tyler, Texas </p>]
Birth Date: ['May 30, 1926']
Death Date: ['May 9, 2015']
Birthplace: ['Tyler, Texas']
merle-haggard
https://countrymusichalloffame.org/artist/merle-haggard/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: April 6, 1937 - Death: April 6, 2016 <br/> Birthplace:

https://countrymusichalloffame.org/artist/loretta-lynn/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth:  April 14, 1935 <br/> Birthplace: Butcher Holler, Kentucky </p>]
Birth Date: ['']
Death Date: ['']
Birthplace: ['Butcher Holler, Kentucky']
uncle-david-macon
https://countrymusichalloffame.org/artist/uncle-david-macon/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: October 7, 1870 - Death: March 22, 1952 <br/> Birthplace: Smart Station, Warren County, Tennessee </p>]
Birth Date: ['October 7, 1870']
Death Date: ['March 22, 1952']
Birthplace: ['Smart Station, Warren County, Tennessee']
barbara-mandrell
https://countrymusichalloffame.org/artist/barbara-mandrell/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text

https://countrymusichalloffame.org/artist/ray-price/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: January 12, 1926 - Death: December 16, 2013 <br/> Birthplace: Perryville, Texas </p>]
Birth Date: ['January 12, 1926']
Death Date: ['December 16, 2013']
Birthplace: ['Perryville, Texas']
charley-pride
https://countrymusichalloffame.org/artist/charley-pride/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: March 18, 1934 <br/> Birthplace: Sledge, Mississippi </p>]
Birth Date: ['March 18, 1934']
Death Date: ['']
Birthplace: ['Sledge, Mississippi']
jerry-reed
https://countrymusichalloffame.org/artist/jerry-reed/
[]
Birth Date: ['']
Death Date: ['']
Birthplace: ['']
jim-reeves
https://countrymusichalloffame.org/artist/jim-reeves/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px

https://countrymusichalloffame.org/artist/sons-of-the-pioneers/
[]
Birth Date: ['']
Death Date: ['']
Birthplace: ['']
jack-stapp
https://countrymusichalloffame.org/artist/jack-stapp/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: December 8, 1912 - Death: December 20, 1980 <br/> Birthplace: Nashville, Tennessee </p>]
Birth Date: ['December 8, 1912']
Death Date: ['December 20, 1980']
Birthplace: ['Nashville, Tennessee']
ray-stevens
https://countrymusichalloffame.org/artist/ray-stevens/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: January 24, 1939 <br/> Birthplace: Clarkdale, Georgia </p>]
Birth Date: ['January 24, 1939']
Death Date: ['']
Birthplace: ['Clarkdale, Georgia']
cliffie-stone
https://countrymusichalloffame.org/artist/cliffie-stone/
[<p style="margin-top: 20px;color: #5

https://countrymusichalloffame.org/artist/dottie-west/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: October 11, 1932 - Death: September 4, 1991 <br/> Birthplace: McMinnville, Tennessee </p>]
Birth Date: ['October 11, 1932']
Death Date: ['September 4, 1991']
Birthplace: ['McMinnville, Tennessee']
don-williams
https://countrymusichalloffame.org/artist/don-williams/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transform: uppercase;">Birth: May 27, 1939 - Death: September 8, 2017 <br/> Birthplace: Floydada, Texas </p>]
Birth Date: ['May 27, 1939']
Death Date: ['September 8, 2017']
Birthplace: ['Floydada, Texas']
hank-williams
https://countrymusichalloffame.org/artist/hank-williams/
[<p style="margin-top: 20px;color: #50565A;font-size: 14px;font-weight: 500;letter-spacing: 1.5px;line-height: 25px; text-transfo

In [148]:
bonus_df = pd.DataFrame(
    {
        "artist": all_artists,
        "year_inducted": all_years_inducted,
        "birth_date": births,
        "death_date": deaths,
        "birth_place":  birthplaces
    }
)

bonus_df

Unnamed: 0,artist,year_inducted,birth_date,death_date,birth_place
0,Roy Acuff,1962,"September 15, 1903","November 23, 1992","Maynardville, Tennessee"
1,Alabama,2005,,,
2,Bill Anderson,2001,"November 1, 1937",,"Columbia, South Carolina"
3,Eddy Arnold,1966,"May 15, 1918","May 8, 2008","Henderson, Tennessee"
4,Chet Atkins,1973,"June 20, 1924","June 30, 2001","Luttrell, Tennessee"
...,...,...,...,...,...
135,Hank Williams,1961,"September 17, 1923","January 1, 1953","Mount Olive, Alabama"
136,Bob Wills,1968,"March 6, 1905","May 13, 1975","Kosse, Texas"
137,Mac Wiseman,2014,"May 23, 1925","February 24, 2019","Crimora, Virginia"
138,Tammy Wynette,1998,"May 5, 1942","April 6, 1998","Itawamba County, Mississippi"
