# Regional Analysis of Rapper Influences - Data Collection

Inspired by my trip to Atlanta, this project aims to analyze regional upbringings and influence of the top selling rappers over the last few decades. While there is no readily available and complete dataset for the top selling rap artists, there are webpages that list off the top selling artists of that year. In this script, I will scrape through many webpages to collect the artists with the biggest first week album sales of that year. In essence, I aim to get a rough sample of the highest paying rappers over time. Then, data cleaning is conducted to clean the list into a readable and useable format for analysis. 

In [379]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

## Web Scrape the Data

In [380]:
# Replace 'your_url_pattern' and 'your_element_selector' with actual values
base_url = 'https://beats-rhymes-lists.com/sales/biggest-hip-hop-album-first-week-sales-of-'
page_range = range(1998, 2024)  # Adjust the range based on the pages you want to scrape

In [417]:
scraped_df = pd.DataFrame(columns=['Top Artists', 'Year'])

for page_number in page_range:
    url = f'{base_url}{page_number}'
    response = requests.get(url)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        # Use the CSS selector for <h2> with class 'wp-block-heading' containing <em>
        target_elements = soup.select('h2')

        # Create a Pandas DataFrame
        tmp = pd.DataFrame(columns=['Top Artists', 'Year'])

    for element in target_elements:
        # Process or print the extracted text
        extracted_text = element.text.strip()
        tmp1 = {'Top Artists': extracted_text, 'Year': page_number}

        # Append the extracted text to the DataFrame
        tmp = pd.concat([tmp, pd.DataFrame([tmp1])], ignore_index=True)
        
    scraped_df = pd.concat([scraped_df, tmp], ignore_index=True)

else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Failed to retrieve the page. Status code: 200


## Clean Data

In [418]:
drop_values = ['All Time', 'Album Sales', 'First Week Sales', 'Best-Selling', 'Most Number 1']
scraped_df['Top Artists'] = scraped_df['Top Artists'].astype(str)
scraped_df['Year'] = scraped_df['Year'].astype(int)
scraped_df = scraped_df[~scraped_df['Top Artists'].str.contains('|'.join(drop_values))].reset_index()

In [419]:
# with pd.option_context('display.max_rows', None, 'display.max_columns', None): 
#     print(scraped_df)
print(scraped_df)

     index                                        Top Artists  Year
0        0    10.\n \n  It’s Dark and Hell Is Hot\n \n by DMX  1998
1        1  9.\n \n  Charge It 2 Da Game\n \n Silkk by The...  1998
2        2     8.\n \n  Vol. 2… Hard Knock Life\n \n by Jay-Z  1998
3        3          7.\n \n  Ghetto Fabulous\n \n by Mystikal  1998
4        4  6.\n \n  Tical 2000: Judgement Day\n \n by Met...  1998
..     ...                                                ...   ...
312    412  YoungBoy Never Broke Again –\n \n  I Rest My Case  2023
313    413                Lil Yachty –\n \n  Let’s Start Here  2023
314    414                      Don Toliver –\n \n  Love Sick  2023
315    415                             Yeat –\n \n  Afterlyfe  2023
316    416                 Trippie Redd –\n \n  Mansion Musik  2023

[317 rows x 3 columns]


In [450]:
# Dataframe 1
clean_df1 = pd.DataFrame(columns=['Top Artists', 'Year'])

for index, row in scraped_df.iterrows():
    if 1998 <= scraped_df.iloc[index, 2] <= 2003 or 2011 <= scraped_df.iloc[index, 2] <= 2021:
        substring = 'by '
        parts = scraped_df.iloc[index, 1].split(substring, 1) # Split substring with 'by ' as separator
        clean_tmp1 = {'Top Artists': parts[1].strip(), 'Year': scraped_df.iloc[index, 2]} # Add substring to dict w/ Year
        clean_df1 = pd.concat([clean_df1, pd.DataFrame([clean_tmp1])], ignore_index=True) # Append to datafrane

# Dataset 2
scraped_tmp = scraped_df.copy()

def extract_artist1(entry):
    match = re.search(r'^(.*?)( –\n|\n)', entry) # Search for all characters before '–\n' or '\n'
    if match:
        return match.group(1)
    else:
        return None

# Apply the extraction function only when 'Year' is in [2022, 2023]
condition = scraped_tmp['Year'].isin([2022, 2023]) 
scraped_tmp.loc[condition, 'Top Artists'] = scraped_tmp.loc[condition, 'Top Artists'].apply(extract_artist1) 

# Select the relevant columns
clean_df2 = scraped_tmp.loc[condition, ['Top Artists', 'Year']]

# Clean dataset 3
scraped_tmp = scraped_df.copy()

def extract_artist2(entry):
    match = re.search(r'\d+\.\s([^\n]+)\s–', entry)
    if match:
        return match.group(1)
    else:
        return None

# Apply the extraction function only when 'Year' is in 2004-2010
condition = scraped_tmp['Year'].isin([2004, 2005, 2006, 2007, 2008, 2009, 2010])
scraped_tmp.loc[condition, 'Top Artists'] = scraped_tmp.loc[condition, 'Top Artists'].apply(extract_artist2)

# Select the relevant columns
clean_df3 = scraped_tmp.loc[condition, ['Top Artists', 'Year']]

# Merge dataframes
full_df = pd.concat([clean_df1, clean_df3, clean_df2])
full_df = full_df.sort_values(by=['Year'])

                    Top Artists  Year
0                           DMX  1998
1                   The Shocker  1998
2                         Jay-Z  1998
3                      Mystikal  1998
4                    Method Man  1998
..                          ...   ...
315                        Yeat  2023
312  YoungBoy Never Broke Again  2023
313                  Lil Yachty  2023
314                 Don Toliver  2023
316                Trippie Redd  2023

[317 rows x 2 columns]


In [451]:
# Export DataFrame to CSV File
full_df.to_csv('topSellingRappers_1998-2023.csv')