
## TO DO:

1. **File naming:**
    1. ~~Rename 'regional' chart_type to 'top200'~~
    2. ~~Rename 'viral' chart_type to 'viral50'~~
    3. ~~Rename country_abbrv dirs to country_name dirs~~
        1. ~~csv of country_abbrvs to country_names~~
    4. Create different enable_levels of CSV creation
        1. per day (*enabled*)
        2. per month
        3. per year (*enabled*)
        4. per country
        5. per pull (*enabled*)
    5. Abstraction of make-dir methods (creates directory $\rightarrow$ returns directory path)
2. ~~**Enable chart skipping:** (*maybe for each region page, scrape available dates from dropdown*)~~
    1. ~~Chart non-existant for region~~
    2. ~~ate non-existant for region~~
    3. ~~*Maybe:*~~
        1. ~~spotify charts crawl; passed(crawl_params)~~
        2. ~~crawl chart type; passed(chart_types, chart_regions)~~
        3. ~~crawl chart region; passed(chart_dates)~~
        4. ~~*n.b. create dataframe on lowest level crawl; i.e. individual chart crawl, passing resulting dataframe all the way back to initial call, allowing empty dataframes to be appended for non-existant charts~~
    4. ~~*Or Maybe:*~~
       1. ~~if chart crawl fails, append URL to a list~~
       2. ~~after web crawl, double check each fail~~
    5. ~~*Or Or Maybe:*~~
        1. ~~handle page non existant HTTPerrors differently in try...expect block~~
        2. ~~i.e. handles IncompleteRead errors by:~~
            1. ~~append to list of failed crawls, retrying crawls at the end~~
            2. ~~or simply retrying the url (*once?; check exception_logs, but it seems IncompleteRead errors are resolved on a single retry, i.e. only fail on attempt = 0, not attempt >= 1)~~
    6. ~~***Determine start/end dates of entire pull from oldest/newest dropdown dates*** ~~
3. **Create data summary methods:**
    1. 200 entries for each Top200 chart
    2. 50 entries for each Viral50 chart
    3. Per country, per year, per month, per day (i.e. highlighting day_charts that are not uniform)
4. **Create data extraction methods:**
    1. List of unique songs; could use???:
        1. `track_title` $\Leftarrow\Rightarrow$ `track_title`
            1. (pro) *do not change*
            2. (con) *naming may not be unifromly formatted across sites*
            3. (con) *many songs have the same name*
            4. ***maybe:***
                1. *use string compare methods that disregard formatting*
                2. *compare strings with different formatting combinations*
                3. *accept some threshold of string similarity*
                4. *also compare `track_artist` using the same string compare methods*
        2. Spotify URL
            1. (pro) *same songs may have same spotify url*
            2. (con) *urls can change overtime*
        4. validate using a combination of both
5. **Determine database design:**
    1. create a UML diagram
    2. consider information:
        1. Tables
        2. Primary Keys
        3. Foriegn Keys
    3. determine all information needed
6. **Adapt file saving to GoogleDrive:**
    1. extract directory creation and file saving
    2. determine neccessary libraries
    3. create GoogleDrive saving methods
7. **Write signatures/documentations/annotations for code:**
    1. import reasoning
    2. program description
    3. method statements
8. ~~**Create master execption log:**~~
    1. ~~include chart_type, chart_region in exception logging~~
    2. ~~append all chart_scrape logs together~~
9. **Repeated code abstraction:**
    1. making directories
    2. making file path names
    3. ~~making soup~~
    4. saving to CSV files

In [1]:
from bs4 import BeautifulSoup
import http.client
import urllib
import os

import pandas as pd
import numpy as np
import csv

import dateutil.parser
import datetime
import sys

In [2]:
def det_trend_type(trend_fill:str) -> str:
    """Returns the trend type given the fill color of the Spotify Chart trend SVG."""
    # i. color: green; shape: up-triangle
    if trend_fill == '#84bd00':
        return 'Up'
    # ii. color: gray; shape: horizontal-rectangle
    elif trend_fill == '#3e3e40':
        return 'Flat'
    # iii. color: red; shape: down-triangle
    elif trend_fill == '#bd3200':
        return 'Down'
    # iv. color: blue; shape: circle
    elif trend_fill == '#4687d7':
        return 'New'
    # v. unhandled: non-v.1.1 tend_fill
    else:
        return 'UNKNOWN'

In [3]:
def get_datestamp_str(date):
    datestamp_str = date.strftime('%Y-%m-%d')
    return datestamp_str

In [4]:
def get_timestamp_str():
    now_datetime = datetime.datetime.now()
    timestamp_str = now_datetime.strftime('D-%Y-%m-%d_T-%H-%M-%S-%f')
    return timestamp_str

In [5]:
def get_soup(url):
    """Returns the BeautifulSoup object for the given URL"""
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
    url_req = urllib.request.Request(url, headers = {'User-Agent': user_agent})
    
    req_resp = urllib.request.urlopen(url_req)
    req_read = req_resp.read()
    req_resp.close()
    
    return BeautifulSoup(req_read)

In [6]:
def scrape_region_dict(chart_type_url):

    chart_soup = get_soup(chart_type_url)
    region_li_tags = chart_soup.find('div', 
                                     attrs={'class':'responsive-select','data-type':'country'}).find_all('li')
    
    region_dict = {}
    for li_tag in region_li_tags:
        region_abrv = li_tag.get('data-value').strip()
        region_name = li_tag.get_text()
        region_dict[region_abrv] = region_name
    
    return region_dict

In [7]:
def scrape_chart(date_soup, chart_name, chart_region, chart_date):
    """Returns pd/DataFrame of data scraped from each chart's <table class="chart-table" ...> tag"""
    tr_tags = date_soup.find('table', class_='chart-table').find('tbody').find_all('tr')
    
    chart_data = []
    for tr_tag in tr_tags:
        
        song_info = []
        
        # 0. Chart info
        song_info.append(chart_name)
        song_info.append(chart_region)
        song_info.append(chart_date)
        
        # 1. Chart Position; element of {1, 2, ... , 200}
        pos_tag = tr_tag.find('td', class_='chart-table-position')
        pos_val = int(pos_tag.get_text().strip())
        song_info.append(pos_val)
        
        # 2. Streaming Trend Type; element of {Up, Flat, Down, New}
        trend_tag = tr_tag.find('td', class_='chart-table-trend')
        trend_fill = trend_tag.find('svg').get('fill').strip()
        trend_val = det_trend_type(trend_fill)
        song_info.append(trend_val)
        
        # 3. Track Title and Artist
        track_tag = tr_tag.find('td', class_='chart-table-track')
        title_val = track_tag.find('strong').get_text().strip()
        artist_val = track_tag.find('span').get_text().strip().replace('by ', '')
        song_info.append(title_val)
        song_info.append(artist_val)
        
        # 4. Total Streams
        if chart_name == 'Viral50':
            song_info.append(None)
        else:
            streams_tag = tr_tag.find('td', class_='chart-table-streams')
            streams_val = int(streams_tag.get_text().strip().replace(',',''))
            song_info.append(streams_val)
        
        # 5. Icon and Spotify URLs
        icon_tag = tr_tag.find('td', class_='chart-table-image')
        icon_url_val = icon_tag.find('img').get('src').strip()
        spotify_url_val = icon_tag.find('a').get('href').strip()
        song_info.append(icon_url_val)
        song_info.append(spotify_url_val)
        
        chart_data.append(song_info)

    return chart_data

In [8]:
def scrape_charts_dates(region_url, start_date, end_date):
    
    region_soup = get_soup(region_url)
    date_div = region_soup.find('div', attrs={'class':'responsive-select','data-type':'date'})
    
    if date_div is not None:
        return []
    else:
        date_li_tags = date_div.find_all('li')
        region_chart_dates_list = []
        
        for li_tag in date_li_tags:
            date = dateutil.parser.isoparse(li_tag.get('data-value').strip()).date()
            
            if date >= start_date and date <= end_date:
                region_chart_dates_list.append(date)
                
        return region_chart_dates_list.reverse()

In [9]:
def crawl_spotify_charts(start_date: datetime.date,
                         end_date: datetime.date,
                         crawl_top = False: bool,
                         crawl_viral = False: bool,
                         colab_save = False: bool) -> pd.DataFrame:
    """Returns a pd.DataFrame containing information scraped from https://spotifycharts.com."""
    # 1. loop through charts
    BASE_URL = 'https://spotifycharts.com'
    chart_type_dict = {'regional':'Top200', 'viral':'Viral50'}
    chart_interval_list = ['daily', 'weekly']
    
    # 2. instantiates result lists
    result_data = []
    error_list = []
    
    # 3. scrapes all available charts
    # 3. loops through chart types
    for chart_type, chart_name in chart_dict.items():
        
        chart_type_url = BASE_URL + '/' + chart_type
        chart_regions_dict = scrape_region_dict(chart_type_url)
            
        # 3.2. loops through chart regions
        for chart_region_abrv, chart_region_name in chart_regions_dict.items():
            
            # 3.3. loops through chart intervals
            for chart_interval in chart_intervals_list:
                
                latest_charts_url = chart_type_url + '/' + chart_region_abrv + '/' + chart_interval
                
                try:
                    charts_dates_list = scrape_charts_dates(chart_region_url, start_date, end_date)
                

                # 3.4. loops through all available chart dates
                for chart_date in charts_dates_list:

                    chart_date_url = chart_region_url + '/' + chart_date.isoformat()

                    # 3.4.1. tries, then retries 3 times, to read the chart's html page
                    for i in range(0,3):
                        # 3.4.1.1. tries to read the url, and if successful, scrapes and appends its data
                        try:
                            chart_date_soup = get_soup(chart_date_url)
                            chart_date_data = scrape_chart(chart_date_soup, 
                                                           chart_name, 
                                                           chart_region_name, 
                                                           chart_date)
                            result_data.append(chart_date_data)
                            print('Appened: ', chart_date, ' - ', chart_region_abrv, ' - ', chart_type)
                            break

                        # 3.4.1.2. cathces IncompleteRead exceptions, retrying the URL twice
                        except http.client.IncompleteRead as e:
                            print('Error: ', chart_date, ' - ', chart_region_abrv, ' - ', chart_type, ', Attempt ' , i ,'\n\turl: ', chart_date_url, ' ; ', sys.exc_info()[0], '\n\n')
                            error_list.append([chart_type, chart_region_name, chart_date, chart_date_url, i, e])
                            continue
                    
                    # 3.5. otherwise, if the loop falls through, skips this chart date
                    else:
                        print('Error: ', chart_date, ' - ', chart_region_abrv, ' - ', chart_type, ', Attempt ' , i ,'\n\turl: ', chart_date_url, ' ; ', sys.exc_info()[0], '\n\n')
                        error_list.append([chart_type, chart_region_name, chart_date, chart_date_url, i, 'skipped'])
                        continue

    # 4. creates both the result and error pd.DataFrames from the accumulated list of data entries
    result_col_names = ['Chart', 'Region' , 'Date', 'Position', 'Trend', 
                        'Title', 'Artist', 'Streams', 'Icon_URL', 'Spotify_URL']
    result_df = pd.DataFrame(result_data, columns = result_col_names)
    
    # 5. saves both the result and error pd.DataFrame as csv files
    timestamp_str = get_timestamp_str()
    tgt_dir_path = './pull_' + timestamp_str
    if not os.path.exists(tgt_dir_path): os.mkdir(tgt_dir_path)
    
    now_datetime_str = get_timestamp_str()
    start_date_str = get_datestamp_str(start_date)
    end_date_str = get_datestamp_str(end_date)

    result_filename = 'Result_' + now_datetime_str + '_FrmD-' + start_date_str + '_ToD-' + end_date_str + '.csv'
    result_filepath = os.path.join(tgt_dir_path, result_filename)
    result_df.to_csv(result_filepath)
    
    # 6. returns the result pd.DataFrame
    return result_df

SyntaxError: invalid syntax (<ipython-input-9-bd14c69efb5e>, line 3)

In [None]:
# WEB CRAWL CALL
# crawl_spotify_charts()

In [None]:
# TESTING:

In [None]:
# 0. vars:
oneday_delta= datetime.timedelta(days = 1)

jan2017_strt = datetime.date(2017,1,1)
feb2017_strt = datetime.date(2017,2,1)
mar2017_strt = datetime.date(2017,3,1)

jan2017_end = feb2017_strt - oneday_delta
feb2017_end = mar2017_strt - oneday_delta

In [None]:
# 1. test call:
test_df = crawl_spotify_charts(start_date = jan2017_end, end_date = feb2017_strt)
test_df

In [None]:
date_str = '2019-10-10'
date_split = date_str.split('-')
datetime.date(int(date_split[0]), int(date_split[1]), int(date_split[2])).isoformat()


In [None]:
dateutil.parser.isoparse(date_str).date()

<br>

### [ $ \uparrow $ ]  Implemented Code

<br>

---

### [ $ \downarrow $ ]  Scrapped Code

<br>

In [None]:
# client_id = '987f8aab8f804962a2f19a86e310905c'
# client_secret = 'bdb457608ae84339ad7d3c41696cf10e'

# client_credentials_manager = spotipy.oauth2.SpotifyClientCredentials(client_id, client_secret)
# sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [None]:
# UNUSED: while the downloadable csv does contain most of the target information, it proved...
# ... difficult to parse unreliable in formatting

# scrape_chart_csv : str -> DataFrame
# returns a DataFrame of information scraped from the csv-styled downloadable byte file 
def scrape_chart_csv(csv_url):
    
    # i. handles the URL
    user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
    url_req = urllib.request.Request(csv_url, headers = {'User-Agent': user_agent})
    url_resp = urllib.request.urlopen(url_req)
    
    # ii. handles the opened byte file
    csv_bytes = url_resp.read()
    csv_str = csv_bytes.decode(encoding='utf-8', errors='strict')
    
    csv_data_str = csv_str[csv_str.find('Position'):].strip()
    csv_data_lines = csv_data_str.split('\n')
    
    url_resp.close()

    # iii. loops through the data string list extracting needed information
    data_list = []
    for line in csv_data_lines:
        
        song_info = []
        
        # 1. Chart Position
        position = line.partition(',')[0].strip()
        if position.isdigit():
            position = int(position)
        song_info.append(position)
            
        # 2. Spotify URL
        url = line.partition(',')[2].rpartition(',')[2].strip()
        song_info.append(url)
        
        # 3. Total Streams
        streams = line.partition(',')[2].rpartition(',')[0].rpartition(',')[2].strip()
        if streams.isdigit():
            streams = int(streams)
        song_info.append(streams)
        
        # 4. Track Artist
        artist = line.partition(',')[2].rpartition(',')[0].rpartition(',')[0].rpartition(',')[2].strip()
        artist = artist.replace('"','').strip()
        song_info.append(artist)
        
        # 5. Track Title
        title = line.partition(',')[2].rpartition(',')[0].rpartition(',')[0].rpartition(',')[0].strip()
        title = title.replace('"','').strip()
        song_info.append(title)
        
        data_list.append(song_info)
    
    # iv. creates a DataFrame of the scraped song information
    cols_names = data_list[0]
    song_data = data_list[1:]
    chart_csv_df = pd.DataFrame(song_data, columns = cols_names)
    
    return chart_csv_df

In [None]:
def crawl_chart(chart_soup):
 
    top_charts_df = pd.DataFrame(columns = col_names)
    
    # ii. creates the base URL
    base_url = 'https://spotifycharts.com/' + chart_type + '/' + chart_region + '/' + chart_interval

    # iii. loops through DateTimes from start_date to end_date inclusively
    try_attempt = 0
    curr_date_iter = start_date
    while curr_date_iter <= end_date:
        
        # 1. formats the date and creates its URL
        curr_iso_date = curr_date_iter.isoformat()
        date_url = base_url + '/' + curr_iso_date
        
        # 2. scrapes the URL into a DataFrame
        try:
            date_table_df = scrape_chart_table(date_url, chart_type, chart_region,  curr_date_iter, col_names)

        except Exception  as e:
            if try_attempt < 3:
                print('Error: ', curr_iso_date, ', Attempt ' , try_attempt ,'\n\turl: ', date_url, ' ; ', sys.exc_info()[0], '\n\n')
                try_attempt += 1
                continue
            else:
                print('Error: ', curr_iso_date, ', Attempt ' , try_attempt ,'\n\turl: ', date_url, ' ; ', sys.exc_info()[0], '\n\n')
                exception_list.append([curr_iso_date, curr_date_iter, date_url, str(e)])
                try_attempt = 0
                curr_date_iter = curr_date_iter + datetime.timedelta(days = 1)
                continue
        
        # •••. creates the necessary directories and saves the files
        if enable_date_csvs: save_date_csv(curr_date_iter, date_table_df, crawl_dir_name)
        
        # 3. appends the DataFrame to the running total dataset
        top_charts_df = top_charts_df.append(date_table_df, ignore_index = True)
        
        # 4. advances the while-iterator
        try_attempt = 0
        curr_date_iter = curr_date_iter + datetime.timedelta(days = 1)
    
    # iv. formats the resulting DataFrame

    # 1. specifies the name of the csv file
    now_datetime_str = get_timestamp_str()
    start_date_str = get_datestamp_str(start_date)
    end_date_str = get_datestamp_str(end_date)

    # 2. creates and saves to the file path top charts DataFrame as a csv file
    chart_name =  chart_type.title() + '-' + chart_region.title() + '-'  + chart_interval.title() + '_' 
    all_charts_filename = chart_name + 'AllCharts_' + 'FrmD-' + start_date_str + '-ToD-' + end_date_str + '_At-' + now_datetime_str +  '.csv'
    all_charts_filepath = os.path.join(crawl_dir_name, all_charts_filename)
    top_charts_df.to_csv(all_charts_filepath)

    # •••. instantiates exception log DataFrame
    el_col_names = ['Atmpt_Date', 'Atmpt_datetime', 'Atmpt_URL', 'Exception']
    exception_log_df = pd.DataFrame(data = exception_list, columns = el_col_names)

    # •••. creates and saves to the file path exception log DataFrame as a csv file 
    exception_log_filename = 'ExceptionLog_' + now_datetime_str + '_FrmD-' + start_date_str + '_ToD-' + end_date_str + '.csv'
    exception_log_filepath = os.path.join(crawl_dir_name, exception_log_filename)
    exception_log_df.to_csv(exception_log_filepath)
    
    # vi. returns: top charts DataFrame for the date range
    return top_charts_df

In [None]:
# 
#
def save_date_csv(curr_date, date_table_df, crawl_dir_name):
    
    year_dir_name = curr_date.strftime('%Y')
    year_dir_path = os.path.join(crawl_dir_name, year_dir_name)
    if not os.path.exists(year_dir_path): os.mkdir(year_dir_path)
    
    month_dir_name = curr_date.strftime('%m')
    month_dir_path = os.path.join(year_dir_path, month_dir_name)
    if not os.path.exists(month_dir_path): os.mkdir(month_dir_path)
    
    timestamp_str = get_timestamp_str()
    curr_datestamp = get_datestamp_str(curr_date)
    
    curr_date_filename = 'ChartDate-' + curr_datestamp + '_At-' + timestamp_str + '.csv'
    curr_date_filepath = os.path.join(month_dir_path, curr_date_filename)
    
    date_table_df.to_csv(curr_date_filepath)