# DS 3000 HW 4

Due: Sun Mar 6 @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file to [gradescope](https://www.gradescope.com/courses/337250).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope.

### Tips for success
- Start early
- Make use of [Piazza](https://course.ccs.neu.edu/ds3000/admin.html#piazza-discussion-forum)
- Make use of [Office Hours](https://course.ccs.neu.edu/ds3000/office_hours.html)
- Remember that [Documentation / style counts for credit](https://course.ccs.neu.edu/ds3000/style_guide.html)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](https://course.ccs.neu.edu/ds3000/syllabus.html#academic-integrity-and-conduct)

## Overview: "Good" songs

You will start **(but not complete)** a comparison of "good" songs as determined by two websites.
 - The [best music](https://pitchfork.com/reviews/best/tracks/) according to [Pitchfork](https://pitchfork.com/)
     - new (mostly independent) music
 - The [best music](https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/) according to [Billboard](billboard.com)
     - "good" defined based on record sales    
    
The analysis pipeline will
 - scrape top songs from pitchfork
 - scrape top songs from billboard
 - query the Spotify API to get popularity rankings on each song
 - produce the histogram shown below

<img src="https://i.ibb.co/0Z8VPQV/Screenshot-from-2021-02-25-15-02-18.png" alt="Drawing" style="width: 400px;"/>


## Part 1: Program design (28 points)
The task above may be completed by running the following script.  Note that `clean_pitchfork()` and `clean_billboard()` both return dataframes with columns `track` and `artist`.

```python
url_pitchfork = 'https://pitchfork.com/reviews/best/tracks/'
url_billboard = 'https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/'
spot_api_key = '<spotify-key-here>'

# get html of each set of songs
html_str_pitchfork = get_url(url_pitchfork)
html_str_billboard = get_url(url_billboard)

# web scrape tracks from html of pages
df_pitchfork = clean_pitchfork(html_str_pitchfork)
df_billboard = clean_billboard(html_str_billboard)

# record source of each track
df_pitchfork['source'] = 'pitchfork'
df_billboard['source'] = 'billboard'

# concatenate all tracks
df_track = pd.concat((df_pitchfork, df_billboard), axis=0)

# query spotify API for popularity of each track
df_track = get_popularity(df_track, api_key=spot_api_key)

# plot histogram of popularity per source
hist_feat(df_track, feat='popularity')
```

For each of the functions listed in sub-parts below, write a function statement and docstring.  

The "work" of this problem is being able to clearly define the inputs and outputs as needed so the pipeline produces the desired result.  Be sure to describe the inputs / outputs of each function by writing the function statement / docstring as shown in the example below:

```python

def some_fnc(input0, input1):
    """ this function does a thing!
    
    Args:
        input0 (type of input0): input0 is a ...
        input1 (type of input1): input1 is ...
        
    Returns:
        output0 (type of output0): output0 is ...
    """
    # "pass" allow us to end an indentation body without causing
    # any errors when from the python interpreter
    pass
```

### Part 1.1: `get_url()`

In [None]:
# Solution Part 1.1
def get_url(url):
    """ gets html associated with a url 
    
    Args:
        url (str): url website to look up
        
    Returns:
        html_str (str): html associated with this url
    """
    pass

### Part 1.2: `clean_pitchfork()`
(No need to write a seperate docstring for `clean_billboard()`, as it has the same inputs / outputs as `clean_pitchfork()`. 

In [None]:
# Solution Part 1.2
def clean_pitchfork(html_str):
    """ scrapes artists / track from a pitchfork page
    
    ex:
    https://pitchfork.com/reviews/best/tracks/?page=1
    
    Args:
        html_str (str): html of pitchfork page
        
    Returns:
        df_pitch (DataFrame): each row is a song.  
            contains columns 'artist', 'track'  
    """
    pass

### Part 1.3 `get_popularity()`

In [None]:
# Solution Part 1.3
def get_popularity(df_track, api_key):
    """ queries spotify API for popularity of all songs in df_track
    
    Args:
        df_track (DataFrame): a dataframe which contains (at least)
            columns: artist & track.  one row per song
        api_key (str): api key of spotify API
        
    Returns:
        df_track (DataFrame): each row is a song.  
            contains columns 'artist', 'track' and 
            `popularity` as well as other columns in
            the input df_track.  note: this is a copy
            of the input (doesn't overwrite)
    """
    pass

### Part 1.4: `hist_feat()`

In [None]:
# Solution Part 1.4
def hist_feat(df, feat):
    """ produces a histogram of feat per unique source
    
    Args:
        df (DataFrame): contains a column source
        feat (str): some other column of input df
    """
    pass

### Part 2: Build `get_url()` (6 points)
When you're done, check that it works by outputting to the jupyter notebook the `html_str` associated with input:
```python
url='https://www.billboard.com/media/lists/best-songs-2020-top-100-9494940/'
```

Tip: you can click or double click the margin just below `Out[x]` to hide / limit this output ... the full html string can be quite long.

In [None]:
# solution Part 2
import requests

def get_url(url):
    """ gets html associated with a url 
    
    Args:
        url (str): url website to look up
        
    Returns:
        html_str (str): html associated with this url
    """
    return requests.get(url).text

In [None]:
url='https://www.billboard.com/media/lists/best-songs-2020-top-100-9494940/'
get_url(url)

<!-- describe your pipeline here -->

### Part 3:  Build `clean_pitchfork()`  (28 points)

Build `clean_pitchfork()`

- You may skip the initial track "Porridge Radio"
- Be sure to remove the double quotes: `“` `”` from the track names.  Note these are not the typical <shift + comma> character, copy and paste them from above to ensure you get the proper string match.

When you're done, check that it works by outputting to the jupyter notebook the first few rows of a DataFrame of Pitchfork songs:
```python
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch = clean_pitchfork(html_str)
df_pitch.head()
```

which should show (as of Feb 22 @ 1PM):

| artist |           track |                                     source |           |
|-------:|----------------:|-------------------------------------------:|-----------|
|      0 |           yeule |                           Bites on My Neck | pitchfork |
|      1 |       Two Shell |                                       home | pitchfork |
|      2 |   Nilüfer Yanya |                               Midnight Sun | pitchfork |
|      3 |        Soul Glo | Jump!! (Or Get Jumped!!!)((By the Future)) | pitchfork |
|      4 | Earl Sweatshirt |                                       2010 | pitchfork |

In [None]:
# solution part 3

from bs4 import BeautifulSoup
import pandas as pd

def clean_pitchfork(html_str):
    """ scraps artists / track from a pitchfork page
    
    ex:
    https://pitchfork.com/reviews/best/tracks/?page=1
    
    Args:
        html_str (str): html of pitchfork page
        
    Returns:
        df_pitch (DataFrame): each row is a song.  
            contains columns 'artist', 'track' and 
            'source' (source is always 'pitchfork')    
    """
    # build soup
    soup = BeautifulSoup(html_str)
    
    df_pitch = pd.DataFrame()
    for song in soup.find_all('div', class_='track-collection-item'):
        # extract artist
        artist = song.find_all('ul', class_='artist-list')[0].text
        
        #extract track
        track = song.find_all('h2', class_='track-collection-item__title')[0].text
        
        # discard all directional double quotes
        track = track.replace('“', '')
        track = track.replace('”', '')
        
        # collect song data in dataframe
        song_dict = {'artist': artist, 
                     'track': track,
                     'source': 'pitchfork'}
        df_pitch = df_pitch.append(song_dict, ignore_index=True)
        
    return df_pitch

In [None]:
# solution part 3
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch = clean_pitchfork(html_str)
df_pitch.head()

### Part 4 Managing the scrolling on Pitchfork's website (10 points)

Notice that as one scrolls to the bottom of the pitchfork page the `?page=x` counter increments.  [Try it yourself](https://pitchfork.com/reviews/best/tracks/).  Just as we did with the API work, we can modify the URL to get different sets of songs from Pitchfork.

Write a script which scrolls through 10 pages of Pitchfork's music reccomendations and collects all the songs you find into a single `df_pitch` DataFrame.  Be sure to use the functions you've created above.

Validation: We found 56 songs running this on Feb 22.

In [None]:
# solution part 4

max_page = 10

df_pitch = pd.DataFrame()
for page_idx in range(1, max_page + 1):
    # build url of a given page of pitchfork music
    url = f'https://pitchfork.com/reviews/best/tracks/?page={page_idx}'
    
    # get pitchfork page
    html_str = get_url(url)
    
    # process to dataframe
    df_pitch_page = clean_pitchfork(html_str)
    
    # collect each page in one common dataframe: df_pitch
    df_pitch = df_pitch.append(df_pitch_page, ignore_index=True)

# output shape to jupyter notebook (56 songs total when I ran this Feb 25)
df_pitch.shape

### Part 5 (28 points)

<img src="https://i.ibb.co/wht5NB0/Screenshot-from-2022-02-23-05-14-26.png" alt="Drawing" style="width: 600px;"/>

Write a function, `clean_quote()` which scrapes all the quotes from https://www.brainyquote.com/topics/websites-quotes:

```python
url = 'https://www.brainyquote.com/topics/websites-quotes'
html = get_url(url)
df_quote = clean_quote(html)
df_quote.head()
```

gives:

|   |          author |                                              text |
|--:|----------------:|--------------------------------------------------:|
| 0 |  Shreya Ghoshal | I'm not a gadget freak, so to say. I own an iP... |
| 1 | Anthony Carmona | Social media websites are no longer performing... |
| 2 |    M. J. Hyland | As is the case for many people with multiple s... |
| 3 |     Brie Larson | There are so many opportunities to learn thing... |
| 4 |      Ben Barnes |        There are loads of websites devoted to me. |

**Extra Credit (up to +3 points)**: Navigate to each quote's own webpage and you'll find more information:

<img src="https://i.ibb.co/ZKQS1ks/Screenshot-from-2022-02-23-05-14-37.png" alt="Drawing" style="width: 600px;"/>

Store the tags associated with each quote too.  For example, Bill Gate's quote above has tags: `'truth'`, `'government'`, `'internet'`, `'never'` and '`hard'`.  Think carefully about how you store the tags so that one may easily understand how many times each tag (e.g. `'internet'`) appears in your dataframe with simple pandas manipulations (hint: look tags are stored for boardgames in `Out [3]` of the `ipynb` for the [board game example project](https://course.ccs.neu.edu/ds3000/proj_example.html)).


In [None]:
# solution part 5
def clean_quote(html_str, collect_tags=False):
    """ cleans brainy quote quotes
    
    ex: https://www.brainyquote.com/topics/websites-quotes
    
    Args:
        html_str (str): html of website
        collect_tags (bool): toggles whether tags are 
            collected on each quote
        
    Returns:
        df_quote (pd.DataFrame): contains columns author,
            text and (potentially ) a column for every tag (e.g.
            Communication, Family, Positive, Love...).
            values of a tag column are 1 where tag 
            applies to the row's quote, otherwise they're 0
    """
    # build soup from html_str
    soup = BeautifulSoup(html_str)

    df_quote = pd.DataFrame()
    for quote in soup.find_all('div', class_='grid-item'):
        
        if 'm-ad-brick' in quote.attrs['class']:
            # ad, skip this one
            continue

        # get author and text
        a_list = quote.find_all('a')
        author = a_list[1].text
        text = a_list[0].text.strip()

        quote_dict = {'author': author,
                      'text': text}

        if collect_tags:
            # ex credit: get quote tags
            # find website of individual quote
            href = 'https://www.brainyquote.com' + quote.a.attrs['href']
            quote_html = get_url(href)
            quote_soup = BeautifulSoup(quote_html)

            # collect tags
            tags = quote_soup.find_all('div', class_='kw-box')[0].text
            for tag in tags.strip().split('\n'):
                 quote_dict[tag] = 1

        # add quote to dataframe
        df_quote = df_quote.append(quote_dict, ignore_index=True)

    if collect_tags:
        # ex credit: all NaNs are where tag wasn't observed, map them to 0
        df_quote = df_quote.fillna(0)
        
    return df_quote

In [None]:
# solution part 5
url = 'https://www.brainyquote.com/topics/websites-quotes'
html = get_url(url)
df_quote = clean_quote(html)
df_quote.head()

### (Solution) Extra Credit Analysis: Which tags are often associated with websites?

In [None]:
# solution part 5
url = 'https://www.brainyquote.com/topics/websites-quotes'
html = get_url(url)
df_quote = clean_quote(html, collect_tags=True)

In [None]:
# solution part 5
# take the mean of all features (finds average quotes with each tag)
df_quote_mean = df_quote.mean()

# display top 10 most common tags in website quotes:
# (People = .2166667 implies that about 20% of website quotes have the tag "People")
df_quote_mean.sort_values(ascending=False).head(10)