# DS 3000 HW 4

Due: Sun Mar 6 @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file to [gradescope](https://www.gradescope.com/courses/337250).  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the `ipynb` file to gradescope.

### Tips for success
- Start early
- Make use of [Piazza](https://course.ccs.neu.edu/ds3000/admin.html#piazza-discussion-forum)
- Make use of [Office Hours](https://course.ccs.neu.edu/ds3000/office_hours.html)
- Remember that [Documentation / style counts for credit](https://course.ccs.neu.edu/ds3000/style_guide.html)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](https://course.ccs.neu.edu/ds3000/syllabus.html#academic-integrity-and-conduct)

## Overview: "Good" songs

You will start **(but not complete)** a comparison of "good" songs as determined by two websites.
 - The [best music](https://pitchfork.com/reviews/best/tracks/) according to [Pitchfork](https://pitchfork.com/)
     - new (mostly independent) music
 - The [best music](https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/) according to [Billboard](billboard.com)
     - "good" defined based on record sales    
    
The analysis pipeline will
 - scrape top songs from pitchfork
 - scrape top songs from billboard
 - query the Spotify API to get popularity rankings on each song
 - produce the histogram shown below

<img src="https://i.ibb.co/0Z8VPQV/Screenshot-from-2021-02-25-15-02-18.png" alt="Drawing" style="width: 400px;"/>


## Part 1: Program design (28 points)
The task above may be completed by running the following script.  Note that `clean_pitchfork()` and `clean_billboard()` both return dataframes with columns `track` and `artist`.

```python
url_pitchfork = 'https://pitchfork.com/reviews/best/tracks/'
url_billboard = 'https://www.billboard.com/articles/news/list/9494940/best-songs-2020-top-100/'
spot_api_key = '<spotify-key-here>'

# get html of each set of songs
html_str_pitchfork = get_url(url_pitchfork)
html_str_billboard = get_url(url_billboard)

# web scrape tracks from html of pages
df_pitchfork = clean_pitchfork(html_str_pitchfork)
df_billboard = clean_billboard(html_str_billboard)

# record source of each track
df_pitchfork['source'] = 'pitchfork'
df_billboard['source'] = 'billboard'

# concatenate all tracks
df_track = pd.concat((df_pitchfork, df_billboard), axis=0)

# query spotify API for popularity of each track
df_track = get_popularity(df_track, api_key=spot_api_key)

# plot histogram of popularity per source
hist_feat(df_track, feat='popularity')
```

For each of the functions listed in sub-parts below, write a function statement and docstring.  

The "work" of this problem is being able to clearly define the inputs and outputs as needed so the pipeline produces the desired result.  Be sure to describe the inputs / outputs of each function by writing the function statement / docstring as shown in the example below:

```python

def some_fnc(input0, input1):
    """ this function does a thing!
    
    Args:
        input0 (type of input0): input0 is a ...
        input1 (type of input1): input1 is ...
        
    Returns:
        output0 (type of output0): output0 is ...
    """
    # "pass" allow us to end an indentation body without causing
    # any errors when from the python interpreter
    pass
```

### Part 1.1: `get_url()`

In [1]:
def get_url(url):
    """using requests, gets the html source of the url that is given
    
    Args:
        url (str): url of the desired webpage
        
    Returns:
        html (str): html source of the url given
    """
    pass

### Part 1.2: `clean_pitchfork()`
(No need to write a seperate docstring for `clean_billboard()`, as it has the same inputs / outputs as `clean_pitchfork()`. 

In [2]:
def clean_pitchfork(html_str_pitchfork):
    """cleans the track of the given html source and extracts the content from the webpage source
    
    Args:
        html_str_pitchfork (str): html string source of webpage, retrieved from get_url() function
        
    Returns:
        df_tracks (pd.DataFrame): dataframe with top track and artists information from given pitchfork (html)
    """
    pass

### Part 1.3 `get_popularity()`

In [3]:
def get_popularity(df_track, api_key):
    """queries api using api_key and retrieves content on the popularity of each track
    
    Args:
        df_track (pd.DataFrame): dataframe with information about all tracks
        api_key (str): api key used to query the api for track popularity
    
    Returns:
        df_track (pd.DataFrame): dataframe with information about all tracks including popularity
    """
    pass

### Part 1.4: `hist_feat()`

In [4]:
def hist_feat(df_track, feat):
    """plots histogram of track popularity, showing the data for user readability and storage
    
    Args:
        df_track (pd.DataFrame): dataframe with information about all tracks
        feat (str): feature that is plotted on the histogram
    
    Returns:
        hist (plot): histogram plot of popularity
    
    """
    pass

### Part 2: Build `get_url()` (6 points)
When you're done, check that it works by outputting to the jupyter notebook the `html_str` associated with input:
```python
url='https://www.billboard.com/media/lists/best-songs-2020-top-100-9494940/'
```

Tip: you can click or double click the margin just below `Out[x]` to hide / limit this output ... the full html string can be quite long.

In [5]:
import requests

In [6]:
def get_url(url):
    """gets the html source of the given url using requests
    
    Args:
        url (str): url of the desired webpage
        
    Returns:
        html (str): html source of the url given"""
    
    html = requests.get(url).text
    
    return html

In [7]:
# validating the get_url() function works
url = 'https://www.billboard.com/media/lists/best-songs-2020-top-100-9494940/'
get_url(url)



<!-- describe your pipeline here -->

### Part 3:  Build `clean_pitchfork()`  (28 points)

Build `clean_pitchfork()`

- You may skip the initial track "Porridge Radio"
- Be sure to remove the double quotes: `“` `”` from the track names.  Note these are not the typical <shift + comma> character, copy and paste them from above to ensure you get the proper string match.

When you're done, check that it works by outputting to the jupyter notebook the first few rows of a DataFrame of Pitchfork songs:
```python
url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch = clean_pitchfork(html_str)
df_pitch.head()
```

which should show (as of Feb 22 @ 1PM):

| artist |           track |                                     source |           |
|-------:|----------------:|-------------------------------------------:|-----------|
|      0 |           yeule |                           Bites on My Neck | pitchfork |
|      1 |       Two Shell |                                       home | pitchfork |
|      2 |   Nilüfer Yanya |                               Midnight Sun | pitchfork |
|      3 |        Soul Glo | Jump!! (Or Get Jumped!!!)((By the Future)) | pitchfork |
|      4 | Earl Sweatshirt |                                       2010 | pitchfork |

In [8]:
import pandas as pd
from bs4 import BeautifulSoup

In [9]:
def clean_pitchfork(html_str_pitchfork):
    """cleans the track of the given html source and extracts the content from the webpage source
    
    Args:
        html_str_pitchfork (str): html string source of webpage, retrieved from get_url() function
        
    Returns:
        df_tracks (pd.DataFrame); dataframe with top track and artists information from given pitchfork (html)
    """    
    # build soup object from text
    soup = BeautifulSoup(html_str)
    
    # initialize empty dataframe
    df_tracks = pd.DataFrame()
    
    # initialize empty series (song data)
    page_contents = pd.Series(dtype='object')
    
    # loop for each artist
    for artist in soup.find_all('div', class_='track-collection-item'):
        artist_info = artist.find_all('h2', class_='track-collection-item__title')[0]
        
        # add information about artist to series
        page_contents['artist'] = artist.ul.li.text
        
        # discarding dobule quotes
        page_contents['track'] = artist_info.text.replace('“', '').replace('”', '')
        
        # add information about source to series
        page_contents['source'] = 'pitchfork'
        
        # add series to dataframe
        df_tracks = df_tracks.append(page_contents, ignore_index=True)

    # return dataframe
    return df_tracks
    

In [10]:
# outputting first few rows of Pitchfork's best tracks

url = 'https://pitchfork.com/reviews/best/tracks/?page=1'
html_str = get_url(url)
df_pitch = clean_pitchfork(html_str)
df_pitch.head()

Unnamed: 0,artist,track,source
0,Caroline Polachek,Billions,pitchfork
1,yeule,Bites on My Neck,pitchfork
2,Two Shell,home,pitchfork
3,Nilüfer Yanya,Midnight Sun,pitchfork
4,Soul Glo,Jump!! (Or Get Jumped!!!)((By the Future)),pitchfork


### Part 4 Managing the scrolling on Pitchfork's website (10 points)

Notice that as one scrolls to the bottom of the pitchfork page the `?page=x` counter increments.  [Try it yourself](https://pitchfork.com/reviews/best/tracks/).  Just as we did with the API work, we can modify the URL to get different sets of songs from Pitchfork.

Write a script which scrolls through 10 pages of Pitchfork's music reccomendations and collects all the songs you find into a single `df_pitch` DataFrame.  Be sure to use the functions you've created above.

Validation: We found 56 songs running this on Feb 22.

In [11]:
# scrolls through 10 pages of Pitchfork's music recommendations and collects all songs to put into df_pitch

# initialize empty df_pitch DataFrame
df_pitch = pd.DataFrame()

# scroll through 10 pages of Pitchfork
for page in range(1,11):
    
    # page url
    url = f'https://pitchfork.com/reviews/best/tracks/?page={page}'
    
    # get html string of the url
    html_str = get_url(url)
    
    # clean html source to get songs
    df_page = clean_pitchfork(html_str)
    
    # append df_page to df_pitch
    df_pitch = df_pitch.append(df_page)

# return df_pitch with all songs from 10 pages of Pitchfork's music recs
df_pitch

Unnamed: 0,artist,track,source
0,Caroline Polachek,Billions,pitchfork
1,yeule,Bites on My Neck,pitchfork
2,Two Shell,home,pitchfork
3,Nilüfer Yanya,Midnight Sun,pitchfork
4,Soul Glo,Jump!! (Or Get Jumped!!!)((By the Future)),pitchfork
5,Earl Sweatshirt,2010,pitchfork
6,Adele,To Be Loved,pitchfork
7,Mitski,The Only Heartbreaker,pitchfork
8,Jlin,Embryo,pitchfork
9,The War on Drugs,I Don’t Live Here Anymore,pitchfork


In [12]:
# check if there are 56 songs
assert len(df_pitch) == 56

In [13]:
# get the length (number of songs) of df_pitch
len(df_pitch)

56

### Part 5 (28 points)

<img src="https://i.ibb.co/wht5NB0/Screenshot-from-2022-02-23-05-14-26.png" alt="Drawing" style="width: 600px;"/>

Write a function, `clean_quote()` which scrapes all the quotes from https://www.brainyquote.com/topics/websites-quotes:

```python
url = 'https://www.brainyquote.com/topics/websites-quotes'
html = get_url(url)
df_quote = clean_quote(html)
df_quote.head()
```

gives:

|   |          author |                                              text |
|--:|----------------:|--------------------------------------------------:|
| 0 |  Shreya Ghoshal | I'm not a gadget freak, so to say. I own an iP... |
| 1 | Anthony Carmona | Social media websites are no longer performing... |
| 2 |    M. J. Hyland | As is the case for many people with multiple s... |
| 3 |     Brie Larson | There are so many opportunities to learn thing... |
| 4 |      Ben Barnes |        There are loads of websites devoted to me. |

**Extra Credit (up to +3 points)**: Navigate to each quote's own webpage and you'll find more information:

<img src="https://i.ibb.co/ZKQS1ks/Screenshot-from-2022-02-23-05-14-37.png" alt="Drawing" style="width: 600px;"/>

Store the tags associated with each quote too.  For example, Bill Gate's quote above has tags: `'truth'`, `'government'`, `'internet'`, `'never'` and '`hard'`.  Think carefully about how you store the tags so that one may easily understand how many times each tag (e.g. `'internet'`) appears in your dataframe with simple pandas manipulations (hint: look tags are stored for boardgames in `Out [3]` of the `ipynb` for the [board game example project](https://course.ccs.neu.edu/ds3000/proj_example.html)).


In [14]:
def clean_quote(url):
    """ 
    scrapes all the quotes from brainyquote and puts them into df_quotes DataFrame
    returns information of author and text
    
    https://www.brainyquote.com/topics/websites-quotes
    
    Args:
        url (str) : url string of webpage with quotes (brainyquotes)
        
    Returns:
        df_quotes (pd.DataFrame) : resulting dataframe that contains quotes and authors from brainyquote
    
    """
    
    # html source from url
    html_str = get_url(url)
    
    # initialize empty DataFrame 
    df_quotes = pd.DataFrame()
    
    # create a BeautifulSoup object from given html_str
    soup = BeautifulSoup(html_str)
    
    # initialize empty series to for quotes
    quote_info = pd.Series(dtype = 'object')
    
    # counter
    count = 0
    
    # iterate through quotes for author and quote
    for div in soup.find_all('div', class_= 'clearfix'):

        # find all <a> 
        all_div = div.find_all('a')
           
        # add author to series
        quote_info['author'] = all_div[1].text
        
        # add text to series
        quote_info['text'] = all_div[0].text.strip()
        
        # increment counter
        count += 1
        
        # append series to dataframe
        df_quotes = df_quotes.append(quote_info, ignore_index = True)
        
        
    # return df_quotes dataframe with quotes and authors
    return df_quotes

In [15]:
df_quote = clean_quote('https://www.brainyquote.com/topics/websites-quotes')
df_quote.head()

Unnamed: 0,author,text
0,Shreya Ghoshal,"I'm not a gadget freak, so to say. I own an iP..."
1,Anthony Carmona,Social media websites are no longer performing...
2,M. J. Hyland,As is the case for many people with multiple s...
3,Ben Barnes,There are loads of websites devoted to me.
4,Matthew McConaughey,I go on The Daily Beast. The Daily Beast is on...
