## Building a Text-Based Data Set by Web Scraping
This notebook creates a text data set stored in a dictionary that has the url as the key and the value is the text in a .txt file.

### Imports

In [1]:
import requests #requests package to get the pages

from bs4 import BeautifulSoup #beautiful soup to process/parse the pages
from bs4.element import Comment

### Reads in list of websites from anasoundcloud_websites.txt file
Tested my code using the "anasoundcould_test.txt" file which only contains 1 url before running it on all of the urls. "anasoundcould_websites.txt" contains all urls for all episodes through November 26, 2019.

In [4]:
#reads in list of A New Angle Soundcloud links as `sites`
sites = []
#with open("anasoundcloud_test.txt",'r') as infile :
with open("anasoundcloud_websites.txt",'r') as infile :
    for line in infile :
        sites.append(line.strip())

In [5]:
#prints all links in `sites`
print(sites)
r = requests.get(sites[0])
#checks HTTP response status codes -> 200 = good
#r.status_code

['https://soundcloud.com/anewangle/sea-change-8-nichole-heyer', 'https://soundcloud.com/anewangle/lara-birkes-sees-a-sustainable-montana', 'https://soundcloud.com/anewangle/jim-sciutto-and-the-shadow-war', 'https://soundcloud.com/anewangle/the-innovation-factory-can-we-teach-creativity', 'https://soundcloud.com/anewangle/sea-change-7-micah-larsen-returns', 'https://soundcloud.com/anewangle/matt-gangloff-final', 'https://soundcloud.com/anewangle/matthamon', 'https://soundcloud.com/anewangle/sheila-stearns-likes-hard-jobs-even-in-retirement', 'https://soundcloud.com/anewangle/nora-saks-and-the-richest-hill', 'https://soundcloud.com/anewangle/sea-change-6-jennifer-palmieri', 'https://soundcloud.com/anewangle/rethining-childcare-with-myvillages-erica-mackey-and-elke-govertsen', 'https://soundcloud.com/anewangle/the-nature-conservancys-leigh-greenwood-keeps-forests-healthy', 'https://soundcloud.com/anewangle/bioscience-clusters-early-childhood-education-at-innovateum-live', 'https://soundcl

### Extracts visible text from each page

In [6]:
#extract text and stores it in a dictionary that has the url as the key and the value is the text.
#from John via stackoverflow
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]', 'div']:
        return False
    if isinstance(element, Comment):
        return False
    return True

In [7]:
anasc_text = dict()

for link in sites :
    try :
        r = requests.get(link)
    except :
        pass 
    
    if r.status_code == 200 :
        soup = BeautifulSoup(r.text, 'html.parser')
        texts = soup.findAll(text=True)
        visible_texts = filter(tag_visible, texts) 
        anasc_text[link] = " ".join(t.strip() for t in visible_texts)

In [8]:
#anasc_text

The extracted text includes some unwanted text at the beginning and end of the text pulled from each link. Things like, "JavaScript is disabled" and related error messages along with "Download... Users who like... " etc.

I messed around trying to exclude it programatically and was unsucessful. Things I tried to incorporate to EXCLUDE from my text data set are:
* `[<p class="errorTitle">JavaScript is disabled</p>` with `soup('p', {'class':'errorTitle'})`

SOOO... I ended up deleting it manually in a text editor... and saving it as "anasoundcloud_dataset_clean.txt".

I also struggled to figure out how to include the number of listens. It's place in the html looks like:
* `<li title="441 plays" class="sc-ministats-item">`

So this bit of information is missing from my dataset.

### Writes all visible text from each page to a local text file

In [9]:
#fills text file with data
with open("anasoundcloud_dataset.txt",'w') as ofile :
    ofile.write(str(anasc_text))

### END
The following includes code trying to test for html tags for inclusion/exclusion.

In [None]:
print(soup.find_all('li'))

In [None]:
print(soup.find_all('p'))

In [None]:
print(soup.find_all('a'))

In [None]:
print(soup.find_all("span", {"class": "sc-visuallyhidden"}))

In [None]:
<span class="sc-visuallyhidden">363 plays</span>
<span class="sc-ministats sc-ministats-medium sc-ministats-plays">