# Country & Soul
### *Analyzing Trends in American Music Journalism from 1960 to Present*
---

# Digital Analytical Edition
#### Table of Contents
1. [Introduction](#introduction)
2. [Libraries & Set Up](#paragraph1)
3. [Creating the LIB table](#paragraph2)
    1. [Parsing Dates](#subparagraph1)
    2. [Narrowing the Scope](#subparagraph2)
4. [Constructing the Corpus](#paragraph3)
5. [Extracting a Vocabulary](#paragraph4)

## Introduction<a name="introduction"></a>
In this notebook, I aggregate the source files to create a Digital Analytical Edition (DAE).  

The DAE contains the following tables:  
+ `LIB.csv`    - Metadata for each article.
+ `CORPUS.csv` - Aggregated text from source files. Indexed by Ordered Hierarchy of Content Object (OHCO) structure.
+ `VOCAB.csv`  - Linguistic features and statistics for words in `CORPUS.csv`. Stop words removed.
+ `BOW.csv`    - Bag-of-words representation of `CORPUS.csv` with stop words removed.

Source files were scraped from [rocksbackpages.com](https://www.rocksbackpages.com/) using `rbpscraper.py`. I present an example of `rbpscraper.py` functionality here. It is **not** recommended that the reader attempts to run this code. `rbpscraper.py` requires a subscription to rocksbackpages.com and several hours of your time!

In [1]:
import sys
sys.path.append('.')

from rbpscraper import RBPScraper

country = RBPScraper(desc = "country", write_path = "./datadir/")
    # desc - description of articles to be scraped
    # write_path - path to directory where articles will be saved

Please paste URL:
https://www.rocksbackpages.com/sample-search-url

Thank you.

Please paste cookies:
sample=cookies; AcceptedCookieNotice=true

Thank you.


After instantiating an RBPScraper object, the user will be prompted to enter the url for the first page of search results. Then, the user will be prompted to enter cookies for authentification. The web-scraping process can be initiated by calling the following methods. This will save articles as html files and a metadata table as a csv file to the specified directory.

In [2]:
#country.searchScraper().articleScraper().writeLIB()

---
## Libraries & Set Up<a name="paragraph1"></a>

In [3]:
import pandas as pd
import numpy as np
from datetime import date

import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

In [4]:
# path to data
dataPath = "./data/"
articlePath = "./data/html/"

# define OHCO structure
OHCO = ['article_id', 'para_id', 'sent_id', 'token_id']

---
## Creating the Library Table<a name="paragraph2"></a>
Two library tables were created during the webscraping process. I combine, process, and refine the tables.

In [5]:
# read in metadata table for each genre
country_metadata = pd.read_csv(f"{dataPath}cLIB.csv")
country_metadata.set_index("id", inplace = True)

soul_metadata = pd.read_csv(f"{dataPath}sLIB.csv")
soul_metadata.set_index("id", inplace = True)

In [6]:
# create LIB table
LIB = pd.concat([country_metadata, soul_metadata])

LIB.index.rename(OHCO[0], inplace = True)

#### Parsing Dates<a name="subparagraph1"></a>
Inspection reveals that the publication dates are not in a standard format. I define a function that parses date strings. The following table describes how dates are assigned to ambiguous formats.

|`date` format|output format|
|------------|-------------|
|Month *year*|YYYY-mm-01|
|*year*|YYYY-07-01|
|Summer *year*|YYYY-08-01|
|Fall *year*|YYYY-11-01|
|Winter *year*|YYYY-02-01|
|Spring *year*|YYYY-05-01|

Any date string that is missing or does not adhere to one of the above formats will be corrected by hand.

In [7]:
def normal_date(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%d %B %Y")

def year_only(dt):
    from datetime import datetime
    
    dt_obj = datetime.strptime(dt, "%Y").date()
    return dt_obj.replace(month = 7, day = 1)

def month_year(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%B %Y").date()

def season_year(dt):
    from datetime import datetime, date
    import numpy as np
    
    try:
        season = dt.split()[-2].lower()
        year = dt.split()[-1]

    except IndexError:
        return np.nan
    
    if season == 'spring':
        return date(int(year), 5, 1)
    elif season == 'summer':
        return date(int(year), 8, 1)
    elif season == 'fall':
        return date(int(year), 11, 1)
    elif season == 'winter':
        return date(int(year), 2, 1)
    else:
        return np.nan



def date_parser(dt):
    '''
    Converts publication date strings from rocksbackpages.com articles
    to date objects with format YYYY-mm-dd.
    '''
    helpers = [normal_date, year_only, month_year, season_year]
    
    for f in helpers:
        try:
            return f(dt)
        except ValueError:
            pass
        

In [8]:
LIB['date_parsed'] = LIB.date.apply(date_parser)

In [9]:
LIB.loc[LIB.date_parsed.isna()]

Unnamed: 0_level_0,title,author,source,date,subjects,topic,type,href,date_parsed
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
c610,"Charley Pride, Tammy Wynette and George Jones:...",Gene Guerrero,The Great Speckled Bird,31 February 1972,"['george-jones', 'tammy-wynette', 'charley-pri...",country,Live Review,/Library/SearchLinkRedirect?folder=charley-pri...,NaT
s12801,Duffy: Rockferry,Gavin Martin,Daily Mirror,29 February 2002,['duffy'],soul,Review,/Library/SearchLinkRedirect?folder=duffy-irock...,NaT
s12916,"Earth Wind And Fire: Beacon Theater, New York",Kandia Crazy Horse,PopMatters,15 2003,['earth-wind--fire'],soul,Live Review,/Library/SearchLinkRedirect?folder=earth-wind-...,NaT


Only three dates failed to parse. I will manually correct them. After that, I will replace the `date` column with `date_parsed` as keeping both is redundant.

In [10]:
LIB.loc['c610', 'date_parsed'] = date(1972, 2, 29)
LIB.loc["s12801", 'date_parsed'] = date(2002, 2, 28)
LIB.loc["s12916", 'date_parsed'] = date(2003, 7, 1)

# replace date col
LIB['date'] = LIB.date_parsed
LIB.drop(columns ='date_parsed', inplace = True)

Next, I notice that the `subjects` column is a string representation of a list--not an actual list. `subjects` is converted to list type.

In [11]:
from ast import literal_eval
LIB.subjects = LIB.subjects.apply(literal_eval)

#### Narrowing the Scope<a name="subparagraph2"></a>
I refine the Focus of my analysis rather than work with all 4000 articles.

In [12]:
# drop articles not tagged with a subject
empty_subjects = [True if x == [] else False for x in LIB.subjects ]
print(f"{sum(empty_subjects)} articles missing a subject were dropped")

LIB = LIB[[not x for x in empty_subjects]]

73 articles missing a subject were dropped


In [13]:
# refine article types

# consolidate 'Sleeve and programme notes' with 'Sleevenotes'
LIB.loc[LIB.type == 'Sleeve and programme notes', 'type'] = 'Sleevenotes'

LIB.type.value_counts().to_frame().T

Unnamed: 0,Interview,Review,Live Review,Profile and Interview,Report and Interview,Profile,Report,Book Excerpt,Obituary,Retrospective,...,Review and Interview,Film/DVD/TV Review,Guide,Audio transcript of interview,Discography,Special Feature,Readers' Letters,Column,Letters,Film/DVD Review
type,1016,905,762,301,155,109,106,104,101,88,...,18,16,6,5,5,4,3,1,1,1


Immediately, I see some document types that should be excluded: *Film/DVD Review*, *Letters*, *Readers' Letters*. These documents do not fit with the broader corpus as they are either not about music, or not written by professional journalists.  

It is debatable whether *obituary* and *memoir* should be included. These document types are dissimilar from other documents in the corpus as they contain more biographical language. However, if an artist has an associated *obituary* or *memoir*, they were likely to have had significant cultural impact. This analysis is conducted at the level of genre rather than artist. Thus, I discard *obituary* and *memoir* in favor of reducing the size of the corpus.

Lastly, I reason that *book excerpts* should also be removed. Out of the 104 book excerpts, 97 are sourced from a single book, *The Faber Companion to 20<sup>th</sup> Century Popular Music*. These excerpts suffer the same issues of unsuitability as the obituaries; they are more pragmatic than they are poetic. Additionally, including so many documents from a single source would almost certainly have an unintended effect on the analysis.

The remaining documents discuss musicians, alblums, and performances.

In [14]:
exclude = ['Film/DVD Review', 'Film/DVD/TV Review',
           'Letters', "Readers' Letters", 
           'Obituary', 'Memoir', 
           'Book Excerpt', 'Book Review',
           'Audio transcript of interview']

LIB = LIB.loc[~LIB.type.isin(exclude)]
print(f"Number of Articles: {len(LIB)}")

Number of Articles: 3651


In [15]:
LIB.sample(10)

Unnamed: 0_level_0,title,author,source,date,subjects,topic,type,href
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
c1419,"Willie Nelson: Hammersmith Odeon, London",Mick Brown,The Guardian,1982-06-09 00:00:00,[willie-nelson],country,Live Review,/Library/SearchLinkRedirect?folder=willie-nels...
s3713,"Al Green, Laura Lee: Apollo Theatre, New York NY",Dan Nooger,The Village Voice,1973-11-15 00:00:00,"[al-green, laura-lee]",soul,Live Review,/Library/SearchLinkRedirect?folder=al-green-la...
s11209,M People: Swing Out Citrus,John Harris,New Musical Express,1994-12-10 00:00:00,[m-people],soul,Interview,/Library/SearchLinkRedirect?folder=m-people-sw...
s4501,"Ohio Players, Graham Central Station, Funkadel...",Vernon Gibbs,The Village Voice,1975-02-24 00:00:00,"[funkadelic, ohio-players-the, graham-central-...",soul,Live Review,/Library/SearchLinkRedirect?folder=ohio-player...
c306,Merle Haggard: Home-fried Humor and Cowboy Soul,Al Aronowitz,Rolling Stone,1968-08-10 00:00:00,[merle-haggard],country,Profile and Interview,/Library/SearchLinkRedirect?folder=merle-hagga...
s9913,World Saxophone Quartet: Rhythm 'n Blues (Elek...,Kirk Silsbee,Musician,1989-09-01 00:00:00,[world-saxophone-quartet],soul,Review,/Library/SearchLinkRedirect?folder=world-saxop...
c3316,Dolly Parton: Backwoods Barbie **½,Mark Kemp,Rolling Stone Online,2008-03-06 00:00:00,[dolly-parton],country,Review,/Library/SearchLinkRedirect?folder=dolly-parto...
s7617,Fela: Return of the Afrobeat Rebel,Randall Grass,Musician,1983-10-01 00:00:00,[fela-kuti],soul,Profile,/Library/SearchLinkRedirect?folder=fela-return...
s10010,A Schism Divides Black Pop Radical Rappers And...,Jim Sullivan,The Boston Globe,1989-12-31 00:00:00,"[beastie-boys-the, earth-wind--fire, ll-cool-j...",soul,Comment,/Library/SearchLinkRedirect?folder=a-schism-di...
c1407,Rodney Crowell: A Songwriter Surfaces,Fred Schruers,Rolling Stone,1980-08-21 00:00:00,[rodney-crowell],country,Profile and Interview,/Library/SearchLinkRedirect?folder=rodney-crow...


---
## Constructing the Corpus<a name="paragraph3"></a>
I read in the body of each document and tokenize it with `nltk`'s `sent_tokenize()` and `WhitespaceTokenizer()` methods. Tokens are amassed in the `CORPUS` table and indexed by document OHCO structure.

In [16]:
# define function for reading files
def RBP_reader(filePath):
    '''
    
    '''
    from bs4 import BeautifulSoup
    
    with open(filePath, 'r', encoding = 'utf-8') as f:
        contents = f.read()
        
    soup = BeautifulSoup(contents, 'html')
    
    writer = soup.find("span", class_="writer") \
                 .get_text() \
                 .replace(r'\n+', ' ') \
                 .strip()
    
    standfirst = soup.find("div", class_="standfirst").get_text()
    copy = soup.find("div", class_="copy").get_text()

    # if standfirst contains writer name, it is assumed that 
    # standfirst is purely metadata and should be ignored
    if writer.lower() in standfirst.lower():
        doc = copy
    else:
        doc = standfirst+copy
    
    return doc.strip()

In [17]:
def tokenize_collection(library, filePrefix = "./data/html/"):
    '''
    Inputs
    -----
    library - pandas dataframe, must include author_id as index
    fileprefix - string, path to directory containing .html files
    
    Returns
    -----
    CORPUS - pandas dataframe, contains all tokens in corpus
    '''
    import re
    import nltk
    
    para_pat = r"\n{2,}"
    documents = []
    
    i=0
    numIter = len(library)
    
    for id in library.index:
        
        doc = RBP_reader(filePrefix+str(id)+".html")
        
        PARAS = re.split(para_pat, doc)
        PARAS = [x.strip().replace('\n', ' ') for x in PARAS]
        PARAS = pd.DataFrame(PARAS, columns = ['para_str'])
        PARAS.index.name = 'para_id'
        
        SENTS = PARAS.para_str.apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame('sent_str')
        
        SENTS.index.names = ['para_id', 'sent_id']
        
        SENTS = SENTS.sent_str.str.replace("-", " ").to_frame()
        
        TOKENS = SENTS.sent_str\
                      .apply(lambda x: \
                             pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))) \
                      .stack() \
                      .to_frame('pos_tuple')

        TOKENS.index.names = OHCO[1:]
        
        TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
        TOKENS['token_str'] = TOKENS.pos_tuple.apply(lambda x: x[0])
        TOKENS['term_str'] = TOKENS.token_str.str.lower()
        
        punc_pos = ['$', "''", '(', ')', '[', ']', ',', '--', '.', ':', '``']
        TOKENS['term_str'] = TOKENS[~TOKENS.pos.isin(punc_pos)].token_str \
                                .str.replace(r'[\W_]+', '', regex=True).str.lower()

        TOKENS['article_id'] = id
        TOKENS = TOKENS.reset_index().set_index(OHCO)
        
        documents.append(TOKENS)
        
        if i % round(numIter/5) == 0:
            print(f"{round(i*100/numIter)}% Complete")
        i +=1

    
    CORPUS = pd.concat(documents).sort_index()
    
    del(documents)
    del(PARAS)
    del(SENTS)
    del(TOKENS)
    
    return CORPUS

In [18]:
CORPUS = tokenize_collection(LIB)

0% Complete
20% Complete
40% Complete
60% Complete
80% Complete
100% Complete


In [19]:
CORPUS.to_csv('./data/CORPUS.csv')

In [30]:
CORPUS.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
article_id,para_id,sent_id,token_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
s5313,50,0,0,"(Over, IN)",IN,Over,over
c2218,3,3,8,"(disparities, NNS)",NNS,disparities,disparities
s10500,8,2,26,"(fearful., NN)",NN,fearful.,fearful
c2118,1,0,8,"(of, IN)",IN,of,of
c3806,10,2,18,"(ground, NN)",NN,ground,ground
c1614,32,2,36,"(country, NN)",NN,country,country
s8003,12,2,13,"(some, DT)",DT,some,some
s11716,7,0,6,"(explosions,, NN)",NN,"explosions,",explosions
s2803,22,0,8,"(does, VBZ)",VBZ,does,does
s9405,2,0,46,"(singing, VBG)",VBG,singing,singing


---
## Extracting a Vocabulary<a name="paragraph4"></a>