# Country & Soul
### *Analyzing Trends in American Music Journalism from 1960 to Present*
---

# Digital Analytical Edition
#### Table of Contents
1. [Introduction](#introduction)
2. [Libraries & Set Up](#paragraph1)
3. [Creating the LIB table](#paragraph2)
    1. [Parsing Dates](#subparagraph1)
    2. [Narrowing the Scope](#subparagraph2)
4. [Constructing the Corpus](#paragraph3)
5. [Extracting a Vocabulary](#paragraph4)
6. [Creating a Bag-of-Words](#paragraph5)
    1. [Add Frequency Features to `VOCAB`](#subparagraph3)
7. [Saving the Digital Analytic Edition](#paragraph6)

## Introduction<a name="introduction"></a>
In this notebook, I aggregate and parse the source files to create a Digital Analytical Edition (DAE).  

The DAE contains the following tables:  
+ `LIB.csv`    - Metadata for each article.
+ `CORPUS.csv` - Aggregated text from source files. Indexed by Ordered Hierarchy of Content Object (OHCO) structure.
+ `VOCAB.csv`  - Linguistic features and statistics for words in `CORPUS.csv`. Stop words removed.
+ `BOW.csv`    - Bag-of-words representation of `CORPUS.csv` with stop words removed.

Source files were scraped from [rocksbackpages.com](https://www.rocksbackpages.com/) using `rbpscraper.py`. I present an example of `rbpscraper.py` functionality here. It is **not** recommended that the reader attempts to run this code. `rbpscraper.py` requires a subscription to rocksbackpages.com and several hours of your time!

In [1]:
import sys
sys.path.append('.')

from rbpscraper import RBPScraper

country = RBPScraper(desc = "country", write_path = "./datadir/")
    # desc - description of articles to be scraped
    # write_path - path to directory where articles will be saved

Please paste URL:
https://www.rocksbackpages.com/

Thank you.

Please paste cookies:
sample=cookies; auth=details

Thank you.


After instantiating an RBPScraper object, the user will be prompted to enter the url for the first page of search results. Then, the user will be prompted to enter cookies for authentification. The web-scraping process can be initiated by calling the following methods. This will save articles as html files and a metadata table as a csv file to the specified directory.

In [2]:
#country.searchScraper().articleScraper().writeLIB()

---
## Libraries & Set Up<a name="paragraph1"></a>

In [3]:
import pandas as pd
import numpy as np
from datetime import date

import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

In [4]:
# path to data
dataPath = "./data/"
articlePath = "./data/html/"

# define OHCO structure
OHCO = ['article_id', 'para_id', 'sent_id', 'token_id']

---
## Creating the Library Table<a name="paragraph2"></a>
Two library tables were created during the webscraping process. I combine, process, and refine the tables.

In [5]:
# read in metadata table for each genre
country_metadata = pd.read_csv(f"{dataPath}cLIB.csv")
country_metadata.set_index("id", inplace = True)

soul_metadata = pd.read_csv(f"{dataPath}sLIB.csv")
soul_metadata.set_index("id", inplace = True)

In [6]:
# create LIB table
LIB = pd.concat([country_metadata, soul_metadata])

LIB.index.rename(OHCO[0], inplace = True)

#### Parsing Dates<a name="subparagraph1"></a>
Inspection reveals that the publication dates are not in a standard format. I define a function that parses date strings. The following table describes how dates are assigned to ambiguous formats.

|`date` format|output format|
|------------|-------------|
|Month *year*|YYYY-mm-01|
|*year*|YYYY-07-01|
|Summer *year*|YYYY-08-01|
|Fall *year*|YYYY-11-01|
|Winter *year*|YYYY-02-01|
|Spring *year*|YYYY-05-01|

Any date string that is missing or does not adhere to one of the above formats will be corrected by hand.

In [7]:
def normal_date(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%d %B %Y")

def year_only(dt):
    from datetime import datetime
    
    dt_obj = datetime.strptime(dt, "%Y").date()
    return dt_obj.replace(month = 7, day = 1)

def month_year(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%B %Y").date()

def season_year(dt):
    from datetime import datetime, date
    import numpy as np
    
    try:
        season = dt.split()[-2].lower()
        year = dt.split()[-1]

    except IndexError:
        return np.nan
    
    if season == 'spring':
        return date(int(year), 5, 1)
    elif season == 'summer':
        return date(int(year), 8, 1)
    elif season == 'fall':
        return date(int(year), 11, 1)
    elif season == 'winter':
        return date(int(year), 2, 1)
    else:
        return np.nan



def date_parser(dt):
    '''
    Converts publication date strings from rocksbackpages.com articles
    to date objects with format YYYY-mm-dd.
    '''
    helpers = [normal_date, year_only, month_year, season_year]
    
    for f in helpers:
        try:
            return f(dt)
        except ValueError:
            pass
        

In [8]:
LIB['date_parsed'] = LIB.date.apply(date_parser)

In [9]:
LIB.loc[LIB.date_parsed.isna()]

Unnamed: 0_level_0,title,author,source,date,subjects,topic,type,href,date_parsed
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
c610,"Charley Pride, Tammy Wynette and George Jones:...",Gene Guerrero,The Great Speckled Bird,31 February 1972,"['george-jones', 'tammy-wynette', 'charley-pri...",country,Live Review,/Library/SearchLinkRedirect?folder=charley-pri...,NaT
s12801,Duffy: Rockferry,Gavin Martin,Daily Mirror,29 February 2002,['duffy'],soul,Review,/Library/SearchLinkRedirect?folder=duffy-irock...,NaT
s12916,"Earth Wind And Fire: Beacon Theater, New York",Kandia Crazy Horse,PopMatters,15 2003,['earth-wind--fire'],soul,Live Review,/Library/SearchLinkRedirect?folder=earth-wind-...,NaT


Only three dates failed to parse. I will manually correct them. After that, I will replace the `date` column with `date_parsed` as keeping both is redundant.

In [10]:
LIB.loc['c610', 'date_parsed'] = date(1972, 2, 29)
LIB.loc["s12801", 'date_parsed'] = date(2002, 2, 28)
LIB.loc["s12916", 'date_parsed'] = date(2003, 7, 1)

# replace date col
LIB['date'] = LIB.date_parsed
LIB.drop(columns ='date_parsed', inplace = True)

Next, I notice that the `subjects` column is a string representation of a list--not an actual list. `subjects` is converted to list type.

In [11]:
from ast import literal_eval
LIB.subjects = LIB.subjects.apply(literal_eval)

#### Narrowing the Scope<a name="subparagraph2"></a>
I refine the Focus of my analysis rather than work with all 4000 articles.

In [12]:
# drop articles not tagged with a subject
empty_subjects = [True if x == [] else False for x in LIB.subjects ]
print(f"{sum(empty_subjects)} articles missing a subject were dropped")

LIB = LIB[[not x for x in empty_subjects]]

73 articles missing a subject were dropped


In [13]:
# refine article types

# consolidate 'Sleeve and programme notes' with 'Sleevenotes'
LIB.loc[LIB.type == 'Sleeve and programme notes', 'type'] = 'Sleevenotes'

LIB.type.value_counts().to_frame().T

Unnamed: 0,Interview,Review,Live Review,Profile and Interview,Report and Interview,Profile,Report,Book Excerpt,Obituary,Retrospective,...,Review and Interview,Film/DVD/TV Review,Guide,Audio transcript of interview,Discography,Special Feature,Readers' Letters,Column,Letters,Film/DVD Review
type,1016,905,762,301,155,109,106,104,101,88,...,18,16,6,5,5,4,3,1,1,1


Immediately, I see some document types that should be excluded: *Film/DVD Review*, *Letters*, *Readers' Letters*. These documents do not fit with the broader corpus as they are either not about music, or not written by professional journalists.  

It is debatable whether *obituary* and *memoir* should be included. These document types are dissimilar from other documents in the corpus as they contain more biographical language. However, if an artist has an associated *obituary* or *memoir*, they were likely to have had significant cultural impact. This analysis is conducted at the level of genre rather than artist. Thus, I discard *obituary* and *memoir* in favor of reducing the size of the corpus.

Lastly, I reason that *book excerpts* should also be removed. Out of the 104 book excerpts, 97 are sourced from a single book, *The Faber Companion to 20<sup>th</sup> Century Popular Music*. These excerpts suffer the same issues of unsuitability as the obituaries; they are more pragmatic than they are poetic. Additionally, including so many documents from a single source would almost certainly have an unintended effect on the analysis.

The remaining documents discuss musicians, alblums, and performances.

In [14]:
exclude = ['Film/DVD Review', 'Film/DVD/TV Review',
           'Letters', "Readers' Letters", 
           'Obituary', 'Memoir', 
           'Book Excerpt', 'Book Review',
           'Audio transcript of interview']

LIB = LIB.loc[~LIB.type.isin(exclude)]
print(f"Number of Articles: {len(LIB)}")

Number of Articles: 3651


In [33]:
LIB.sample(10)

Unnamed: 0_level_0,title,author,source,date,subjects,topic,type,href,length
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
s15507,Grace Jones et al.: Love Supreme Jazz Festival...,Nick Hasted,The Independent,2016-07-07 00:00:00,"[burt-bacharach, grace-jones, kamasi-washingto...",soul,Live Review,/Library/SearchLinkRedirect?folder=grace-jones...,430
c3802,Kris Kristofferson,David Burke,R2/Rock'n'Reel,2013-09-01 00:00:00,[kris-kristofferson],country,Interview,/Library/SearchLinkRedirect?folder=kris-kristo...,1939
s11106,Rick James: The Untold Story,Michael Goldberg,Vibe,1994-04-01 00:00:00,[rick-james],soul,Interview,/Library/SearchLinkRedirect?folder=rick-james-...,3594
s5018,Chuck Jackson: Chuck's Foot Is Back On The Thr...,Kevin Allen,Record Mirror,1975-12-06 00:00:00,[chuck-jackson],soul,Report and Interview,/Library/SearchLinkRedirect?folder=chuck-jacks...,881
c903,Scott Walker: We Had It All,Fred Dellar,New Musical Express,1974-09-21 00:00:00,[scott-walker],country,Review,/Library/SearchLinkRedirect?folder=scott-walke...,101
c3816,"Country music's gay stars: ""We're still kickin...",Graeme Thomson,The Guardian,2014-04-10 00:00:00,"[chely-wright, lavender-country]",country,Report and Interview,/Library/SearchLinkRedirect?folder=country-mus...,1403
s12606,Seeking Tuneage Kicks With... Kelis,Dorian Lynskey,Select,2001-01-01 00:00:00,[kelis],soul,Interview,/Library/SearchLinkRedirect?folder=seeking-tun...,1106
s15410,Khruangbin: The Universe Smiles Upon You,Daryl Easlea,MOJO,2016-01-01 00:00:00,[khruangbin],soul,Review,/Library/SearchLinkRedirect?folder=khruangbin-...,149
s7113,"Darlene Hits Local Scene, Reviving an Old Spector",Joel Selvin,San Francisco Chronicle,1982-07-18 00:00:00,"[darlene-love, phil-spector]",soul,Profile and Interview,/Library/SearchLinkRedirect?folder=darlene-hit...,541
s1300,"The Righteous Brothers: Greek Theatre, Los Ang...",June Harris,New Musical Express,1967-09-23 00:00:00,[righteous-brothers-the],soul,Live Review,/Library/SearchLinkRedirect?folder=the-righteo...,129


---
## Constructing the Corpus<a name="paragraph3"></a>
I read in the body of each document and tokenize it with `nltk`'s `sent_tokenize()` and `WhitespaceTokenizer()` methods. Tokens are amassed in the `CORPUS` table and indexed by document OHCO structure.

In [16]:
# define function for reading files
def RBP_reader(filePath):
    '''
    
    '''
    from bs4 import BeautifulSoup
    
    with open(filePath, 'r', encoding = 'utf-8') as f:
        contents = f.read()
        
    soup = BeautifulSoup(contents, 'html')
    
    writer = soup.find("span", class_="writer") \
                 .get_text() \
                 .replace(r'\n+', ' ') \
                 .strip()
    
    standfirst = soup.find("div", class_="standfirst").get_text()
    copy = soup.find("div", class_="copy").get_text()

    # if standfirst contains writer name, it is assumed that 
    # standfirst is purely metadata and should be ignored
    if writer.lower() in standfirst.lower():
        doc = copy
    else:
        doc = standfirst+copy
    
    return doc.strip()

In [17]:
def tokenize_collection(library, filePrefix = "./data/html/"):
    '''
    Inputs
    -----
    library - pandas dataframe, must include author_id as index
    fileprefix - string, path to directory containing .html files
    
    Returns
    -----
    CORPUS - pandas dataframe, contains all tokens in corpus
    '''
    import re
    import nltk
    
    para_pat = r"\n{2,}"
    documents = []
    
    i=0
    numIter = len(library)
    
    for id in library.index:
        
        doc = RBP_reader(filePrefix+str(id)+".html")
        
        PARAS = re.split(para_pat, doc)
        PARAS = [x.strip().replace('\n', ' ') for x in PARAS]
        PARAS = pd.DataFrame(PARAS, columns = ['para_str'])
        PARAS.index.name = 'para_id'
        
        SENTS = PARAS.para_str.apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame('sent_str')
        
        SENTS.index.names = ['para_id', 'sent_id']
        
        SENTS = SENTS.sent_str.str.replace("-", " ").to_frame()
        SENTS = SENTS.sent_str.str.replace("/", " ").to_frame()
        
        TOKENS = SENTS.sent_str\
                      .apply(lambda x: \
                             pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))) \
                      .stack() \
                      .to_frame('pos_tuple')

        TOKENS.index.names = OHCO[1:]
        
        TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
        TOKENS['token_str'] = TOKENS.pos_tuple.apply(lambda x: x[0])
        TOKENS['term_str'] = TOKENS.token_str.str.lower()
        
        punc_pos = ['$', "''", '(', ')', '[', ']', ',', '--', '.', ':', '``']
        TOKENS['term_str'] = TOKENS[~TOKENS.pos.isin(punc_pos)].token_str \
                                .str.replace(r'[\W_]+', '', regex=True).str.lower()

        TOKENS['article_id'] = id
        TOKENS = TOKENS.reset_index().set_index(OHCO)
        
        documents.append(TOKENS)
        
        if i % round(numIter/5) == 0:
            print(f"{round(i*100/numIter)}% Complete")
        i +=1

    # sort index & columns
    CORPUS = pd.concat(documents).sort_index()
    CORPUS = CORPUS[['token_str', 'term_str', 'pos_tuple', 'pos']]
    
    # add POS_group
    CORPUS['pos_group'] = CORPUS.pos.str.slice(0,2)
    
    del(documents)
    del(PARAS)
    del(SENTS)
    del(TOKENS)
    
    return CORPUS

In [18]:
CORPUS = tokenize_collection(LIB)

0% Complete
20% Complete
40% Complete
60% Complete
80% Complete
100% Complete


In [19]:
# remove blank tokens and NANs
CORPUS = CORPUS[CORPUS.term_str!='']
CORPUS = CORPUS[~CORPUS.term_str.isna()]

**Adding article length to `LIB`**

In [20]:
LIB['length'] = CORPUS.groupby('article_id').term_str.count()

---
## Extracting a Vocabulary<a name="paragraph4"></a>

In [21]:
# create VOCAB table
VOCAB = CORPUS.term_str.value_counts().to_frame('n')

In [23]:
# add max part-of-speech
VOCAB['max_pos'] = CORPUS[['term_str','pos']].value_counts().unstack(fill_value=0).idxmax(1)

# add max POS group
VOCAB['max_pos_group'] = VOCAB.max_pos.str.slice(0,2)

# add number of POS associated with each term
VOCAB['n_pos'] = CORPUS[['term_str','pos']].value_counts().unstack().count(1)

VOCAB['cat_pos'] = CORPUS[['term_str','pos']].value_counts().to_frame('n').reset_index()\
    .groupby('term_str').pos.apply(lambda x: set(x))


In [24]:
# Add term-based statistics
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)

In [25]:
VOCAB.sample(5)

Unnamed: 0,n,max_pos,max_pos_group,n_pos,cat_pos,n_chars,p,i
deena,2,NNP,NN,1,{NNP},5,5.15869e-07,20.886492
catsuit,2,NN,NN,1,{NN},7,5.15869e-07,20.886492
pastored,1,VBN,VB,1,{VBN},8,2.579345e-07,21.886492
pines,19,NNP,NN,4,"{NNP, NNPS, NN, NNS}",5,4.900756e-06,17.638564
enlightenment,6,NN,NN,3,"{JJ, NN, NNS}",13,1.547607e-06,19.301529


---
## Creating a Bag-of-Words<a name="paragraph5"></a>
I create a bag-of-words model for each document in the corpus. In this representation, the frequency of each word is recorded while grammar and word order is discarded. Rather than record simple word counts, I use **tf-idf** which weights words based on importance. I use scikit-learn's `CountVectorizer` and `TfidfTransformer`.

In [26]:
# Gather CORPUS into articles
ARTICLES = CORPUS.groupby('article_id').term_str.apply(lambda x: ' '.join(x))

I create 2 bags-of-words. The first contains the top 32,000 unigrams with stop words removed. The limit of 32,000 was chosen because it is roughly half the number of terms in the vocabulary. Additionally, the top 32,000 terms occur at least 3 times in the corpus. 

The second bag-of-words contains the top 10000 N-grams where $\{N: 1,2,3,4\}$. Again, stop words are removed.

In [27]:
def createBOW(documents, ngram_range = (1,1), max_features = None):
    
    # initialize CountVectorizer
    count_engine = CountVectorizer(stop_words = 'english',
                                   ngram_range = ngram_range,
                                   max_features = max_features)
    
    # fit/transform
    X1 = count_engine.fit_transform(documents)
    
    # Create Document-Term matrix from output
    DTM = pd.DataFrame(X1.toarray(),
                       columns = count_engine.get_feature_names_out(),
                       index = documents.index)
    
    # initialize TfidfTransformer
    tfidf_engine = TfidfTransformer(norm='l2', use_idf=True)
    
    # fit/transofrm
    X2 = tfidf_engine.fit_transform(DTM)
    
    # create TFIDF table
    TFIDF = pd.DataFrame(X2.toarray(),
                         columns = DTM.columns,
                         index=DTM.index)
    
    # create BOW
    BOW = DTM[DTM > 0].stack().to_frame('n') \
        .join(TFIDF[TFIDF > 0].stack().to_frame('tfidf'))
    BOW.index.rename('term_str', level = 1, inplace = True)
    
    return BOW

In [28]:
# create unigram BOW
BOW_unigrams = createBOW(ARTICLES, ngram_range = (1,1), max_features = 32000)

In [29]:
# create Ngram BOW
BOW_ngrams = createBOW(ARTICLES, ngram_range = (1,4), max_features = 10000)

#### Add Frequency Features to `VOCAB`<a name="subparagraph3"></a>
I add the following features to the `VOCAB` table.
+ `tfidf_mean` - Average tfidf for all occurences in corpus
+ `tfidf_max` - Max tfidf value of term
+ `df` - Document Frequency, count of documents in which the term appears

In [30]:
# initialize CountVectorizer with default params, but custom tokenizer
vocab_engine = CountVectorizer(tokenizer=lambda txt: txt.split())

# fit/transform
vocab_counts = vocab_engine.fit_transform(ARTICLES)

# convert to dataframe
DOC_TERM = pd.DataFrame(vocab_counts.toarray(),
                        columns = vocab_engine.get_feature_names_out(),
                       index = ARTICLES.index)
# append to VOCAB
VOCAB['df'] = DOC_TERM[DOC_TERM > 0].count()

In [31]:
# initialize TfidfTransformer
vocab_tfidf_engine = TfidfTransformer(norm='l2', use_idf=True)

# fit/transform
vocab_tfidf = vocab_tfidf_engine.fit_transform(DOC_TERM)

# create TFIDF dataframe
TFIDF = pd.DataFrame(vocab_tfidf.toarray(),
                    columns = DOC_TERM.columns,
                    index =  DOC_TERM.index)
# add to VOCAB
VOCAB['tfidf_mean'] = TFIDF.mean()
VOCAB['tfidf_max'] = TFIDF.max()

---
## Saving the Digital Analytical Edition<a name="paragraph6"></a>

With the creation of the `LIB`, `CORPUS`, `VOCAB`, and `BOW` tables, I have established a foundation on which I can begin my analysis. I save these tables to separate `.csv` files for ease of use. In the following notebook, I will use each of these tables to perform principal component analysis, topic modelling, word embeddings, and sentiment analysis. Along the way, I will continue to build out the digital analytical edition by adding features to each table.

In [32]:
# save to csv
LIB.to_csv('./data/LIB.csv')
CORPUS.to_csv('./data/CORPUS.csv')
VOCAB.to_csv('./data/VOCAB.csv')
BOW_unigrams.to_csv('./data/BOW_unigrams.csv')
BOW_ngrams.to_csv('./data/BOW_ngrams.csv')