# Country Music Reviews
### *Digital Analytical Edition*
---

### Table of Contents
1. [Introduction](#introduction)
2. [Libraries & Set Up](#paragraph1)
3. [Creating the LIB table](#paragraph2)
    1. [Parsing Dates](#subparagraph1)
4. [Constructing the Corpus](#paragraph3)

## Introduction<a name="introduction"></a>
In this notebook, I aggregate the source files to create a Digital Analytical Edition (DAE). Source files were scraped from [rocksbackpage.com](rocksbackpage.com) using `RBP_search_scraper.py` and `RBP_article_scraper.py`.  

The DAE contains the following tables:  
+ `LIB.csv`    - Metadata for each article.
+ `CORPUS.csv` - Aggregated text from source files. Indexed by Ordered Hierarchy of Content Object (OHCO) structure.
+ `VOCAB.csv`  - Linguistic features and statistics for words in `CORPUS.csv`. Stop words removed.
+ `BOW.csv`    - Bag-of-words representation of `CORPUS.csv` with stop words removed.

## Libraries & Set Up<a name="paragraph1"></a>

In [1]:
import pandas as pd
import numpy as np
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

In [2]:
# path to data
txtPath = "./data/articles_txt/"
htmlPath = "./data/articles_html/"

# define OHCO structure
OHCO = ['article_id', 'para_id', 'sent_id', 'token_id']

---
## Creating the Library Table<a name="paragraph2"></a>

In [3]:
# read in article & search metadata
article_metadata = pd.read_csv("./data/RBP_article_metadata.csv")
article_metadata.set_index("id", inplace = True)

search_metadata =pd.read_csv("./data/RBP_search_metadata.csv")
search_metadata.set_index("id", inplace = True)

In [4]:
# create LIB table
LIB = article_metadata
LIB.index.rename(OHCO[0], inplace = True)

LIB['doc_type'] = search_metadata.type

#### Parsing Dates<a name="subparagraph1"></a>
Inspection reveals that the publication dates are not in a standard format. I define a function that parses date strings. The following table describes how dates are assigned to ambiguous formats.

|`date` format|output format|
|------------|-------------|
|Month *year*|YYYY-mm-01|
|*year*|YYYY-07-01|
|Summer *year*|YYYY-08-01|
|Fall *year*|YYYY-11-01|
|Winter *year*|YYYY-02-01|
|Spring *year*|YYYY-05-01|

Any date string that is missing or does not adhere to one of the above formats will be corrected by hand.

In [5]:
def normal_date(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%d %B %Y")

def year_only(dt):
    from datetime import datetime
    
    dt_obj = datetime.strptime(dt, "%Y").date()
    return dt_obj.replace(month = 7, day = 1)

def month_year(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%B %Y").date()

def season_year(dt):
    from datetime import datetime, date
    import numpy as np
    
    try:
        season = dt.split()[-2].lower()
        year = dt.split()[-1]

    except IndexError:
        return np.nan
    
    if season == 'spring':
        return date(int(year), 5, 1)
    elif season == 'summer':
        return date(int(year), 8, 1)
    elif season == 'fall':
        return date(int(year), 11, 1)
    elif season == 'winter':
        return date(int(year), 2, 1)
    else:
        return np.nan



def date_parser(dt):
    '''
    Converts publication date strings from rocksbackpages.com articles
    to date objects with format YYYY-mm-dd.
    '''
    helpers = [normal_date, year_only, month_year, season_year]
    
    for f in helpers:
        try:
            return f(dt)
        except ValueError:
            pass
        

In [6]:
LIB['date_parsed'] = LIB.date.apply(date_parser)

In [7]:
LIB.loc[LIB.date_parsed.isna()]

Unnamed: 0_level_0,title,author,source,date,subjects,doc_type,date_parsed
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
610,"Charley Pride, Tammy Wynette and George Jones:...",Gene Guerrero,The Great Speckled Bird,31 February 1972,"['george-jones', 'tammy-wynette', 'charley-pri...",Live Review,NaT


Only one date failed to parse. We will manually set this date to February 29, 1972. After that, we will replace the `date` column with `date_parsed` as keeping both is redundant.

In [8]:
from datetime import date
LIB.loc[610, 'date_parsed'] = date(1972,2,29)

# replace date col
LIB['date'] = LIB.date_parsed
LIB.drop(columns ='date_parsed', inplace = True)

Next, I notice that the `subjects` column is a string representation of a list--not an actual list. `subjects` is converted to list type.

In [9]:
from ast import literal_eval
LIB.subjects = LIB.subjects.apply(literal_eval)

In [10]:
# preview LIB
LIB.sample(10)

Unnamed: 0_level_0,title,author,source,date,subjects,doc_type
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4115,Classic Rockers Firefall Drop First New Record...,Bob Ruggiero,Houston Press,2020-11-30 00:00:00,"[firefall, gram-parsons]",Interview
1019,Linda Ronstadt: 'Heat Wave' – The Long Hot Ses...,Todd Everett,Rolling Stone,1975-12-18 00:00:00,[linda-ronstadt],Report and Interview
1118,"George Jones, August 1976, at Sunset Park, Wes...",Peter Stone Brown,unpublished,1976-08-01 00:00:00,[george-jones],Interview
2602,Sandy Posey,"Phil Hardy, Dave Laing",The Faber Companion to 20th-Century Popular Music,2001-07-01 00:00:00,[sandy-posey],Book Excerpt
216,Linda Leading Stone Poneys To Gold Water,Tony Leigh,KRLA Beat,1967-12-30 00:00:00,[stone-poneys-the],Interview
3601,"Laura Cantrell: St. Bonaventure's, Bristol",Stephen Dalton,The Times,2011-05-04 00:00:00,[laura-cantrell],Live Review
2503,Shelby Lynne: I Am Shelby Lynne (Mercury),Tom Cox,The Guardian,1999-09-24 00:00:00,[shelby-lynne],Review
2103,Travis Tritt: Ten Feet Tall and Bulletproof (W...,Eric Weisbard,Spin,1994-06-01 00:00:00,[travis-tritt],Review
508,Lynn Anderson: Pssst! Don't tell the British t...,Richard Green,New Musical Express,1971-03-20 00:00:00,[lynn-anderson],Interview
1502,Rosanne Cash: Somewhere In The Stars (Columbia),Mitchell Cohen,Creem,1982-10-01 00:00:00,[rosanne-cash],Review


---
## Constructing the Corpus<a name="paragraph3"></a>
I read in the body of each document and tokenize it with scikit-learn's `CountVectorizer`. Tokens are amassed in the `CORPUS` table and indexed by document OHCO structure.

In [11]:
# define function for reading files
def RBP_reader(filePath):
    '''
    
    '''
    from bs4 import BeautifulSoup
    
    with open(filePath, 'r', encoding = 'utf-8') as f:
        contents = f.read()
        
    soup = BeautifulSoup(contents, 'html')
    
    writer = soup.find("span", class_="writer") \
                 .get_text() \
                 .replace(r'\n+', ' ') \
                 .strip()
    
    standfirst = soup.find("div", class_="standfirst").get_text()
    copy = soup.find("div", class_="copy").get_text()

    # if standfirst contains writer name, it is assumed that 
    # standfirst is purely metadata and should be ignored
    if writer.lower() in standfirst.lower():
        doc = copy
    else:
        doc = standfirst+copy
    
    return doc.strip()

In [12]:
def tokenize_collection(library, filePrefix = "./data/articles_html/"):
    '''
    Inputs
    -----
    library - pandas dataframe, must include author_id as index
    fileprefix - string, path to directory containing .html files
    
    Returns
    -----
    CORPUS - pandas dataframe, contains all tokens in corpus
    '''
    import re
    import nltk
    
    para_pat = r"\n{2,}"
    documents = []
    
    i=0
    
    for id in library.index:
        
        doc = RBP_reader(filePrefix+str(id)+".html")
        
        PARAS = re.split(para_pat, doc)
        PARAS = [x.strip().replace('\n', ' ') for x in PARAS]
        PARAS = pd.DataFrame(PARAS, columns = ['para_str'])
        PARAS.index.name = 'para_id'
        
        SENTS = PARAS.para_str.apply(lambda x: pd.Series(nltk.sent_tokenize(x)))\
        .stack()\
        .to_frame('sent_str')
        
        SENTS.index.names = ['para_id', 'sent_id']
        
        SENTS = SENTS.sent_str.str.replace("-", " ").to_frame()
        
        TOKENS = SENTS.sent_str\
                      .apply(lambda x: \
                             pd.Series(nltk.pos_tag(nltk.WhitespaceTokenizer().tokenize(x)))) \
                      .stack() \
                      .to_frame('pos_tuple')

        TOKENS.index.names = OHCO[1:]
        
        TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
        TOKENS['token_str'] = TOKENS.pos_tuple.apply(lambda x: x[0])
        TOKENS['term_str'] = TOKENS.token_str.str.lower()
        
        punc_pos = ['$', "''", '(', ')', '[', ']', ',', '--', '.', ':', '``']
        TOKENS['term_str'] = TOKENS[~TOKENS.pos.isin(punc_pos)].token_str \
                                .str.replace(r'[\W_]+', '', regex=True).str.lower()

        TOKENS['article_id'] = id
        TOKENS = TOKENS.reset_index().set_index(OHCO)
        
        documents.append(TOKENS)
        
        i +=1
        if i % 150 == 0:
            print(f"{round(i/len(library.index)*100, 1)}% Complete")
    
    CORPUS = pd.concat(documents).sort_index()
    
    del(documents)
    del(PARAS)
    del(SENTS)
    del(TOKENS)
    
    return CORPUS

In [13]:
CORPUS = tokenize_collection(LIB)

18.6% Complete
37.3% Complete
55.9% Complete
74.5% Complete
93.2% Complete


In [14]:
CORPUS.to_csv('./data/CORPUS.csv')

In [19]:
CORPUS.sample(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
article_id,para_id,sent_id,token_id,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1207,8,2,12,"(Nelson's, NNP)",NNP,Nelson's,nelsons
3014,7,2,16,"(ever, RB)",RB,ever,ever
3904,4,1,20,"(old, JJ)",JJ,old,old
1002,69,0,11,"(meaning,, NNS)",NNS,"meaning,",meaning
1815,13,1,16,"(she'll, JJ)",JJ,she'll,shell
3101,29,10,0,"(Don't, NNP)",NNP,Don't,dont
3019,21,1,41,"(plot, NN)",NN,plot,plot
3609,11,1,8,"(Kennedy,, NNP)",NNP,"Kennedy,",kennedy
2912,51,0,18,"(punk, NN)",NN,punk,punk
4018,25,2,23,"(too, RB)",RB,too,too
