# Country Music Reviews
### *Digital Analytical Edition*
---

## Introduction
In this notebook, I aggregate the source files to create a Digital Analytical Edition (DAE). Source files were scraped from [rocksbackpage.com](rocksbackpage.com) using `RBP_search_scraper.py` and `RBP_article_scraper.py`.  

The DAE contains the following tables:  
+ `LIB.csv`    - Metadata for each article.
+ `CORPUS.csv` - Aggregated text from source files. Indexed by Ordered Hierarchy of Content Object (OHCO) structure.
+ `VOCAB.csv`  - Linguistic features and statistics for words in `CORPUS.csv`. Stop words removed.
+ `BOW.csv`    - Bag-of-words representation of `CORPUS.csv` with stop words removed.

## Libraries & Set Up

In [1]:
import pandas as pd
import numpy as np
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

In [2]:
# path to data
txtPath = "./data/articles_txt/"
htmlPath = "./data/articles_html/"

# define OHCO structure
OHCO = ['article_id', 'paragraph_num', 'sentence_num', 'token_id']

---
## Constructing the `LIB` Table

In [3]:
# read in article & search metadata
article_metadata = pd.read_csv("./data/RBP_article_metadata.csv")
article_metadata.set_index("id", inplace = True)

search_metadata =pd.read_csv("./data/RBP_search_metadata.csv")
search_metadata.set_index("id", inplace = True)

In [4]:
# create LIB table
LIB = article_metadata
LIB.index.rename(OHCO[0], inplace = True)

LIB['doc_type'] = search_metadata.type

Inspection reveals that the publication dates are not in a standard format. I define a function that parses date strings. The following table describes how dates are assigned to ambiguous strings.

|`date` format|output format|
|------------|-------------|
|Month *year*|YYYY-mm-01|
|*year*|YYYY-07-01|
|Summer *year*|YYYY-08-01|
|Fall *year*|YYYY-11-01|
|Winter *year*|YYYY-02-01|
|Spring *year*|YYYY-05-01|

Any date string that is missing or does not adhere to one of the above formats will be corrected by hand.

In [5]:
# define helper functions which parse common formats

def normal_date(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%d %B %Y")

def year_only(dt):
    from datetime import datetime
    
    dt_obj = datetime.strptime(dt, "%Y").date()
    return dt_obj.replace(month = 7, day = 1)

def month_year(dt):
    from datetime import datetime
    
    return datetime.strptime(dt, "%B %Y").date()

def season_year(dt):
    from datetime import datetime, date
    import numpy as np
    
    try:
        season = dt.split()[-2].lower()
        year = dt.split()[-1]

    except IndexError:
        return np.nan
    
    if season == 'spring':
        return date(int(year), 5, 1)
    elif season == 'summer':
        return date(int(year), 8, 1)
    elif season == 'fall':
        return date(int(year), 11, 1)
    elif season == 'winter':
        return date(int(year), 2, 1)
    else:
        return np.nan

# define wrapper function to try-catch errors

def date_parser(dt):
    '''
    Converts publication date strings from rocksbackpages.com articles
    to date objects with format YYYY-mm-dd.
    '''
    helpers = [normal_date, year_only, month_year, season_year]
    
    for f in helpers:
        try:
            return f(dt)
        except ValueError:
            pass
        

In [6]:
LIB['date_parsed'] = LIB.date.apply(date_parser)

In [7]:
LIB.loc[LIB.date_parsed.isna()]

Unnamed: 0_level_0,title,author,source,date,subjects,doc_type,date_parsed
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
610,"Charley Pride, Tammy Wynette and George Jones:...",Gene Guerrero,The Great Speckled Bird,31 February 1972,"['george-jones', 'tammy-wynette', 'charley-pri...",Live Review,NaT


Only one date failed to be parsed. We will manually set this date to the last day of the month, February 29, 1972. After that, we will replace the `date` column with `date_parsed` as keeping both is redundant.

In [8]:
from datetime import date
LIB.loc[610, 'date_parsed'] = date(1972,2,29)

# replace date col
LIB['date'] = LIB.date_parsed
LIB.drop(columns ='date_parsed', inplace = True)

In [12]:
# preview LIB
LIB.sample(10)

Unnamed: 0_level_0,title,author,source,date,subjects,doc_type
article_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3414,Alison Krauss: An Interview,Paul Sexton,Daily Telegraph,2009-07-22 00:00:00,['alison-krauss'],Interview
518,Waylon Jennings,Gene Guerrero,The Great Speckled Bird,1971-06-21 00:00:00,['waylon-jennings'],Interview
2019,Garth Brooks,Robin Eggar,The Sunday Times,1994-01-23 00:00:00,['garth-brooks'],Profile and Interview
2911,The Dixie Chicks: Weapons Of Mass Destruction,Andria Lisle,MOJO,2003-10-01 00:00:00,['dixie-chicks-the'],Interview
2005,"Trisha Yearwood: Crazy Horse, Santa Ana CA",Richard Cromelin,Los Angeles Times,1992-10-14 00:00:00,['trisha-yearwood'],Live Review
804,Chet Atkins: Country Gent,Chris Charlesworth,Melody Maker,1973-11-24 00:00:00,['chet-atkins'],Interview
605,"Charley Pride, Lynn Anderson: Oakland Coliseum...",Philip Elwood,The San Francisco Examiner,1971-10-30 00:00:00,"['lynn-anderson', 'charley-pride']",Live Review
1009,Willie Nelson: Red Headed Stranger (Columbia),Nick Tosches,Creem,1975-08-01 00:00:00,['willie-nelson'],Review
3718,Guy Clark: My Favorite Picture Of You,Holly Gleason,Paste,2013-07-23 00:00:00,['guy-clark'],Review
2205,Frontman: Merle Haggard,Mark Rowland,Musician,1995-09-01 00:00:00,['merle-haggard'],Interview
