# Country & Soul
### *Analyzing Trends in American Music Journalism from 1960 to Present*
---

# Textual Analysis 
#### Table of Contents
1. [Introduction](#introduction)
2. [Vector Space Modeling](#paragraph1)
3. [Clustering](#paragraph2)
4. [Principal Component Analysis](#paragraph3)
5. [Topic Modeling](#paragraph4)
6. [Word Embeddings](#paragraph5)
7. [Sentiment Analysis](#paragraph6)

## Introduction<a name="introduction"></a>
In this notebook, I apply several analytical techniques to the corpus of music reviews. I employ vector space models (VSM), clustering methods, Principal Component Analysis (PCA), topic models, word embeddings, and semantic analysis.

To begin, I read in the tables that I created in *`Digital Analytical Edition.ipynb`*.

In [1]:
# Libraries
import pandas as pd
import numpy as np
import plotly_express as px

from gensim.models import word2vec
from ast import literal_eval

import plotly.io as pio
pio.renderers.default='notebook'

In [2]:
# define OHCO structure
OHCO = ['article_id', 'para_id', 'sent_id', 'token_id']

# import data
path = "./data/"


LIB = pd.read_csv(path+"LIB.csv")
LIB.set_index('article_id', inplace = True)
LIB.subjects = LIB.subjects.apply(literal_eval)

VOCAB = pd.read_csv(path+"VOCAB.csv")
VOCAB.set_index('term_str', inplace = True)

CORPUS = pd.read_csv(path+"CORPUS.csv")
CORPUS.set_index(OHCO, inplace = True)

In [3]:
uBOW = pd.read_csv(path+"BOW_unigrams.csv")
uBOW.set_index(['article_id', 'term_str'], inplace = True)

nBOW = pd.read_csv(path+"BOW_ngrams.csv")
nBOW.set_index(['article_id', 'term_str'], inplace=True)

In [4]:
# Frequent artists by genre
from itertools import chain, count
soul_artists = list(chain.from_iterable(LIB[LIB.topic == 'soul'].subjects))
country_artists = list(chain.from_iterable(LIB[LIB.topic == 'country'].subjects))

top_soul_artists  = pd.DataFrame(soul_artists).value_counts().to_frame().head(5)
top_soul_artists.index.names = ['artist']
top_soul_artists.rename(columns = {0:'mentions'}, inplace = True)
top_country_artists = pd.DataFrame(country_artists).value_counts().to_frame().head(5)
top_country_artists.index.names = ['artist']
top_country_artists.rename(columns = {0:'mentions'}, inplace = True)

top_artists = top_soul_artists.append(top_country_artists)


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



In [5]:
top_artists = top_artists.reset_index().artist.to_list()

---
## Vector Space Modeling<a name="paragraph1"></a>

Here, I create several vector space models (VSMs). The first VSM is a straightforward document-term matrix. This representation allows me to asses the similarity of documents in the corpus.

The second VSM is a term-time matrix that measures term occurence over time. Terms are assigned a time step based on the order of their occurence in an article. Publication dates are used to order articles within the corpus.

In [6]:
# Creating the Doc-Term Matrix
DTM = uBOW.tfidf.unstack('term_str', 0)

In [17]:
from sklearn.decomposition import PCA

pca_engine = PCA(n_components = 2)

docPCs = pca_engine.fit_transform(DTM)
DOCTERMPC = pd.DataFrame(docPCs, index = DTM.index, columns = ['PC1', 'PC2'])
DOCTERMPC = DOCTERMPC.join(LIB)


Feature names only support names that are all strings. Got feature names with dtypes: ['float', 'str']. An error will be raised in 1.2.



In [19]:
px.scatter(DOCTERMPC,
          x = 'PC1',
          y = 'PC2',
          color = 'topic')

---
## Clustering<a name="paragraph2"></a>

---
## Principal Component Analysis<a name="paragraph3"></a>

---
## Topic Modeling<a name="paragraph4"></a>

---
## Word Embeddings<a name="paragraph5"></a>

---
## Semtiment Analysis<a name="paragraph6"></a>