<a href="https://colab.research.google.com/github/columbia-data-club/meetings/blob/main/2023/april_6_intermediate_textual_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![A blue background with the SQLite logo and the words Data Club on it](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/data-club-spacy.png)

# Intermediate Textual Data Analysis

April 6, 2023

by [Moacir P. de Sá Pereira](https://moacir.com) for the [Columbia Data Club](https://github.com/columbia-data-club/).

As we investigate our textual data in more detail, the techniques for analyzing such unstructured data rely on new libraries and models provided by machine learning. Here, we look to the cutting edge of contemporary Python text analysis libraries to learn how to mobilize their potential.

This notebook builds on the notebook “[Exploratory Analysis of Textual Data](https://github.com/columbia-data-club/meetings/blob/main/2023/march_23_textual_analysis.ipynb)” from March 2023. It is a continuation of the work in that notebook, and, as such, takes the contents of that notebook for granted.

We’ll start by installing the [Textacy](https://textacy.readthedocs.io/) library and [spaCy](https://spacy.io/) language models.

In [None]:
!python -m pip install textacy
!python -m spacy download en_core_web_sm

## Isolating Text and Building Metadata

In [the previous notebook](https://github.com/columbia-data-club/meetings/blob/main/2023/march_23_textual_analysis.ipynb), we limited ourselves to 393 articles published by Yahoo Finance on March 10, 2023. For this notebook, I’ve prepared a dataset that scraped every article published by Yahoo Finance on that date and then limited it to the articles from Yahoo Finance itself and the articles that did not yield errors of some sort.

In [None]:
import pandas as pd

df = pd.read_parquet("https://github.com/columbia-data-club/meetings/blob/main/assets/data/mar_10_articles_full.parquet?raw=true",
                     columns=["url", "headline",	"hostname",	"raw_html_text"])
print(len(df))
df.head()

We have 2504 total articles, then. When we captured the content of each article, we slurped up everything inside the `<article>` tag. This was generally a good idea at the time, but it includes things like the “Trending” sidebar, as well as other material ancillary to the text of the article itself.

That said, some of that ancillary material can give us metadata for the articles that we could use in our analysis. The `<header>` tag, for example, gives the source of the article as well as the headline. Similarly, we can find the article’s author in the `<div class="caas-attr-item-author">` as well as the estimated reading time in `<span class="caas-attr-mins-read">`.

The text of the news story resides entirely within `<p>` tags inside `<div class="caas-body">`, so we can extract those paragraphs to rebuild the text without the additional textual elements inside the `<article>` tag. Time for more [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/).

Then, let’s see where the articles come from and get a sense of the reading times.

In [None]:
# This creates a single, sample article for inspecting

# sample = df.sample(1, random_state=42)
# sample_article = sample.iloc[0]
# print(sample_article["url"])
# with open("article.html", "w") as file:
#     file.write(sample_article["raw_html_text"])

In [None]:
from bs4 import BeautifulSoup
from datetime import datetime
import numpy as np

def extract_textual_features(raw_html_text):
  html = BeautifulSoup(raw_html_text, "html.parser")
  provider = html.find("span", class_="caas-attr-provider").text.strip()
  publication_datetime = datetime.strptime(html.find("time").get("datetime"), "%Y-%m-%dT%H:%M:%S.%f%z")
  byline = html.find("div", class_="caas-attr-item-author").text.strip()
  read_time = np.nan
  read_time_span = html.find("span", class_="caas-attr-mins-read")
  if read_time_span:
    read_time = int(read_time_span.text.strip().replace(" min read", ""), base=10)
  text = " ".join([p.text.strip() for p in html.find("div", class_="caas-body").find_all("p")])
  char_count = len(text)

  return [provider, publication_datetime, byline, read_time, char_count, text]

# print(extract_textual_features(sample_article["raw_html_text"]))

In [None]:
from tqdm.auto import tqdm

tqdm.pandas()

# ~1:13 to complete
df[["provider", "publication_datetime", "byline", "read_time", "char_count", "text"]] = df.progress_apply(
    lambda row: extract_textual_features(row["raw_html_text"]),
    axis=1,
    result_type="expand")

In [None]:
ax = df["provider"].value_counts().plot(
    kind="bar", 
    figsize=(10,2),
    title="Article Provider Distribution on March 10, 2023")
ax.set_xlabel("Provider")
ax.set_ylabel("Count")

In [None]:
df["read_time"].describe()

## Making Docs with spaCy and a Corpus with Textacy

Excellent work. With this metadata split out and the article text isolated, we can move forward and do the computationally heavy part of the workbook, converting each article into a spaCy [`Doc`](https://spacy.io/api/doc). 

In [None]:
import textacy
from textacy import preprocessing

preproc = preprocessing.make_pipeline(
    preprocessing.normalize.whitespace,
    preprocessing.normalize.quotation_marks,
    preprocessing.replace.emojis,
    preprocessing.replace.emails,
    preprocessing.replace.urls
)

In [None]:
def build_doc_from_row(row):
  metadata = {
      "title": row["headline"],
      "url": row["url"],
      "provider": row["provider"],
      "byline": row["byline"],
      "read_time": row["read_time"],
      "char_count": row["char_count"],
      "publication_datetime": row["publication_datetime"].strftime("%Y-%m-%dT%H:%M:%S.%f%z")
  }
  
  return textacy.make_spacy_doc((row["text"], metadata), lang="en_core_web_sm")


In [None]:
# ~ 7:00

df["doc"] = df.progress_apply(lambda row: build_doc_from_row(row), axis=1)

Now that we have created all the `Doc`s, we can abandon `pandas` and collapse our dataset into a Textacy `Corpus`, which well let us work with the entire collection at once, instead of iteratively.

In [None]:
corpus = textacy.Corpus("en_core_web_sm", data=df["doc"])
print(corpus)

And what’s more, we can save the corpus to disk. We have not hit the computer too hard so far, but this is a good, safe step.

In [None]:
corpus.save("corpus.bin.gz")

Surprisingly, at 23Mb, the corpus takes up about half the disk space as the parquet file with the raw HTML of all the articles (42Mb). However, if we also gzip the parquet file, its size falls to 3.9Mb. Hopefully this gives a sense both of how useful gzip is for moving files around quickly and how much data has been added since we tokenized and parsed the text with spaCy to make those `Doc`s.

Anyway, we can load up the corpus from disk like this with our old friend, the Requests library.

In [None]:
import textacy # In case we're starting from this cell.
import requests

corpus_response = requests.get(
    "https://github.com/columbia-data-club/meetings/blob/main/assets/data/mar_10_articles_corpus.bin.gz?raw=true"
)
with open("mar_10_articles_corpus.bin.gz","wb") as f:
  f.write(corpus_response.content)

corpus = textacy.Corpus.load("en_core_web_sm", "./mar_10_articles_corpus.bin.gz")
print(corpus)

Textacy includes some properties that let us get some information about our corpus as a whole, and we’ll look at those now. Here, we’ll be using some of the tricks from the [Textacy `Corpus` tutorial](https://textacy.readthedocs.io/en/latest/tutorials/tutorial-2.html). Textacy’s topic modeling is just an extension of [scikit-learn’s decomposition models](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition), and we’ll be using the one for [latent Dirichlet allocation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html).

We’ll:

1. Extract named entities.
2. Lemmatize our `Doc`s, which bundles morphologically different versions of words into one (“was” and “is” both get filed under “are”).
3. Vectorize the documents, which lets us compare them to each other and measure distinctness of documents, creating a matrix with documents along one axis and terms along the other.
4. Generate some topic models based on the categorical distinctions of the articles. 
5. Visualize the topics.

Before continuing, I want to emphasize the compex nature of topic modeling and underscore that these topics are not necessarily of particular semantic use. Topic modeling works by splitting $n$ documents into $m$ topics, but the topics are not what the articles are “about.” “Topic” is a bit of a misleading term here, since we associate it with a summary or description of the information a chunk of text passes to the reader. That is not what happens in topic modeling. So if a topic appears with, say aquatic terms in it, that does not mean that the documents in that topic are about the ocean. It simply means that they used those words that seemed commmon enough to be one of the $m$ topics, but distinct enough so they only applied to a specific subset of documents.

In [None]:
import random
from textacy import extract
# Partial takes a function as its first positional argument and passes the rest
from functools import partial

def extract_terms(doc):
  return list(extract.terms(
    doc,
    # n-gram terms and settings
    ngs=partial(extract.ngrams, n=2, include_pos={"NOUN", "ADJ"}),
    # etntity terms and settings. GPE = Geopolitical entity. Try `spacy.explain("GPE")`
    ents=partial(extract.entities, include_types={"PERSON", "ORG", "GPE", "LOC"}),
    # dont extract extract entities if they are already n-grams
    dedupe=True
    ))

In [None]:
sample_article = random.choice(corpus)
print(sample_article._.meta["url"])
terms = extract_terms(sample_article)
terms

In [None]:
list(extract.terms_to_strings(terms, by="lemma"))

In [None]:
docs_terms = (extract_terms(doc) for doc in tqdm(corpus))
tokenized_docs = (extract.terms_to_strings(doc_terms, by="lemma") for doc_terms in docs_terms)

In [None]:
from textacy import representations

doc_term_matrix, vocab = representations.build_doc_term_matrix(tokenized_docs,
  # minimum document frequency, as a percentage ([0, 1]) or fixed number
  min_df = 5,
  # maximum document frequency
  max_df = 0.7,
  # how to weight term frequency
  tf_type="linear",
  # which idf equation to use.
  # this is idf = log(n_docs + 1 / df + 1) + 1 
  idf_type="smooth"
)
doc_term_matrix

In [None]:
import textacy.tm

model = textacy.tm.TopicModel("lda", n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)
doc_topic_matrix.shape # Should be (2340, 10) as we asked for 10 topics

In [None]:
doc_topic_matrix

In [None]:
id_to_term = {id_: term for term, id_ in vocab.items()}
for topic_idx, terms in model.top_topic_terms(id_to_term, top_n=8):
  print(f"topic {topic_idx}: {'  '.join(terms)}")

In [None]:
for topic_idx, doc_idxs in model.top_topic_docs(doc_topic_matrix, top_n=8):
  print(f"topic {topic_idx}: {'   '.join(corpus[doc_idx]._.meta['provider'] for doc_idx in doc_idxs)}")

In [None]:
_ = model.termite_plot(doc_term_matrix, id_to_term, n_terms=30)


What can we tell from these descriptions of the ten topics we asked for? What sorts of changes might we want to implement if we wanted “better” topics? 

## Let’s Do This again with Gensim

I was considering doing more individual, `Doc`-level analysis with spaCy, but I started wondering about how good our topic models were, and if there were a way to test this. In what proceeds, I’ll be adapting from the topic modeling blueprints in [_Blueprints for Text Analytics in Python_](https://www.oreilly.com/library/view/blueprints-for-text/9781492074076/), a 2020 book written by Jens Albrecht, Sidharth Ramachandran, and Christian Winkler. In looking over their code, we can see how radically different their approach is, and not just because they are using [Gensim](https://radimrehurek.com/gensim/index.html) instead of Textacy/scikit-learn.

In [None]:
import re
import nltk
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models.nmf import Nmf

nltk.download('stopwords')

stopwords = set(nltk.corpus.stopwords.words("english"))

In [None]:
gensim_docs = [[w for w in re.findall(r'\b\w\w+\b', paragraph.lower()) if w not in stopwords] for paragraph in df["text"]]
dict_gensim_docs = Dictionary(gensim_docs)
dict_gensim_docs.filter_extremes(no_below=5, no_above=0.7)
bow_gensim_docs = [dict_gensim_docs.doc2bow(doc) for doc in gensim_docs]
tfidf_gensim_docs = TfidfModel(bow_gensim_docs)
vectors_gensim_docs = tfidf_gensim_docs[bow_gensim_docs]

In [None]:
nmf_gensim_docs = Nmf(vectors_gensim_docs, 
  num_topics=10, 
  id2word=dict_gensim_docs, 
  kappa=0.1, eval_every=5)


In [None]:
for (topic_idx, terms) in nmf_gensim_docs.show_topics(formatted=False):
  print(f"topic {topic_idx}: {'   '.join([term[0] for term in terms])}")

What do we think?

In [None]:
from gensim.models.coherencemodel import CoherenceModel

nmf_gensim_docs_coherence = CoherenceModel(model=nmf_gensim_docs, 
  texts=gensim_docs,
  dictionary=dict_gensim_docs,
  coherence="c_v")
print(nmf_gensim_docs_coherence.get_coherence())

When we moved to Gensim, we made a few other changes, like using [non-negative matrix factorization](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization) as our model algorithm. Scikit-learn also provides NMF, but I had some trouble using it in testing this notebook and switched to LDA in a panic. Nevertheless, we can compare our Gensim NMF model to a Gensim LDA model using the same corpus. 

In [None]:
from gensim.models import LdaModel

lda_gensim_docs = LdaModel(corpus=bow_gensim_docs,
  id2word=dict_gensim_docs,
  chunksize=2000,
  alpha="auto",
  eta="auto",
  iterations=400,
  num_topics=10,
  passes=20,
  eval_every=None,
  random_state=42)
for (topic_idx, terms) in lda_gensim_docs.show_topics(formatted=False):
  print(f"topic {topic_idx}: {'   '.join([term[0] for term in terms])}")

In [None]:
lda_gensim_docs_coherence = CoherenceModel(model=lda_gensim_docs, 
  texts=gensim_docs,
  dictionary=dict_gensim_docs,
  coherence="c_v")
print(lda_gensim_docs_coherence.get_coherence())

The coherence score for the probablistic LDA model is higher than the decompositional NMF model. This is surprising because my understanding is that NMF is considered to provide more specific topics and topics that are more closely related to semantic topics (as in, actually telling us what documents are _about_).

The O’Reilly book includes some code for predicting the “correct” number of topics for a corpus. Unfortunately, calculating coherence takes a long time (this analysis takes over an hour), so here I move to a different notebook to see the results.

In closing, let’s try NMF with Textacy again.

## Back to Textacy

In [None]:
nmf_model = textacy.tm.TopicModel("nmf", n_topics=10)
nmf_model.fit(doc_term_matrix)
nmf_doc_topic_matrix = nmf_model.transform(doc_term_matrix)

print("--- LDA ---")

for topic_idx, terms in model.top_topic_terms(id_to_term, top_n=8):
  print(f"topic {topic_idx}: {'  '.join(terms)}")

print("--- NMF ---")

for topic_idx, terms in nmf_model.top_topic_terms(id_to_term, top_n=8):
  print(f"topic {topic_idx}: {'  '.join(terms)}")


In [None]:
_2 = nmf_model.termite_plot(doc_term_matrix, id_to_term, n_terms=30)


In [None]:
lda_matrix = pd.DataFrame(doc_topic_matrix)
nmf_matrix = pd.DataFrame(nmf_doc_topic_matrix)

In [None]:
lda_matrix.describe()

In [None]:
nmf_matrix.describe()