<a href="https://colab.research.google.com/github/columbia-data-club/meetings/blob/main/2023/march_23_textual_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![A blue background with the SQLite logo and the words Data Club on it](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/data-club-textacy.png)

# Exploratory Analysis of Textual Data

March 23, 2023

by [Moacir P. de Sá Pereira](https://moacir.com) for the [Columbia Data Club](https://github.com/columbia-data-club/).

Working with unstructured, textual data in Python presents new challenges. We can use some of our familiar pandas idioms to organize our corpus of text documents, but even a surface knowledge of the corpus demands new tools for analyzing data. Here, we’ll build a corpus of text and begin looking for macro trends in it.

This notebook was inspired by “[A Full Guide on Scraping Yahoo Finance](https://www.octoparse.com/blog/how-to-scrape-yahoo-finance),” by Octoparse.

## Getting the Text

A corpus of text can come ready-made or it has to be collected by the researcher through a process that often calls for a certain amount of creativity. The Libraries’ [Guide on Text Mining](https://guides.library.columbia.edu/text-mining) includes a few resources for finding both free and licensed corpora. But often a ready-made corpus fits the needs of a researcher looking to improve an algorithm, not a researcher looking to glean information from the content of the corpus.

In this latter case, the researcher has to build the corpus. If the text is online, there are usually three ways to do this:

1. **Access the Site’s Database**: This is probably the easiest solution, but it’s also the least likely. Still, it probably does not hurt to reach out to the admin of a site you want to mine and ask for either a dump of their content or read access to their database. But, again, this is unlikely.
1. **Access via the Site’s API**: If the site has a [REST API](https://restfulapi.net/) or similar, then it’s often possible to interact with the site programmatically to get the site’s content. For example, historically both [Twitter](https://developer.twitter.com/en/docs/twitter-api) and [Reddit](https://www.reddit.com/dev/api/) have had friendly APIs in the past that have since been somewhat curtailed. Twitter’s API is still rather open for academic researchers, but, of course, in March 2023, it’s hard to tell what’s exactly going on on Twitter. With Reddit, historical access to the content of posts seems no longer possible, but there are external tools like [PushShift](https://pushshift.io/) that can help.
1. **Scrape the Site**: This is the technique we will be using. This is the **WORST OPTION** for two reasons: it’s the finickiest, as what works today might not work in the future with no prior warning, and, second, websites often **EXPLICITLY FORBID** scraping. This forbidding is often listed on the site’s `robots.txt` file, which tells webcrawlers if they are allowed to crawl the site and which parts are crawlable. However, sometimes, like in [Facebook’s `robots.txt` file](https://facebook.com/robots.txt), the limitations are spelled out in clear language.

Since we’ll be scraping, let’s move on to strategies for doing that.

## Scraping and Parsing the Internet

First, sometimes [`wget`](https://www.gnu.org/software/wget/) is sufficient for collecting the contents of a website. A command-line tool, `wget` can recursively work its way through an entire sitemap, following links and downloading pages. 

Additionally, we could write our own webcrawling spider in Python using [Scrapy](https://scrapy.org/), which both downloads and parses html files, while `wget` only downloads.

Instead, we’ll be doing our downloading the old fashioned way, with vanilla `HTTP` requests. We’ll do this using the [Requests](https://requests.readthedocs.io/en/latest/) library. As for parsing, we’ll use the [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) library.

That said, here’s where we need to start optimizing for the nature of the documents we hope to download.

## Yahoo Finance at a Glance

We’re going to be working with the [Yahoo Finance Latest News](http://finance.yahoo.com/news), which at the time of this typing, looks like this:

![A screenshot of the Yahoo Finance news website](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/yahoo-1.png)

Visually, the page has a simple structure. There’s a header bar with navigation links and little widgets giving information about the state of the market. There’s a sidebar with a few articles on it and a footer just out of view, but the bulk of the news content is on an infinite scroll bar that loads articles as you keep scrolling. In a production environment, this would cause problems, but we’re keeping things light for this workshop.

To smooth things along, we can investigate the semantic structure of the news page to see how it signals articles:

![A view of the DOM of the Yahoo Finance news page](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/yahoo-2.png)

Note `<div id="Fin-Stream"…>`. This is the container that holds the stream of articles, which are members of an unordered list. Each list member seems to have the shape of `<li class="js-stream-content Pos(r)">`, so we can use that to parse out objects.

In [None]:
from bs4 import BeautifulSoup
import requests

def getWebPage(url):
  try:
    r = requests.get(url)
    if r.status_code == 200:
      return r.text
    else:
      raise Exception(f"Status code for {url} was {r.status_code}")
  except:
    print(f"getWebPage failed on {url} ({r.status_code})")

In [None]:
from bs4 import BeautifulSoup

webpage = getWebPage("https://finance.yahoo.com/news")
parsed_doc = BeautifulSoup(webpage, "html.parser")

for i, article in enumerate(parsed_doc.find_all("li", class_="js-stream-content")):
  # The article headline is in an <h3> tag
  print(f"{i + 1}. “{article.h3.get_text()}”")

When we naively grab the news page, then, we only get the top 26 articles. If we were a business that had a crawler ping the site every five minutes and grab the top 26, then this might be a useful way to construct a corpus. But if we want to do historical research on Yahoo Finance or get the news more systematically, this is not a particularly efficient way to go about doing this.

Again, creativity.

## Scraping Yahoo Finance

There are a few options even within Yahoo Finance. One is a robot working every five minutes. Going forward, that’s not a terrible idea. Going backward, however, it is. Luckily, Yahoo Finance publishes a systematic index of [all of its articles](https://finance.yahoo.com/sitemap/) on its sitemap. What’s more, the map is broken apart into days, meaning we can target a specific date or range of dates and download all the articles. 

The structure of the page is pretty straightforward: there is a `<ul>` inside `<div class="sitemapcontent"…>` that holds the list of articles, where every article is `<li><a>Headline</a></li>`, more or less. As such, the code is not much different than above to see the list of articles.

The Silicon Valley Bank failure, recently in the news, happened on March 10, 2023. Initially, I planned on getting every article published on Yahoo Finance on that day, but it’s over 1000 articles, so instead let’s just get about 500 and call it a day.

In [None]:
import datetime

# Doing this in code should we want to automate getting more than
# one day's worth of articles.
date = datetime.date(2023, 3, 10)
base_url = "https://finance.yahoo.com/sitemap/"

articles_url = base_url + date.strftime("%Y_%m_%d")
webpage = getWebPage(base_url + date.strftime("%Y_%m_%d"))
parsed_doc = BeautifulSoup(webpage, "html.parser")
sitemap_content = parsed_doc.find("div", class_="sitemapcontent")

# print(sitemap_content)
for i, article in enumerate(sitemap_content.find_all("li")):
  print(f"{i + 1}. “{article.get_text()}”")

Now we get 50 articles, but we want more, still. At the bottom of the list of articles is a "Next" button (and sometimes “Start” button) that takes the form of:

```html
<div>
  <a 
    href="https://finance.yahoo.com/sitemap/2023_03_10_start{some epoch time}"…>
    Next
  </a>
</div>
```

We can look for this link and follow it when it’s available. However, now we’re about to start looping with our requests, which means it’s a good idea to space out the process a bit using `time.sleep()`. Many websites have limits on how many times you can hit their servers in a given period, and other sites may be set up to intercept and block scraping. In short, it’s polite not to slam a website with a flurry of requests.

In [None]:
import time

def add_articles(url, page=1, articles=[]):
  webpage = getWebPage(url)
  parsed_doc = BeautifulSoup(webpage, "html.parser").find("div", class_="sitemapcontent")
  for article in parsed_doc.find_all("li"):
    articles.append(article)

  # We assume there is a sibling to the <ul> that holds the Next button
  nav_buttons = parsed_doc.ul.next_sibling
  # Stop when we hit 10 pages (500 articles)
  page += 1
  if page < 11 and nav_buttons:
    next_button = nav_buttons.find_all("a").pop()
    if next_button.text == "Next":
      new_url = next_button.get("href")
      print(f"Waiting five seconds before getting articles from page {page}")
      time.sleep(5)
      return add_articles(new_url, page, articles)

  return articles

In [None]:
articles = add_articles(base_url + date.strftime("%Y_%m_%d"))

We now have links to 500 Yahoo Finance articles from March 10th. Let’s see where these links point to.

In [None]:
from urllib.parse import urlparse
import pandas as pd

df = pd.DataFrame({
    "url": [article.find("a").get("href") for article in articles],
    "headline": [article.text for article in articles]
})
df["hostname"] = df["url"].apply(lambda x: urlparse(x).hostname)
df["hostname"].value_counts()

Super. In my initial testing, it looked like a lot of the links were to sites other than Yahoo Finance, which would cause a problem for scraping. Namely, every single website has a different way of presenting information, which means avid scrapers like us have to manage different ways of getting the information we want without a bunch of information we don’t want. 

Let’s toss out everything except articles that point to `finance.yahoo.com` and grab their text. 

Luckily, Yahoo Finance puts all of their article content in an `<article>` tag, so we don’t need to do much to grab the text.

In [None]:
import numpy as np

def get_yahoo_finance_article_text(url):
  time.sleep(1)
  print(url)
  webpage = getWebPage(url)
  try: 
    parsed_doc = BeautifulSoup(webpage, "html.parser")
    if parsed_doc:
      return parsed_doc.find("article").prettify()
    return np.nan
  except:
    return np.nan

In [None]:
### This code is illustrative. The second line will make hundreds of calls
#   to Yahoo Finance, which is probably not what you want to do, especially
#   especially since all of the articles are already downloaded and saved to
#   our repository as a parquet file.

# df = df[df["hostname"] == "finance.yahoo.com"]
# df["raw_html_text"] = df["url"].apply(lambda url: get_yahoo_finance_article_text(url))
# df["raw_html_text"].isna().sum() #-> 50
# df = df.dropna()
# df.to_parquet("mar_10_articles.parquet")

The process of getting all 393 articles obviously takes a while if we are waiting a second in between. Furthermore, there’s no real reason to go and download them all again for the purposes of this workshop, so instead, I drop all the ones that failed to grab an article for whatever reason (50 of 393) and convert the dataframe to a parquet so we can download it and keep working off it instead.

This requires an extra little step because I save the raw HTML into the dataframe, not the parsed Beautiful Soup object, so when we import the parquet, we need to reparse everything.

So let’s start from scratch with what we’ve got, because technically the scraping is over.

## Parsing Yahoo Finance Articles

Let’s install a library and grab our articles so we can start doing some light textual analysis.


In [None]:
!python -m pip install textacy
!python -m spacy download en_core_web_sm

In [None]:
from bs4 import BeautifulSoup
import pandas as pd

df = pd.read_parquet("https://github.com/columbia-data-club/meetings/blob/main/assets/data/mar_10_articles.parquet?raw=true")
df.head()

Excellent! Let’s have a look at one of these articles to see what the basic structure is.

In [None]:
sample = df.sample(1, random_state=42)
print(sample["url"].iloc[0])
with open('article.html', 'w') as file:
    file.write(sample["raw_html_text"].iloc[0])

Here I kind of stop with the pre-fab code because I want to see how we all interpret this html and how we decide to go forward.

We’ll be using the library [textaCy](https://textacy.readthedocs.io/), which has many preproccessing tools available for us, so we can start by naively extracting the text from our articles.

**Usability note**: the textaCy that gets installed by default is above version 0.11.0 (0.12.0 as of this writing), but currently the documentation does not reflect the changes to the library, so things break. The [release notes for 0.12.0](https://github.com/chartbeat-labs/textacy/releases/tag/0.12.0) give some guidance.

In [None]:
df["naive_txt"] = df["raw_html_text"].apply(lambda x: BeautifulSoup(x, "html.parser").text)

In [None]:
print(df["naive_txt"].sample(1, random_state=42).iloc[0])

In [None]:
from textacy import preprocessing

preproc = preprocessing.make_pipeline(
    preprocessing.normalize.whitespace,
    preprocessing.normalize.quotation_marks,
    preprocessing.replace.emojis,
    preprocessing.replace.emails
)

df["preproc_txt"] = df["naive_txt"].apply(preproc)

In [None]:
print(df["preproc_txt"].sample(1, random_state=42).iloc[0])

If we’ve gotten this far, let’s convert everything to [spaCy docs](https://spacy.io/api/doc) using textaCy’s built in generator

In [None]:
import textacy

df["doc"] = df["preproc_txt"].apply(lambda x: textacy.make_spacy_doc(x, lang="en_core_web_sm"))

Now we can work through the textaCy [quick start](https://github.com/chartbeat-labs/textacy/blob/main/docs/source/quickstart.md) for 0.12.0, more or less, using the documentation on GitHub. Or do our own thing.

In [None]:
doc = df["doc"].sample(1, random_state=42).iloc[0]
print(doc._.preview)

In [None]:
list(textacy.extract.entities(doc, exclude_types="NUMERIC"))

In [None]:
from textacy import text_stats as ts
print(ts.n_words(doc), ts.n_unique_words(doc))
print(ts.diversity.ttr(doc))
print(ts.flesch_kincaid_grade_level(doc))
print(ts.counts.pos(doc))