<a href="https://colab.research.google.com/github/amckenny/text_analytics_intro/blob/main/notebooks/04_building_a_corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Prerequisites

In [None]:
# Get external files and install 3rd party packages
!mkdir -p texts
!mkdir -p texts/10ks
!wget -q https://www.dropbox.com/s/5ibk0k4mibcq3q6/AussieTop100private.zip?dl=1 -O ./texts/AussieTop100private.zip
!wget -q https://www.dropbox.com/s/u6m4k0uhhj9m2um/Sample_Qualtrics_Output.xlsx?dl=1 -O ./texts/Sample_Qualtrics_Output.xlsx
!unzip -qq -d ./texts/ ./texts/AussieTop100private.zip
!pip install -U sec-edgar-downloader

# Standard library imports
import glob, pprint, random, requests, time
from pathlib import Path
from IPython.display import display

# 3rd party imports
import pandas as pd
from bs4 import BeautifulSoup
from sec_edgar_downloader import Downloader

#Module 4 - Building a Corpus
---

One of the most time consuming aspects of text analysis is actually building the corpus of texts themselves. With the increasing availability of texts in electronic format over the Internet, this is getting increasingly easier and faster. However, it is still far from a trivial process.

In this module, we'll introduce several methods of obtaining and getting texts into Python for analysis. The goals for this module are:

* Load a corpus from text files.
* Load a corpus from Qualtrics survey exports.
* Load a corpus from an API.
* Load a corpus from web scraping.

**Note**: Whereas modules 1-3 were designed to be used in a self-directed manner, modules 4 and on are designed to be part of my workshop/course. There is far less prose explanation in these notebooks. However, with some tinkering you may still be able to work through these on your own.

##4.1. Building a Corpus from Text Files
---

The *prerequisites* code automatically loaded two text corpora into the './texts/About/' and './texts/PR/' directories. 

Go to the file navigator in Colab (on the left, it looks like a folder) and verify that they're there. If they're not there, ensure that you ran the prerequisites code above and click the refresh button just under "Files" in the file navigator (looks like a folder with a circle at the bottom-right).

If you don't see these folders here, neither will Python!

We want to load every '.txt' file in those directories, so first we need to tell Python where those directories are.

In [None]:
# Tell Python directories where texts are located
texts_dir = Path.cwd() / "texts"
about_dir =  texts_dir / "About" 
pr_dir = texts_dir / "PR" 

dirs_to_load = [about_dir, pr_dir]

We will then create two loops to get all the files in the directories:

1. Loop through all of the directories we want to load texts from (`dirs_to_load`)

2. Loop through all .txt files in that directory

Then for each 

In [None]:
# Loads the texts into a list called "texts"
texts = [] 
for directory in dirs_to_load: # Loop 1
  for file in glob.glob(f"{directory}/*.txt"): # Loop 2
    with open(file, 'r') as infile: # Open the text file
      text_type = file.split("/")[-2]
      text_id = file.split("/")[-1]
      texts.append({'text_type': text_type, 'text_id': text_id, 'text': infile.read()}) # Save the contents of the file to the "texts" list

# Creates a Pandas DataFrame from the corpus and saves the Dataframe as a .csv file
corpus_df = pd.DataFrame(texts) 
corpus_df.to_csv(texts_dir / "about_pr_texts.csv")

# Displays information about the corpus
print(f"There are {len(corpus_df)} texts in the corpus")
print("\nThe number of each type of text:")
display(corpus_df.groupby(by='text_type').agg('count')['text'])
print("\nA sample of what's in the table:")
display(corpus_df.head(5))

That's it! There are ways of getting word files/PDF files/etc into Python as well. They follow a similar pattern, but are beyond the scope of this module.

##4.2. Building a Corpus from Qualtrics
---

Another source of texts you may want to analyze is results from free-response questions in a survey/experiment. Often we are able to export the full results from our survey instrument; however, it'd be nice not to have to break the spreadsheet into separate text documents. Let's see how this can be done using the Excel exported by Qualtrics.

First, we tell Python the name and location of the Qualtrics export file. In this case we use Excel format.

In [None]:
# Tells Python where to find the qualtrics export file
qualtrics_file = texts_dir / "Sample_Qualtrics_Output.xlsx"

Once Python knows where the Wualtrics file can be found, one line of code imports that file into a Pandas DataFrame.

From there we probably want to drop the first row (Qualtrics has two header rows and by default Pandas reads one of them in as 'data'). We also may want to save only certain columns of relevance to our analysis (in this case we'll save only the Progress, Duration, and text) to make viewing the output easier.

In [None]:
# Load the Qualtrics export file into a Pandas DataFrame
qualtrics_df = pd.read_excel(qualtrics_file, header=0)

# Eliminates unneeded rows/columns and saves result to a csv file
qualtrics_df = qualtrics_df.drop(0)
qualtrics_df = qualtrics_df[["Progress","Duration (in seconds)", "sample_Text"]]
qualtrics_df.to_csv(texts_dir / "qualtrics_texts.csv")

# Displays the first five rows of the DataFrame
display(qualtrics_df.head(5))

As with loading text files, there are also multiple ways of loading survey data into Python (e.g., with a csv file, etc). Here too, it follows a similar procedure, but is beyond the scope of this notebook.

##4.3. Building a Corpus from an API
---

Sometimes we want to collect texts from organizations who have built an 'API' or **A**pplication **P**rogramming **I**nterface to help us obtain texts. Myriad organizations have such APIs and each one works a little differently. 

Depending on the API, sometimes you can find Python code written by someone else that will make using the API easier. We're going to look at how to get 10-K documents from the SEC EDGAR database. As it turns out, [Jad Chaar](https://github.com/jadchaar/) has written some Python code we're going to use to make our lives easier: the [sec-edgar-downloader](https://github.com/jadchaar/sec-edgar-downloader) package.

(The installation and loading of this package is done in the *Prerequisites* code.)



According to his package documentation, we first need to create a Downloader object and tell it where to store the files.

Notice that we're telling Python where to *store* them, not where to *find* them. Unlike the manual/qualtrics corpus examples, here we're getting the files from an online source and the API already knows where to find them.

In [None]:
# Initializes the SEC downloader and tells it where to store the 10-Ks
tenk_directory = Path.cwd() / "texts" / "10ks"
dl = Downloader(tenk_directory)

We then tell the Downloader object what files we want and from what company. Let's download **all** of IBM and Apple's 10-K documents.

In [None]:
# Tells the API we want the IBM and Apple 10-Ks
company_tickers = ["IBM", "AAPL"]
for ticker in company_tickers:
  dl.get("10-K", ticker)

Go over to the file navigator in Colab and look in the './texts/10ks' directory. You'll see that there is now an entire directory tree housing the downloaded texts. They're not all downloaded into one directory like we had when we used our own texts.

You *could* copy and paste them all into one directory, but imagine if we had used a loop to get all 10-k documents for all S&P 500 companies. That would take forever! Let's use Python to go through these directories for us so we don't have to!

In [None]:
# Identifies the companies based on the tickers in the sec-edgar-filings directory
results_dir = tenk_directory / "sec-edgar-filings"
companies = [company.name for company in results_dir.iterdir() if results_dir.is_dir()] 

texts = []

# Loops through the directories and collects all of the 10-K information in a list called 'texts'
for company in companies:
  company_10k_dir = results_dir / company / "10-K"
  filings = [filing.name for filing in company_10k_dir.iterdir() if company_10k_dir.is_dir()] 

  for filing_id in filings:
    tenk_filename = company_10k_dir / filing_id / "full-submission.txt"
    if tenk_filename.exists():
      with open(tenk_filename, 'r') as infile:
        texts.append({'company': company, 'filing_type': "10-K", 'filing_id': filing_id, 'text': infile.read()})

# Converts the 'texts' list to a Pandas DataFrame and outputs it to a csv file
tenk_df = pd.DataFrame(texts)
tenk_df.to_csv(texts_dir / "tenk_texts.csv")

# Displays information about the corpus
print(f"There are {len(tenk_df)} texts in the corpus")
print("\nThe number of texts from each company:")
display(tenk_df.groupby(by='company').agg('count')['text'])

Now let's take a look at one of our texts and see what it looks like.

In [None]:
# Displays the first 10,000 characters of one of the texts in the corpus
print(tenk_df.iloc[-1]['text'][:10000])

Well... that's certainly a 10-K... however, that's not all text. That almost looks like the code behind an HTML file! ...and yep, that's how you get it. It looks like texts from this data source are going to require some cleaning before we use them in a text analysis!

Every API works a little bit differently, so you're often going to have to go to the API documentation and tinker a bit. However, once you have been through a few APIs, you'll generally see the same ideas implemented over and over again with a few tweaks from API to API.

##4.4. Scraping a Corpus
---


Sometimes the texts that you want are online and there is no API readily available to interface with. In this case, you're left with a decision: scrape the text or collect it manually.

There are advantages and disadvantages of each, and we'll talk about that in the workshop/course. However, one thing I want to put in the notebook is an ethical concern. Some websites explicitly disallow scraping. I've observed some scholars scraping these sites (I suspect without permission), but I don't agree with this approach. Use this knowledge/these tools for good and where permitted.

For our example, let's see what's going on at the [Kelley School of Business](https://news.iu.edu/tags/kelley-school-of-business). I searched through the site and didn't see anything prohibiting it, so it seems fair to use so long as we're mindful not to be too taxing on the system.

###4.4.1. Building the List of URLs to Scrape
---

First let's have Python go out and get the news page and see what we see:

In [None]:
# Has Python call the webpage with the article links on it and displays whether the page was accessed successfully
url = "https://news.iu.edu/tags/kelley-school-of-business"
response = requests.get(url)

status = response.status_code
if status == 200:
  print(f"The status code was {status} - that means that we received the webpage back")
else:
  print(f"The status code was {status} - something didn't work")

Now let's look at the "text" we got back:

In [None]:
# Displays the contents of the webpage that was accessed
text = response.text
print(text)

Well that's massive, and again, in HTML... but if we [navigate to the website](https://news.iu.edu/tags/kelley-school-of-business) and compare what we see there to the code, a pattern emerges:

* Each story we want to access seems to be contained in a tag called: `<div class="grid-item--container">`

This insight enables us to extract from the HTML only the bits that surround the news articles. We do so with BeautifulSoup:

In [None]:
# Displays only the website text within the grid-item--container sections
bs_text = BeautifulSoup(text)
article_containers = bs_text.find_all('div', attrs={'class':'grid-item--container'})
print(article_containers)

What we want from here is just the URL to the articles. We see that within each container tag, the URLs are stored within an `<a href=...>` tag.

Let's get just those.

In [None]:
# Displays the URLs to the articles
for article in article_containers:
  print(article.a['href'])

OK, so now we can see the URLs... but there appears to be two different kinds:
* Stories: Start with /
* Blog entries: Contain the full URL.

We could do both, but that would require us to scrape two pages with two separate formats. Let's just do the stories for our demo.

We know that the stories all start with https://news.iu.edu, so let's prepend that and add those to a list of all articles:

In [None]:
# Creates a list of URLs for the selected articles starting with a forward slash (/)
article_urls = ["https://news.iu.edu"+article.a['href'] for article in article_containers if article.a['href'].startswith('/')]
print(article_urls)

That's great, but there's one more important piece of information... this isn't the last page of news stories... there are many more pages we need to get the links from. 

How can we tell this? Well if you look at the webpage, there is a "Next >" button when there are no more pages of news.

If we look in our HTML, we see that there is a `<li class="next">` tag when that button is there. Let's take a look at that:

In [None]:
# Finds the HTML code for the 'next' button and prints it
next_page_code = bs_text.find('li', attrs={'class':'next'})
print(next_page_code)

It looks like it too has a URL in it within an `<a href=...>` tag... but this time starting with a question-mark. That just means at the end of "https://news.iu.edu/tags/kelley-school-of-business' we need to add a question-mark and the page number like so:

`https://news.iu.edu/tags/kelley-school-of-business?page=2`

Let's get this URL as well so we know what page to get data from next:

In [None]:
# Extracts the URL from the HTML code for the 'next' button.
next_news_url = "https://news.iu.edu/tags/kelley-school-of-business"+next_page_code.a['href']
print(next_news_url)

Let's systematize what we've done a little bit:

In [None]:
def get_kelley_news_urls(url):
  # Get URL Data
  response = requests.get(url)
  status = response.status_code
  if status == 200:
    print(f"URL \"{url}\" successfully requested. Parsing...", end=" ")
  else:
    print(f"URL \"{url}\" failed with code {status}. Skipping...", end=" ")
    return ([], None)
  html_code = response.text
  bs_text = BeautifulSoup(html_code)

  # Parse news URLs
  article_containers = bs_text.find_all('div', attrs={'class':'grid-item--container'})
  article_urls = ["https://news.iu.edu"+article.a['href'] for article in article_containers if article.a['href'].startswith('/')]

  # Find "Next" button: Return URL if it's there or None if it isn't.
  next_page_code = bs_text.find('li', attrs={'class':'next'})
  print("Returning...", end=" ")
  if next_page_code is None:
    return (article_urls, None)
  else:
    next_news_url = "https://news.iu.edu/tags/kelley-school-of-business"+next_page_code.a['href']
    return (article_urls, next_news_url)

And let's see if it produces consistent results:

In [None]:
# Gets all article URLs and the 'next' URL from the specified webpage
url = "https://news.iu.edu/tags/kelley-school-of-business"
get_kelley_news_urls(url)

Ok, now let's use a loop to get **all** of the URLs.

In [None]:
# Iteratively accesses the Kelley news page, extracting the article URLs and 'next' URL until there are no more articles to be extracted.
url = "https://news.iu.edu/tags/kelley-school-of-business"
list_of_article_urls = []
while True:
  result_tuple = get_kelley_news_urls(url)
  list_of_article_urls.extend(result_tuple[0])
  url = result_tuple[1]
  if not url:
    break
  else:
    print(f"Sleeping...")
    time.sleep(3)

print(f"\n\nThe full list of articles is: {list_of_article_urls}")

###4.4.2. Scraping the News Articles
---

Now that we have a list of the URLs for the articles themselves, we will largely repeat what we did above. The difference is that here we are looking for the text of the article, not the URLs to be scraped.

Let's start with one article, the first in our list:

In [None]:
url =  list_of_article_urls[0]
 
# Get URL Data
response = requests.get(url)
html_code = response.text
bs_text = BeautifulSoup(html_code)
print(bs_text)

And we're back to a mess of HTML again, but we see some valuable data in this HTML:

*note*: You'll see I added 'try' and 'except' blocks here. This is because some articles may/may not have each field. If you try to access a field that doesn't exist, Python will throw an error at you. The 'try' and 'except's just tell Python what to do if there is no error ('try') and what to do if there is an error ('except').

In [None]:
# The category
try:
  category = bs_text.find('div', attrs={'class': 'article-category'}).a.text.strip()
  print(category)
except:
  print("There was no category for this article")

In [None]:
# The title 
title = bs_text.find('h1', attrs={'class': 'article--title'}).text.strip()
print(title)

In [None]:
# The subtitle
try:
  subtitle = bs_text.find('h2', attrs={'class': 'article--subtitle'}).text.strip()
  print(subtitle)
except:
  print("There was no subtitle for this article")

In [None]:
# The author
try:
  author = bs_text.find('p', attrs={'class': 'byline author'}).text.replace("By\n", " ").strip()
  print(author)
except:
  print("There was no author for this article")

In [None]:
# The date
try:
  date = bs_text.find('p', attrs={'class': 'byline date'}).text.strip()
  print(date)
except:
  print("There was no date for this article")

We also see the body of the text in the `<div class="text">` tags. However, unlike the others, there are more than one of them, and they contain HTML tags in them:

In [None]:
# The text body
body_text = bs_text.find_all('div', attrs={'class': 'text'})
for section in body_text:
  print(section)

Fortunately, BeautifulSoup has a `get_text()` function that will help us get only the printed text from this. We can stitch together the multiple sections ourselves.

In [None]:
# Extracts the displayed text from the HTML
fulltext = ""
for section in body_text:
  fulltext = fulltext + " " + section.get_text().strip()

print(fulltext)

Again, let's pull this together into one function:

In [None]:
def parse_kelley_news_page(url):
  # Get URL Data
  response = requests.get(url)
  status = response.status_code
  if status == 200:
    print(f"URL \"{url}\" successfully requested. Parsing...", end=" ")
  else:
    print(f"URL \"{url}\" failed with code {status}. Skipping...", end=" ")
    return (None)
  html_code = response.text
  bs_text = BeautifulSoup(html_code)

  # Parse HMTL into article sections
  article = {}
  article["url"] = url
  
  # Not every article will have every field, so we use 'try' and 'except' statements to handle cases when it does not
  try:
    article["title"] = bs_text.find('h1', attrs={'class': 'article--title'}).text.strip()
  except:
    article["title"] = "None"
  try:
    article["category"] = bs_text.find('div', attrs={'class': 'article-category'}).a.text.strip()
  except:
    article["category"] = "None"
  try:
    article["subtitle"] = bs_text.find('h2', attrs={'class': 'article--subtitle'}).text.strip()
  except:
    article["subtitle"] = "None"
  try:
    article["author"] = bs_text.find('p', attrs={'class': 'byline author'}).text.replace("By\n", " ").strip()
  except:
    article["author"] = "None"
  try:
    article["date"] = bs_text.find('p', attrs={'class': 'byline date'}).text.strip()
  except:
    article["date"] = "None"
  try:
    body_text = bs_text.find_all('div', attrs={'class': 'text'})
    article["fulltext"] = ""
    for section in body_text:
      article["fulltext"] = article["fulltext"] + " " + section.get_text().strip()
  except:
    article["fulltext"] = ""

  print("Done...", end=" ")
  return article

Let's see if it produces consistent results to what we saw previously:

In [None]:
# Tests our custom parsing function to see if it pulls the right information
result_dict = parse_kelley_news_page(list_of_article_urls[2])
print(f"\n{result_dict}")

Now we will build a loop to take us through all of the article URLs we scraped:

In [None]:
article_texts = []

# Iterates through all article URLs, extracting the article information and text, and stores it to a list 'article_texts'
for url in list_of_article_urls:
  result_dict = parse_kelley_news_page(url)
  if result_dict:
    article_texts.append(result_dict)
  print(f"Sleeping...")
  time.sleep(3)

In [None]:
# Creates a Pandas DataFrame from the results and saves the corpus to a csv file
kelleynews_df = pd.DataFrame(article_texts)
kelleynews_df.to_csv(texts_dir/"kelleynews_texts.csv")

# Displays the contents of the corpus
display(kelleynews_df)

Now you have a corpus of Kelley-related news articles scraped from the IU webpage. If you go to the file navigator on the left, you can download the corpus from the server to your local machine. 