<a href="https://colab.research.google.com/github/columbia-data-club/meetings/blob/main/WIS/2023/1_2_Building_a_Corpus_with_JSTOR_and_Introduction_to_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![A blue background with the Python logo and the words Data Club on it](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/data-club-python.png)

## Building a Corpus with JSTOR and an introduction to pandas

Dec. 1, 2023

by [Moacir P. de Sá Pereira](https://moacir.com) for [Women in STEM @ SIPA](https://sipa.campusgroups.com/wis/home/), modified from  [Columbia Data Club](https://github.com/columbia-data-club/) notebooks by Roger Creel, Isha Shah, and others in the Data club past.

This notebook underpins a ~75 minute presentation that uses a corpus of articles downloaded from [JSTOR](http://www.jstor.org) as an occasion to learn the basics of [pandas](https://pandas.pydata.org/) for complete beginners to Python and to programming.

## Data and Corpora


When we want to analyze data, no matter what the data are, we have to acquire them somehow. Sometimes this takes the form of collecting or measuring our own data, such as with a GPS tracker or with surveys of people conducted in the field. Other times this involves acquiring data from other sources.

If our data are textual in nature, it is very likely that we will be acquiring or compiling the data from some already existing repository, such as connecting to a website and downloading its textual data or interacting with a purpose-built dataset generator.

Today, we will be working with such a generator, at least to see how it works. The actual data we will download from a link I provide in the notebook that will only work during this workshop. That means that after today, you will have to build up the data yourself if you wish to reproduce this notebook.

Our collection of documents will make up a corpus (pl. corpora). We anticipate that every document in our corpus will have more or less the same shape; that is, the metadata will be predictable, and what works to get the names of the authors, for example, from one document will work for every document. This is a **luxury**. Often data collected from different sources will have different shapes and properties, and combining all the data into one predictable format can be very time-consuming. Today, we cheat a little to move more quickly.

### Constellate and Other Corpus Building Tools



[JSTOR](http://jstor.org), the non-profit repository of academic journals, launched a tool, [Constellate](http://www.constellate.org), a few years ago to facilitate entry-level macroanalysis of the contents of the JSTOR archive. The Libraries do not subscribe to these entry-level features, but we do maintain access to the downloads of datasets we build in Constellate. We will use Constellate today to download the contents of the _Journal of Political Economy_, or at least all of what JSTOR has, which as of today runs through 2017.

Constellate, of course, is limited to the documents (articles, etc.) that JSTOR has. If you are eager to build a corpus using materials unavailable in JSTOR, please reach out to us at Research Data Services (`data@library.columbia.edu`), and we will try to help you. Alternatively, you can look at some of the resources I have collected in our [Text and Data Mining Guide](https://guides.library.columbia.edu/text-mining).

One particular tool worth mentioning is ProQuest's [TDM Studio](https://guides.library.columbia.edu/text-mining/proquest-tdm-studio). This tool lets users build datasets out of newspaper articles, including _The New York Times_. Unlike Constellate, TDM Studio does not let you download the data to your own computer. Instead, ProQuest provides a virtual computer where you can use the data in a Jupyter notebook, much like what we are doing today with these Colab notebooks.

### Using Constellate



![The front page of the Constellate website](https://raw.githubusercontent.com/columbia-data-club/meetings/main/WIS/assets/img/constellate/01-front-page.png)

Navigating to [Constellate](http://constellate.org) will grab us with a lot of different things to explore at our leisure, but we will jump straight to logging in. Under the login, we select to log in with Google and use our Lionmail account to access Constellate. Once we log in, we are brought to the Dashboard, but we can immediately move to the dataset builder by clicking on “Build a Dataset.”

![The Constellate Dataset Builder](https://raw.githubusercontent.com/columbia-data-club/meetings/main/WIS/assets/img/constellate/03-dataset-builder.png)


The Dataset Builder is a bit finicky. We cannot just type in the publication title we want (or, at least, I am incapable of getting it to work). Instead we have to browse the titles, tick the box for _Journal of Political Economy_, and then scroll to the top and choose “Search Journals.” Now in the builder we will see the name _Journal of Political Economy_ underneath the Publication Titles filter, and we will see that Constellate is offering 13k documents instead of 33m. This is helpful, since Constellate maxes out at 25k documents per dataset, but, of course, once we download the datasets we can combine them. Once we confirm that this is the dataset we want, we click on “Build,” give the dataset a nickname, and it takes a few minutes to build the dataset. Once it is ready, it appears in our collection of datasets, and we can proceed to download.

![The Constellate Dataset download options](https://raw.githubusercontent.com/columbia-data-club/meetings/main/WIS/assets/img/constellate/04-download-options.png)

The first two download options provide sampled data from the entire dataset. We want everything, though. Under “More Download Options,” we can find the option to download everything. It will take several minutes for Constellate to prepare the download and then more time to do the actual download. To keep things moving, however, I have pre-setup the dataset and made it available for you all for the duration of this workshop, so lets download it!

## Getting our Data into Colab


We need to download the data that I have made available on my Google Drive into the Colab environment. There are a few ways to do this, including ways that work entirely in Python, but we will do it a little bit old school:

https://drive.google.com/file/d/19NSasyWmdcQS9oM372FdN-Qme_qw9XZQ/view?usp=drive_link

Download this file (it is about 70Mb) to your computer, and click on the little folder to the left of this text in Colab. The file sidebar will open, and drop the file in there.

**The Colab is temporary!** When this notebook is closed or has to restart, the file will be deleted, which is why Google warns you to have your ensure you have a copy on your own computer!

### A Word about File Formats

Constellate provides its data in the `jsonl` or "line delimited JSON" or "JSON lines" format. You will likely get your data in any of a number of formats, but there are a few guidelines to follow when saving your own:

* Unless you are working with “big data” (gigabytes, like we actually are today), it probably makes sense to stick to a plain text file format. For tabular data, that is typically a `csv`, or [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) file. Tabular data (that is, with rows and columns) often comes in the [`json`](https://en.wikipedia.org/wiki/JSON) format, which is also a plain text format.
* At large enough sizes, it becomes more efficient to use open binary file formats, like [`parquet`](https://en.wikipedia.org/wiki/Apache_Parquet). Binary file formats provide many efficiencies, but they cannot be read by any plain text parser.
* Our format today, `jsonl`, is a variant of `json`, where each “row” is its own line. This is in contrast to typical `json`, where line breaks mean nothing.
* If you have further questions, ask us!

In a Python context, we can consider a `json` or `jsonl` file as a file that, when read by a Python interpreter, outputs a **list** made up of several **dictionaries**, where each dictionary, typically, has the same **keys**, though that is not a requirement. So, for example, a regular `json` file could look like:

```python
[
  {
    "name": "Aisha",
    "id": 452
  },
  {
    "name": "Becky",
    "id": 792
  },
  {
    "name": "Chanda",
    "id": 223
  }
]
```

The same content as a `jsonl` file would look like:

```python
{"name": "Aisha", "id": 452}
{"name": "Becky", "id": 792}
{"name": "Chanda", "id": 223}
```

And as actual python code:

In [None]:
team = [{"name": "Aisha", "id": 452}, {"name": "Becky", "id": 792}, {"name": "Chanda", "id": 223}]
print([person["name"] for person in team])

## Pandas

Let’s now move to [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html). Pandas is a library for analyzing data, and it is often the first point of entry to any Python project involving data. It provides an idiosyncratic but predictable syntax for organizing your data the way that you want.

The primary structure in pandas is the **DataFrame**. This is a two dimensional table that organizes your data into rows and columns. In other words, it is very similar to a _single_ Excel spreadsheet or Google Sheets sheet.

It is customary in Python to import pandas as `pd` and to call your DataFrame `df`. You can do whatever you like, of course, but I recommend sticking to this tradition for today.




#### Getting Data into pandas

But let’s read in our file we uploaded. Pandas can read many different formats, including Excel spreadsheets, `csv` files, regular `json` files, and more.

In [None]:
import pandas as pd # we will refer to pandas as "pd" going forward
import matplotlib.pyplot as plt # we may do some plotting with matplotlib

In [None]:
df = pd.read_json("jpe-full-nograms.jsonl", lines=True)
df.head()

We now have a variable, `df`, that holds the contents of our file. Running the `.head()` method displays the first five rows (by default) in a slightly interactive format.

#### Initial Exploratory Data Analysis


The information here is somewhat unwieldy, so let's look at the 27 columns that we have and consider dropping some.

In [None]:
df.columns

Let’s also see what kinds of types are in these columns.

In [None]:
df.dtypes

Let’s take a few moments to consider all of these column names. Are there any we definitely want? Are there any we might want to drop? Are there any we want to know more about?

We can use a few methods, `.describe()` and `.value_counts()` to get a sense of some of these columns.

In [None]:
df['docType'].value_counts()

In [None]:
df['publicationYear'].describe()

In [None]:
df['wordCount'].describe()

One thing I noticed when downloading this data was that the `fullText` column was often empty when I was sampling the dataset. I didn’t look into it systematically, but we are supposed to have the full text here, so one question is “how many articles actually have nothing in the `fullText` column?”

In [None]:
df['fullText'].isnull().sum()

In [None]:
print(f"{round(100 * df['fullText'].isnull().sum()/len(df))}% of the entries have no fullText")

What about authors, or “creators” in JSTOR’s language? Let’s check the publication year, too.

In [None]:
for columnName in ["fullText", "creator", "publicationYear"]:
  print(f"{round(100 * df[columnName].isnull().sum()/len(df))}% of the entries have no {columnName}")

Excellent. We can handle a drop off on full text, but it’s good to see most of the articles have a creator and all have a publication year.

Let’s look into the articles that don't have full text, though.

In [None]:
df[df["fullText"].isnull()]["publicationYear"].hist()

In [None]:
df[df["fullText"].notnull()]["publicationYear"].hist()

It looks like it might be clear what articles are provided with full text and what articles aren’t!

In [None]:
df[df["fullText"].isnull()]["publicationYear"].describe()

In [None]:
df[df["fullText"].notnull()]["publicationYear"].describe()

Well this is irritating, but it also explains why this file is only 70Mb!

I reached out to the Constellate team between preparing this notebook and this workshop, and they told me that Constellate will not have the full data on demand anytime soon. I misunderstood what it was for. For many text analysis operations, the unigrams that Constellate includes are sufficient, but for analysis where word order matters, you need to fill out a [Data for Research request](https://jstor.libwizard.com/f/dfr-request). These are handled by hand, so there is about a 48 hour turnaround.

#### The `[]` Operator

We learned in the first module this morning that `[]` attached to the end of a variable can access aspects of that variable. For example, in a list, we can use it to get a specific index, like `a[0]` will return the “zeroth” (first) element of the list, and `a[-1]` will return the last.

In a dictionary, we can use `d['key']` to access the value associated with a specific key.

In pandas, the `[]` operator lets us do many, many things!

* If we pass just a **string** (that is a column name), we get that entire column:

In [None]:
single_column_name_string = "docType"
df[single_column_name_string]

* If we pass a **list** of strings (column names), we get a collection of columns.

In [None]:
list_of_columns = ["docType", "docSubType"]
df[list_of_columns]

* If we pass a **Series** (a special data type in pandas) made up of `True`/`False` statements, we can use that to filter our data frame. This one is complicated. First, let’s create a series

In [None]:
df["doi"] == "10.1086/250019"

In [None]:
print(type(df["doi"] == "10.1086/250019"))
print(type(df))

In [None]:
(df["doi"] == "10.1086/250019").value_counts()

Now if we feed that Series into the `[]` operator, it will work like a filter:

In [None]:
# Note the double `df`!
df[df["doi"] == "10.1086/250019"]

In [None]:
# This is the same as the previous cell
doi_series = df["doi"] == "10.1086/250019"
df[doi_series]

We already used this double `df` operation above when filtering out articles that had empty fields for `fullText` and so on:

```python
df[df["fullText"].notnull()]["publicationYear"].describe()
```

In the first `[]` we filter based on rows where `fullText` is `notnull()`. The we add a second `[]` to limit our results to just the `publicationYear` column. Then we attach the standard pandas method `.describe()`, which works on dataframes and series, to get information about the series.

This filtering takes getting used to, but it is pretty ergonomic and powerful.

#### Grouping

Another complex topic is using pandas’s groups. This is very similar to using pivot tables in Excel, but I will show how to create something powerful with our dataset. First, let's look at a sample of the `creator` column. What does it look like?

In [None]:
df["creator"].sample(10, random_state=42)

With the `[]` notation, they kind of look like lists, so let’s see if they actually are lists:

In [None]:
# .iloc[] is how you capture a specific row based on its index.
creators = df.iloc[1626]["creator"]
print(type(creators))
print(creators)

Excellent! Now I'm curious to see if there's a relationship between number of authors over time, like maybe the number of authors grows. So we have this list of authors for every article, but we don't have anything in the dataframe that turns that into a specific number. Let’s change that.

First, let’s write a function that takes a row from our dataframe and outputs the number of authors:

In [None]:
def get_author_count(row):
  number_of_authors = len(row["creator"])
  return number_of_authors

The idea behind this function is it is passed a row, much like how we created the row with `df.iloc[1626]` above. Then it takes the `creator` column from that row and uses the `len()` function to get its length.

Let’s test it with the row we used above.

In [None]:
homan_and_noble = df.iloc[1626]
number_of_authors = get_author_count(homan_and_noble)
print(number_of_authors)
print(homan_and_noble['creator'])

Here is where things get exciting. Let’s create a new column, called `author_count` for our dataframe and have it hold the value of this `author_count()` function we created. We’ll use the `apply` method, which iteratively executes a function on every row (if `axis = 1`) or column (otherwise) in the dataframe.

In [None]:
df["author_count"] = df.apply(get_author_count, axis=1)

**An ERROR!** What did we do wrong? The message reads:

> TypeError: object of type 'NoneType' has no len()

And that, actually, makes sense. What it’s saying is that when the `creator` column is empty, we do not have an empty list, where `len([])` is 0. Instead, we have the special value `None`, which is Python’s version of `null`/`nil`. And if we try to take the length of `None`:

In [None]:
len(None)

We get the exact same error as above. This means we have to change our `author_count()` function to react one way when `creator` is a list, and another way when it is `None`. There are many ways to do this, but this is a quick one:

In [None]:
def get_author_count(row):
  if row["creator"] == None:
    return 0

  number_of_authors = len(row["creator"])
  return number_of_authors

And now…

In [None]:
df["author_count"] = df.apply(get_author_count, axis=1)

Let’s investigate our new column!

In [None]:
df["author_count"].describe()

Given that 75% of the articles have 2 or fewer authors, this may not be particularly illuminating, but let’s see what happens when we group things.



In [None]:
pub_year_group = df.groupby("publicationYear")
print(type(pub_year_group))
pub_year_group

As you can see, the group is not, by itself, interesting. Groups in pandas are sort of interstitial objects. You trigger them with the `.groupby()` method, but then you need to use a second method to get information out of them.

Below, we extract the `author_count` column out of the group and use the `.max()` method to get the number of authors in the article with the most authors for every year.

In [None]:
pub_year_group["author_count"].max()

In [None]:
pub_year_group["author_count"].mean()

In [None]:
pub_year_group["author_count"].mean().plot()

Does this tell us anything? Let's plot the number of articles published a year, too. I use the `id` column just because I assume every row has an `id`, and if I just plot the group by itself, it will plot all of the columns!

In [None]:
pub_year_group["id"].count().plot()

This isn’t my field, but already I have a handful of possible research topics to pursue!

#### Removing Data

Pandas is non-destructive. When you make changes to your dataframe, it always returns those changes as a new dataframe (or series). That is why you will often see commands that look like:

```python
df = df...
```

This is a way of overwriting the data. But with notebooks, we can always go back to the top of the notebook and press play all over again and rebuild the data if we have made mistakes.

Up above, I showed that using the `[]` operator with a list of column names will show you only those columns. That, it turns out, is an easy way to delete columns you don’t want by omission. But for this part of the exercise, we are going to remove all the rows (articles) that don’t have authors and see how the numbers for author counts change once we have at least one author.

We do this by creating a series of `True`/`False` based on the `author_count`, and then we save the filtered dataframe _as_ `df`, overwriting the first df:

In [None]:
print(f"df is currently {len(df)} rows large")
df = df[df["author_count"] > 0]
print(f"df is currently {len(df)} rows large")

We lost around 1500 rows this way, but let's see if the plots change.

In [None]:
pub_year_group = df.groupby("publicationYear")
pub_year_group["author_count"].mean().plot()

What does this new chart tell us?

#### Plotting with Matplotlib

[Matplotlib](https://matplotlib.org/stable/) is the standard plotting library in Python. Pandas is already leveraging it in the plots above, but in order to exercise better control over the plot, we should acess Matplotlib directly.

We imported it at the same time that we imported pandas, and we renamed it to `plt`. This is, again, a Python convention.

In [None]:
plt.scatter(x=df["wordCount"], y=df["author_count"])
plt.title('Do many authors make articles longer?')
plt.xlabel('Word Count')
plt.ylabel('Author Count')
plt.show()

What seems to be the answer to the question posed by the title in this chart?

## Exporting Data from pandas

As a final step, we need to export our data so we can use it in the next module. Pandas can export to basically any format you might need, but we will use `parquet`.

Furthermore, we’ll export to our Google Drives. First we have to mount the drives by clicking on the little folder icon to the left and then the folder with the triangular Drive logo. This will make a `drive` folder appear after a minute or so. It has a folder called `MyDrive`, and we'll save our file in there.

In [None]:
df.to_parquet("drive/MyDrive/jpt-full-no-0-authors.parquet")

We may need to hit the little folder refresh icon on the side to see our new file, but let's get ready for part 3!