<a href="https://colab.research.google.com/github/columbia-data-club/meetings/blob/main/WIS/2023/1_3_Introdution_to_Text_Analysis_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![A blue background with the Python logo and the words Data Club on it](https://raw.githubusercontent.com/columbia-data-club/meetings/main/assets/images/data-club-python.png)

## Introduction to Text Analysis with Python

Dec. 1, 2023

by [Moacir P. de Sá Pereira](https://moacir.com) for [Women in STEM @ SIPA](https://sipa.campusgroups.com/wis/home/), modified from  [Columbia Data Club](https://github.com/columbia-data-club/) notebooks.

This notebook underpins a ~75 minute presentation that uses a corpus of articles downloaded from [JSTOR](http://www.jstor.org) as an occasion to learn the basics of text analysis for complete beginners to Python and to programming.

Though some text analysis libraries are mentioned, we will actually limit ourselves becaose of time constraints to more straightforward data analysis.

## Determining Gender of Authors

Before we move to NLP of the sort requiring the libraries above, Vanshika suggested we see if we can intuit anything from our _JPE_ dataset regarding female authors.

Given what we developed in the previous module, how can we go about seeing how _JPE_ has historically published women? What do we need to do?

### The Plan


Given a list of articles, here is a step-by-step description of how we can solve our problem:

1. We load in the data!

1. We turn our list of articles into a list of authors.

1. We ascertain the gender of each author

1. We match our name-to-gender lookup table to the original list of articles so we can see which articles have women authors

1. We group our dataset of articles by whatever criteria we like to see the relationship between the number? percentage? of women authors and whatever criteria we are tracking.

### 1. We Load in the Data!

This should be straight forward. We saved the data in the last module as “drive/MyDrive/jpt-full-no-0-authors.parquet” in our Google Drive, so let’s mount the drive again by clicking on the folder icon and then on the folder icon with the little Drive icon.

Next, we need to import pandas and read the file!

In [None]:
import pandas as pd

file_path = "drive/MyDrive/jpt-full-no-0-authors.parquet"

df = pd.read_parquet(file_path)

In [None]:
df.head()

### 2. We Turn our List of Articles into a List of Authors.

`df` is a dataset of articles, and each article has a `creator` column, which has a list inside that is a list of all the authors. We want a list with all the authors only. What we will do is:

1. Create an empty, master list of authors.
1. Iterate over `df`.
  1. For every article, we pull out the list of authors and then iterate over the list of authors
    1. For every author, we add them to the master list of authors.
1. The master list of authors will now have all the authors.
1. We turn the master list into a master **Set**, which will remove duplicates for us.
1. We now have a list of all the unique authors.

This may sound like a lot, but it is rather straight forward in just a few lines of code.

In [None]:
master_author_list = [] # empty list
def add_creators_to_master_list(creator_list):
  for creator in creator_list:
    master_author_list.append(creator)

df["creator"].apply(add_creators_to_master_list)

This is not the most elegant solution, but it does get us where we need to go. Let’s see how many authors were collected.

In [None]:
len(master_author_list)

Who are the most common duplicates?

In [None]:
pd.Series(master_author_list).value_counts().head(15)

Does that sound right?

Now let's get rid of the duplicates.

In [None]:
unique_authors = list(set(master_author_list))
len(unique_authors)

Ten thousand out of twenty thousand authors are unique, meaning on average an author has two articles in _JPE_. No big deal.

### 3. We ascertain the gender of each author

We will start out by using the [gender-guesser](https://pypi.org/project/gender-guesser/) library in Python. But we have to install it. Unlike pandas, it is not built into the Colab environment.

In [None]:
!python -m pip install gender_guesser

In [None]:
import gender_guesser.detector as gender
detector = gender.Detector()



Let’s have a look at a random sample of 15 authors. Let's feed all 15 names into the gender detector.

In [None]:
sample_of_15 = pd.Series(unique_authors).sample(15, random_state=15)
sample_of_15

In [None]:
for name in sample_of_15:
  print(f"{name} is coded as {detector.get_gender(name)}")

That's entirely useless. If we look at the manual for `gender-guesser`, we see:

```python
>>> print(d.get_gender(u"Bob"))
male
>>> print(d.get_gender(u"Sally"))
female
>>> print(d.get_gender(u"Pauley")) # should be androgynous
andy
```

Oh. It only works on first names. Well, we can use the `.split()` method on a string to split a string into a list and then only test the first value of the list. As an example:

In [None]:
print("Charles S. Ascher".split(" "))
print("Charles S. Ascher".split(" ")[0])

Let’s try again!

In [None]:
for name in sample_of_15:
  first_name = name.split(" ")[0]
  print(f"{name} is coded as {detector.get_gender(first_name)}")

In short, we will not be able to ascertain the gender of someone based on their first initial even under the best of circumstances.
Let’s create a new data frame with all these names, though, and see what we can do.

In [None]:
names_df = pd.DataFrame(unique_authors, columns=["full_name"])
names_df.sample(15, random_state=15)

Let's create a new column with just the first name.

In [None]:
names_df["first_name"] = names_df["full_name"].apply(lambda full_name: full_name.split(" ")[0])
names_df.sample(15, random_state=15)

OK. Not too shabby. Now let's add a column for gender.

In [None]:
names_df["gender"] = names_df["first_name"].apply(lambda first_name: detector.get_gender(first_name))
names_df.sample(15, random_state=15)

Notice these two times with `.apply()` we did not define a function ahead of time. Instead, we used a [lambda function](https://realpython.com/python-lambda/). These are anonymous functions that we can use when we need to do a transformation quickly on the fly. So, above,

```python
names_df["first_name"].apply(lambda first_name: detector.get_gender(first_name))
```

is the same as

```python
def get_gender(first_name):
  return detector.get_gender(first_name)

names_df["first_name"].apply(get_gender)
```

From my understanding, lambdas are not particularly idiomatic Python, but I am very used to them from JavaScript, where they are ubiquitous (and are called anonymous functions).

For longer functions that involve some kind of control flow, it makes sense to define a function. But for one-liners, a lambda is usually ok!

Back to gender, though. Let's see the breakdown of genders.

In [None]:
names_df["gender"].value_counts()

This probably is not telling us much that we don’t already know. _JPE_ is printing mostly men. And, in fact, even if all the edge cases were reclassified as women, it would still account for only 1/3 of the total number of authors. Let's look at the list of unknown gender people and see if it might make sense just to drop all of them.

Notice, I'm going to filter the data frame based on gender and then look at the value counts for first names in one line of code.

In [None]:
names_df[names_df["gender"] == "unknown"]["first_name"].value_counts()

So of the 2865 unknowns, there are 823 distinct first names. And we can see that some of the names look like legit first names, but the gender guesser opted not to guess.

### 4. We Match our Name-to-Gender Lookup Table to the Original List of Articles so We Can See Which Articles Have Women Authors

Now that we have a one of six (`male`, `female`, `mostly male`, `mostly female`, `unknown`, and `andy`) genders assigned to all of our unique authors, we can feed that information back into our original `df`. But we will have to be careful.

First, let's just create six gender counter columns for `df` and set their values all to 0. Notice how simple it is to set an entire column with one value.


In [None]:
gender_counter_columns = ["male_count", "female_count", "mostly_male_count", "mostly_female_count",
                            "unknown_count", "andy_count"]
for gender_counter_column in gender_counter_columns:
    df[gender_counter_column] = 0

df.head()

Now is when things get tricky:

* Iterate over every row in `df` and with each row:
  * Create a dictionary with all the gender counters and fill it
    with the current counter values (will be 0).
  * Iterate over the list of authors and for every author,
    * Match the name with the name in `df_names` using the `.loc()` method
    * Increment the matching gender counter in the dictionary by 1.
  * Return the gender counters dictionary.

Notice below how closely the actual function matches the shape of the bullet points above. This is very helpful when designing functions and algorithms.

In [None]:
def count_genders(row):
  # First, we create a dictionary for all six gender counters.
  gender_counters = {}
  # We fill the counters with the current values from df. These will all be 0.
  for gender_counter_column_name in gender_counter_columns:
    gender_counters[gender_counter_column_name] = row[gender_counter_column_name]
  # Next, we iterate over the list of authors.
  for author in row["creator"]:
    # Match the name to the name in df_names and get the gender
    # This could cause an error if no author exists with that name,
    # But we have not dropped any articles, so every author should be
    # findable.
    #
    # This is a complex line of code, but it basically says:
    # Filter the names_df dataframe and give me a new dataframe where
    # the "full_name" is equal to the author variable above.
    # Next, give me the "gender" column of that dataframe, so now I
    # have a series.
    # Then, give me the first row in the series using .iloc[0].
    detected_gender = names_df.loc[names_df["full_name"] == author]["gender"].iloc[0]

    # This `gender` variable is equal to one of the six genders, but we want
    # to convert the string "mostly male," say, into the counter column
    # "mostly_male_counter".
    gender_counter_column_name = detected_gender.replace(" ", "_") + "_count"

    # Now that we have the correct gender counter column name, we know what to
    # increment in the dictionary.
    # The `+=` increments the left-hand side by the right-hand side.
    gender_counters[gender_counter_column_name] += 1

  # And last, we return the dictionary with all of the gender counters
  return gender_counters

The above function is really only doing a few small things, but they are cognitively tricky and there are nested loops: we are iterating over each article and then, for each article we are iterating over each author. But now we can run this as one line of an `.apply()` call and we'll be done!

Below I do two things so I can update all six gender counter columns at once:
1. I’m feeding the list of columns to the `[]` operator on the left-hand side
1. By passing in the keyword argument `result_type="expand"`, pandas is expecting a dictionary as the return value to the `.apply()` method (the return value of `count_genders` here), and then it uses the keys of the dictionary to know which columns in the dataframe to update.

This is the first “big” command we are running, because it has to iterate over 10k articles and for each article it has to iterate over the 10k-long list of names at least once (if there is only one author), meaning it is doing $10,000^2$ operations, _minimum_. And still it only takes 30 seconds.

In [None]:
df[gender_counter_columns] = df.apply(count_genders, axis = 1, result_type="expand")

We can sort the articles by number of authors and see the top 10 to make sure that it seems like the number of gender counts sums to the author count, and it looks like it does.

In [None]:
df.sort_values("author_count", ascending=False)[["author_count", "male_count", "female_count", "mostly_male_count", "mostly_female_count", "unknown_count", "andy_count"]].head(10)

## Analyzing Genders of Authors

Let’s look at the breakdown of authors by gender again from the `names_df` dataframe in terms of percentages

In [None]:
names_df["gender"].value_counts() / len(names_df)

Identifiably male authors are 65% of the total, unknown authors make up over a quarter, and identifiably female authors make up just 6% of the authors in _JPE_. But let's track these things over time.

In [None]:
jpe_by_year = df.groupby("publicationYear")
jpe_by_year["female_count"].sum().plot()

What does this chart show? What are potential problems with this chart based on the charts we looked at in the previous session?

Let’s plot all the three big genders and the total number of authors.

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 2, figsize=(10, 8))
fig.tight_layout()
jpe_by_year["author_count"].sum().plot(ax=axes[0,0], ylabel="total authors")
jpe_by_year["female_count"].sum().plot(ax=axes[0,1], ylabel="female authors")
jpe_by_year["male_count"].sum().plot(ax=axes[1,0], ylabel="male authors")
jpe_by_year["unknown_count"].sum().plot(ax=axes[1,1], ylabel="unknown authors")

What is this telling us? How can we make this analysis more sensitive? Where do we go from here?

## What Is Text Analysis?



In simplified terms, “text analysis,” with “natural language processing” or “NLP,” can be understood as methods that allow computers to algorithmically “read” textual data. There are many techniques that we can use to break apart our data:

* **Sentence parsing** We can break up the sentences in our corpus into subjects, verbs, and objects and analyze those items. Or we can see which prepositions are used most often and how. This practice relies on good “grammars” in place for your specific language, so it knows how to find a subject, verb, etc.
* **Topic modeling** This technique analyzes _documents_ in a corpus and tries to find particular words that link specific documents together. Although the word “topic” suggests that there is a semantic meaning to these groupings (they are arranged “by topic,” or “by what they are about”), the algorithm does not actually understand, semantically, what words are related to other words.
* **Classification** Similarly, we can use text analysis to classify documents based on certain features. This is a popular way to use machine learning with text analysis.
* **Sentiment Analysis** We can also use NLP to do vibechecks on our documents. If you have seen articles about how negative Facebook posts attract more attention that positive ones, the authors are likely using sentiment analysis to classify the posts as positive or negative.
* **Named Entity Recognition** NLP can find proper names and make reasonable guesses as to whether they refer to people, places, or institutions. This lets the researcher collect all the places in a corpus and build a map from that geographical data.

### Some Terminology

* **Corpus** In NLP, a “corpus” is typically a collection of **documents**.
* **Document** A “document” is a collection of text, typically unified in some way. A document can be any length; a tweet can be a document. The document provides the logical structure for classifying and organizing text objects. If you want to classify a handful of paragraphs from a novel as happy or sad, it probably makes sense to create a document for each paragraph and then analyze the paragraphs separately.
* **Token** and **Tokenization** A “token,” in simplest terms, corresponds more or less to a word. “Tokenization” is the process of taking a document from a corpus and breaking it up into words. This is actually a more complex process than it may initially seem. What kinds of obstacles could make tokenization difficult for a computer?
* **Lemma** and **Lemmatization** A “lemma” is the base root of a word, and “lemmatization” refers to the process whereby our NLP tools determine the root of a word. For example, “sings,” “singing” and “sang” all have the same lemma, “sing.”
* **Stop Words** Often we are not interested in tracking the use of certain words, particularly very common ones like “a” and “the.” Stop words are a list of words that we tell our algorithims to ignore.
* **Bag of Words** In many NLP applications, word order, paradoxically, does not matter. We say that these applications are working with a “Bag of Words.”


## Python NLP Libraries

There are many libraries in Python that help with text analysis and NLP. I will mention four here:

1. [NLTK](https://www.nltk.org/) is the best known library for NLP in Python, and it more or less established the field. It has 50 modules or so for doing all kind of text-related analysis. The “[NLTK Book](https://www.nltk.org/book/)” is a deep introduction to the library. The library [TextBlob](https://textblob.readthedocs.io/en/dev/index.html) sits atop NLTK and may make it easier for some users to use.

2. [spaCy](https://spacy.io/) is a new library. Unlike NLTK, spaCy limits options to let users get up and running with it quickly. It is also very, very fast and designed for production use (on websites, cell phones, etc.). [Quickstart](https://spacy.io/usage)

2. [Textacy](https://textacy.readthedocs.io/en/latest/) is a wrapper for spaCy that helps with preparing documents for spaCy. [Quickstart](https://textacy.readthedocs.io/en/latest/quickstart.html)

3. [Gensim](https://radimrehurek.com/gensim/#) is very purpose driven towards topic modelling and document organization and analysis. [Quickstart](https://radimrehurek.com/gensim/auto_examples/index.html#core-tutorials-new-users-start-here)

4. As a robust data analysis library, [scikit-learn](https://scikit-learn.org/stable/modules/decomposition.html) also has modules that help with classification and topic modeling. They have a [specific tutorial on working with text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#).