# Modeling Commonalities Among Banned Books in US Prisons in 18 States

Through this Jupyter notebook, I will be performing manipulations on data drawn from The Marshall Project, using topic modeling and data visualization to explore
commonalities among them. Please see the accompanying paper for methodology and discussion of the results.

In [70]:
### Imports

import pandas as pd
import spacy
import matplotlib as plt

nlp = spacy.load("en_core_web_sm")

In [71]:
"""
Conveniently, the project provides a list of all the states' data combined. See the paper for discussion of some complications around this.
We pull the csv from the Marshall Project website, save it into a Pandas dataframe, and print it to make sure everything looks good.
We also save the original csv locally. This isn't necessary, but it's good practice to keep our original data at hand as it allows us to reference things more easily
going forward, particularly if the original link dies at some point.
"""
print("Pulling CSV into the dataframe, please wait!")
books_df = pd.read_csv("https://s3.amazonaws.com/tmp-gfx-public-data/banned-books20220819/banned_book_data_combined_lists.csv?a")
books_df.to_csv("./banned_books.csv")
print("CSV loaded and saved locally! Here it is:")
books_df

Pulling CSV into the dataframe, please wait!
CSV loaded and saved locally! Here it is:


Unnamed: 0,publication,author,date,year,month,day,reason,state_arc
0,[ARABIC LANGUAGE CHARACTERS].,,2019-12-18 00:00:00,2019.0,12.0,18.0,,fl
1,[ARABIC LANGUAGE CHARACTERS]. TAJWEED QURAN.,,2021-09-09 00:00:00,2021.0,9.0,9.0,,fl
2,[KOREAN LANGUAGE CHARACTERS] HOLY SPIRIT LOVE.,,2021-02-25 00:00:00,2021.0,2.0,25.0,,fl
3,[KOREAN LANGUAGE CHARACTERS].,,2021-02-25 00:00:00,2021.0,2.0,25.0,,fl
4,[KOREAN LANGUAGE CHARACTERS].,,2021-02-25 00:00:00,2021.0,2.0,25.0,,fl
...,...,...,...,...,...,...,...,...
48371,Mentalism Companion,,2021-08-25 00:00:00,2021.0,8.0,25.0,,ri
48372,"You Can Read Anyone, Rules of the Game",,2021-09-22 00:00:00,2021.0,9.0,22.0,,ri
48373,Sex Slave never judge a Women by her desires,,2021-09-30 00:00:00,2021.0,9.0,30.0,,ri
48374,Moorish American Civics Guide,,2021-09-30 00:00:00,2021.0,9.0,30.0,,ri


In [72]:
# First, we clean the dataframe by removing the columns that aren't of interest to us: the time-related ones. Then we replace all NaN values with the string "None,"
# this allows us to properly process it with Spacy while still keeping track of how many nil values are present in the data.

books_df.drop("date", inplace=True, axis=1)
books_df.drop("year", inplace=True, axis=1)
books_df.drop("month", inplace=True, axis=1)
books_df.drop("day", inplace=True, axis=1)
books_df = books_df.fillna("None")
#books_df["author"]=books_df["author"].map(str)
# Then print again to see how it looks.

books_df

Unnamed: 0,publication,author,reason,state_arc
0,[ARABIC LANGUAGE CHARACTERS].,,,fl
1,[ARABIC LANGUAGE CHARACTERS]. TAJWEED QURAN.,,,fl
2,[KOREAN LANGUAGE CHARACTERS] HOLY SPIRIT LOVE.,,,fl
3,[KOREAN LANGUAGE CHARACTERS].,,,fl
4,[KOREAN LANGUAGE CHARACTERS].,,,fl
...,...,...,...,...
48371,Mentalism Companion,,,ri
48372,"You Can Read Anyone, Rules of the Game",,,ri
48373,Sex Slave never judge a Women by her desires,,,ri
48374,Moorish American Civics Guide,,,ri


In [86]:
# With itertuples we read each column of the dataframe into a separate variable to later process with Spacy.

publication_column = (row.publication for row in books_df.itertuples())
author_column = (row.author for row in books_df.itertuples())
reason_column = (row.reason for row in books_df.itertuples())
state_column = (row.state_arc for row in books_df.itertuples())
#We return to these variables later when we do some topic modeling, but first we do POS tagging in Spacy.

In [87]:
# We move to Spacy and run through the nlp pipeline, removing stopwords and non-alpha characters from the publication and reason columns while we're at it.
# This gives us the Spacy data we will analyze.

publication_nouns = []
publication_proper = []
publication_verbs = []
author_names = []
reason_nouns = []
reason_adj = []
states = []

print("Running everything through the Spacy NLP pipeline. This may take some time, please be patient.")

# We process the publication column, pulling out alphanumeric, nonstop, lemmatized versions of the nouns, proper nouns, and verbs.
print("\nProcessing publication column.")
for doc in nlp.pipe(publication_column):
    for token in doc:
        if token.is_alpha and not token.is_stop:
            if token.pos_ == "NOUN":
                publication_nouns.append(token.lemma_.lower())
            elif token.pos_ == "PROPN":
                publication_proper.append(token.lemma_.lower())
            elif token.pos_ == "VERB":
                publication_verbs.append(token.lemma_.lower())
print("\n...\nPublication column processed.")

#We process the author column, just assembling one list of all the names.
print("\n...\nProcessing author column.")
for doc in nlp.pipe(author_column):
    for token in doc:
        if token.is_alpha:
            author_names.append(token.text.lower())
print("\n...\nAuthor column processed.")

#We process the reasons column, pulling out lemmatized nonstop nouns and adjectives.
print("\n...\nProcessing reason column.")
for doc in nlp.pipe(reason_column):
    for token in doc:
        if token.is_alpha and not token.is_stop:
            if token.pos_ == "NOUN" or "PROPN":
                reason_nouns.append(token.lemma_.lower())
            elif token.pos_ == "ADJ":
                reason_adj.append(token.lemma_.lower())
print("\n...\nReason column processed.")

#The states do not need to be sorted by POS since there's only one type of information in the column. However, to make sure our data is consistent we use Spacy
#to filter out any potential non-alpha characters.
print("\n...\nProcessing state column.")
for doc in nlp.pipe(state_column):
    for token in doc:
        if token.is_alpha:
            states.append(token.text.lower())
print("\n...\nState column processed.\n...\nAll columns have been processed, thank you for waiting.")

Running everything through the Spacy NLP pipeline. This may take some time, please be patient.

Processing publication column.

...
Publication column processed.

...
Processing author column.

...
Author column processed.

...
Processing reason column.

...
Reason column processed.

...
Processing state column.

...
Reason state processed.
...
All columns have been processed, thank you for waiting.


In [90]:
states

['fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',
 'fl',