# Stanford Encyclopedia of Philosophy - Who is being cited?

## Introduction

The *Stanford Encyclopedia of Philosophy* (https://plato.stanford.edu/) is often considered to contain the condensed knowledge of (academic) philosophy. As such, it is suggestive to analyze it in order to learn something about the field of academic philosophy. This is what I aim to do in this notebook.

In particular, I am interested in examining to which degree the different genders are represented within the encyclopedia. To this end, we will analyze a dataset containing all the references that appear in the encyclopedia, that is, all the books and journal articles that are being cited (a description of how I gathered the data can be found in a different notebook). I will address the following questions:

* Overall, how often are works by women cited, and how often works by men?
* Does the proportion of female-authored works change over time (i.e. if we look at different publication years)?
* Who are the most cited authors for each gender?
* Which university affiliations account for most citations?
* How are cited authors distributed geographically?

## The data set

Let us begin by loading the data set and examining its properties.

In [1]:
import math

import pandas as pd
import numpy as np
import pycountry
import gender_guesser.detector as gender
import googlemaps

In [2]:
df = pd.read_csv("data/dataset_full_names.csv")[["First author", "Year", "Title"]]
print(f"The dataset contains {len(df)} entries.")
df.head()

The dataset contains 146472 entries.


Unnamed: 0,First author,Year,Title
0,Peter Achinstein,2001,The Book of Evidence
1,Jonathan Adler,1994,"Testimony, Trust, Knowing"
2,Kent Bach,1979,Linguistic Communication and Speech Acts
3,Alexander Bird,1998,Philosophy of Science
4,John Bigelow,2010,"Quine, Mereology, and Inference to the Best Ex..."


The data set contains information on the *first author* of an article/book, the *year* in which it was published, and the *title* of the article/book. It contains 146472 entries (some preprocessing of erroneous references or reference that did not fit the format has been done beforehand, see the data notebook for more details). 

## Inferring an author's gender

We want to analyze how works by men and women are represented in the encyclopedia. The only features we have at our disposal, however, are the name of the first author, the year of publication and the title of the work. We have no direct information about an author's gender. An author's first name, however, allows to guess their gender with relatively high reliability. This is, of course, a simplification that disregards non-binary authors who do not identify with either gender, and it might get the names of transgender people wrong, as well as some names which are used in untypical ways (Hilary Putnam, for instance, is a male philosopher with a first name that is more commonly given to girls). Nonetheless, under the assumption that these errors occur (roughly) equally often for authors categorized as *male* or *female*, the first name will allow us to obtain a relatively accurate estimate of the proportion of female-authored references in the encyclopedia. Yet, it is important to highlight that this does not give us exact numbers, but only an estimate.

We use a package called *gender guesser* which categorizes first names into the following six categories:
* *male* (a first name that is significantly more frequently given to boys), 
* *female* (a first name that is significantly more frequently given to girls), 
* *mostly male* (a first name that is slightly more frequently given to boys)
* *mostly female* (a first name that is slightly more frequently given to girls)
* *andy* (a first name that is equally frequently given to boys and girls)
* *unknown* (the name was not found in the data base)

In [3]:
# Initialize gender detector
d = gender.Detector()
df["Gender guess"] = df["First author"].apply(lambda x: d.get_gender(x.split(" ")[0]))
df.head()

Unnamed: 0,First author,Year,Title,Gender guess
0,Peter Achinstein,2001,The Book of Evidence,male
1,Jonathan Adler,1994,"Testimony, Trust, Knowing",male
2,Kent Bach,1979,Linguistic Communication and Speech Acts,male
3,Alexander Bird,1998,Philosophy of Science,male
4,John Bigelow,2010,"Quine, Mereology, and Inference to the Best Ex...",male


The only blatant mistake among the highly cited first authors seems to be that *Hilary Putnam* is categorized as *female*.

In [4]:
# Find all the rows in which the first author is "Hilary Putname"
putnam_index = df.loc[df["First author"]=="Hilary Putnam"].index.values
# Iterate through all those rows
for i in putnam_index:
    # Change value to "male"
    df.at[i,"Gender guess"] = "male"

## Analysis

Let us now explore the gender differences present in the data. We begin by looking at the references' total distribution of first authors' gender.

In [5]:
# Count citations for each gender guess, drop irrelevant columns, and rename the "Title" column to "Count"
df_total = df.groupby("Gender guess").count().sort_values("Title", ascending=False).reset_index()
df_total = df_total[["Gender guess", "Title"]]
df_total = df_total.rename(columns={"Title": "Count"})

# Get total number of citations
total_citations = sum(df_total["Count"])

# Add column percentage 
df_total["Percentage"] = df_total["Count"]/total_citations
df_total["Percentage"] = df_total["Percentage"].apply(lambda x: str(round(100*x,2))+"%")
df_total.head()

Unnamed: 0,Gender guess,Count,Percentage
0,male,110229,75.26%
1,female,22419,15.31%
2,unknown,8580,5.86%
3,mostly_male,2812,1.92%
4,mostly_female,1775,1.21%


In [6]:
# Import Plotly packages
import plotly.express as px
from plotly.offline import init_notebook_mode, iplot
from plotly.graph_objs import *

# Set the template
template="none"
color_male = px.colors.qualitative.D3[1]
color_female = px.colors.qualitative.D3[0]

In [22]:
# Barplot for distribution of total citations in SEP
fig = Figure(data=[Bar(
    x=df_total["Gender guess"],
    y=df_total["Count"],
    marker_color=[color_male,color_female,"lightgrey","lightgrey","lightgrey","lightgrey"],
    text=df_total["Percentage"]
)])

fig.update_layout(title_text="Distribution of total citations on SEP (of works published after 1950)", 
                  template=template)

As we can see, 75.26% of references are works (guessed to be) authored by men. Only 15.31% of references are works (guessed to be) authored by women. The proportion of references assigned to the other four categories is less than 10%. To simplify the analysis, we will hence subsequently focus only on references that can be categorized as authored by either a male or a female author. Moreover, we will restrict our analysis to works published in the year 1950 and after.



In [8]:
# Import function to determine whether a publication date is valid and within a specific year range
from sep_functions import published_between

In [9]:
# Drop year columns which are not between 1950 and 2022 or which cannot be converted to numbers ("forthcoming", etc.)
df_years = df.copy()
df_years["Year range"] = df_years["Year"].apply(lambda x: published_between(x,1950,2022))
df_years = df_years[df_years["Year range"]][["Year", "Gender guess","Title"]]
df_years = df_years.rename(columns={"Title": "Count"})
pd.to_numeric(df_years["Year"])
df_years = df_years.sort_values("Year")

In [10]:
# Leave only male and female for simplicity
df_years = df_years[-(df_years["Gender guess"]=="unknown")]
df_years = df_years[-(df_years["Gender guess"]=="mostly_male")]
df_years = df_years[-(df_years["Gender guess"]=="mostly_female")]
df_years = df_years[-(df_years["Gender guess"]=="andy")]

For each year individually, we will now count the citations by gender.

In [11]:
# Group by year and gender guess, and count the references
df_years = df_years.groupby(["Year", "Gender guess"]).count().reset_index()
df_years = df_years.sort_values(["Year","Gender guess"]).reset_index(drop=True)
df_years.head()

Unnamed: 0,Year,Gender guess,Count
0,1950,female,8
1,1950,male,226
2,1951,female,8
3,1951,male,213
4,1952,female,10


Next, we will plot how the number of citations by gender changes with the publication year of the referenced book or journal article. We can see that the majority of cited works are, roughly, from between 1995 and 2015. There are close to no publications by female authors from before 1970. After 1970, ever more publications by female authors are being cited.

In [12]:
# Plot the proportion over time
fig = px.bar(df_years, 
             x="Year", 
             y="Count", 
             color = "Gender guess", 
             title="Proportion of citations in different publication years (1950 - present)",
             template=template)
fig.show()


Let us now plot the proportion of female citations throughout the years.

In [13]:
# Prepare data set with percentages of female-authored references per publication year
yearly_counts_total = df_years.groupby("Year").sum().reset_index()[["Year", "Count"]]
yearly_counts_female = df_years[df_years["Gender guess"]=="female"]
df_percentage = pd.merge(
    yearly_counts_female,
    yearly_counts_total,
    how="inner",
    on="Year",
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)
df_percentage["Female percentage"] = df_percentage["Count_x"]/df_percentage["Count_y"]*100

In [14]:
# Plot
fig = px.scatter(df_percentage, x="Year",
                 y="Female percentage", 
                 title='Proportion of female authored-works being cited in SEP articles',
                 labels={"Year": "Publication year of cited work", 
                         "Female percentage": "Percentage of female-authored works"},
                 trendline ="rolling",
                 trendline_options=dict(window=5),
                 trendline_color_override="black")
fig.update_layout(template=template)
fig.update_traces(mode = 'lines')
fig['data'][0]['showlegend']=True
fig['data'][0]['line']['color']=color_female
fig['data'][0]['name']='Year-by-year'
fig['data'][1]['showlegend']=True
fig['data'][1]['name']='5-year moving average'
fig.show()

The blue line represents the actual percentage of female-authored works from the given publication year. The black line represents the 5-year moving average. As we can see, up to the publication year 1970 the proportion of female authored references is somewhere around 5%. After that, the 5-year moving average increases (with a few exceptions) from year to year, and in 2021 it reaches over 28%. 

In the medium-term future, we would ideally like to arrive at a state of the discipline where roughly half of the senior staff in philosophy, and hence the authors of influential philosophy articles, are women. As this analysis shows, we are still far from this. Nonetheless, this analysis also comes to the optimistic conclusion that female philosophers are gaining more influence on the field, and that the change we witness is a change in a positive direction. 

### Who are the most cited authors?

In [15]:
# Creat a dataframe with the counts of citations of individual authors
most_cited_df = df.groupby(["First author","Gender guess"]).count().reset_index()
most_cited_df = most_cited_df.sort_values("Title", ascending=False)
most_cited_df = most_cited_df[["First author","Gender guess","Title"]]
most_cited_df = most_cited_df.rename(columns={"Title": "Count"})

print(len(most_cited_df))


29362


In [16]:
# Reduce to those who have been cited more than 50 times to make it manageable manually
most_cited_df = most_cited_df[most_cited_df["Count"]>=75]
print(len(most_cited_df))

170


In [17]:
count_male = len(most_cited_df[most_cited_df["Gender guess"]=="male"])
count_female = len(most_cited_df[most_cited_df["Gender guess"]=="female"])
count_mmale = len(most_cited_df[most_cited_df["Gender guess"]=="mostly_male"])
count_mfemale = len(most_cited_df[most_cited_df["Gender guess"]=="mostly_female"])
count_andy = len(most_cited_df[most_cited_df["Gender guess"]=="andy"])
count_unknown = len(most_cited_df[most_cited_df["Gender guess"]=="unknown"])
print(f"Over 75 citations: \n Men: {count_male}"+
      f"\n Women: {count_female}"+
      f"\n Mostly male: {count_mmale}"+
      f"\n Mostly female: {count_mfemale}"+
      f"\n Androgenous: {count_andy}"+
      f"\n Unknown: {count_unknown}")

Over 75 citations: 
 Men: 143
 Women: 15
 Mostly male: 6
 Mostly female: 1
 Androgenous: 0
 Unknown: 5


A manual analysis of the 170 entries reveals that all "unknown", "mostly_male", and "mostly_female" entries seem to actually be male. Consequently, we change their gender guess value.

In [18]:
# Get the indices of all the "unknown", "mostly_male", and "mostly_female" entries
unknown_index = most_cited_df.loc[most_cited_df["Gender guess"]=="unknown"].index.values
mmale_index = most_cited_df.loc[most_cited_df["Gender guess"]=="mostly_male"].index.values
mfemale_index = most_cited_df.loc[most_cited_df["Gender guess"]=="mostly_female"].index.values

# Concatenate the indices
change_indices = np.concatenate([unknown_index, mmale_index, mfemale_index])

# Change the indices to male
for change_index in change_indices:
    most_cited_df.at[change_index,"Gender guess"]="male"

Now, let us look at the ten most cited men and women. To this end, we will create a dataframe which contains the names of the ten most cited men and women and their gender-specific ranking.

In [19]:
ten_male = most_cited_df[most_cited_df["Gender guess"]=="male"].head(10)
ten_male = ten_male.reset_index().reset_index()
ten_male["Rank"] = ten_male["level_0"]+1
ten_male.sort_values("Rank", ascending=False)
ten_male["Description"] = "<b>"+ten_male["First author"]+"</b>"+ten_male["Count"].apply(lambda x: " ("+str(x)+")")
ten_male = ten_male[["First author", "Gender guess", "Count", "Rank","Description"]]

ten_female = most_cited_df[most_cited_df["Gender guess"]=="female"].head(10)
ten_female = ten_female.reset_index().reset_index()
ten_female["Rank"] = ten_female["level_0"]+1
ten_female.sort_values("Rank", ascending=False)
ten_female["Description"] = "<b>"+ten_female["First author"]+"</b>"+ten_female["Count"].apply(lambda x: " ("+str(x)+")")
ten_female = ten_female[["First author", "Gender guess", "Count", "Rank", "Description"]]

In [30]:
from plotly.subplots import make_subplots
fig = make_subplots(rows=1, cols=2, specs=[[{}, {}]], shared_yaxes=True, horizontal_spacing=0)

# Bar plot for male philosophers
fig.append_trace(Bar(x=ten_male["Count"], 
                     y=ten_male["Rank"],
                     orientation='h', 
                     showlegend=True, 
                     text=ten_male["Description"], 
                     name='Male philosophers',
                     marker_color=color_male,
                     hovertemplate=ten_male["Count"],
                     textposition=["inside","outside","outside","outside","outside","outside","outside","outside","outside","outside"],
                     ), 1, 1)

# Bar plot for female philosohers
fig.append_trace(Bar(x=ten_female["Count"], y=ten_female["Rank"], 
                        orientation='h', showlegend=True,
                        text=ten_female["Description"],
                        name='Female philosophers',
                        marker_color=color_female,
                        hovertemplate=ten_male["Count"],
                        textposition="outside",), 1, 2)

fig.update_xaxes(showgrid=True, dtick = 100, range=[600,0], row=1, col=1)
fig.update_xaxes(showgrid=True, dtick = 100, range=[0,600], row=1, col=2)

fig.update_yaxes(showgrid=False, categoryorder='total descending', 
                 ticksuffix=' ', dtick=1, showline=False)



fig['layout']['yaxis']['autorange'] = "reversed"

fig.update_layout(title='The ten most cited men and women on the SEP',
                  margin=dict(t=80, b=50, l=80, r=40),
                  xaxis_title="Number of citations",
                  yaxis_title="Ranking",
                  legend=dict(orientation="h", yanchor="bottom",
                              y=1, xanchor="center", x=0.5),
                  hoverlabel=dict(bgcolor="white", font_size=13, 
                                  font_family="Lato, sans-serif"),
                  template=template)

We can see that *David Lewis* is by far the most cited author on the SEP, followed by *John Rawls* and *Bertrand Russell*. Among female philosophers, *Martha Nussbaum* is the most cited, followed by *Christine Korsgaard* and *Julia Annas*. It is striking that the ten most cited male philosophers each have significantly more citations than their female counterpart on the same rank. The female philosopher with the most citations, Martha Nussbaum, has less citations than the ninth most cited male philosopher. This further confirms that there is a strong imbalance between the representation of genders in the SEP.

### Affiliations and geographical representation

Let us now look at which universities researchers who are cited in the SEP are affiliated with, and how citations are located geographically. To this end, we extract data on researcher affiliations from PhilPeople (http://www.philpeople.org). Unfortunately, only for roughly half the citations was it possible to obtain affiliation data (72140 of 147599, or ~49%). The missing affiliation data might be due to a number of reasons. Historical figures are typically not listed on PhilPeople, as are researchers from fields other than philosophy. Under this assumption, we are not analyzing where citations on the SEP come from in general, but where contemporary philosophy citations come from. This, however, is an interesting question in its own right. All in all, we find that citations come from researchers from 1070 different universities.

In [261]:
# Load university date
uni_counts = pd.read_csv("data/university_counts.csv")
len(uni_counts[1:])

1070

In [263]:
# Rename some universities for better readability
uni_counts["Affiliation"].replace({"Rutgers University - New Brunswick":"Rutgers University (NB)",
                                  "University of Notre Dame": "Notre Dame",
                                  "University of Texas at Austin":"UT Austin"},
                                  #"Ludwig Maximilians Universität, München":"LMU München",
                                  #"Humboldt-Universität, Berlin":"HU Berlin",
                                  #"University of Groningen":"Groningen"}, 
                                  inplace=True)

colors = [px.colors.qualitative.Plotly[0] for i in range(10)]
colors[1] = px.colors.qualitative.Plotly[1]

# Barplot for distribution of total citations in SEP
fig = Figure(data=[Bar(
    x=uni_counts[1:10]["Affiliation"],
    y=uni_counts[1:10]["Count"],
    marker_color=colors,
    text=uni_counts[1:10]["Count"].apply(lambda x: str(round(((x/sum(uni_counts["Count"]))*100),2))+"%")
)])

fig.update_layout(title_text="Most cited reseacher's affiliations on the SEP", 
                  template=template)

We can see that works written by researchers affiliated with NYU account for 1.3% of all citations on the SEP. That is, more than every 100th citations is from an NYU researcher. Among the 10 universities which account for the most citations, 9 are in the US. Only one of the ten is from outside the US, namely Oxford in the UK.

It is, of course, not surprising that an English language encyclopedia cites mostly works published in English. However, it would be interesting to see to which extent the English speaking countries dominate the influence in the SEP, and where non-English citations come from. 

In order to examine where the citations come from geographically, we need geocoding data. This, we obtain using the Google Maps Geocode API.

In [None]:
# The API key was removed for this notebook
api_key = ""

# Initialize the Google Maps Client
gmaps = googlemaps.Client(key=api_key)

# Initialize empty list
entries = []

# Iterate through different university names to obtain latitude, longitude, and country
for university in university_df["Affiliation"]:
    geocode = gmaps.geocode(university)
    lat = geocode[0]["geometry"]["location"]["lat"]
    lng = geocode[0]["geometry"]["location"]["lng"]
    for address_component in geocode[0]["address_components"]:
        if "country" in address_component["types"]:
            country = address_component["long_name"]
            # Break inner loop
            break
    entry = [university, lat, lng, country]
    entries.append(entry)
    
# Turn this into a data frame
df_geocode = pd.DataFrame(entries, columns=["Affiliation","Latitude","Longitude","Country"])

In [322]:
# Merge geocoed data with citation counts data
df_geo_counts = pd.merge(df_geocode,
        uni_counts,
        on="Affiliation").drop(columns=["Unnamed: 0"])

# Rename some universities for better readability
df_geo_counts["Affiliation"].replace({"Ludwig Maximilians Universität, München":"LMU München",
                                   "Humboldt-University, Berlin":"HU Berlin",
                                   "University of Groningen":"U. of Groningen",
                                     "University of Amsterdam":"U. of Amsterdam",
                                     "University of Helsinki":"U. of Helsinki"}, 
                                  inplace=True)

# Group by country
df_geo_counts_grp = df_geo_counts.groupby("Country").sum().sort_values("Count", ascending=False).reset_index()

In [333]:
# Define condition to exclude English speaking countries
cond = ((df_geo_counts["Country"]!="United States") 
        & (df_geo_counts["Country"]!="United Kingdom") 
        & (df_geo_counts["Country"]!="Canada") 
        & (df_geo_counts["Country"]!="Australia")
       & (df_geo_counts["Country"]!="New Zealand"))

# Set colors according to country
color_scheme = {"Germany":4,
              "Sweden":6,
              "Switzerland":8,
              "Netherlands":5,
              "France":7,
              "Finland":9}

# Assign colors to list according to country
colors = [px.colors.qualitative.Plotly[color_scheme.get(country)] for country in df_geo_counts[cond][:10]["Country"]]

# Get text
countrycodes = [pycountry.countries.get(name=x).alpha_2 for x in df_geo_counts[cond][:10]["Country"]]
ticktext = [df_geo_counts[cond]["Affiliation"].iloc[i]+" ("+countrycodes[i]+")" for i in range(10)]

fig = Figure(data=[Bar(
    #x=df_geo_counts[cond][:10]["Affiliation"],
    x=ticktext,
    y=df_geo_counts[cond][:10]["Count"],
    marker_color=colors,
    text=df_geo_counts[cond][:10]["Count"].apply(lambda x: str(round(((x/sum(uni_counts["Count"]))*100),2))+"%"),
)])

fig.update_layout(title_text="Most cited reseacher's affiliations on SEP (non-English countries)", 
                  template=template)
fig.update_xaxes(tickangle=28,
                 ticktext = ticktext)
fig.show()

Here we can see the ten most-cited institutes from non-English speaking countries. The clear leader is the university of Amsterda, followed by Ludwig-Maximilians Universität München, and Stockholm University. 

Let us now generally look at how the citations are distributed by the countries in which authors' affiliations are located.

In [96]:
fig = px.pie(df_geo_counts_grp, values="Count", names="Country")
fig.update_traces(textposition='inside')
fig.update_layout(uniformtext_minsize=12, uniformtext_mode='hide')
fig.show()

We can see that over 64% of those citations for which we could identify the affiliation come from researchers who are affiliated with a US institution. The US hence account for the vast majority of citations. Next in line is the UK with 12.5% of citations, followed by Canada with ~5%, Australia with ~4%, and Germany with ~2%. It is unsurprising that English speaking countries are overpresented in an English encyclopedia, but the extent to which this is the case might still be surprising: together, they account for over 85% of (identifiable) citations. 

In [234]:
def get_count_cat(int):
    if int==1:
        return "1"
    elif int<=10:
        return "1-10"
    elif int<=100:
        return "11-100"
    elif int<=500:
        return "101-500"
    elif int>500:
        return ">500"

df_geo_counts["Citations"] = df_geo_counts["Count"].apply(lambda x: get_count_cat(x))

fig = px.scatter_geo(df_geo_counts, 
                     
                     lat=df_geo_counts["Latitude"],
                     lon=df_geo_counts["Longitude"],
                     color="Citations",
                     hover_name="Affiliation", 
                     size=df_geo_counts["Count"].apply(lambda x: math.log(x+1)),
                     projection="natural earth",
                     size_max=20
                    )

fig.update_geos(
    visible=False, resolution=110,
    showcoastlines=True, coastlinecolor="Black",
    showland=True, landcolor="LightGreen",
    showocean=True, oceancolor="LightBlue",
    showcountries=True,
)
fig.show()

Lastly, we can inspect the map on which the institutes where citations are coming from are represented by contribution size and color. There are dense clusters in North America and Europe, but also substantial contributions from some areas in South America, some countries in Asia, and Australia. Very few works from authors who work at institutes on the African continent are cited (with the exception of South Africa). 