`Keywords`

- Wikipedia
- Philosophy
- Getting to Philosophy 
- First Page Philosophy
- First Link
- Navigability



# Abstract

In this study, I analyze a phenomenon on Wikipedia in which repeatedly clicking “first link” of a webpage invariably takes a user to the Philosophy page. I examine the percent of pages on Wikipedia in which this idea holds true in an effort to understand how Wikipedia’s network is structured and what that means for its user navigability and understanding. Previous research indicates that users’ page navigation is heavily focused on the lead of a Wikipedia article, rarely venturing beyond the first paragraph[@VisualClicks]; therefore, I limit my analysis to the first several links in this section; further analysis with greater computing power could be done on the links within the entire article. Amongst these first several links, I seek to determine if there are any other link locations that reach a specific page with any abnormal frequencies, including the philosophy page. To conduct my analysis, I construct a network using Wikipedia pages as nodes and the links on the page as directed links between nodes. I collected my data using a Breadth-First Search (BFS), meaning once I reach a page that has already been visited, I move on to another root page. With the network, I examine average path lengths to the philosophy page, the neighbors of the philosophy page that most commonly direct to it, and the structure of the first-link network itself. Furthermore, I examine the second link of Wikipedia pages and conduct and analysis of that network as well. My conclusions demonstrate the effectiveness of Wikipedia’s effort to make their introductory sentences and links sufficiently broad.

#  Introduction

Wikipedia pages are built with the user’s understanding in mind. To ensure consistency across pages and maintain reliability as a credible source, there are extensive guidelines on the structure of each page. As one of the most important components of a Wikipedia page, linked content and the content of the lead paragraph is tightly monitored. Links serve to “provide instant pathways to locations within and outside the project that can increase readers' understanding of the topic at hand.”[@Wikipedia:ManualofStyle] Users will click on links when a topic is unfamiliar to them, or if they interested in learning more.

When arriving to a page, a user ought to have the topic explained to them as though they know little to nothing about it. The lead ought to frame the reader so as to “set the scene of the topic.”[@Wikipedia:ManualofStyle] Wikipedia explains the structure of the lead paragraph:

> In Wikipedia, the lead section is an introduction to an article and a summary of its most important contents. It is located at the beginning of the article, before the table of contents and the first heading. It is not a news-style lead or "lede" paragraph.

> The average Wikipedia visit is a few minutes long. The lead is the first thing most people will read upon arriving at an article, and may be the only portion of the article that they read. It gives the basics in a nutshell and cultivates interest in reading on—though not by teasing the reader or hinting at what follows. It should be written in a clear, accessible style with a neutral point of view.[@Wikipedia:ManualofStyle]

Wikipedia goes on to outline how the opening paragraph and sentence ought to be structured. They explain that the “[The opening paragraph] should establish the context in which the topic is being considered by supplying the set of circumstances or facts that surround it. If appropriate, it should give the location and time.”[@Wikipedia:ManualofStyle] For example, a building’s first link will most likely be its location. Within that paragraph, its opening sentence is critical for my study as it will contain the first link. Editors are instructed that “the first sentence should tell the nonspecialist reader what or who the subject is, and often when or where.”[@Wikipedia:ManualofStyle] They go on to provide explicit instructions on what the first linked topic ought to be in an article: 

> The first sentence should provide links to the broader or more elementary topics that are important to the article's topic or place it into the context where it is notable. 
    
> For example, an article about a building or location should include a link to the broader geographical area of which it is a part. 
    
> Arugam Bay is a [bay](https://en.wikipedia.org/wiki/Bay) on the [Indian Ocean](https://en.wikipedia.org/wiki/Bay) in the dry zone of [Sri Lanka's](https://en.wikipedia.org/wiki/Sri_Lanka) southeast coast. 
    
> In an article about a technical or jargon term, the first sentence or paragraph should normally contain a link to the field of study that the term comes from. 
    
> In [heraldry](https://en.wikipedia.org/wiki/Heraldry), tinctures are the colours used to [emblazon](https://en.wikipedia.org/wiki/Blazon) a [coat of arms](https://en.wikipedia.org/wiki/Coat_of_arms). 
        
> The first sentence of an article about a person should link to the page or pages about the topic where the person achieved prominence. 
    
> *Harvey Lavan "Van" Cliburn Jr.* (July 12, 1934 – February 27, 2013) was an American [pianist](https://en.wikipedia.org/wiki/Pianist) who achieved worldwide recognition in 1958 at age 23, when he won the first quadrennial [International Tchaikovsky Piano Competition](https://en.wikipedia.org/wiki/International_Tchaikovsky_Competition) in Moscow, at the height of the [Cold War](https://en.wikipedia.org/wiki/Cold_War).
    
> Exactly what provides the context needed to understand a given topic varies greatly from topic to topic.[@Wikipedia:ManualofStyle]

The first link of each page will be increasingly broad as you continue to click the first link. These instructions create a picture of how a topic like philosophy can be at the center of Wikipedia’s first link network. Conversely, it is doubtful that such a center exists for any other link placement. Even just the second link in an article can be increasingly specific, moving laterally or even backwards in specificity rather than towards larger hubs such as philosophy. Take one of Wikipedia’s examples in Harvey Lavan "Van" Cliburn Jr; his first link path begins with pianist then continues as follows: piano, keyboard instrument, musical instrument, music, art, creativity, psychology, mind, thought, consciousness, awareness, philosophy. With each passing link you can sense that your destiny on the philosophy page grows closer; the topics are broader and the connection from it to philosophy feels increasingly obvious. However, if we were to follow the second link, International Tchaikovsky Piano Competition, we find ourselves on the following path: Saint Petersburg, Russia, Eastern Europe, Ural Mountains, Eurasia, Europe, peninsulas, mainland, continent, regions, Earth’s surface, hemispheres, etc. Unlike with the first link, the second link gets stuck in geographic limbo without ever getting closer to a central topic like Philosophy. I will explore what a second link network looks like further in my analysis and see that geography, broadly speaking, is the typical destination of pages when clicking the second-link. 

There is special focus on the very beginning of a Wikipedia page because that is where users devote most of their attention. Dimitrov et al. utilize click data from Wikipedia’s navigation logs to construct a heat map of where users are clicking the most on Wikipedia pages. The heat map illustrates two clear dark red, high density, lines at the beginning of the page directly where the lead is located, demonstrating that users highest click rate is on links within the first few lines of the opening paragraph. The rest of the page is sparse beyond a preference for links on the left side of pages, a phenomenon the authors themselves do not fully understand.[@VisualClicks] However, the high click rate within the lead indicates to us that understanding the nature of the network of the first few links in an article is indicative of the nature of the network that users are typically interacting with.

Research has already been done into the size of the Giant Connected Component (GCC) of nodes that connect to the philosophy node. In a study of Wikipedia’s navigability by language, as of 2017, 97.0% of pages in English will connect to the philosophy page[@LamprechtNavigability], a slight increase of around 2.5% since 2011.[@Wikipedia:GettingtoPhilosophy] These numbers fluctuate across languages, with some languages have a center on pages such as "Psychology" in Spanish or "Person" in Japanese each with varying sizes but still having the majority of nodes reach these pages[@LamprechtNavigability]; my study will only be focused on the English network of Wikipedia. In the future, it would be interesting to study this phenomenon in other languages as I have done with English. In particular, previous studies indicate that Dutch has the smallest GCC with just 67.0% of nodes in its GCC.[@LamprechtNavigability] I would like to compare its network to English to understand this discrepancy. However, the English network is already far large enough for the scope of this study.

If you would like to see how this network is formed beyond clicking through Wikipedia webpages on your own, the online page [xefer](https://www.xefer.com/wikipedia) will quickly build out a network of pages and their first links until you reach the philosophy page. This is a helpful tool that is good to visualize what this can look like in practice. However, it was designed to always reach the philosophy page even for those pages that manage to avoid the philosophy page. It does this by skipping to the second link on a page when it realizes it will not be able to reach the philosophy page through the first link.[@xefer] Therefore, we need to construct our own network to understand these disconnected nodes. 

To understand how a node can be disconnected, we ought to look at what makes philosophy the center of the network. If you click on the first link on the philosophy page, you go to the [existence page](https://en.wikipedia.org/wiki/Existence), which takes you to the [entity page](https://en.wikipedia.org/wiki/Entity), then right back to the existence page, forming a loop that makes out the "bottom" of this network. These pages are not nearly as central as philosophy, otherwise the phenomena would be about one of them instead. For another node to avoid the philosophy node, it would require a similar cycle or lack any link. Therefore, it is going to be a broad topic as it has to be something that could similarly be in the first sentence of a Wikipedia page. This eliminates hyper-specific pages from consideration despite them being the intuitive guess for what might manage to avoid philosophy. However, these specific pages can eventually lead to the broad pages that manage to cycle without hitting the philosophy page. Furthermore, there can also be pages with no links that function as dead-end pages. Wikipedia recently underwent an effort to remove all true dead-end pages (pages with zero links).[@Wikipedia:Dead-end_pages] Despite these efforts, there remain pages with no links as far as this study is concerned. For example, many sports pages have a lot of links, but they all lie within tables which are not included in this phenomena. For example, on [2011-12 Exeter City F.C. season](https://en.wikipedia.org/wiki/2011-12_Exeter_City_F.C._season), there are tons of links but none in the *content* of the page. All of them are in tables or citations, meaning that this is a dead-end page for the philosophy phenomena. Additionally, this study does not consider links in lists, a choice explained in greater detail in the methods section.

A page’s neighbors will remain within semantically related to that page amongst links in the lead. In a study that constructed Wikipedia’s network using the first ten links in an article as a node’s edges, it was determined that the nodes will form into communities of semantically related terms.[@MatasDomain] The mathematics page will be in a community of other topics related to math such as physics. For our sakes, this is an important result as it helps to paint a picture of what the branches stemming from philosophy’s neighbors will look like. For example, we can now expect all scientific terms to be connected in communities allowing them all to pass through the science page on their way to the philosophy page. 

Beyond some of the quicker results such as the size of the GCC, the average path length to philosophy, the number of disconnected components, and the nature of networks from other link locations, I will also look at the neighbors of the philosophy node. I also investigate the size of the GCC if the Philosophy node is removed and the size of other remaining large components. I hypothesize and found that the awareness node and its connecting parts form the basis of the GCC and that the component does not shrink by more than 10%. However, if awareness is removed as well, the GCC would shrink dramatically as the awareness node serves as a bridge between all scientific topics and all locations-based topics (buildings, monuments, historical figures).

#  Methods

All of my analysis and data collection was done using [Python 3.10.12](https://www.python.org/downloads/release/python-31012/). 

## Finding First Links

*get_first_link(page_url)*

By far the most difficult task was writing the function that would find the first (or second) link on a Wikipedia page. What is an incredibly easy task for the human eye proved to be quite difficult to program. If the task was to get the first link of any form on a page that would be quite easy. However, the more literal phenomena occurs with the first link to another Wikipedia page in the content section of the page that is not in parenthesis nor is a citation. To find this took a lot of trial and error as Wikipedia pages vary far more than you might think.

I first tried to use the [Wikipedia API](https://pypi.org/project/wikipedia/). Its *links* attribute would have seemed to be an easy way to grab the links on a page. However, there is no functionality to get the links in order of appearance; instead, they are returned alphabetically. I briefly investigate ways to figure out which of these links came first by parsing the HTML but quickly found that it would just be easier to do the entire thing using the HTML.

To read the wikipedia pages, I used the [Requests](https://requests.readthedocs.io/en/latest/) and [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) libraries. I then found all of the paragraph content on the Wikipedia page. I decided to excluded list components (bullet points) from my analysis as I felt they did not meet the same criteria as a link within a paragraph. This means that pages like [History of the Administrative Divisions of China](https://en.wikipedia.org/wiki/History_of_the_administrative_divisions_of_China) or [1965 Palanca Awards](https://en.wikipedia.org/wiki/1965_Palanca_Awards) will have 'no links' as their are no links in their primary content. In future analysis, I hope to include these links and compare the results to see which is a better measure.

From there, I found all of the hyperlinks in each paragraph, then got the href and class for each link. I used these to filter out any "bad" links such as citations, files, links that leave Wikipedia, and the most challenging, links within text parenthesis. This was a difficult decision as sometimes it would seem that the text here is meaningful. For example, [the Creativity page](https://en.wikipedia.org/wiki/Creativity) has six parenthetical links before you reach the first link. However, most links inside parentheses are for translations and other self-referring content as seen on [the Ancient Greece page](https://en.wikipedia.org/wiki/Ancient_Greece); referring to Greek here is wrong as that is in reference to the translation, not the content of the page itself. Therefore, parenthetical links were excluded using the *isValid(ref, paragraph)* function [from Christopher Chiche on Stack Overflow](https://stackoverflow.com/questions/18916616/get-first-link-of-wikipedia-article-using-wikipedia-api). Finally, I only considered the first link in an actual paragraph on Calendar pages such as [Thout 1](https://en.wikipedia.org/wiki/Thout_1). These pages all have a first line of the calendar date, the calendar, and the next date on the calendar, meaning that when used it would take hundreds of pages to make it through the entire calendar. More importantly, this line is not a part of the actual content of the page which is what we want. These were detected and avoided by searching for the exact paragraph style they are found in. 

I then built a list of links and grabbed the first one. Typically, these href's were structured as /wiki/href. To get cleaner node names, I removed the /wiki/ as it would be repetitive to see it as a prefix on every single page. If there were no functional links on a page, it served as a dead-end for the network even though it may not be under [Wikipedia's definition of a Dead-End page](https://en.wikipedia.org/wiki/Wikipedia:Dead-end_pages). This effectively only applied for [Disambiguation Pages](https://en.wikipedia.org/wiki/Wikipedia:Disambiguation#Disambiguation_pages). If an *AttributeError* or *TypeError* occurs, which is rare, a unique string "!FAIL!: " is added to the front of the url to be detected later so as not to be confused with successful pages.

Finally, to find the second link on a page, I used a nearly identical function that returned the second item in the list or, if there was only one link, no links at all.

## Creating the First-Link Network

*network_expander(G, page_url, seen_pages, is_root, fails, disconnects,convergence_df, new_pages=100)*

This function is used to create or expand the network. This is done using a **Breadth-First Search (BFS)**. It takes in a lot of variables but many of those are just set as empty lists. It is primarily there to give the option of expanding the network in multiple steps rather than one giant run-through as it takes quite some time to run. 

The function first checks if there have been any previous iterations or if it is starting new. It also has a list of "Notable Nodes" that I have manually set. These are the nodes that I have found in my analysis to be the most important (central) and therefore want to track their centrality to ensure the network has converged so that we can make claims on the centrality of these nodes despite not encompassing all of the pages of Wikipedia. Additionally, the function monitors the average page distance from the philosophy page and the size of the network's weakly connected Giant Connected Component (GCC) as I will explain shortly. Finally, the function finds the name of the first page it will look at by splitting its url.

Then, the bulk of the function occurs in a for-loop. Each time through the loop adds a new "seed page" to the network. Meaning, it starts at a new page and works its way towards the philosophy page or, to another page that loops back to itself. The *new_pages* parameter determines how many times this loop runs. For my final network, I set this to 50,000. However, this does not mean there are 50,000 pages in the network. Rather, there are the portion of usable pages plus all of the pages in between those seed pages and the philosophy page, resulting in 23,169 pages. With greater time and computing power, I would like to conduct a larger analysis, however, all of the values I discuss would not change in any statistically significant way as demonstrated in the convergence section of my analysis. 

Each loop starts with the url of its seed page (*page_url*). For all but the first page, these pages are found using *wiki_random_page(seen_pages)*. This function uses a while loop that ends when the function finds a new random page. It knows it is new if it is not in the input parameter, *seen_pages*, which contains a list of every previous page that the function has seen. It then uses the [Wikipedia API's random function](https://wikipedia.readthedocs.io/en/latest/code.html#api) to select a random wikipedia page. The function then checks that it has not seen that page before. Then, it avoids two types of pages:

1. **List Pages**: These pages often do not contain any actual information and are just lists of other Wikipedia pages. While some would work for the network, many are unnecessary and lack any links in their primary content, creating issues for the network. See [List of painters by name beginning with "P"](https://en.wikipedia.org/wiki/List_of_painters_by_name_beginning_with_%22P%22) as an example. These are not pages that would impact Wikipedia's navigability and therefore we can exclude them as seed pages. They are not skipped if they are the first link on a page which can occur (e.g. [Sitting](https://en.wikipedia.org/wiki/Sitting)). 

2. **Disambiguation Pages**: These pages were a much easier decision to skip as they contain no information. They serve to point users to actual pages when their search term was too vague. Additionally, they all lack a first link and would skew statistics such as the size of the GCC. See [Category: Disambiugation Pages](https://en.wikipedia.org/wiki/Category:Disambiguation_pages) for more information and the [Art Disambiguation Page](https://en.wikipedia.org/wiki/Art_(disambiguation)) as an example.

Finally, *wiki_random_page* creates a proper page url by replace spaces with underscores and breaks the while loop. The function then returns the *random_page* name and its url, *page_url*.

It then gets the first link on that page using *get_first_link(page_url=page_url)*. It then double checks that *get_first_link* returned a string. If it did not, and returned a *NoneType* instead, there are two options: 

1. If that page is a seed page, it picks a new seed page and starts over. 
2. If it is the first link of a different seed page, its url is added to the *fails* list which is manually checked at the end to repair any issues. These are, however, very rare (<1 in 1,000).

Next, it normalizes the formatting for the first link by making it all lowercase. This is stored as a separate variable as capitalization is [case sensitive in Wikipedia Urls](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking#:~:text=Wikipedia%20article%20titles%20almost%20always,characters%20after%20the%20initial%20one.). For the network, however, the capital p Philosophy page should be no different than the lowercase p philosophy page. Therefore, all nodes are lowercase. 

Then, it checks for if the unique fail string, "!FAIL!: " as discussed above in the string. If it is there, it follows the same procedure as if the function returned a *NoneType*, but instead adds the first link to the fails list.

If there is a dead end node, that is added to a list of disconnects, then a new seed page is found using *wiki_random_page(seen_pages)*.

If it makes it through all of those checks, which most pages do, it is added as a node. Additionally, if it is not the seed page, an edge is added from the previous page to its first link.

Then, once all of the Notable Nodes are in the network (which should happen after just a few iterations), at every 1/100th of the total network size, centrality measures of Betweeness Centrality, Closeness Centrality, and In-Degree Centrality are calculated for each of the notable nodes using [NetworkX Centrality functions](https://networkx.org/documentation/stable/reference/algorithms/centrality.html) as well as the average distance from the philosophy page and the size of the GCC which are calculated manually. These are then organized into a row of a [Pandas](https://pandas.pydata.org/docs/index.html) DataFrame called *convergence_df*. Out-Degree centrality was excluded because the out-degree of every page is 1, making all of their Out-Degree centralities identical. Additionally, eigenvector centrality was ignored as the idea that high degree nodes would be connected to one another is doubtful in this network. There is nothing to suggest that having a common first link makes that page itself also common. Given that the underlying assumption behind eigenvector centrality is not met by this network, it was not worth tracking. These are the values we will track to ensure the DataFrame is large enough that these values are no longer changing. This is by far the most time consuming part of the function. Once the network reaches a large enough size, these calculations can take several minutes, hence why they are only done 1% of the time to maximize efficiency. 

Finally, the first link is checked to determine if it is the philosophy page. In which case, we can now move on to a new seed page as we know its outcome. Then, it is checked to see if it returned to the seed page, meaning it looped back to itself; these are added to the list of disconnects. Most disconnects, however, are found in more central pages and have to be found later looking at Smaller Connected Components. They are then checked to see if they have already been visited, in which case we know their eventual outcome and can select a new seed page. Finally, if none of these conditions are met, it is added to the seen pages and searched for its own first link. This process continues until one of the previous criteria is met.

After the loop is completed, it returns the Network (*G*), its *seen_pages*, *fails*, *disconnects*, and the Convergence DataFrame (*convergence_df*) to be analyzed.

Incidentally, the function can process roughly 1000 seed pages every 10 minutes, however, this slows down as the network expands due to the convergence calculations. Hence, the network size is limited.

I then manually check and fix any failed pages to complete the network and it is saved to a gml file and the convergence data to a csv file.

## Creating the Second-Link Network

*second_link_network_expander(G, page_url, seen_pages, is_root, fails, disconnects, convergence_df, new_pages=100)*

The second link network was created in a near identical fashion with a couple of key differences. First, it uses *get_second_link(page_url)* to gather the pages second links, rather than their first. Additionally, since it is more difficult to know the "notable" pages, these are selected by finding the top 3 pages in each centrality measure. Since we want to wait until a critical size to look into this, it does not start until the network has at least 3000 seed pages. There is also no reason to check each pages distance from the philosophy page in this network so that convergence measure has been eliminated. Similarly, it no longer stops looking for new pages at a specific page like it did for the philosophy page; now, it continues until it hits a page that we have already visited.

I chose not to investigate any other link locations for a few reasons. First, as the link location grows, it becomes increasingly unlikely that a page has that many links on it, shrinking the network. Second, no patterns presented themselves within the second-link network that seemed necessary to investigate at other locations. And finally, as previously discussed, the opening line(s) of a Wikipedia article have significantly more guidance onto their structure and links [@Wikipedia:ManualofStyle]. Link locations beyond these are unlikely to be more than a random assortment of pages with no notable patterns, however, work would have to be done to prove this claim. 

## Plotting Methods

To create the plots needed for my analysis, I used [MatPlotLib](https://matplotlib.org/stable/index.html), [Seaborn](https://seaborn.pydata.org/index.html), and [NetworkX's Drawing Tool](https://networkx.org/documentation/stable/reference/generated/networkx.drawing.nx_pylab.draw.html#draw).

#  Results

## First-Link Network

The first-link network finished with 23,169 total nodes.

### First-Link Convergence 

First, it is important to demonstrate that the size of the network is sufficient to make claims as to the nature of this phenomena. Beginning with the centrality measures of the most important nodes in the network, we can see that they have all leveled off and that any additional nodes would only not change their values in any statistically significant way.

!["Node Centrality Convergence"](../images/first-link-centrality-convergence.png)

In the left column, you can see the centrality measure across the entire network construction while the right column features the last iterations of the network construction to give a "zoomed in" view of the measures. All of them are almost entirely flat, with slopes that are less than 0.001. 

Next, we can see that the average distance away from the philosophy page also flattened out with the networks expansion.

!["Average Distance from the Philosophy Page with Network Expansion"](../images/first-link-dist-from-phil.png)

While it is not quite as flat as the convergences, we can still see strong evidence that it has settled to an approximate value of a little above 11 with a final value of 11.058. 

!["Size of the GCC with Network Expansion"](../images/first-link-gcc-convergence.png)

Finally, we see a negative slope in the size of the GCC of the network as new nodes are added, but that it flattens at 85.82% percent of pages end up at the philosophy page. 

### Notable Nodes and Paths to Philosophy

As was discussed earlier, there were several nodes that became apparent as the most important in the network. These nodes 'funneled' pages into the philosophy page and thus boasted the largest centrality measures in the network. This can perhaps best be scene by visualizing the nodes closest to the philosophy node using the force-directed [Kamada-Kawai Layout](https://networkx.org/documentation/stable/reference/generated/networkx.drawing.layout.kamada_kawai_layout.html).

!["Betweeness Centrality Kamada-Kawai Layout"](../images/first-link-betweeness-kk.png) !["Closeness Centrality Kamada-Kawai Layout"](../images/first-link-closeness-kk.png)

The largest nodes in these plots can be better seen using a force-directed spectral layout.

!["Betweeness Centrality Spectral Layout"](../images/first-link-betweeness-spectral.png) !["Closeness Centrality Spectral Layout"](../images/first-link-closeness-spectral.png)

These plots help visualize how nodes "flow" towards the philosophy page. You can see that it has several low degree neighbors in addition to some of these huge hubs. Within my search, I found 32 neighbors of the philosophy page but there are surely countless others that were not found within my search. For example, [Immanuel Kant's page's](https://en.wikipedia.org/wiki/Immanuel_Kant) first link is philosophy but it is unlikely to be the first link on more than a handful of other pages, making its chances of being found in a network of this size incredibly small. The same likely goes for numerous philosophers and adjacent topics. The inward-neighbors of Philosophy are listed below by degree: 

|    | Node                      |   Degree |
|---:|:--------------------------|---------:|
|  1 | political_philosophy      |       14 |
|  2 | modernism                 |        6 |
|  3 | aesthetics                |        5 |
|  4 | awareness                 |        4 |
|  5 | medical_specialty         |        4 |
|  6 | ethics                    |        4 |
|  7 | philosophy_of_culture     |        3 |
|  8 | outline_of_philosophy     |        3 |
|  9 | specialty_(medicine)      |        3 |
| 10 | philosophy_of_logic       |        2 |
| 11 | platonism                 |        2 |
| 12 | post-structuralist        |        2 |
| 13 | philosophical_school      |        2 |
| 14 | philosophy_of_mind        |        2 |
| 15 | natural_philosophy        |        2 |
| 16 | object_(philosophy)       |        2 |
| 17 | philosophies              |        2 |
| 18 | naturalism_(philosophy)   |        2 |
| 19 | art_theory                |        2 |
| 20 | metaphysics               |        2 |
| 21 | philosophical_tradition   |        2 |
| 22 | noticing                  |        2 |
| 23 | richard_velkley           |        1 |
| 24 | western_philosophy        |        1 |
| 25 | george_santayana          |        1 |
| 26 | alan_thomas_(philosopher) |        1 |
| 27 | philosophy_of_sex         |        1 |
| 28 | philosophy_of_science     |        1 |
| 29 | british_philosophy        |        1 |
| 30 | immanuel_kant             |        1 |
| 31 | platonic_philosophy       |        1 |
| 32 | moral_philosophy          |        1 |


However, this order does not represent the average path to philosophy. More typically, pages will end up in one of a few specific paths to philosophy. For most disciplines in arts, sciences, or technology, they will most often end up on the science page, leading to the knowledge page, then awareness, before philosophy. Awareness is by far the largest neighbor of Philosophy. Its closeness centrality just barely tails philosophy for the second largest in the network. Those two are followed by existence and entity which are the two nodes reached by clicking the first link on the philosophy page. Then, knowledge and science, which both come before awareness before a significant drop off in the closeness centrality of remaining nodes. Below are the full values of the top ten nodes by closeness centrality:

|    | Node                |   Closeness Centrality |
|---:|:--------------------|-----------------------:|
|  1 | philosophy          |              0.0773157 |
|  2 | awareness           |              0.0752007 |
|  3 | existence           |              0.0712966 |
|  4 | entity              |              0.0658643 |
|  5 | knowledge           |              0.065415  |
|  6 | science             |              0.0457291 |
|  7 | geography           |              0.0261291 |
|  8 | continent           |              0.0233602 |
|  9 | mind                |              0.0219937 |
| 10 | branches_of_science |              0.0217194 |

To better understand how nodes reach the philosophy page, here are the top ten nodes by appearances in paths to philosophy as well as the percentage of paths they appear in:

|    | Node                |   Path Appearances | Percent of All Paths   |
|---:|:--------------------|-------------------:|:-----------------------|
|  1 | philosophy          |              19811 | 85.51%                 |
|  2 | awareness           |              17806 | 76.85%                 |
|  3 | knowledge           |              13151 | 56.76%                 |
|  4 | science             |               7227 | 31.19%                 |
|  5 | consciousness       |               4652 | 20.08%                 |
|  6 | thought             |               4509 | 19.46%                 |
|  7 | mind                |               4498 | 19.41%                 |
|  8 | sciences            |               4468 | 19.28%                 |
|  9 | branches_of_science |               4467 | 19.28%                 |
| 10 | psychology          |               3673 | 15.85%                 |

The Awareness node is so central, in fact, that when you remove the Philosophy node from the network, severing Awareness from all of the other paths to philosophy, the new GCC is still 76.84% of the network, centered on the Awareness page. The next largest component of this network is headed by [the Philosophy of Logic page](https://en.wikipedia.org/wiki/Philosophy_of_logic). This is likely because many technical, particularly foreign origin words, will go to their language of origins page. These are then directed to [the language page](https://en.wikipedia.org/wiki/Language) which eventually reaches [the Philosophy of Logic page](https://en.wikipedia.org/wiki/Philosophy_of_logic), which then hits the philosophy page. Unfortunately, these nodes are too far from the Philosophy page to be visualized. Furthermore, the [communication](https://en.wikipedia.org/wiki/Communication), [information](https://en.wikipedia.org/wiki/Information), [ethnic group](https://en.wikipedia.org/wiki/Ethnic_group), and [genre](https://en.wikipedia.org/wiki/Genre) pages all also take you to [the Philosophy of Logic page](https://en.wikipedia.org/wiki/Philosophy_of_logic). However, this component is still just 6.74% of the network. If the Awareness node is then removed in addition to the Philosophy node the size of the remaining GCC falls dramatically down to 56.77%, lead by the Knowledge node, demonstrating that it is really the Awareness node that is holding the network together. Then, removing Knowledge drops the GCC to 31.2%, led by Science whose removal drops it to 20.08%.

There are also several pages with high in-degree centralities that are not seen in these plots. Below are the largest nodes by Degree and In-Degree centrality and their paths to philosophy:

|    | Node                    |   Degree |   In-Degree Centrality | Path to Philosophy                                                                                                                                                                                                                      |
|---:|:----------------------------------|--------------:|-----------------------:|:----------------------------------------------------------------------------------|
|  1 | county_(united_states)  |      203 |             0.00871892 | ['county_(united_states)', 'united_states', 'north_america', 'continent', 'geography', 'science', 'knowledge', 'awareness', 'philosophy']                                                                                               |
|  2 | public_university       |      153 |             0.00656077 | ['public_university', 'university', 'educational_institution', 'education', 'knowledge', 'awareness', 'philosophy']                                                                                                                     |
|  3 | association_football    |      148 |             0.00634496 | ['association_football', 'team_sport', 'sport', 'physical_activity', 'exercise', 'human_body', 'human', 'species', 'biology', 'science', 'knowledge', 'awareness', 'philosophy']                                                        |
|  4 | family_(biology)        |       98 |             0.00418681 | ['family_(biology)', 'taxonomic_rank', 'biology', 'science', 'knowledge', 'awareness', 'philosophy']                                                                                                                                    |
|  5 | u.s._state              |       97 |             0.00414365 | ['u.s._state', 'united_states', 'north_america', 'continent', 'geography', 'science', 'knowledge', 'awareness', 'philosophy']                                                                                                           |
|  6 | capital_city            |       79 |             0.00336671 | ['capital_city', 'municipality', 'administrative_division', 'sovereign_state', 'state_(polity)', 'politics', 'decision-making', 'psychology', 'mind', 'thought', 'consciousness', 'awareness', 'philosophy']                            |
|  7 | tennis                  |       72 |             0.00306457 | ['tennis', 'list_of_racket_sports', 'game', 'play_(activity)', 'recreational', 'leisure', 'time', 'sequence', 'mathematics', 'knowledge', 'awareness', 'philosophy']                                                                    |
|  8 | rock_music              |       70 |             0.00297825 | ['rock_music', 'genre_(music)', 'music', 'the_arts', 'creativity', 'psychology', 'mind', 'thought', 'consciousness', 'awareness', 'philosophy']                                                                                         |
|  9 | county_seat             |       70 |             0.00297825 | ['county_seat', 'seat_of_government', 'government', 'state_(polity)', 'politics', 'decision-making', 'psychology', 'mind', 'thought', 'consciousness', 'awareness', 'philosophy']                                                       |
| 10 | rural_districts_of_iran |       70 |             0.00297825 | ['rural_districts_of_iran', 'administrative_divisions_of_iran', 'country_subdivision', 'sovereign_state', 'state_(polity)', 'politics', 'decision-making', 'psychology', 'mind', 'thought', 'consciousness', 'awareness', 'philosophy'] |

Most of these pages relate to geography. The biggest surprise here seems to be [the Association Football](https://en.wikipedia.org/wiki/Association_football) page which appears here due to a shockingly large number of football (soccer) pages. Whether this is due to random chance in the search or if football pages make up a large portion of Wikipedia's network is impossible to say due to our sample size, however, *association_football* showed up consistently as one of the largest nodes by in-degree during the data collection process. 

Finally, the furthest page from philosophy, that reached it, was [Ski Area](https://en.wikipedia.org/wiki/ski_area) and its path can be seen below. It took 32 pages to reach the philosophy page. It appears to have gone down a long route on what vacation is, extending its journey. It is the only page in the network with this distance. 

!["Longest Path to Philosophy"](../images/longest-path-crop.png)

### First-Link Network Structure

The network also exhibits a typical long-tail distribution. Most nodes have an in-degree of 1, with only a tiny fraction exceeding double-digits.

!["First-Link Network Structure"](../images/first-link-degree-distribution.png)

### Degree vs. Distance from the Philosophy Page

I also plotted degree vs the distance from the philosophy page to see if there was any correlation,

!["Degree vs. Distance from the Philosophy Page"](../images/distance-from-phil-degree.png)

However, as seen in the figure above, there was little to no obvious correlation due to the long tail distribution of the network. With an r-value near zero of -0.08, the linear relationship is extremely weak. 

## Second-Link Network

The second-link network finished with 14,631 total nodes.

### Second-Link Convergence
Again, you can see that the network's largest nodes all converge, showing little change in their centrality measures during the final expansions of the network.

!["Second-Link Node Centrality Convergence"](../images/second-link-centrality-convergence.png)

Similarly, we can see the GCC of the network converges to just a little over 16% with an exact value of 16.23%:

!["Second-Link GCC Convergence"](../images/second-link-gcc-convergence.png)

### Second-Link Largest Connected Components

As above, we find above that the largest connected component of the second link network is quite small, at just around 16%. This component's largest nodes are primarily geographic, led by India and Japan. It is visualized in the plots below.

!["Second-Link Network GCC"](../images/second-link-gcc.png)

In this second plot, you can see that, unlike in the first-link network, the network does not flow to one node but rather is much more scattered.

!["Second-Link Network GCC Kamada-Kawai"](../images/second-link-gcc-kk.png)

The next largest component is significantly smaller, at just 7.74% of the network. It is similarly led by geographic pages; its top 5 pages were France, Europe, Germany, Spain, and Earth. It is visualized below:

!["Second-Link Network Second Largest Connected Component"](../images/second-link-second-gcc.png)

### Second-Link Notable Nodes

As it may be becoming clear, geographic nodes are the most prevalent in the second link network. By far the largest node by degree, was [the US State page](https://en.wikipedia.org/wiki/U.S._state). Its peers were almost all geographic pages as well. Here is the full list by degree:

|    | Node                 |   Degree |   In-Degree Centrality |
|---:|:---------------------|---------:|-----------------------:|
|  1 | u.s._state           |      198 |             0.0134655  |
|  2 | poland               |       77 |             0.00519481 |
|  3 | united_states        |       50 |             0.00334928 |
|  4 | india                |       33 |             0.00218729 |
|  5 | association_football |       32 |             0.00211893 |
|  6 | united_kingdom       |       30 |             0.00198223 |
|  7 | australia            |       28 |             0.00184552 |
|  8 | research_university  |       28 |             0.00184552 |
|  9 | iran                 |       27 |             0.00177717 |
| 10 | france               |       27 |             0.00177717 |

The top nodes by closeness centrality were also led by [the US State page](https://en.wikipedia.org/wiki/U.S._state), but it was then followed up by more similar terms relating to federalism and governance. Here are the top ten nodes by Closeness Centrality:

|    | Node                       |   Closeness Centrality |
|---:|:---------------------------|-----------------------:|
|  1 | u.s._state                 |             0.0172532  |
|  2 | federated_state            |             0.013413   |
|  3 | federation                 |             0.0110436  |
|  4 | political_union            |             0.0105248  |
|  5 | society                    |             0.00991123 |
|  6 | individual                 |             0.00956221 |
|  7 | organism                   |             0.00955794 |
|  8 | sun                        |             0.00938739 |
|  9 | administrative_subdivision |             0.00932515 |
| 10 | person                     |             0.0089184  |

Conversely, the most central nodes by betweeness centrality were more philosophic terms. Here are the top ten:

|    | Node                |   Betweeness Centrality |
|---:|:--------------------|------------------------:|
|  1 | society             |             0.000114381 |
|  2 | individual          |             0.000110876 |
|  3 | person              |             0.000104335 |
|  4 | organism            |             0.000101765 |
|  5 | living_system       |             9.78309e-05 |
|  6 | morality            |             9.69432e-05 |
|  7 | self-organization   |             9.2766e-05  |
|  8 | social_actions      |             8.95421e-05 |
|  9 | social_science      |             8.85655e-05 |
| 10 | action_(philosophy) |             8.244e-05   |

### Second-Link Network Structure

Finally, the second link network also displayed a long tail distribution, with the majority of nodes having a degree of <10 as seen below.

!["Second Link Degree Distribution"](../images/second-link-degree-distribution.png){width=75%}




#  Conclusions

## First-Link Network Conclusions

### Getting to Philosophy

First and foremost, it is clear that the Getting to Philosophy phenomena still holds true. With the vast majority of pages successfully leading to the philosophy page as you click on the first link of the page and every following page. While there was little doubt about this, it is good to see it confirmed.

Within this, I found that it takes, on average, 11 pages to reach the philosophy page. However, you may find that it does not feel that way when you investigate it for yourself. Since we were looking at totally random Wikipedia pages, many of the pages were very specific. Unless you are quite imaginative, your tests may seem to be much closer to the philosophy page than the average node is.

Additionally, my analysis suggests the network had shrunk from previous research's measurements. It will take further research to understand why this negative slope exists, however, it settles around 85.5% of the total network. This is a sharp drop from the previously found value of 97% [@LamprechtNavigability]. There are a couple of reasons for this. First, I would expect that they included links in lists in their analysis. More importantly, the 97% figure cited by Wikipedia [@Wikipedia:GettingtoPhilosophy] is not actually the number of pages that reach the Philosophy page from the first-link network. Rather, it is "the percentage of articles which eventually lead to a cycle when repeatedly following first links." [@LamprechtNavigability] There are several of these cycles which occur without ever connecting to the Philosophy page. For example, the first link on [the money](https://en.wikipedia.org/wiki/Money) page is payment. Then, the first link on [the payment page](https://en.wikipedia.org/wiki/Money) is money. This loop blocks several nodes from ever reaching the philosophy page. Similar loops occur on the [name](https://en.wikipedia.org/wiki/Name), [accounting](https://en.wikipedia.org/wiki/Accounting), and [candidate](https://en.wikipedia.org/wiki/Candidate) pages to name a few. This difference in methodology likely makes up for the near difference 10% difference here. Lamprecht et al. do list a figure for the amount of pages that link directly to the philosophy page of 92.1%. This much smaller difference is likely the result of including list links, the size of our networks, their use of the Wikipedia API, and changes in Wikipedia's network structure over the past seven years. Due to too many interacting variables, it is difficult to make a strong claim as to the true difference in the size of this component.

### Why is the Philosophy Node Significant and What is the Nature of the Most Common First Links?

One of the primary questions going into this study was why this phenomena occurs. After all, there seems no obvious answer on first thought. Philosophy professors would tell you it is because philosophy is "the first science" and is the study of nearly everything by that regard. Others might speculate that it is because it is, by nature, a meta discipline, contemplating the nature of other sciences. For any of these reasons and more, it appears as the first link on a few key pages. Namely, by being the first link on the [Awareness](https://en.wikipedia.org/wiki/Awareness) and [the Philosophy of Logic](https://en.wikipedia.org/wiki/Philosophy_of_logic) pages. By being the first link on just these two pages alone, philosophy is already connected to 83.58% percent of the network, meaning the remainder of philosophy's connections make up only around 2% of the network.

Then, in order to not lead to an even more central node, philosophy has to link back to itself. The first link network is so connected in the first-place due to Wikipedia's instructions to make the first link on a page increasingly broad[@Wikipedia:ManualofStyle]. This ensures that few pages are capable of these loops. We found that the most common first links often dealt with geography. This means that Wikipedia's instructions, as laid out in the introduction, were executed quite well. Pages are explained in their broadest terms first, particularly, by locating the topic in the world. This results in county, state, and other geographic terms taking the lead. Why *association_football* is so common remains a mystery, the prevalence of football (soccer) pages on Wikipedia would be an interesting subject to explore in the future.

Philosophy follows a path back to itself that is actually quite a bit longer than many of the cycles found in the network, many of which bounce right back to themselves from their first link. Philosophy, on the other hand, takes five additional pages (existence, entity, abstraction, rules, philosophy of logic) to get back to itself. Thus, technically, all of these pages are connected to such a vast portion of the network. These pages, are merely riding on philosophy's coattails in this technicality. Philosophy remains a dramatically more central node. Therefore, no matter why you think the neighbors of philosophy are so connected to so many pages, philosophy's connection to just a few important nodes and its self-loop are the root of this phenomena.

### Distance from Philosophy and Degree

Here, I had hypothesized to see that being closer to the philosophy page meant your topic was broader, resulting in a higher degree. For this to exist, we would near an r-value near negative one to indicate a strong negative correlation as I had hypothesize. This would have been quite an interesting discovery, but, unfortunately, I found no significance with an r-value of -0.08, no where close to what would be needed for significance. Whether such a significance could arise with a greater network size is unclear, but I remain curious as to whether this may be. I expect that the more specific topics near the philosophy page, such as specific philosophers or topics, drags this down. Additionally, my analysis was quite simple, with more time I would like to look into a more complex statistical analysis here to see if there is a correlation that extends beyond the eye.

## Second-Link Network Conclusions

There is not as much to say here. First, no page presented itself as exhibiting anything similar to what was occurring with the philosophy page in the first-link network. I am further convinced that this is a phenomena unique to the first link location on Wikipedia pages. If the second link, which is the least random link besides the first, cannot form a connected component larger than even just 20% of the network, one must imagine that any other link location would form increasingly random and disconnected networks.

The second-link network was able to create nice loops due to similar reasons as discussed with the first network, their links are prescribed to be broad. Again, you see geography take an even greater role in this network, likely because it still exists for buildings where the larger geographic context (such as first link city, second link state) is still present. Meanwhile, many objects that are described first go straight to their geographical location instead of the nature of what they are. Then, these locations would get back to themselves by looping on their broader locational significance, as a state in a federalist system, for example. 

While interesting in some regards, this network serves more as a reason to dismiss further studies of other link locations as they are increasingly unlikely to demonstrate a pattern of similar note to the first link network. 

## References

::: {#refs}
:::


  <!-- - Graduate level work should typically include linked and numbered internal citations. These references should be included at the end as a numbered citation list pointing to all textbooks and peer-reviewed articles mentioned in the work. -->