# Task 1 - Getting to Philosophy
**Write a Python script to check the "Getting to Philosophy" law.
https://en.wikipedia.org/wiki/Wikipedia:Getting_to_Philosophy
Clicking on the first link in the main body of a Wikipedia article and repeating the process
for subsequent articles would usually lead to the article Philosophy.
The program should receive a Wikipedia link as an input, go to another normal link and
repeat this process until either Philosophy page is reached, or we are in an article without
any outgoing Wikilinks, or stuck in a loop.
A "normal link" is a link from the main page article, not in a box, is blue (red is for
non-existing articles), not in parentheses, not italic and not a footnote. You don't have to
check style tables or other fancy things, it is enough that the script works with the current
Wikipedia style (for example you can use 'class' attribute in Wikipedia tags). For easy
validation, please print all visited links to the standard output.
Use a 0.5 second timeout between queries to avoid heavy load on Wikipedia (sleep function
from time module).
You can use https://en.wikipedia.org/wiki/Special:Random to check this hypothesis at
home.**




---



---


**First, Install dependencies like: BeautifulSoup, urllib, time, sys & requests where :**

*   Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML.

*   Urllib module is the URL handling module for python. It is used to fetch URLs.
   * urllib.parse for parsing URLs

*   time to handle time-related tasks.
*   requests module allows you to send HTTP requests using Python.




In [31]:
from bs4 import BeautifulSoup
import urllib
import time
import sys
import requests


In [32]:
start_url = "https://en.wikipedia.org/wiki/Special:Random"
target_url = "https://en.wikipedia.org/wiki/Philosophy"

# to store urls of the visited article 
visited_urls = [start_url]

In [33]:
def find_first_link(url):
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html, "html.parser")
    link = None
   
    for anchor in soup.find(id="mw-content-text").find(class_="mw-parser-output").find_all("p", recursive=False):
      if anchor.find("a", recursive=False):
        link = anchor.find("a", recursive=False).get('href')
        break

    if not link:
        return

    # Build a full url 
    first_link = urllib.parse.urljoin(
        'https://en.wikipedia.org/', link)

    return first_link

In [36]:
def continue_scraping(visited_urls, target_url):
    max_steps = 100
    # When reaches to philosphy
    if visited_urls[-1] == target_url:
        print("Target ('Philosphy') article reached!")
        return False
    # max iterations 
    elif len(visited_urls) > max_steps:
        print("Maximum (100) searches reached, interrupted.")
        return False
    elif visited_urls[-1] in visited_urls[:-1]:
        print("We are in a Loop , interrupted.")
        return False
    else:
        return True

In [37]:
while continue_scraping(visited_urls, target_url):
    #print first link
    print(visited_urls[-1])

    first_link = find_first_link(visited_urls[-1])
    # when arrive at an article with no links
    if not first_link:
        print("Arrived at an article with no links, search aborted.")
        break
        
    visited_urls.append(first_link)
    # Use a 0.5 second timeout between queries to avoid heavy load on Wikipedia
    time.sleep(0.5) 
visited_urls=[start_url]

https://en.wikipedia.org/wiki/Special:Random
https://en.wikipedia.org/wiki/Persian_language
https://en.wikipedia.org/wiki/Exonym_and_endonym
https://en.wikipedia.org/wiki/Greek_language
Arrived at an article with no links, search aborted.
