# Articles crawlled by class-Documentation

### This is a method for collecting all the titles of a specific class.

Given a class (FA, GA, A, B, C, Start or Stub), the crawller search is based in en.wikipedia.org/wiki/"Desired_Class"-Class_articles webpage, wich has all the categories that could have articles with the refered classification. Then the crawller goes to all of the category pages and get all the titles.

## Imported modules

In [1]:
import sys
from xml.dom.minidom import parseString
import csv
import time
from datetime import datetime
import requests
from pathlib import Path
import json
import file_manager as file_manager
from checkpoint_manager import Checkpoint

## Getting the attributes in the html
  Let's assume that you have collected the html of the page. To indentify the elements in the page, like the "div" that as the links of the category pages, this two functions searches in the html for the desired "id" or "class".

In [2]:

def getElementById(elements,strId):
	for element in elements:
		if element.hasAttribute('id') and element.getAttribute('id') == strId:
			return element
        
def getElementsByClass(elements,strClass):
    selected = []
    for element in elements:
        if element.hasAttribute('class') and element.getAttribute('class') == strClass:
            selected.append(element)
            
    return selected

## Request
This function makes a request to the urls' page to get the html.

In [3]:
def requestCategory(urlToRequest):
	dataToResquest = {}
	r = requests.get(urlToRequest, data=dataToResquest)
	print("Requisitando: "+urlToRequest)

	return parseString(r.text)

## Getting the next page and the subcategories
The categories in Wikipedia can have subcategories links or titles. If the category has a lot of subcategories or titles, the page implements a pagination that creates a "next page" link. So the crawler has to get the subcategories and the next page links  and add they to the url_to_crawl list to be explored later. 

In [4]:
def getNextPage_Link(domHTML):
    mw_subcategories = getElementById(domHTML.getElementsByTagName("div"),"mw-subcategories")
    if mw_subcategories != None:
        page_links = mw_subcategories.getElementsByTagName("a")
        if len(page_links) > 1:
            for i in range(2):
                if(page_links[i].childNodes[0].data == "next page"):
                    if("href" in page_links[i].attributes):
                        return [page_links[i].attributes["href"].value]
    return []

def getSubcategory_Links(domHTML,category):
    subCats = getElementsByClass(domHTML.getElementsByTagName("div"),"CategoryTreeItem")
    to_crawl = []

    for subcategory in subCats:
        subcategory_spans = subcategory.getElementsByTagName("span")
        if  len(subcategory_spans)> 1:
            has_data = subcategory_spans[1].childNodes[0].data
            if(has_data != "(empty)" or has_data == "►"):
                all_links = subcategory.getElementsByTagName("a")
                subcategory_link = all_links[0].attributes["href"].value
                if(subcategory_link.find(f"Category:{category}")>=0):
                    to_crawl.append(subcategory_link)

    return to_crawl 

## How the crawler navigate and collect the titles

&nbsp;
First the crawler requests the category page and gets the html. Then it searches for links for future exploration, like a "next page" link for categories and the subcategories links.


&nbsp;
Now it's time for the titles. The titles and the "next page" link for titles are in a div whith the ID "mw-pages". After obtaining the div, the crawler verify if it is a "next page" link to save for future exploration. If it isn't a "next page" link it's a link to the Talk Page of the article, so simply getting the text of the link it's enought to know the article title and it is saved in a txt file. 

In [5]:
def get_url_cat_links(url_to_crawl,discovered_pages,log,category,checkpoint,output,get_sub_cats=False):
    try:
        domHTML = requestCategory(url_to_crawl)
    except Exception as ex:
        log(f"Erro ao requisitar url: {url_to_crawl} ERRO: {ex}")
        return []

    arrNewURLToCrawl = []

#get the next page link and the subcategories links
    arrNewURLToCrawl = getNextPage_Link(domHTML) + getSubcategory_Links(domHTML,category)
    
#write all article pages found save the next and previous link to be crawled
    pages = getElementById(domHTML.getElementsByTagName("div"),"mw-pages")
    if(pages!=None):
        arrSubCats = pages.getElementsByTagName("a") 
        for catLink in arrSubCats:
            link_text = catLink.childNodes[0].data.strip()
            if("href" in catLink.attributes):
                link_url = catLink.attributes["href"].value
                if(link_url.find(f"Category:{category}-Class_")>=0):
                    if(link_text!= "previous page"):
                        arrNewURLToCrawl.append(link_url)
                else:

                    if(link_text not in discovered_pages):
                        discovered_pages.append(link_text)
                        file_manager.append_file(output,link_text)
    return arrNewURLToCrawl

## The Main

* **Notes:** 
  * The arguments were subsitute to compile the code in this notebook. Follow the instructions in the comments to run the algorithm in the terminal.
  * The enable variable was added because exit() doesn't work for ending the program in Jupyter.
  
  
&nbsp;
While it has elements to crawl, the program gets the first element of the list, calls get_url_cat_links() that will collect the titles in the page and return the links in the page to be requested latter. The new links (the ones that is not in "added_urls" list) are saved in "new_urls_to_crawl". Then it is stored in "added_urls" wich is a list that stores all discovered urls, then the new urls are added to the "urls_to_crawl" to be explored later.

* Struture
  * "added_urls": Contains all the discovered urls;
  * "urls_to_crawl": Contains the urls that are waiting to be requested;
  * "arr_urls": Contais the discovered urls of this iteraction;
  * "new_urls_to_crawl": Contais the discovered urls of this iteraction that are new (not in "added_urls").
  



In [6]:
if __name__ == "__main__":
    # If you want FA, GA, A, B, C, Start or Stub articles, just put the name of the Class.
    # Expecify with you want to reset the JSON: T or F.
    # You also have to include the name of your JSON file to have a checkpoint.
    # You always have to reset the JSON with the desired category to collect before starting the crawler. 
    # Example: python get_wiki_category.py A T checkpoint.json to reset the JSON.
    #          python get_wiki_category.py A F checkpoint.json to start the crawler.
    # Pause when want and your effort wont be lost!
    # Good Crawling !!

    category = "A"
    reset = "T"
    file_name = "checkpoint.json"

    output = f"articles_{category}"
    file_manager.create_file_if_does_not_exist(output)

    checkpoint = Checkpoint(category,file_name)

    enable = True
    if(reset == "T"):
        checkpoint.reset_Checkpoint()
        print("Reseted")
        enable = False
        # exit()

    available_categories = ["FA","GA","A","B","C","Start","Stub"]
    if(category in available_categories):
        print("-----------------Starting the Crawler----------------------")
    else:
        print("This code is for getting FA, GA, A, B, C, START and STUB articles only.")	
        enable = False
        # exit()
        

    added_urls, urls_to_crawl = checkpoint.load_Checkpoint()
    discovered_pages = []
    #parameters
    domain = "https://en.wikipedia.org"
    log = file_manager.create_logger("erros")
  
    wasted_time = datetime.now() - datetime.now()
    if(enable):
        i = 1
        while len(urls_to_crawl)>0:

            time_before = datetime.now()

            new_urls_to_crawl = []
            current_url = urls_to_crawl.pop()
            setNewUrls = set([])
            arr_urls = None
            while(arr_urls == None):
                arr_urls = get_url_cat_links(current_url,discovered_pages,log,category, checkpoint, output, get_sub_cats=True)
            setNewUrls = set(arr_urls)

            [new_urls_to_crawl.append(domain+pageURL) for pageURL in setNewUrls if domain+pageURL not in added_urls]

            added_urls = added_urls + new_urls_to_crawl

            urls_to_crawl = urls_to_crawl + new_urls_to_crawl
        # Save in the checkpoint
            checkpoint.add(urls_to_crawl,"urls_to_crawl")
            checkpoint.add(new_urls_to_crawl,"visited")

            wasted_time = (wasted_time + (datetime.now() - time_before))/i




            print("URL ("+str(i)+")"+current_url+" crawled. URLs to crawl: "+str(len(urls_to_crawl))+" Time needed per URL: "+str(wasted_time.microseconds/10**6)+" seconds" )

            time.sleep(1)
            i = i+1

Reseted
-----------------Starting the Crawler----------------------
