# Data, classification task

Our data will be a set of articles downloaded from Wikipedia that we need to classify as concerning 'Medical' topics, or 'Non medical'; in particular, we need to classify documents in two classes: those with a 'Medical' tags, and those without it.

## Data retrieval

To create a dataset that can be used for the classification task we need to download articles from Wikipedia API. Wikipedia already groups articles under different categories, so for this binary classification task we can just use their functions to select a category and the download articles belonging to that category.

Wikipedia categories are highly hierarchical, so, for e.g., the category 'Medicine' contains few articles and many subcategories, which are themselves articles. To 'fix' the search I did some exploratory analysis of the available categories to determine which had enouigh articles to build a decently-sized dataset. 

For the medical catogories, I went with the following categories:

In [3]:
medical_categories = [
    "Category:Alternative medicine stubs",
    "Category:Evidence-based medicine",
    "Category:Veterinary medicine stubs",
    "Category:Vaccination",
    "Category:2018 disease outbreaks",
]

For the non-medical categories, I went with scientific and non-scientific categories: . I did explicity add scientific/related categories to make the classification task more interesting: i expect an article belonging to the 'Literature' to be classified correctly quite easily; an article belonging to 'Geology' or 'AI', not so much (due to possible overlapping topics).  

In [4]:
non_medical_categories = [
    "Category:Dark ages",
    "Category:Historiography of China",
    "Category:Sports controversies",
    "Category:Philosophy of artificial intelligence",
    "Category:Geology",
    "Category:Space",
    "Category:Literature",
    "Category:Music videos",
]

We use these categories to download the articles ids using the function

In [7]:
import requests
def get_ids(url, categories):
    returned_ids = []

    for c in categories:
        params = {
            "action": "query",
            "cmtitle": c,
            "cmlimit": "500",
            # "cmtype": "subcat",
            "list": "categorymembers",
            "format": "json",
        }

        req = requests.get(url=url, params=params)
        pages = req.json()["query"]["categorymembers"]

        page_ids = [page["pageid"] for page in pages]

        for id in page_ids:
            new_params = {
                "format": "json",
                "action": "query",
                "prop": "extracts",
                "exintro": True,
                "explaintext": True,
                "redirects": 1,
                "pageids": id,
            }
            req = requests.get(url, new_params)
            try:
                title = req.json()["query"]["pages"][str(id)]["title"]
                # print(title)
                if (
                    title.startswith("Category")
                    or title.startswith("Template")
                    or title.startswith("Portal")
                ):
                    continue
                else:
                    returned_ids.append(id)
            except:
                print(f"||Failed at id {id}||")

    return returned_ids

In [None]:
url = "https://en.wikipedia.org/w/api.php"
# We download them like this
__M_IDS__ = get_ids(url, medical_categories)
__NON_M_IDS__ = get_ids(url, non_medical_categories)

Some ids and dataset sizes

In [16]:
# For presentation purposes, we show the one that we have already used, without needing to compute them another time
from ids import __M_IDS__, __NON_M_IDS__

print(f"Medical ids (clipped): {__M_IDS__[:20]} ...\n")
print(f"Non medical ids (clipped): {__NON_M_IDS__[:20]} ... \n")

print(f"Size of the medical dataset: {len(__M_IDS__)}")
print(f"Size of the non-medical dataset: {len(__NON_M_IDS__)}")
print(f"Total dataset size: {len(__M_IDS__) + len(__NON_M_IDS__)}")

Medical ids (clipped): [46303046, 5835531, 3887850, 52103535, 33076141, 5335383, 40440492, 4031180, 66655182, 55554061, 951614, 66012376, 10420896, 3740763, 3296243, 6868776, 38474884, 300772, 10616040, 36732930] ...

Non medical ids (clipped): [7410249, 18400571, 90138, 5571005, 53290497, 5832437, 48110, 1840762, 2958015, 8099572, 18472072, 4513331, 25508360, 3054853, 68092158, 34043, 440393, 36082813, 4175228, 13666328] ... 

Size of the medical dataset: 457
Size of the non-medical dataset: 399
Total dataset size: 856


We have now two sets of ids: medical ids and non-medical ids. We will use them to download articles and build our dataset. 

In [17]:
# "ids": ids used of downloaded documents; 
# "kind": medical or non-medical, used to separate the two types of documents into two folders for ease of use
def download_documents(ids, kind):
    for id in ids:
        new_params = {
            "format": "json",
            "action": "query",
            "prop": "revisions",
            "rvslots": "*",
            "rvprop": "content",
            "redirects": 1,
            "pageids": id,
        }
        req = requests.get(url, new_params)
        try:
            title = req.json()["query"]["pages"][str(id)]["revisions"][0]["slots"][
                "main"
            ]["*"]
            with open(f"./documents/{kind}/{id}.txt", "w") as f:
                f.write(title)
        except:
            print(f"||Failed at id {id}||")

We download all the documents and store them in 'documents/medicine' and 'documents/non_medicine'.

### Examples of documents

**Medical**

{{Short description|University in Tianjin, China}}
'''Tianjin University of Traditional Chinese Medicine''' (天津中医药大学 in [[Chinese language|Chinese]]) is  a university in [[Tianjin]], [[China]], under the municipal government. Specialized in traditional Chinese Medicine, it is selected by the Chinese state [[Double First Class University Plan|Double First-Class University]], included in the national Double First Class University Plan.<ref name="Chinese Department of Education">{{Cite web |url=http://www.moe.gov.cn/srcsite/A22/moe_843/201709/t20170921_314942.html |title=教育部 财政部 国家发展改革委 关于公布世界一流大学和一流学科建设高校及建设 学科名单的通知 (Notice from the Ministry of Education and other national governmental departments announcing the list of double first class universities and disciplines)}}</ref>

== See also ==
[[Japan Campus of Foreign Universities]]

== References ==
{{Reflist}}

{{-}}
{{Universities and colleges in Tianjin}}

{{coord missing|Tianjin}}

{{authority control}}

{{DEFAULTSORT:Tianjin University of Traditional Chinese Medicine}}
[[Category:Universities and colleges in Tianjin]]
[[Category:Traditional Chinese medical schools in China]]
[[Category:Medical and health organizations based in China]]


{{China-university-stub}}
{{Alt-med-stub}}

**Non-medical**

{{Short description|Abbreviation of 1,000,000 years}}{{redirect|Million years ago|the [[Adele]] song|Million Years Ago (song)|1,000,000 BC|one million (disambiguation)}}
{{about|"million years" (Myr)|the [[Taake]] song|Noregs vaapen|other uses}}

'''Myr''' is an abbreviation for '''million years''', a [[unit of time]] equal to {{val|fmt=commas|1000000|u=years}} (i.e. {{val|1|e=6}} years), or 31.556926 [[Terasecond and longer#Teraseconds|teraseconds]].
It is equivalent to one ''[[megaannum]]'' (symbol Ma), based on the [[metric prefix]] [[mega-]].

==Usage==
Myr (million years) is in common use in fields such as [[Earth science]] and [[cosmology]]. Myr is also used with '''Mya''' or '''Ma''' (million years ago). Together they make a reference system, one to a quantity, the other to a particular place in a [[calendar era|year numbering system]] that is ''time before the present''.

Myr is deprecated in [[geology]], but in [[astronomy]] ''Myr'' is standard. Where "myr" ''is'' seen in geology it is usually "Myr" (a unit of mega-years). In astronomy it is usually "Myr" (Million years).

== Debate ==
In geology a debate remains open concerning the use of ''Myr'' (duration) plus ''Ma'' (million years ago) versus using only the term ''Ma''.<ref>{{cite web|last=Mozley|first=Peter|title=Discussion of GSA Time Unit Conventions|url=https://www.geosociety.org/TimeUnits/|work=web page|publisher=[[Geological Society of America]]|archive-url=https://web.archive.org/web/20160303232640/https://www.geosociety.org/TimeUnits/|archive-date=2016-03-03}}</ref><ref name="Biever-war">{{cite journal |first=Celeste |last=Biever |title=Push to define year sparks time war |journal=[[New Scientist]] |volume=210 |issue=2810 |pages=10 |url=https://www.newscientist.com/article/dn20423-push-to-define-year-sparks-time-war.html |date=April 27, 2011 |access-date=April 28, 2011|bibcode=2011NewSc.210R..10B |doi=10.1016/S0262-4079(11)60955-X }}</ref> In either case the term ''[[Year#SI prefix multipliers|Ma]]'' is used in geology literature conforming to [[ISO 31-1]] (now [[ISO 80000-3]]) and NIST 811 recommended practices. Traditional style geology literature is written {{Quote|The Cretaceous started 145 Ma and ended 66 Ma, lasting for 79 Myr.}}
The "ago" is implied, so that any such year number "X Ma" between 66 and 145 is "Cretaceous", for good reason. But the counter argument is that having ''myr'' for a duration and ''Mya'' for an age mixes unit systems, and tempts capitalization errors: "million" need not be capitalized, but "mega" must be; "ma" would technically imply a ''milliyear'' (a thousandth of a year, or 8 hours). On this side of the debate, one avoids ''myr'' and simply adds ''ago'' explicitly (or adds ''[[Before Present|BP]]''), as in {{Quote|The Cretaceous started 145 Ma ago and ended 66 Ma ago, lasting for 79 Ma.}}
In this case, "79 Ma" means only a quantity of 79 million years, without the meaning of "79 million years ago".

== See also ==
* [[Billion years|Byr]]
* [[Kyr]]
* [[Year#SI prefix multipliers|Megaannum]] (Ma)
* [[Year#Abbreviations yr and ya|Symbols y and yr]]

==References==
<references/>

{{Portal bar|Earth science|Mathematics|Astronomy|Stars}}

[[Category:Units of time]]
[[Category:Units of measurement in astronomy]]
[[Category:Geology]]

As you can see, they are very raw and have lots of useless, for the purpose of classification, symbols and words that are used by Wikipedia to render and refer to the article. We need to clean these documents.

In [23]:
import re
import os

# Given a folder containing the desired documents, clean them and save them in './documents'.
def clean_documents(folder):
    path = f"./documents/{folder}"
    os.chdir(path)
    for file in os.listdir():
        # Check whether file is in text format or not
        if file.endswith(".txt"):
            file_path = f"{path}/{file}"

            new_lines = []
            with open(file, "r") as f:
                # All lines contained in the file
                lines = f.readlines()
                for l in lines:
                    # Clean line and add to new_lines
                    new_lines.append(clean_string(l))
            f_path = file.split(".")[0]
            new_path = f"../{f_path}_c.txt"
            # Save to "./documents/{file_path}_c.txt"
            with open(new_path, "w") as fw:
                for nl in new_lines:
                    fw.write(nl)

# Clean a string. These are heuristic-based rules that work on Wikipedia articles
def clean_string(string):
    string = re.sub("<ref.*?</ref>", "", string)  # removes refs
    string = re.sub("<ref.*?/>", "", string)  # idem
    string = re.sub("{.*?}", "", string)  # removes "{...}"
    string = re.sub(
        "\|.*\n?", "", string
    )  # removes lines starting with "|"" and continuing until the end
    string = re.sub("(Category).*\n?", "", string)
    string = re.sub("(thumb\|.*?\|)", "", string)  # removes "thumb|...|"
    string = re.sub(
        "(thumb)", "", string
    )  # removes "thumb" (cannot easily distinguish all cases)
    string = re.sub(
        "\[\[.*?\|", "", string
    )  # removes links such as [[dieting|diet]], but only the first part (up until "|"), which is the link.
    string = re.sub(
        "[\[,\],{,},',\\',\,\.,#,=,*\|`-]", "", string
    )  # removes all remaining bad characters: left out [], {}, #, =, |, ', `, -, *
    string = re.sub("\\n", "\n", string)  # removes newlines
    return string
