# Wiki Auto Deck

### Quick Demo

In [1]:
from generate_deck import generate_deck
from classes import ApiSession

All Wikipedia API requests are made through an ApiSession object.

In [2]:
S = ApiSession()

The user may input any term. If the term has a valid Wikipedia article associated with it, validate_term() returns that term.

In [3]:
term = "Buddhism"
article_title = S.validate_term(term)

Buddhism matches the page Buddhism on Wikipedia!


If the term doesn't match or redirect to a Wikipedia article, validate_term() returns a list of search results of Wikipedia articles that may match the user's term.

In [4]:
term = "artomobiles"
article_title = S.validate_term(term)

artomobiles doesn't match a page on Wikipedia.
Term not recognized. Please select a suggestion:
1 Automobiles Alpine
2 Automobiles Darracq France
3 Automobiles Gonfaronnaises Sportives
4 Automobiles L. Rosengart
5 Automobiles Talbot France
6 Automobiles Lombard
7 Car
8 Automobiles Rally
9 Automobiles Martini
10 Automobiles ERAD
Enter number of term: 7


In [5]:
article_title

'Car'

In cases where the entered term automatically redirects when searched on Wikipedia, validate_term() returns the redirect target.

In [6]:
term = "zen buddhism"
article_title = S.validate_term(term)

zen buddhism matches the page Zen on Wikipedia!


Once we have our article title, we can set the number of related pages we'd like returned for our deck and the length of the descriptions we'd like associated with those cards. 

The script is designed to pull the most related articles from a pool of the longest articles linked from your root page. The pool of longest linked articles is set to be ten times bigger than the eventual deck. So in this case, we pull the 300 longest articles linked from Zen and perform a TF-IDF comparison to find pages most similar to Zen.

A note on TF-IDF implementation:

`
vect = TfidfVectorizer(stop_words="english", ngram_range=(1,3))
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
root_similarity = arr[0]
`

We use English stop words and collect unigrams, bigrams, and trigrams for our analysis. We only need the similarities of each longest linked article to the root article, so we pull only the first array in the pairwise similarity matrix.

In [7]:
deck_size = 30
desc_length = 0

In [8]:
cards, all_similars = generate_deck(S, article_title, deck_size, desc_length)

Pulling article lengths...


100%|██████████████████████████████████████████████████████████████████████████████████| 30/30 [00:20<00:00,  1.50it/s]


Pulling article extracts...


100%|████████████████████████████████████████████████████████████████████████████████| 300/300 [00:37<00:00,  7.91it/s]


In [9]:
type(cards)

dict

We now have a handy dictionary keyed by article title, with values of extracted introductions.

In [10]:
titles = list(cards.keys())
for i,title in enumerate(titles):
    print(i,title)

0 Chan Buddhism
1 Dharma transmission
2 Chinese Buddhism
3 Buddhism in the United States
4 Buddhist meditation
5 Kenshō
6 Buddhism in Japan
7 Korean Buddhism
8 Buddhism
9 Buddhism in the West
10 Dhyāna in Buddhism
11 Meditation
12 Buddhist texts
13 Faith in Buddhism
14 Mahayana
15 Buddha-nature
16 Bodhidharma
17 Buddhist philosophy
18 Buddhism and psychology
19 Tiantai
20 Tang dynasty
21 Tendai
22 Lotus Sutra
23 Vajrayana
24 Outline of Buddhism
25 Buddhist art
26 Chinese folk religion
27 Mahayana sutras
28 Tibetan Buddhism
29 Buddhist devotion


In [11]:
cards[titles[21]]

'Tendai (天台宗, Tendai-shū), also known as the Tendai Lotus School (天台法華宗 Tendai hokke shū, sometimes just "hokke shū") is a Mahāyāna Buddhist tradition (with significant esoteric elements) officially established in Japan in 806 by the Japanese monk Saichō (posthumously known as Dengyō Daishi). The Tendai school, which has been based on Mount Hiei since its inception, rose to prominence during the Heian period (794-1185). It gradually eclipsed the powerful Hossō school and competed with the rival Shingon school to become the most influential sect at the Imperial court.\nBy the Kamakura period (1185-1333), Tendai had become one of the dominant forms of Japanese Buddhism, with numerous temples and vast landholdings. During the Kamakura period, various monks left Tendai (seeing it as corrupt) to establish their own "new" or "Kamakura" Buddhist schools such as Jōdo-shū, Nichiren-shū and Sōtō Zen. The destruction of the head temple of Enryaku-ji by Oda Nobunaga in 1571, as well as the geograp

### A closer look at the TF-IDF results

By digging into all_similars, we can see how TF-IDF ranks pages behind the scenes.

In [12]:
all_similars

[('Zen', nan),
 ('Chan Buddhism', 0.45265095142070144),
 ('Dharma transmission', 0.19256214432855379),
 ('Chinese Buddhism', 0.18318075130470943),
 ('Buddhism in the United States', 0.18295429967032492),
 ('Buddhist meditation', 0.17817762615900373),
 ('Kenshō', 0.17673088183878244),
 ('Buddhism in Japan', 0.15965985073465755),
 ('Korean Buddhism', 0.1446147214122153),
 ('Buddhism', 0.14288809755153903),
 ('Buddhism in the West', 0.1424171337906724),
 ('Dhyāna in Buddhism', 0.1298075167521294),
 ('Meditation', 0.11779079537051218),
 ('Buddhist texts', 0.11711609699510328),
 ('Faith in Buddhism', 0.11621463959874004),
 ('Mahayana', 0.11620026539851243),
 ('Buddha-nature', 0.1151897881416502),
 ('Bodhidharma', 0.1127201297101709),
 ('Buddhist philosophy', 0.11196094803961604),
 ('Buddhism and psychology', 0.10879579678983695),
 ('Tiantai', 0.10837979607372757),
 ('Tang dynasty', 0.1041955795798917),
 ('Tendai', 0.09864107606867198),
 ('Lotus Sutra', 0.09356124401011262),
 ('Vajrayana', 0

Zen and Chan Buddhism are shown to be very similar, as Zen is derived from Chan. The rest of the top results are all context about specific aspects or regions of the Zen tradition, and a few important terms, such as Kenshō and Dhyāna.

We can see a contrasting example by looking at a different subject.

In [13]:
term = "stephen king"
article_title = S.validate_term(term)
deck_size = 30
desc_length = 0
cards, all_similars = generate_deck(S, article_title, deck_size, desc_length)

stephen king matches the page Stephen King on Wikipedia!
Pulling article lengths...


100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:13<00:00,  1.48it/s]


Pulling article extracts...


100%|████████████████████████████████████████████████████████████████████████████████| 300/300 [00:58<00:00,  5.14it/s]


In [15]:
list(cards.keys())

['Steve King',
 'Carrie (novel)',
 'Stephen King bibliography',
 'The Dark Tower (series)',
 'Stephen King short fiction bibliography',
 'Rose Red (miniseries)',
 'Paul LePage',
 'Joe Hill (novelist)',
 'The Dark Tower (2017 film)',
 'The Stand (1994 miniseries)',
 'Randall Flagg',
 'Maximum Overdrive',
 'The Stand',
 'The Shining (franchise)',
 'The Shining (film)',
 '11/22/63',
 'Carrie (1976 film)',
 'Horror fiction',
 'Castle Rock (Stephen King)',
 'It (miniseries)',
 'Ramsey Campbell',
 'Simon & Schuster',
 'Genre fiction',
 'The Dark Tower (comics)',
 'University of Maine',
 'Under the Dome (novel)',
 'Doctor Sleep (2019 film)',
 'Dennis Etchison',
 'The Stand (2020 miniseries)',
 'Richard Matheson']

In [16]:
cards['Steve King']

'Steven Arnold King (born May 28, 1949) is an American politician and businessman who served as a U.S. representative from Iowa from 2003 to 2021. A member of the Republican Party, he represented the 5th congressional district until redistricting meant he began representing the 4th district.\nBorn in 1949 in Storm Lake, Iowa, King attended Northwest Missouri State University from 1967 to 1970. He founded a construction company in 1975 and worked in business and environmental study before seeking the Republican nomination for a seat in the Iowa Senate in 1996. He won the primary and the general election, and was reelected in 2000. In 2002 King was elected to the U.S. House of Representatives from Iowa\'s 5th congressional district after the incumbent, Tom Latham, was reassigned to the 4th district after redistricting. He was reelected four times before the 2010 United States Census removed the 5th district and placed King in the 4th, which he represented from 2013.\nKing is an opponent 

So we have a bug here: the politician Steve King is barely related to the author Stephen King, but the algorithm thinks they're related, perhaps because of simple occurrence of the word "king" on both pages. We could solve this by teaching the algorithm to be more discerning when counting tokens, and exclude tokens that appear in article titles.

In [17]:
all_similars

[('Stephen King', nan),
 ('Steve King', 0.21892323443059183),
 ('Carrie (novel)', 0.1683257157658417),
 ('Stephen King bibliography', 0.16638346527599432),
 ('The Dark Tower (series)', 0.15873452357204787),
 ('Stephen King short fiction bibliography', 0.13988239892484455),
 ('Rose Red (miniseries)', 0.12039102418502666),
 ('Paul LePage', 0.10800243471183366),
 ('Joe Hill (novelist)', 0.10666957210497816),
 ('The Dark Tower (2017 film)', 0.09471512119232056),
 ('The Stand (1994 miniseries)', 0.0935565049076569),
 ('Randall Flagg', 0.09314208354728183),
 ('Maximum Overdrive', 0.09285764125525463),
 ('The Stand', 0.09204918305809294),
 ('The Shining (franchise)', 0.08659239170059718),
 ('The Shining (film)', 0.08380003011372747),
 ('11/22/63', 0.08061063292253817),
 ('Carrie (1976 film)', 0.0790608850019245),
 ('Horror fiction', 0.07832873458350832),
 ('Castle Rock (Stephen King)', 0.07702165956166025),
 ('It (miniseries)', 0.07496228801722502),
 ('Ramsey Campbell', 0.07470724394658211),


Anyway, I think I'm going to wrap up this demo notebook. Thanks for taking the time to look at it! Please feel free to give the script a try yourself and let me know if you encounter any odd behavior or interesting results.