In [2]:
import requests, wikipedia, spacy
from bs4 import BeautifulSoup

import ast # for converting string representation of list to a list

In [3]:
URL = "https://en.wikipedia.org/wiki/Earth"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
title = soup.find(id="firstHeading").text
print(title)

nlp = spacy.load("en_core_web_sm")

Earth


# First Version

Take first paragraph (of Wikipedia's "Earth" page), choose which sentence you would like to make a flashcard for, and finally choose which parts of the sentence you would like to hide (which become the answer).

In [4]:
all_paragraphs = soup.find_all("p")

In [5]:
# getting first paragraph
for i in range(len(all_paragraphs)):
    p = all_paragraphs[i]
    if p.text == "\n": # avoids \n paragraph which comes first
        continue
    else:
        break # only one for now

### Using spacy to tokenise and sentence split our paragraph

In [6]:
p_text = p.text.strip()
doc = nlp(p_text)

sents = list(doc.sents) # all sentences are type spacy.tokens.span.Span
number_of_sents = len(list(doc.sents))

### We begin the interaction with the user by asking which line they want to make a flashcard for

The user is presented with the lines of the paragraph they have chosen (version one only includes the first paragraph). Line 6 seems to be the most interesting, try that :)

### Then we ask what span(s) of the sentence the user wishes to learn

E.g. Choose sentence 6 and choose to learn "magnetosphere" and "solar winds" (spans: [[14], [18,19]])

### Finally we present the user with the question that will be the front of the flashcard (the original sentence but with words deleted), and the answers (the missing words).

In [10]:
print("Choose which line to learn from, or go back with \"b\".")

for i in range(1, len(sents) + 1):
    print(f"{i}) {sents[i - 1]}")

# ask user, line 6 is interesting enough to create a flashcard for
line_number_to_learn = input("What line do you want to learn? ")

if line_number_to_learn == "b":
    raise # going back here to choose another paragraph/line will be part of version two
else:
    line_number_to_learn = int(line_number_to_learn)
    
if line_number_to_learn not in list(range(1, number_of_sents + 1)):
    print("Please try again and choose a valid line, or go back with b.")
    raise # raising instead of exit as it kills kernel, really we want to ask for a line number again

print("\nWhich words would you like to hide?")

line_to_learn = sents[line_number_to_learn - 1]
for i in range(1, len(line_to_learn) + 1):
    print(f"{i} {line_to_learn[i - 1]}")

# [[14], [18,19]] for line 6 generates two interesting answers
selected_spans_str = input(f"\nSelect a span (e.g. \"[1,2,3]\"), a single word (e.g. \"[5]\"), or even a multiple spans (e.g. \"[[1,2], [5,6]]\"\n")

all_spans = [] # to track all the spans mentioned
span_numbers = {} # to track, for a given index, which span group (i.e. the answer) it corresponds to
answers = [] # to collect the answers for the given sub_spans

try:
    selected_spans = ast.literal_eval(selected_spans_str)
    assert isinstance(selected_spans, list)
    
    if isinstance(selected_spans[0], int):
        # if only have one span, e.g. "[1,2,3]" or "[5]"
        selected_spans = [selected_spans] # can trivially treat as a nested span
        
    elif not isinstance(selected_spans[0], list):
        raise
    
    # now have a list of spans "[[1,2], [5,6]]", or simply "[[1,2,3]]"
    all_spans = [idx for sub_span in selected_spans for idx in sub_span] # simply extracting every word index from our list of lists
    for i, sub_span in enumerate(selected_spans, start = 1):
        # now building answer for given sub_span, and assigning span numbers (corresponding answer) for each index of the sub-span
        answer = ""

        for j, idx in enumerate(sub_span, start=1):
            span_numbers[idx] = i # assigns indiviual index to a specific span (answer), to use when creating the flashcard
            
            if j != len(sub_span):
                answer += line_to_learn[idx - 1].text_with_ws
            else:
                # don't add whitespace if at the last sub_span index (don't want an answer with a space at the end)
                answer += line_to_learn[idx - 1].text
        
        answers.append(answer)

except:
    print("Please try again with a valid span.")


print("\nYour new flashcard will look like this:")

question = ""
for i in range(1, len(line_to_learn) + 1):
    token = line_to_learn[i - 1]
    if i in all_spans:
        span_number = span_numbers[i] # like a reverse dict search
        question += f"[{span_number}]"
        question += token.whitespace_
    else:
        question += token.text_with_ws
print(question)

print("\nAnd the answers are:")
for i, answer in enumerate(answers, start = 1):
    print(f"{i} {answer}")

Choose which line to learn from, or go back with "b".
1) Earth is the third planet from the Sun and the only astronomical object known to harbor life.
2) While large volumes of water can be found throughout the Solar System, only Earth sustains liquid surface water.
3) About 71% of Earth's surface is made up of the ocean, dwarfing Earth's polar ice, lakes, and rivers.
4) The remaining 29% of Earth's surface is land, consisting of continents and islands.
5) Earth's surface layer is formed of several slowly moving tectonic plates, interacting to produce mountain ranges, volcanoes, and earthquakes.
6) Earth's liquid outer core generates the magnetic field that shapes Earth's magnetosphere, deflecting destructive solar winds.
What line do you want to learn? 6

Which words would you like to hide?
1 Earth
2 's
3 liquid
4 outer
5 core
6 generates
7 the
8 magnetic
9 field
10 that
11 shapes
12 Earth
13 's
14 magnetosphere
15 ,
16 deflecting
17 destructive
18 solar
19 winds
20 .

Select a span (