# Домашнє завдання
Зробіть `summary` нижчевказаного тексту використовуючи бібліотеки для `NLP`: `nltk` та `SpaCy`

    The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound.

### Підказка
Перш за все, ми повинні імпортувати необхідні бібліотеки. Для `SpaCy` це можна зробити за допомогою команди:

    import spacy

Зауважте, що для `NLTK` можливо потрібно завантажити додаткові дані, наприклад, список стоп-слів або токенізатори.

    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize, sent_tokenize

Перш ніж почати роботу з `SpaCy`, необхідно завантажити потрібну мовну модель. Наприклад, для англійської мови ми можемо завантажити модель `"en_core_web_sm"`:

    nlp = spacy.load('en_core_web_sm')

### Підготовка тексту
Перш ніж почати створювати `text summary`, текст потрібно підготувати. Це включає в себе видалення непотрібних символів, токенізацію (розбиття тексту на окремі слова або речення), видалення стоп-слів (слова, які не несуть суттєвої інформації) і, за необхідності, інші обробки тексту, такі як стемінг або лематизація.

### Текст для обробки
    text = "This is an example sentence for tokenization and lemmatization."

### Токенізація
    doc = nlp(text)
    tokens = [token.text for token in doc]
    print(tokens)

`NLTK`також надає розширені функції для обробки тексту. За допомогою методів `NLTK`, таких як `word_tokenize`, `sent_tokenize` або `stopwords`, ми можемо отримати токенизовані слова та речення, а також список стоп-слів.

    tokens = word_tokenize(text)
    sentences = sent_tokenize(text)
    stop_words = set(stopwords.words('english'))

І також не забуваємо про знаки пунктуації

`punctuation = punctuation + '\n'`

Також можна порахувати частоту появи тих чи інших слів у тексті(але варто пам'ятати, що це треба робити, вже виключивши усі знаки пунктуації)

    word_frequencies = {}
    for word in doc:
      if word.text.lower() not in stop_words:
        if word.text.lower() not in punctuation:
          if word.text not in word_frequencies.keys():
            word_frequencies[word.text] = 1
          else:
            word_frequencies[word.text] += 1




Коли ми вже маємо підготовлений текст та використали `SpaCy` або `NLTK` для отримання необхідної інформації, ми можемо створити текстове резюме. Це можна зробити, наприклад, шляхом виділення найважливіших речень з тексту, враховуючи їх вагу або частоту вживання певних слів.

### Бібліотека `heapq`
Бібліотека `heapq` є частиною стандартної бібліотеки `Python` і надає функціонал для роботи зі структурами даних під назвою `heap`. Один з імпортованих об’єктів у цій бібліотеці - `nlargest` - є функцією, яка дозволяє знаходити найбільші елементи з ітерабельного об’єкту.

    from heapq import nlargest

Функція `nlargest(n, iterable, key=None)` приймає три аргументи:

* `n` - це кількість найбільших елементів, які ви хочете отримати
* `iterable` - це ітерабельний об’єкт, з якого ви хочете вибрати найбільші елементи
* `key` (необов’язковий) - це функція, яка визначає, за яким ключем відбувається порівняння елементів (наприклад, `key=str.lower`)
* Функція `nlargest` повертає список з `n` найбільших елементів з `iterable`. Ці елементи будуть впорядковані у порядку спадання. Якщо n більше довжини `iterable`, то функція поверне весь `iterable` у відсортованому порядку.

Отже, імпортований `from heapq import nlargest` дозволяє використовувати функцію `nlargest` для знаходження найбільших елементів з довільного ітерабельного об’єкту.

    select_length = int(len(sentence_tokens))
    summary = nlargest(select_length, sentence_scores, key = sentence_scores.get)
    summary

У даному випадку, функція `nlargest` використовується для знаходження в `select_length` найбільших елементів зі словника `sentence_scores`. Ключі словника представляють речення, а значення - їхні оцінки або ваги. Аргумент key заданий як `sentence_scores.get`, що означає, що функція `get` використовується для порівняння елементів. У даному випадку, вона повертає значення (оцінку) для кожного речення, яке використовується як критерій для порівняння. Отже, змінна summary міститиме список `select_length` найкращих речень зі словника `sentence_scores` у порядку спадання оцінок.

## Imports

In [1]:
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
import spacy
from string import punctuation
from heapq import nlargest

## load language model

In [2]:
nlp = spacy.load('en_core_web_sm')

## Define text for summary

In [3]:
text = "The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering. Because it has achieved significance within the past fifty years, Criteria Consideration G applies. Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA. Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station. Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions. She had the honor of being chosen as the Return to Flight vehicle after both the Challenger and Columbia accidents. Discovery was the first shuttle to fly with the redesigned SRBs, a result of the Challenger accident, and the first shuttle to fly with the Phase II and Block I SSME. Discovery also carried the Hubble Space Telescope to orbit and performed two of the five servicing missions to the observatory. She flew the first and last dedicated Department of Defense (DoD) missions, as well as the first unclassified defense-related mission. In addition, Discovery was vital to the construction of the International Space Station (ISS); she flew thirteen of the thirty-seven total missions flown to the station by a U.S. Space Shuttle. She was the first orbiter to dock to the ISS, and the first to perform an exchange of a resident crew. Under Criterion C, Discovery is significant as a feat of engineering. According to Wayne Hale, a flight director from Johnson Space Center, the Space Shuttle orbiter represents a “huge technological leap from expendable rockets and capsules to a reusable, winged, hypersonic, cargo-carrying spacecraft.” Although her base structure followed a conventional aircraft design, she used advanced materials that both minimized her weight for cargo-carrying purposes and featured low thermal expansion ratios, which provided a stable base for her Thermal Protection System (TPS) materials. The Space Shuttle orbiter also featured the first reusable TPS; all previous spaceflight vehicles had a single-use, ablative heat shield. Other notable engineering achievements of the orbiter included the first reusable orbital propulsion system, and the first two-fault-tolerant Integrated Avionics System. As Hale stated, the Space Shuttle remains “the largest, fastest, winged hypersonic aircraft in history,” having regularly flown at twenty-five times the speed of sound."

## Tokens prepare

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

In [5]:
doc = nlp(text)
tokens = word_tokenize(text)
sentences = sent_tokenize(text)
stoplist = set(stopwords.words('english'))
punctuation_symbols = punctuation + '“”'

In [6]:
print(tokens)

['The', 'Orbiter', 'Discovery', ',', 'OV-103', ',', 'is', 'considered', 'eligible', 'for', 'listing', 'in', 'the', 'National', 'Register', 'of', 'Historic', 'Places', '(', 'NRHP', ')', 'in', 'the', 'context', 'of', 'the', 'U.S.', 'Space', 'Shuttle', 'Program', '(', '1969-2011', ')', 'under', 'Criterion', 'A', 'in', 'the', 'areas', 'of', 'Space', 'Exploration', 'and', 'Transportation', 'and', 'under', 'Criterion', 'C', 'in', 'the', 'area', 'of', 'Engineering', '.', 'Because', 'it', 'has', 'achieved', 'significance', 'within', 'the', 'past', 'fifty', 'years', ',', 'Criteria', 'Consideration', 'G', 'applies', '.', 'Under', 'Criterion', 'A', ',', 'Discovery', 'is', 'significant', 'as', 'the', 'oldest', 'of', 'the', 'three', 'extant', 'orbiter', 'vehicles', 'constructed', 'for', 'the', 'Space', 'Shuttle', 'Program', '(', 'SSP', ')', ',', 'the', 'longest', 'running', 'American', 'space', 'program', 'to', 'date', ';', 'she', 'was', 'the', 'third', 'of', 'five', 'orbiters', 'built', 'by', 'NAS

In [7]:
print(sentences)

['The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering.', 'Because it has achieved significance within the past fifty years, Criteria Consideration G applies.', 'Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA.', 'Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station.', 'Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly

In [8]:
print(stoplist)

{'because', 'theirs', 'than', 'shan', 'what', 'hasn', 'there', 'nor', 'those', "needn't", 'further', 'other', 'not', 'against', 'himself', 'with', 'have', 'myself', 'most', 'were', 'some', 'off', "you'd", 'am', "mustn't", "doesn't", 'isn', 'couldn', 'do', "should've", 'is', 'the', 's', 'as', 'a', 'will', 'll', 'herself', 'was', "you'll", 'each', 'before', 'then', 'hadn', 'doing', "isn't", 'why', 'y', 'this', 'until', 'down', 'your', 'on', 'all', "weren't", "that'll", 'weren', 'ma', "won't", 'just', 'over', 'that', 'shouldn', 'i', 'no', 'should', 'an', "don't", 'did', 'mightn', 'he', 'her', 't', 'aren', 'ours', 'into', 'very', 'doesn', 'during', 've', "haven't", 'how', 'of', "it's", 'own', 'or', 'under', 'and', 'they', 'out', "wasn't", 'few', "didn't", 'their', 'it', 'both', 'didn', 'about', 'them', 'wouldn', 'from', 'by', 'so', 'same', 'up', 'wasn', 'our', 'she', 'these', 'won', 'ourselves', 'below', 'more', 'at', 'ain', 'you', 'can', 'which', "you've", 'such', 'but', 'now', 'we', 'you

In [9]:
print(punctuation_symbols)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~“”


In [10]:
word_frequencies = {}
for word in doc:
  if word.text.lower() not in stoplist:
    if word.text.lower() not in punctuation_symbols:
      if word.text not in word_frequencies.keys():
        word_frequencies[word.text] = 1
      else:
        word_frequencies[word.text] += 1

In [11]:
print(word_frequencies)

{'Orbiter': 1, 'Discovery': 7, 'OV-103': 1, 'considered': 1, 'eligible': 1, 'listing': 1, 'National': 1, 'Register': 1, 'Historic': 1, 'Places': 1, 'NRHP': 1, 'context': 1, 'U.S.': 2, 'Space': 10, 'Shuttle': 6, 'Program': 2, '1969': 1, '2011': 1, 'Criterion': 4, 'areas': 1, 'Exploration': 1, 'Transportation': 1, 'C': 2, 'area': 1, 'Engineering': 1, 'achieved': 1, 'significance': 1, 'within': 1, 'past': 1, 'fifty': 1, 'years': 1, 'Criteria': 1, 'Consideration': 1, 'G': 1, 'applies': 1, 'significant': 2, 'oldest': 1, 'three': 1, 'extant': 1, 'orbiter': 6, 'vehicles': 2, 'constructed': 1, 'SSP': 2, 'longest': 1, 'running': 1, 'American': 1, 'space': 3, 'program': 1, 'date': 1, 'third': 1, 'five': 3, 'orbiters': 2, 'built': 1, 'NASA': 1, 'Unlike': 1, 'Mercury': 1, 'Gemini': 1, 'Apollo': 1, 'programs': 1, '’s': 1, 'emphasis': 1, 'cost': 1, 'effectiveness': 1, 'reusability': 1, 'eventually': 1, 'construction': 2, 'station': 2, 'Including': 1, 'maiden': 1, 'voyage': 1, 'launched': 1, 'August'

In [12]:
select_length = 10
summary_words = nlargest(select_length, word_frequencies, key = word_frequencies.get)

In [13]:
print(summary_words)

['Space', 'first', 'Discovery', 'Shuttle', 'orbiter', 'Criterion', 'missions', 'space', 'five', 'flew']


In [14]:
top_10_words_score = np.sort([value for value in word_frequencies.values()])[-10:]

In [15]:
mean_word_frequency = np.mean(top_10_words_score)

In [17]:
sentence_scores = {}
for sentence in sentences:
  for word, freq in word_frequencies.items():
    if word.lower() in sentence.lower():
      if freq > mean_word_frequency:
        if sentence not in sentence_scores.keys():
          sentence_scores[sentence] = 1
        else:
          sentence_scores[sentence] += 1

In [18]:
print(sentence_scores)

{'The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering.': 4, 'Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA.': 4, 'Unlike the Mercury, Gemini, and Apollo programs, the SSP’s emphasis was on cost effectiveness and reusability, and eventually the construction of a space station.': 1, 'Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions.': 4, 'Discovery was the first shuttle to fly with the redesigned SRBs, a res

In [19]:
select_length = 3
summary_sentences = nlargest(select_length, sentence_scores, key=sentence_scores.get)

In [20]:
for sentence in summary_sentences:
  print(sentence)

The Orbiter Discovery, OV-103, is considered eligible for listing in the National Register of Historic Places (NRHP) in the context of the U.S. Space Shuttle Program (1969-2011) under Criterion A in the areas of Space Exploration and Transportation and under Criterion C in the area of Engineering.
Under Criterion A, Discovery is significant as the oldest of the three extant orbiter vehicles constructed for the Space Shuttle Program (SSP), the longest running American space program to date; she was the third of five orbiters built by NASA.
Including her maiden voyage (launched August 30, 1984), Discovery flew to space thirty-nine times, more than any of the other four orbiters; she was also the first orbiter to fly twenty missions.
