# process

1. clean text
2. split text at sentence level
3. calulated weighted frequence of words ( how many time word occurs / number of time word with highest frequency occur)
4. replace words by their weighted frequencies in original sentences
5. sort in descending order of sum of weighted frequencies
6. pick top once as summary

In [5]:
import bs4 as bs
import urllib.request
import re

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Dyatlov_Pass_incident')

article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article, 'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ''

for p in paragraphs:
    article_text += p.text

In [7]:
# preprocessing
# remove square brackets and extra spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

# remove special chars and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text)
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [9]:
# converting text to sentences
import nltk
from nltk import sent_tokenize, word_tokenize
sentence_list = nltk.sent_tokenize(article_text)

stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1
            

In [10]:
# weighted frequency of words

maximum_frequency = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

In [13]:
# sentence scores

sentence_scores = {}

for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_score.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [14]:
# summary

import heapq
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)

print(f'{summary}')

The foundation's stated aim is to continue investigation of the case and to maintain the Dyatlov Museum to preserve the memory of the dead hikers. There are indeed records of parachute mines being tested by the Soviet military in the area around the time the hikers were there. On 26 February, the searchers found the group's abandoned and badly damaged tent on Kholat Syakhl. A legal inquest started immediately after the first five bodies were found. The goal of the expedition was to reach Gora Otorten (Гора Отортен), a mountain 10 kilometres (6.2 mi) north of the site of the incident. Diaries and cameras found around their last campsite made it possible to track the group's route up to the day preceding the incident. An examination of the four bodies which were found in May shifted the narrative as to what had occurred during the incident. In 1967, Sverdlovsk writer and journalist Yuri Yarovoi (Russian: Юрий Яровой) published the novel Of the Highest Degree of Complexity, inspired by th