<a href="https://colab.research.google.com/github/gitmystuff/INFO4080/blob/main/Week_12-Research_References/WikiScrape_Summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# WikiScrape Summarizer



## Wikipedia API

If you intend to do any scraping projects or automated requests, consider alternatives such as Pywikipediabot or MediaWiki API, which has other superior features.

* wikipedia.search('keywords', results=2)
* wikipedia.suggest('keyword')
* wikipedia.summary('keywords', sentences=2)
* wikipedia.page('keywords')
* wikipedia.page('keywords').content
* wikipedia.page('keywords').references
* wikipedia.page('keywords').title
* wikipedia.page('keywords').url
* wikipedia.page('keywords').categories
* wikipedia.page('keywords').content
* wikipedia.page('keywords').links
* wikipedia.geosearch(33.2075, 97.1526)
* wikipedia.set_lang('hi')
* wikipedia.languages()
* wikipedia.page('keywords').images[0]
* wikipedia.page('keywords').html()

## Beautiful Soup

In [None]:
# pip install wikipedia

In [None]:
# https://kleiber.me/blog/2017/07/22/tutorial-lda-wikipedia/
import pandas as pd
import random
import wikipedia

# rtitles = wikipedia.random(5)

# get 5 Wikipedia page titles based on keywords
titles = []
keywords = ['ultranationalism', 'religion', 'religious facism', 'state religion', 'deifying rulers']
for key in keywords:
    title = wikipedia.search(key, results=5)
    titles.append(title[0])

# print(titles)
data = []

for title in titles:
    # disambiguous error fix
    try:
        url_title = title.strip().replace(' ', '_')
        url = f'https://en.wikipedia.org/wiki/{url_title}' # left alt, shift, down to duplicate line
        # data.append([title, url, wikipedia.page(title, auto_suggest=False).content, wikipedia.summary(title, auto_suggest=False, sentences=15)])
        data.append([title, url])
    except wikipedia.exceptions.DisambiguationError as e:
        s = random.choice(e.options)
        data.append([title, wikipedia.page(s).content,  wikipedia.summary(title, auto_suggest=False, sentences=15)])

# df = pd.DataFrame(data, columns=['title', 'url', 'content', 'summary'])
pages = pd.DataFrame(data, columns=['title', 'url'])
pages.head()

Unnamed: 0,title,url
0,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism
1,Religion,https://en.wikipedia.org/wiki/Religion
2,Fascism,https://en.wikipedia.org/wiki/Fascism
3,State religion,https://en.wikipedia.org/wiki/State_religion
4,Apotheosis,https://en.wikipedia.org/wiki/Apotheosis


In [None]:
# wikiscrape
from bs4 import BeautifulSoup
import pandas as pd
import requests

data = []

def make_soup(page):
  # global df
  soup = BeautifulSoup(requests.get(page.url).text)
  s = soup.find_all('h2')
  s_list = [x.get_text().replace('[edit]', '') for x in s]
  # print(pd.Series(s_list))
  data.extend([[page.title, page.url, x.get_text().replace('[edit]', '')] for x in s])

x = pages.apply(make_soup, axis=1)
headings = pd.DataFrame(data, columns=['title', 'url', 'heading'])
drop_list = ['Contents', 'See also', 'References', 'External links', 'Notes', 'Sources', 'Further reading', 'Bibliography']
headings = headings[~headings['heading'].isin(drop_list)]
print(headings.shape)
headings.head()

(33, 3)


Unnamed: 0,title,url,heading
1,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Background concepts and broader context
2,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Historical movements and analysis
3,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties
4,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist organizations
5,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist terrorism


In [None]:
headings['title'].value_counts()

title
Religion            8
Apotheosis          8
Ultranationalism    6
Fascism             6
State religion      5
Name: count, dtype: int64

In [None]:
import re

CLEANR = re.compile('<.*?>')
def cleanhtml(raw_html):
  cleantext = re.sub(CLEANR, '', raw_html)
  return cleantext

data = []
def get_subs(row):
  heading1 = row['heading']
  title = row['title']
  url = row['url']
  soup = BeautifulSoup(requests.get(url).text)
  txt = ''
  txt1 = ''
  target = soup.find('span', attrs={'id': heading1.replace(' ', '_')}).parent
  for sib in target.find_next_siblings():
      if sib.name=='h2':
          break
      else:
          txt += str(sib)
          if sib.name=='p':
            txt1 += str(sib)

  soup2 = BeautifulSoup(txt)
  s = soup2.find_all('h3')
  s_list2 = [x.get_text().replace('[edit]', '') for x in s]
  # print(f'{heading1}\n')
  if len(s_list2) > 0:
    # print(pd.Series(s_list2))
    for i in range(len(s_list2)):
      txt=''
      heading2 = s_list2[i]
      target2 = soup.find('h3', string=heading2)
      target2 = soup.find('span', attrs={'id': heading2.replace(' ', '_')}).parent
      for sib in target2.find_next_siblings():
          if sib.name=='h3':
              break
          else:
            if sib.name=='p':
              txt += sib.text

      data.append([title, url, heading1, heading2, cleanhtml(txt)])
  else:
      data.append([title, url, heading1, 'None', cleanhtml(txt1)])

x = headings.apply(get_subs, axis=1)
df = pd.DataFrame(data, columns=['title', 'url', 'heading', 'subheading', 'txt'])
print(df.shape)
df.head()

(92, 5)


Unnamed: 0,title,url,heading,subheading,txt
0,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Background concepts and broader context,,British political theorist Roger Griffin has s...
1,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Historical movements and analysis,,American historian Walter Skya has written in ...
2,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Currently represented in national governments ...,The following political parties have been char...
3,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Represented parties with former ultranationali...,The following political parties historically h...
4,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Formerly represented in national governments o...,Arising out of strident Sri Lankan Tamil natio...


## LDA (Latent Dirichlet Allocation)

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

Sources:
 * https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
 * https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

results = 10
components = 10
topics = 10

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['txt'].values.astype('U'))

model = LatentDirichletAllocation(n_components=components)
model.fit(vectors)

topics_dictionary = {}
for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-topics:]]}')
    topics_dictionary[index] = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-topics:]]



Topic 0 top words: ['opposite', 'agnostic', 'describes', 'jewish', 'capitalism', 'definition', 'particularly', 'atheistic', 'judaism', 'religions']
Topic 1 top words: ['specific', 'holds', 'recognized', 'special', 'status', 'established', 'islam', 'state', 'religion', 'countries']
Topic 2 top words: ['converts', 'case', 'services', 'interfaith', 'dialogue', 'established', 'ethnic', 'religions', 'state', 'church']
Topic 3 top words: ['lully', 'jurisdictions', 'science', 'wrote', 'entitled', 'superstition', 'poem', 'criticism', 'apotheosis', 'religion']
Topic 4 top words: ['countries', 'called', 'ideology', 'government', 'women', 'fasces', 'political', '135', 'culture', 'sponsored']
Topic 5 top words: ['classification', 'theories', 'cognitive', 'characterised', 'good', 'study', 'tibetan', 'wealth', 'yuan', 'morality']
Topic 6 top words: ['world', 'religions', 'italy', 'religious', 'mussolini', 'italian', 'religion', 'political', 'fascist', 'fascism']
Topic 7 top words: ['state', 'reason'

In [None]:
def get_topics(row):
  return ', '.join([top for top in topics_dictionary[row.topic_idx]])

topic_results = model.transform(vectors)
df['topic_idx'] = topic_results.argmax(axis=1)

df['topics']= df.apply(get_topics, axis=1)
df.head()

Unnamed: 0,title,url,heading,subheading,txt,topic_idx,topics
0,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Background concepts and broader context,,British political theorist Roger Griffin has s...,2,"converts, case, services, interfaith, dialogue..."
1,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Historical movements and analysis,,American historian Walter Skya has written in ...,0,"opposite, agnostic, describes, jewish, capital..."
2,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Currently represented in national governments ...,The following political parties have been char...,6,"world, religions, italy, religious, mussolini,..."
3,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Represented parties with former ultranationali...,The following political parties historically h...,6,"world, religions, italy, religious, mussolini,..."
4,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Formerly represented in national governments o...,Arising out of strident Sri Lankan Tamil natio...,6,"world, religions, italy, religious, mussolini,..."


## SpaCy

* https://spacy.io/
* https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744

In [None]:
# uncomment to download
import spacy.cli

spacy.cli.download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## Language Model and Pipelines

en_core_web_sm

* https://www.kdnuggets.com/2021/03/natural-language-processing-pipelines-explained.html
* https://spacy.io/usage/spacy-101
* https://en.wikipedia.org/wiki/Language_model
* https://builtin.com/data-science/beginners-guide-language-models

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

nlp = spacy.load('en_core_web_sm')

In [None]:
# get example text
import textwrap

textwrap.fill(df.iloc[0]['txt'])

'British political theorist Roger Griffin has stated that\nultranationalism is essentially founded on xenophobia in a way that\nfinds supposed legitimacy "through deeply mythicized narratives of\npast cultural or political periods of historical greatness or of old\nscores to settle against alleged enemies". It can also draw on\n"vulgarized forms" of different aspects of the natural sciences such\nas anthropology and genetics, eugenics specifically playing a role, in\norder "to rationalize ideas of national superiority and destiny, of\ndegeneracy and subhumanness" in Griffin\'s opinion. Ultranationalists\nview the modern nation-state as, according to Griffin, a living\norganism directly akin to a physical person such that it can decay,\ngrow, die, and additionally experience rebirth. He has highlighted\nNazi Germany as a regime which was founded on ultranationalism.[3]\nUltranationalist activism can adopt varying attitudes towards\nhistorical traditions within the populace. For instance

In [None]:
import textwrap
import re

# data = []
# summary_text = ' '.join([re.sub("\[.*?\]", "", txt) for txt in df.txt])
# doc = nlp(summary_text)
summary_text = ' '.join([re.sub("\[.*?\]", "", df.iloc[0]['txt'])])
doc = nlp(summary_text)
keyword = []
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if(token.text in stopwords or token.text in punctuation):
        continue
    if(token.pos_ in pos_tag):
        keyword.append(token.text)

freq_word = Counter(keyword)
max_freq = Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)

sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent] += freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]

    try:
      data.append([sent_strength[sent], str(sent)])
    except:
      pass
    print(sent_strength[sent])
    print(textwrap.fill(str(sent)))
    print()

# summary = nlargest(10, sent_strength, key=sent_strength.get)
# summary = ' '.join([w.text for w in summary])
# print(textwrap.fill(summary, 100))
# df2 = pd.DataFrame(data, columns=['strength', 'txt'])
# df2.sort_values(by=['strength'], ascending=False).head()

12.0
British political theorist Roger Griffin has stated that
ultranationalism is essentially founded on xenophobia in a way that
finds supposed legitimacy "through deeply mythicized narratives of
past cultural or political periods of historical greatness or of old
scores to settle against alleged enemies".

8.999999999999998
It can also draw on "vulgarized forms" of different aspects of the
natural sciences such as anthropology and genetics, eugenics
specifically playing a role, in order "to rationalize ideas of
national superiority and destiny, of degeneracy and subhumanness" in
Griffin's opinion.

6.999999999999997
Ultranationalists view the modern nation-state as, according to
Griffin, a living organism directly akin to a physical person such
that it can decay, grow, die, and additionally experience rebirth.

2.6666666666666665
He has highlighted Nazi Germany as a regime which was founded on
ultranationalism.

3.0
Ultranationalist activism can adopt varying attitudes towards
histor

In [None]:
len(sent_strength)

11

In [None]:
summary = nlargest(int(len(sent_strength)/2), sent_strength, key=sent_strength.get)
summary = ' '.join([w.text for w in summary])
summary = ' '.join([re.sub("\[.*?\]", "", summary)])
print(textwrap.fill(summary))

According to American scholar Janusz Bugajski, summing up the doctrine
in practical terms, "in its most extreme or developed forms, ultra-
nationalism resembles fascism, marked by a xenophobic disdain of other
nations, support for authoritarian political arrangements verging on
totalitarianism, and a mythical emphasis on the 'organic unity'
between a charismatic leader, an organizationally amorphous movement-
type party, and the nation". British political theorist Roger Griffin
has stated that ultranationalism is essentially founded on xenophobia
in a way that finds supposed legitimacy "through deeply mythicized
narratives of past cultural or political periods of historical
greatness or of old scores to settle against alleged enemies". It can
also draw on "vulgarized forms" of different aspects of the natural
sciences such as anthropology and genetics, eugenics specifically
playing a role, in order "to rationalize ideas of national superiority
and destiny, of degeneracy and subhumannes

In [None]:
# pip install spacy-llm

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

nlp = spacy.blank('en')
nlp.add_pipe('sentencizer')

# https://www.educative.io/answers/text-summarization-in-spacy-and-nltk
# df.iloc[0]['txt']
def summarizer(row):
  txt = row['txt']
  text = ' '.join([re.sub('\[.*?\]|"', '', txt)])
  doc = nlp(text)

  word_frequencies = {}
  for token in doc:
      if token.text not in STOP_WORDS and token.text not in punctuation:
          if token.text not in word_frequencies:
              word_frequencies[token.text] = 1
          else:
              word_frequencies[token.text] += 1


  sorted_sentences = sorted(doc.sents, key=lambda sent: sum(word_frequencies[token.text]
                          for token in sent if token.text in word_frequencies), reverse=True)

  return str(' '.join(sent.text for sent in sorted_sentences[:int(len(sorted_sentences)/4)]).strip())

# print(textwrap.fill(summarizer(df.iloc[0]['txt'])))

In [None]:
df['summary']= df.apply(summarizer, axis=1)
df.head()

Unnamed: 0,title,url,heading,subheading,txt,topic_idx,topics,summary
0,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Background concepts and broader context,,British political theorist Roger Griffin has s...,2,"converts, case, services, interfaith, dialogue...","According to American scholar Janusz Bugajski,..."
1,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Historical movements and analysis,,American historian Walter Skya has written in ...,0,"opposite, agnostic, describes, jewish, capital...","In late 2015, the Israeli political journalist..."
2,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Currently represented in national governments ...,The following political parties have been char...,6,"world, religions, italy, religious, mussolini,...",
3,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Represented parties with former ultranationali...,The following political parties historically h...,6,"world, religions, italy, religious, mussolini,...",
4,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Formerly represented in national governments o...,Arising out of strident Sri Lankan Tamil natio...,6,"world, religions, italy, religious, mussolini,...",The assassination of Pavlos Fyssas in Septembe...


In [None]:
df.to_csv('wikiscrape.csv')

In [None]:
df.iloc[2].txt

'The following political parties have been characterised as ultranationalist.\nThe following political parties have been described as having ultranationalist factions.\n'

In [None]:
print(textwrap.fill(summarizer(df.iloc[0])))

According to American scholar Janusz Bugajski, summing up the doctrine
in practical terms, in its most extreme or developed forms, ultra-
nationalism resembles fascism, marked by a xenophobic disdain of other
nations, support for authoritarian political arrangements verging on
totalitarianism, and a mythical emphasis on the 'organic unity'
between a charismatic leader, an organizationally amorphous movement-
type party, and the nation. British political theorist Roger Griffin
has stated that ultranationalism is essentially founded on xenophobia
in a way that finds supposed legitimacy through deeply mythicized
narratives of past cultural or political periods of historical
greatness or of old scores to settle against alleged enemies. It can
also draw on vulgarized forms of different aspects of the natural
sciences such as anthropology and genetics, eugenics specifically
playing a role, in order to rationalize ideas of national superiority
and destiny, of degeneracy and subhumanness in Gr