<a href="https://colab.research.google.com/github/gitmystuff/INFO4080/blob/main/Week_02-The_Research_Question/Summarizing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Summarizing



## Mount Drive

## Beautiful Soup

In [1]:
# pip install wikipedia

In [2]:
# get table of contents
from bs4 import BeautifulSoup
import pandas as pd
import requests

soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/Outline_of_the_history_of_Western_civilization").text)
s = soup.find_all('span', {'class' : 'mw-headline'})
s_list = [x.get_text() for x in s]
s_series = pd.Series(s_list)
toc = s_series.to_frame()
toc.columns = ['header']
toc.to_csv('toc.csv', index=False)
toc.head()

Unnamed: 0,header
0,Nature of Western civilization
1,Antiquity: before 500
2,Rise of Christendom
3,The Middle Ages
4,Early Middle Ages: 500–1000


In [3]:
import pandas as pd

kws = pd.read_csv('/content/drive/MyDrive/INFO4080/Week 02 - The Research Question/toc.csv')
# kws.head()
# kws.iloc[0].header


## Wikipedia API

If you intend to do any scraping projects or automated requests, consider alternatives such as Pywikipediabot or MediaWiki API, which has other superior features.

* wikipedia.search('keywords', results=2)
* wikipedia.suggest('keyword')
* wikipedia.summary('keywords', sentences=2)
* wikipedia.page('keywords')
* wikipedia.page('keywords').content
* wikipedia.page('keywords').references
* wikipedia.page('keywords').title
* wikipedia.page('keywords').url
* wikipedia.page('keywords').categories
* wikipedia.page('keywords').content
* wikipedia.page('keywords').links
* wikipedia.geosearch(33.2075, 97.1526)
* wikipedia.set_lang('hi')
* wikipedia.languages()
* wikipedia.page('keywords').images[0]
* wikipedia.page('keywords').html()

In [4]:
import wikipedia

idx = 1
results = 4
components = results
topics = 10

print(kws.iloc[idx])
wikipedia.search(kws.iloc[idx].header, results=results)

header    Antiquity: before 500
Name: 1, dtype: object


['Outline of the history of Western civilization',
 'Metals of antiquity',
 'Late antiquity',
 'Age of Earth']

In [5]:
titles = wikipedia.search(kws.iloc[idx].header, results=results)
# print(titles)
for title in titles:
  print(wikipedia.page(title).url)

https://en.wikipedia.org/wiki/Outline_of_the_history_of_Western_civilization
https://en.wikipedia.org/wiki/Metals_of_antiquity
https://en.wikipedia.org/wiki/Late_antiquity
https://en.wikipedia.org/wiki/Age_of_Earth


In [6]:
# https://stackoverflow.com/questions/61651052/find-exact-match-of-a-wikipedia-page-title-using-python

data = []
for title in titles:
  # print(title)
  # print()
  try:
      if wikipedia.page(title).title != title:
          print("no wikipedia page")
      else:
          # print(wikipedia.summary(title, sentences=5))
          data.append([title, wikipedia.page(title).content, wikipedia.summary(title, sentences=5)])
  except:
      print(title + " not found.")

df = pd.DataFrame(data, columns=['title', 'content', 'summary'])
df.head()

Unnamed: 0,title,content,summary
0,Outline of the history of Western civilization,The following outline is provided as an overvi...,The following outline is provided as an overvi...
1,Metals of antiquity,The metals of antiquity are the seven metals w...,The metals of antiquity are the seven metals w...
2,Late antiquity,Late antiquity is sometimes defined as spannin...,Late antiquity is sometimes defined as spannin...
3,Age of Earth,The age of Earth is estimated to be 4.54 ± 0.0...,The age of Earth is estimated to be 4.54 ± 0.0...


## LDA (Latent Dirichlet Allocation)

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

Sources:
 * https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
 * https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [7]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['summary'].values.astype('U'))

model = LatentDirichletAllocation(n_components=components)
model.fit(vectors)

topics_dictionary = {}
for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-topics:]]}')
    topics_dictionary[index] = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-topics:]]

topics_dictionary

Topic 0 top words: ['empire', 'generally', 'roman', 'development', 'following', 'western', 'europe', 'time', 'south', 'world']
Topic 1 top words: ['empire', 'generally', 'roman', 'development', 'following', 'western', 'europe', 'time', 'south', 'world']
Topic 2 top words: ['antiquated', 'history', 'provided', 'philosophy', 'science', 'ancient', 'contributions', 'western', 'civilizations', 'civilization']
Topic 3 top words: ['end', 'century', 'billion', 'late', 'period', 'years', 'known', 'age', 'antiquity', 'metals']


{0: ['empire',
  'generally',
  'roman',
  'development',
  'following',
  'western',
  'europe',
  'time',
  'south',
  'world'],
 1: ['empire',
  'generally',
  'roman',
  'development',
  'following',
  'western',
  'europe',
  'time',
  'south',
  'world'],
 2: ['antiquated',
  'history',
  'provided',
  'philosophy',
  'science',
  'ancient',
  'contributions',
  'western',
  'civilizations',
  'civilization'],
 3: ['end',
  'century',
  'billion',
  'late',
  'period',
  'years',
  'known',
  'age',
  'antiquity',
  'metals']}

In [8]:
topic_results = model.transform(vectors)
df['topic_idx'] = topic_results.argmax(axis=1)
df.head()

Unnamed: 0,title,content,summary,topic_idx
0,Outline of the history of Western civilization,The following outline is provided as an overvi...,The following outline is provided as an overvi...,2
1,Metals of antiquity,The metals of antiquity are the seven metals w...,The metals of antiquity are the seven metals w...,3
2,Late antiquity,Late antiquity is sometimes defined as spannin...,Late antiquity is sometimes defined as spannin...,3
3,Age of Earth,The age of Earth is estimated to be 4.54 ± 0.0...,The age of Earth is estimated to be 4.54 ± 0.0...,3


In [9]:
def get_topics(row):
  return ', '.join([top for top in topics_dictionary[row.topic_idx]])

df['topics']= df.apply(get_topics, axis=1)
df.head()

Unnamed: 0,title,content,summary,topic_idx,topics
0,Outline of the history of Western civilization,The following outline is provided as an overvi...,The following outline is provided as an overvi...,2,"antiquated, history, provided, philosophy, sci..."
1,Metals of antiquity,The metals of antiquity are the seven metals w...,The metals of antiquity are the seven metals w...,3,"end, century, billion, late, period, years, kn..."
2,Late antiquity,Late antiquity is sometimes defined as spannin...,Late antiquity is sometimes defined as spannin...,3,"end, century, billion, late, period, years, kn..."
3,Age of Earth,The age of Earth is estimated to be 4.54 ± 0.0...,The age of Earth is estimated to be 4.54 ± 0.0...,3,"end, century, billion, late, period, years, kn..."


## SpaCy

* https://spacy.io/
* https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744

In [10]:
# uncomment to download
# import spacy.cli

# spacy.cli.download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Language Model and Pipelines

en_core_web_sm

* https://www.kdnuggets.com/2021/03/natural-language-processing-pipelines-explained.html
* https://spacy.io/usage/spacy-101
* https://en.wikipedia.org/wiki/Language_model
* https://builtin.com/data-science/beginners-guide-language-models

In [11]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

nlp = spacy.load('en_core_web_sm')

In [12]:
# doc = nlp(df.loc[0]['content'])
summary_text = ' '.join([txt for txt in df.summary])
# print(summary_text)
doc = nlp(summary_text)
len(list(doc.sents))
keyword = []
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if(token.text in stopwords or token.text in punctuation):
        continue
    if(token.pos_ in pos_tag):
        keyword.append(token.text)

# count most frequent words
freq_word = Counter(keyword)
print(freq_word.most_common(5))

# normalize for better processing
max_freq = Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)

print(freq_word.most_common(5))

[('Western', 7), ('civilization', 6), ('century', 6), ('metals', 6), ('antiquity', 6)]
[('Western', 1.0), ('civilization', 0.8571428571428571), ('century', 0.8571428571428571), ('metals', 0.8571428571428571), ('antiquity', 0.8571428571428571)]


In [13]:
# weights based on frequency
sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent] += freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]

print(sent_strength)

{The following outline is provided as an overview of and topical guide to the history of Western civilization:
History of Western civilization – record of the development of human civilization beginning in Ancient Greece and Ancient Rome, and generally spreading westwards.
: 8.000000000000002, Ancient Greek science, philosophy, democracy, architecture, literature, and art provided a foundation embraced and built upon by the Roman Empire as it swept up Europe, including the Hellenic world in its conquests in the 1st century BC.: 5.857142857142857, From its European and Mediterranean origins, Western civilization has spread to produce the dominant cultures of modern North America, South America, and much of Oceania, and has had immense global influence in recent centuries.


: 5.000000000000002, == Nature of Western civilization ==
Western world – The first civilizations made various unique contributions to the western civilizations.: 5.2857142857142865, These contributions, which are li

In [14]:
import textwrap

summary = nlargest(10, sent_strength, key=sent_strength.get)
summary = ' '.join([w.text for w in summary])
print(textwrap.fill(summary, 100))

Late antiquity is sometimes defined as spanning from the end of classical antiquity to the local
start of the Middle Ages, from around the late 3rd century up to the 7th or 8th century in Europe
and adjacent areas bordering the Mediterranean Basin depending on location. The following outline is
provided as an overview of and topical guide to the history of Western civilization: History of
Western civilization – record of the development of human civilization beginning in Ancient Greece
and Ancient Rome, and generally spreading westwards.  Calcium–aluminium-rich inclusions—the oldest
known solid constituents within meteorites that are formed within the Solar System—are 4.567 billion
years old, giving a lower limit for the age of the Solar System. Following the development of
radiometric age-dating in the early 20th century, measurements of lead in uranium-rich minerals
showed that some were in excess of a billion years old. Ancient Greek science, philosophy,
democracy, architecture, lit