<a href="https://colab.research.google.com/github/gitmystuff/INFO4080/blob/main/Week_06-Method_Section/Web_Page_Scraper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Page Scraping



## Scraping and Summarizing Web Pages

* https://www.techtarget.com/whatis/feature/How-to-scrape-data-from-a-website
* https://towardsdatascience.com/web-scraping-basics-82f8b5acd45c
* https://www.termsfeed.com/blog/web-scraping-laws/
* https://webscraper.io/

Terms and Conditions

* https://futureplc.com/terms-conditions/
* https://www.quora.com/What-are-the-websites-that-allow-web-scraping
* https://www.linkedin.com/pulse/need-new-renaissance-lise-kingo

## Beautiful Soup

In [None]:
# scrape web page
from bs4 import BeautifulSoup
import requests

data = []
url = 'https://www.linkedin.com/pulse/need-new-renaissance-lise-kingo'

soup = BeautifulSoup(requests.get(url).text)
for section in soup.find_all('p'):
  for heading in section.find_all('strong'):
    print(heading.text)



It’s time for a re-birth of knowledge and reason 
We are at an inflection point
A crucial role for science and learning
The Leonardo Centre on Business for Society


In [None]:
# scrape web page
from bs4 import BeautifulSoup
import pandas as pd
import requests
import textwrap

data = []
url = 'https://www.linkedin.com/pulse/need-new-renaissance-lise-kingo'

soup = BeautifulSoup(requests.get(url).text)

s = soup.find('div', {'class' : 'article-main__content'})
for para in s.find_all('p'):
  for st in para.find_all('strong'):
    print()
    print(st.text)
    print()

  print(textwrap.fill(para.get_text(), 100))

I was educated in the classics – in Greek and Roman culture and philosophy. That’s likely why the
renaissance has always fascinated me. The renaissance marked a rediscovery of Greek and Roman
ideals, including the writings of Roman stateman, philosopher and scholar Cicero, who spoke of the
moral duty of the state to govern in harmony with the universal principles of nature, and in
accordance with the principles of equality, liberty and rule of law – principles that have inspired
modern democracies and provided the tenets of the United Nations.   
The human race is extraordinary for its resilience. In fact, throughout the ages, some of the
greatest leaps forward for humanity have happened after deep and enduring crises.
The renaissance took off from the devastating impacts of the bubonic plague. It was a rebirth of
knowledge and reason and changed the world in just about every way we can think. To make sense of
the world – of light and shadow, the human anatomy, the laws of gravity, of 

## LDA (Latent Dirichlet Allocation)

In natural language processing, latent Dirichlet allocation (LDA) is a Bayesian network (and, therefore, a generative statistical model) for modeling automatically extracted topics in textual corpora. The LDA is an example of a Bayesian topic model. In this, observations (e.g., words) are collected into documents, and each word's presence is attributable to one of the document's topics. Each document will contain a small number of topics.

Sources:
 * https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
 * https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

In [None]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import TfidfVectorizer

results = 10
components = 10
topics = 10

vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df['txt'].values.astype('U'))

model = LatentDirichletAllocation(n_components=components)
model.fit(vectors)

topics_dictionary = {}
for index, topic in enumerate(model.components_):
    print(f'Topic {index} top words: {[vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-topics:]]}')
    topics_dictionary[index] = [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-topics:]]



Topic 0 top words: ['ilkhanate', 'world', 'soil', 'culture', 'day', 'art', 'mongol', 'christian', 'dialogue', 'interfaith']
Topic 1 top words: ['include', 'samuel', 'african', 'africa', 'services', 'entitled', 'wrote', 'apotheosis', 'poem', 'mythology']
Topic 2 top words: ['beings', 'putin', 'definition', 'west', '33', 'control', 'sacred', 'dharma', 'case', 'women']
Topic 3 top words: ['apotheose', 'catholic', 'hero', 'lully', 'ultranationalist', 'parties', 'science', 'religiō', 'following', 'apotheosis']
Topic 4 top words: ['factions', 'sponsored', '138', 'yuan', 'tendencies', 'religions', 'confucianism', 'political', 'health', 'religion']
Topic 5 top words: ['global', 'abrahamic', 'follow', 'rates', 'religious', 'population', 'study', 'law', 'judaism', 'wealth']
Topic 6 top words: ['state', 'abolished', 'conquered', 'disappeared', 'fell', 'emperor', 'jurisdictions', 'roman', 'states', 'superstition']
Topic 7 top words: ['299', 'tyranny', '243', 'deification', 'implications', 'worship

In [None]:
def get_topics(row):
  return ', '.join([top for top in topics_dictionary[row.topic_idx]])

topic_results = model.transform(vectors)
df['topic_idx'] = topic_results.argmax(axis=1)

df['topics']= df.apply(get_topics, axis=1)
df.head()

Unnamed: 0,title,url,heading,subheading,txt,topic_idx,topics
0,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Background concepts and broader context,,British political theorist Roger Griffin has s...,9,"italian, buddhism, political, church, religiou..."
1,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Historical movements and analysis,,American historian Walter Skya has written in ...,0,"ilkhanate, world, soil, culture, day, art, mon..."
2,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Currently represented in national governments ...,The following political parties have been char...,3,"apotheose, catholic, hero, lully, ultranationa..."
3,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Represented parties with former ultranationali...,The following political parties historically h...,4,"factions, sponsored, 138, yuan, tendencies, re..."
4,Ultranationalism,https://en.wikipedia.org/wiki/Ultranationalism,Ultranationalist political parties,Formerly represented in national governments o...,Arising out of strident Sri Lankan Tamil natio...,9,"italian, buddhism, political, church, religiou..."


## SpaCy

* https://spacy.io/
* https://medium.com/analytics-vidhya/text-summarization-using-spacy-ca4867c6b744

In [None]:
# uncomment to download
import spacy.cli

spacy.cli.download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Language Model and Pipelines

en_core_web_sm

* https://www.kdnuggets.com/2021/03/natural-language-processing-pipelines-explained.html
* https://spacy.io/usage/spacy-101
* https://en.wikipedia.org/wiki/Language_model
* https://builtin.com/data-science/beginners-guide-language-models

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

nlp = spacy.load('en_core_web_sm')

In [None]:
import textwrap
import re

data = []
summary_text = ' '.join([re.sub("\[.*?\]", "", txt) for txt in df.txt])
doc = nlp(summary_text)
keyword = []
stopwords = list(STOP_WORDS)
pos_tag = ['PROPN', 'ADJ', 'NOUN', 'VERB']
for token in doc:
    if(token.text in stopwords or token.text in punctuation):
        continue
    if(token.pos_ in pos_tag):
        keyword.append(token.text)

freq_word = Counter(keyword)
max_freq = Counter(keyword).most_common(1)[0][1]
for word in freq_word.keys():
    freq_word[word] = (freq_word[word]/max_freq)

sent_strength={}
for sent in doc.sents:
    for word in sent:
        if word.text in freq_word.keys():
            if sent in sent_strength.keys():
                sent_strength[sent] += freq_word[word.text]
            else:
                sent_strength[sent] = freq_word[word.text]

    try:
      data.append([sent_strength[sent], str(sent)])
    except:
      pass
    # print(sent)
    # print()

summary = nlargest(10, sent_strength, key=sent_strength.get)
summary = ' '.join([w.text for w in summary])
print(textwrap.fill(summary, 100))
df2 = pd.DataFrame(data, columns=['strength', 'txt'])
df2.sort_values(by=['strength'], ascending=False).head()

A number of disciplines study the phenomenon of religion: theology, comparative religion, history of
religion, evolutionary origin of religions, anthropology of religion, psychology of religion
(including neuroscience of religion and evolutionary psychology of religion), law and religion, and
sociology of religion.  The jurisdictions below give various degrees of recognition in their
constitutions to Eastern Orthodoxy, but without establishing it as the state religion: The following
states recognize some form of Protestantism as their state or official religion: The Anglican Church
of England is the established church in England as well as all three of the Crown Dependencies:
Jurisdictions where a Lutheran church has been fully or partially established as a state recognized
religion include the Nordic States.  Mussolini was aware that Italy did not have the military
capacity to carry out a long war with France or the United Kingdom and waited until France was on
the verge of imminent c

Unnamed: 0,strength,txt
177,10.305085,A number of disciplines study the phenomenon o...
842,5.864407,The jurisdictions below give various degrees o...
621,5.587571,Mussolini was aware that Italy did not have th...
212,5.457627,"In the field of comparative religion, a common..."
281,5.316384,"In West Africa, these religions include the Ak..."
