### Implementation of the code for extracting HTML web content using paragraph tags and Natural language Processing:

Step 1: Import the necessary libraries: 
We import the requests library to send a GET request to the website, the BeautifulSoup library to parse the HTML content, and the NLTK and spacy library to perform NLP tasks.

In [3]:
import requests
from bs4 import BeautifulSoup
import nltk
import spacy

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Step 2: Specify the URL of the website to scrape: We then specify the URL of the website that we want to scrape. In this example, we are scraping the Wikipedia page on natural language processin  and loads the pre-trained English language model called "en_core_web_sm" which provides access to various NLP (Natural Language Processing) functionalities such as tokenization, part-of-speech tagging, named entity recognition, etc.

In [4]:
nlp = spacy.load('en_core_web_sm')
url = 'https://en.wikipedia.org/wiki/Natural_language_processing'

Step 3: store the response in a variable: 
We use the requests library to send a GET request to the website and store the response in a variable.

In [5]:
response = requests.get(url)
print(response)

<Response [200]>


Step 4: Use BeautifulSoup to parse the HTML content and store in variable name called "soup"

In [6]:
soup = BeautifulSoup(response.content, 'html.parser')
soup


<!DOCTYPE html>

<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Natural language processing - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-language-alert-in-sidebar-enabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabl

######finding all the HTML p (paragraph) tags in the web page's HTML content using the find_all method of the soup object (which is a BeautifulSoup object that represents the parsed HTML content of the web page), and storing them in the paragraphs variable.

In [7]:
paragraphs = soup.find_all('p')
paragraphs

[<p><b>Natural language processing</b> (<b>NLP</b>) is an <a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a> subfield of <a href="/wiki/Linguistics" title="Linguistics">linguistics</a>, <a href="/wiki/Computer_science" title="Computer science">computer science</a>, and <a href="/wiki/Artificial_intelligence" title="Artificial intelligence">artificial intelligence</a> concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of <a href="/wiki/Natural_language" title="Natural language">natural language</a> data.  The goal is a computer capable of "understanding" the contents of documents, including the <a href="/wiki/Context_(language_use)" title="Context (language use)">contextual</a> nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize t

Step 5: extract the text from the paragraph tags

In [8]:
text = ''
for p in paragraphs:
    text += p.text

print(text)


Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.
Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that

Step 6: Tokenize the text and remove stop words using NLTK and print

In [9]:
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
cleaned_tokens = [token for token in tokens if token.lower() not in stop_words]
print("Original tokens:", tokens)
print("Cleaned tokens:", cleaned_tokens)

Original tokens: ['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'an', 'interdisciplinary', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', 'language', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.', 'The', 'goal', 'is', 'a', 'computer', 'capable', 'of', '``', 'understanding', "''", 'the', 'contents', 'of', 'documents', ',', 'including', 'the', 'contextual', 'nuances', 'of', 'the', 'language', 'within', 'them', '.', 'The', 'technology', 'can', 'then', 'accurately', 'extract', 'information', 'and', 'insights', 'contained', 'in', 'the', 'documents', 'as', 'well', 'as', 'categorize', 'and', 'organize', 'the', 'documents', 'themselves', '.', 'Challenges', 'in', 'natural', 'language', 'processing', 'frequently', 'involve', 'speech',

Step 7: Perform part-of-speech tagging on cleaned_tokens (the list of cleaned words), and perform named entity recognition using the pos_tags. The named entities are then saved in the ner_tags variable.



In [10]:
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
pos_tags = nltk.pos_tag(cleaned_tokens)
ner_tags = nltk.ne_chunk(pos_tags)


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.



Step 8: we are printing the named entities (extracted using NER) and their labels for the given text. We are also printing the Part-of-Speech (POS) tags and Named Entity Recognition (NER) tags for the cleaned tokens. The output will display the named entities and their labels, followed by a list of tuples where each tuple contains a word and its corresponding POS tag, and then a tree-like structure of the NER tags for the given text.

In [11]:
print("Named Entities:")
for entity in ner_tags:
    if hasattr(entity, 'label') and entity.label() == 'NE':
        print(entity)
print("POS tags:", pos_tags)
print("NER tags:", ner_tags)

Named Entities:
POS tags: [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('interdisciplinary', 'JJ'), ('subfield', 'NN'), ('linguistics', 'NNS'), (',', ','), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('artificial', 'JJ'), ('intelligence', 'NN'), ('concerned', 'VBN'), ('interactions', 'NNS'), ('computers', 'NNS'), ('human', 'JJ'), ('language', 'NN'), (',', ','), ('particular', 'JJ'), ('program', 'NN'), ('computers', 'NNS'), ('process', 'VBP'), ('analyze', 'JJ'), ('large', 'JJ'), ('amounts', 'NNS'), ('natural', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.'), ('goal', 'NN'), ('computer', 'NN'), ('capable', 'JJ'), ('``', '``'), ('understanding', 'JJ'), ("''", "''"), ('contents', 'NNS'), ('documents', 'NNS'), (',', ','), ('including', 'VBG'), ('contextual', 'JJ'), ('nuances', 'NNS'), ('language', 'NN'), ('within', 'IN'), ('.', '.'), ('technology', 'NN'), ('accurately', 'RB'), ('extract', 'JJ'), ('information', 'NN'), ('

Step 9: Extracting Named Entities of Organization Type from Paragraphs

In [12]:
for paragraph in paragraphs:
    text = paragraph.get_text()
    entities = []
    doc = nlp(text)  
    for entity in doc.ents:
        if entity.label_ == "ORG":
            entities.append(entity.text)

    if entities:
        print(f"Text: {text}")
        print(f"Organizations: {', '.join(entities)}")
        print("="*50)


Text: Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Organizations: NLP
Text: The premise of symbolic NLP is well-summarized by John Searle's Chinese room experiment: Given a collection of rules (e.g., a Chinese phrasebook, with questions and matching answers), the computer emulates natural language understanding (or other NLP tasks) by applying those rules to the data it confronts.

Organizations: NLP, NLP
Text: Up to the 1980s, most natural la