Collecting Data from JSON 
---------------------------------------

Problem
------------
You want to read a JSON file/object.

Solution
------------
The simplest way to do this is by using requests and the JSON library.

In [1]:
# Install and import all the necessary libraries
import requests
import json

In [2]:
# Extracting text from JSON file

#json from "https://quotes.rest/qod.json"
r = requests.get("https://quotes.rest/qod.json")#http response
res = r.json()#fetching json 
print(json.dumps(res, indent = 4))#json dump string  indent is 4 means 4 spaces 

# Note :
# From the Docs :
# If indent is a non-negative integer or string, then JSON array elements 
# and object members will be pretty-printed with that indent level. 
# An indent level of 0, negative, or "" will only insert newlines.

# None (the default) selects the most compact representation. 
# Using a positive integer indent indents that many spaces per level. 
# If indent is a string (such as "\t"), that string is used to indent 
# each level.

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "He who is not courageous enough to take risks will accomplish nothing in life.",
                "length": "78",
                "author": "Mohamad Ali",
                "tags": [
                    "courage",
                    "inspire",
                    "risk"
                ],
                "category": "inspire",
                "language": "en",
                "date": "2020-03-01",
                "permalink": "https://theysaidso.com/quote/mohamad-ali-he-who-is-not-courageous-enough-to-take-risks-will-accomplish-nothin",
                "id": "ifuqTGVbNWPSJIzhrGQakQeF",
                "background": "https://theysaidso.com/img/qod/qod-inspire.jpg",
                "title": "Inspiring Quote of the day"
            }
        ]
    },
    "baseurl": "https://theysaidso.com",
    "copyright": {
        "year": 2022,
        "url": "https://theysaidso.com"

In [3]:
#extract contents
q = res['contents']['quotes'][0]
q

{'quote': 'He who is not courageous enough to take risks will accomplish nothing in life.',
 'length': '78',
 'author': 'Mohamad Ali',
 'tags': ['courage', 'inspire', 'risk'],
 'category': 'inspire',
 'language': 'en',
 'date': '2020-03-01',
 'permalink': 'https://theysaidso.com/quote/mohamad-ali-he-who-is-not-courageous-enough-to-take-risks-will-accomplish-nothin',
 'id': 'ifuqTGVbNWPSJIzhrGQakQeF',
 'background': 'https://theysaidso.com/img/qod/qod-inspire.jpg',
 'title': 'Inspiring Quote of the day'}

In [5]:
q = res['contents']['quotes'][0]['quote']
q

'He who is not courageous enough to take risks will accomplish nothing in life.'

Collecting Data from HTML
--------------------------------------

Problem
------------
You want to read parse/read HTML pages.

Solution
------------
The simplest way to do this is by using the bs4 library.

In [6]:
# Install and import all the necessary libraries

!pip install bs4
import urllib.request as urllib2
from bs4 import BeautifulSoup

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-cp36-none-any.whl size=1279 sha256=7410b161829d7f44c3ae7d38b916c6ee4f4ec37f2c8f9b7eb3d821a04c393f0e
  Stored in directory: C:\Users\Zuhrah\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [7]:
# Fetch the HTML file
# Pick any website from the web that you want to extract. 
# Let’s pick Wikipedia for this example.
response = urllib2.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
html_doc = response.read()

In [8]:
# Parse the HTML file

# Now we get the data:

#Parsing
soup = BeautifulSoup(html_doc, 'html.parser')
# Just for information:
# textDataForParsing = bs4.Beautiful(res.text,features='Language')
# lxml is specification for the parse

# Formating the parsed html file
strhtm = soup.prettify()

# Print few lines
print (strhtm[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Natural language processing - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XlYemApAMMQAA9UNWeoAAADB","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Natural_language_processing","wgTitle":"Natural language processing","wgCurRevisionId":942694720,"wgRevisionId":942694720,"wgArticleId":21652,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: location","Wiki

In [9]:
# Extracting tag value
# We can extract a tag value from the first instance of the tag 
# using the following code.

## type your code here
print(soup.title)
print(soup.title.string)
print(soup.a.string)
print(soup.b.string)



<title>Natural language processing - Wikipedia</title>
Natural language processing - Wikipedia
None
Natural language processing


In [10]:
# Extracting all instances of a particular tag
# Here we get all the instances of a tag that we are interested in:
for x in soup.find_all('a'): print(x.string)

None
Jump to navigation
Jump to search
Nonlinear programming
None
None
automated online assistant
customer service
[1]
linguistics
computer science
information engineering
artificial intelligence
natural language
speech recognition
natural language understanding
natural language generation
None
None
None
None
None
None
None
None
None
None
None
edit
history of natural language processing
Alan Turing
Computing Machinery and Intelligence
Turing test
clarification needed
Georgetown experiment
automatic translation
[2]
ALPAC report
statistical machine translation
SHRDLU
blocks worlds
ELIZA
Rogerian psychotherapist
Joseph Weizenbaum
ontologies
chatterbots
PARRY
Racter
Jabberwacky
machine learning
Moore's law
Chomskyan
transformational grammar
corpus linguistics
[3]
decision trees
part-of-speech tagging
hidden Markov models
statistical models
probabilistic
real-valued
cache language models
speech recognition
machine translation
textual corpora
Parliament of Canada
European Union
unsupervised


In [13]:
a=0

for x in soup.find_all('a'):
    print(x.string)
    a=a+1
print(a)

None
Jump to navigation
Jump to search
Nonlinear programming
None
None
automated online assistant
customer service
[1]
linguistics
computer science
information engineering
artificial intelligence
natural language
speech recognition
natural language understanding
natural language generation
None
None
None
None
None
None
None
None
None
None
None
edit
history of natural language processing
Alan Turing
Computing Machinery and Intelligence
Turing test
clarification needed
Georgetown experiment
automatic translation
[2]
ALPAC report
statistical machine translation
SHRDLU
blocks worlds
ELIZA
Rogerian psychotherapist
Joseph Weizenbaum
ontologies
chatterbots
PARRY
Racter
Jabberwacky
machine learning
Moore's law
Chomskyan
transformational grammar
corpus linguistics
[3]
decision trees
part-of-speech tagging
hidden Markov models
statistical models
probabilistic
real-valued
cache language models
speech recognition
machine translation
textual corpora
Parliament of Canada
European Union
unsupervised


In [19]:
# Extracting all text of a particular tag
# we get the text in <p> tags:
for x in soup.find_all('p'):
    print(x.text)

    

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

The history of natural language processing (NLP) generally started in the 1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence[clarification needed].

The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian sentences into English. The authors claimed that within three or five years, machine translation would be a solve

In [27]:
import re
regex = re.compile(r'Natural/s*language/s*processing')
a = re.findall('natural language processing', strhtm, re.I)


print(len(a))

35
