## Extracing data from various data sources

### 1. From a URL Link

Python can fetch the HTML/XML from a webpage through the ```urllib``` or the ```requests``` package. We can then use ```BeautifulSoup``` to get the actual text/content!

In [38]:
from urllib import request

In [7]:
# The source
url = "https://electrek.co/2019/09/10/tesla-new-assembly-line-fremont-factory-model-y-production/" 

# Read in the url in html format, decoded by UTF-8
html = request.urlopen(url).read().decode('utf8')
print(html[:100])

<!DOCTYPE html>
<html lang="en-US">
	<head>
				<meta charset="UTF-8" />
		<meta name="viewport" con


In [21]:
# Let's get the actual content
from bs4 import BeautifulSoup
# First save an instance of the HTML in soup object
soup = BeautifulSoup(html, 'html.parser')

In [33]:
# List of all children of a webpage
list(soup.children)[:2]

['html', '\n']

In [36]:
from nltk import word_tokenize
# Get word tokens from the soup - gets all the text, regardless of its in the heading/body etc.
tokens = word_tokenize(soup.get_text())
tokens[:10]

['Tesla',
 'is',
 'working',
 'on',
 '5th',
 'assembly',
 'line',
 'at',
 'Fremont',
 'factory']

### Same example with ```requests```

In [41]:
import requests

response = requests.get(url)
response # 200 implies success! 404 bad.

<Response [200]>

In [45]:
# Print out the HTML
print(response.content[:100])

b'<!DOCTYPE html>\n<html lang="en-US">\n\t<head>\n\t\t\t\t<meta charset="UTF-8" />\n\t\t<meta name="viewport" con'


In [48]:
# And then can use beautful soup and nltk again to get the words!
soup2 = BeautifulSoup(response.content)
print(soup2.prettify()[:100])

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-


### 2. Working with JSON libraries

We can read JSON files from the web by connection with URL links, as done previously. The ```json``` package can then be used to extract a json file into a python dictionary.

In [50]:
import json

# A toy json file
json_query = 'https://api.opencorporates.com/companies/nl/17087985'

In [57]:
response = requests.get(json_query)
print(response.text)

{"api_version":"0.4.8","results":{"company":{"name":"Bover B.V.","company_number":"17087985","jurisdiction_code":"nl","incorporation_date":null,"dissolution_date":null,"company_type":"Besloten Vennootschap","registry_url":"https://www.kvk.nl/zoeken/handelsregister/#!uitgebreid-zoeken\u0026handelsnaam=\u0026kvknummer=17087985\u0026straat=\u0026postcode=\u0026huisnummer=\u0026plaats=\u0026hoofdvestiging=true\u0026rechtspersoon=true\u0026nevenvestiging=false\u0026zoekvervallen=1\u0026zoekuitgeschreven=1\u0026start=0\u0026initial=0\u0026searchfield=uitgebreidzoeken","branch":null,"branch_status":null,"inactive":false,"current_status":"Active","created_at":"2011-01-12T21:50:57+00:00","updated_at":"2020-01-12T20:42:49+00:00","retrieved_at":"2020-01-10T10:03:54+00:00","opencorporates_url":"https://opencorporates.com/companies/nl/17087985","source":{"publisher":"Kamer van Koophandel (KvK)","url":"https://www.kvk.nl/zoeken/handelsregister/#!uitgebreid-zoeken\u0026handelsnaam=\u0026kvknummer=170

In [63]:
# Turn the json into a dictionary
output = json.loads(response.text)
type(output)

# An example key from dictionary
print(output['api_version'])

0.4.8


### 3. Reading other file types

### Word files

In [64]:
from docx import Document

document = Document("C:\\Github\\nlp-analytics\\lectures\\S2 - Data sources and crawling\\sample.docx")

In [68]:
# Can get paragraphs - but this output is encoded
print(document.paragraphs[:10])

[<docx.text.paragraph.Paragraph object at 0x000001AE8395B088>, <docx.text.paragraph.Paragraph object at 0x000001AE839BE208>, <docx.text.paragraph.Paragraph object at 0x000001AE839CEE48>, <docx.text.paragraph.Paragraph object at 0x000001AE839CEF88>, <docx.text.paragraph.Paragraph object at 0x000001AE839CE608>, <docx.text.paragraph.Paragraph object at 0x000001AE839AA588>, <docx.text.paragraph.Paragraph object at 0x000001AE839AA848>, <docx.text.paragraph.Paragraph object at 0x000001AE839AA2C8>, <docx.text.paragraph.Paragraph object at 0x000001AE839AA788>, <docx.text.paragraph.Paragraph object at 0x000001AE839CE748>]


In [80]:
for p in document.paragraphs[:5]:
    print(p.text + "\n")



Template for Preparation of Papers for IEEE Sponsored Conferences & Symposia

Frank Anderson, Sam B. Niles, Jr., and Theodore C. Donald, Member, IEEE

Abstract—These instructions give you guidelines for preparing papers for IEEE conferences. Use this document as a template if you are using Microsoft Word 6.0 or later. Otherwise, use this document as an instruction set. Instructions about final paper and figure submissions in this document are for IEEE journals; please use this document as a “template” to prepare your manuscript. For submission guidelines, follow instructions on paper submission system as well as the Conference website. Do not delete the blank line immediately above the abstract; it sets the footnote at the bottom of this column.

INTRODUCTION



### PDF Files

In [82]:
from PyPDF2 import PdfFileReader, PdfFileWriter

In [86]:
file = 'C:\\Github\\nlp-analytics\\lectures\\S2 - Data sources and crawling\\sample.pdf'

pdf = PdfFileReader(file)

In [87]:
# Get Document info
print(pdf.getDocumentInfo())

{'/Creator': 'Rave (http://www.nevrona.com/rave)', '/Producer': 'Nevrona Designs', '/CreationDate': 'D:20060301072826'}


In [88]:
# Get number of pages
print(pdf.getNumPages())

2


In [92]:
# can also rotate PDFs which may be useful for tables in PDFs

# Access first page
page = pdf.getPage(0)
page.rotateClockwise(90)

{'/Type': '/Page',
 '/Parent': {'/Type': '/Pages',
  '/Count': 2,
  '/Kids': [IndirectObject(4, 0), IndirectObject(6, 0)]},
 '/Resources': {'/Font': {'/F1': {'/Type': '/Font',
    '/Subtype': '/Type1',
    '/Name': '/F1',
    '/BaseFont': '/Helvetica',
    '/Encoding': '/WinAnsiEncoding'}},
  '/ProcSet': ['/PDF', '/Text']},
 '/MediaBox': [0, 0, 612, 792],
 '/Contents': {},
 '/Rotate': 360}