# Introduction to Text Analytics
> Lecturer: Binh Thanh Le

> School of Computer Science, University College Dublin

## Getting Data
This notebook showcases how to download data available on the Internet. We cover most formats the data is typically available in, and learn/practice via example Python code or utilities for getting data.

    TOPIC1: Getting data from a Web URL, RSSFeed, PDF

    TOPIC2: Crawling/Scraping data from the Web (entire websites).

    TOPIC3: Getting data via APIs (JSON format).

## Tasks:

    1) Learn how to get data from URL, RSSFeed, PDF

    2) Learn how to crawling/scraping the Web

    3) Learn how to use JSON to get data from APIs


## TOPIC1: Getting data from a Web URL, RSSFeed, PDF

In [34]:
#Import package 'requests'for URL scrapping
import requests

#Look at the package structure
print('PACKAGE STRUCTURE: \n' , dir(requests))
print('-'*100)

#Look at individual functions
print('PRINT THE HELP OF FUNCTION')
help(requests.get)
#Same as help() but opens a new window
# ?requests.get

PACKAGE STRUCTURE: 
----------------------------------------------------------------------------------------------------
PRINT THE HELP OF FUNCTION
Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



### Get a text file from URL

In [35]:
#Get a text file.
#Get book "Alice's Adventures in Wonderland" from Project Gutenberg, in text format

#If not imported already, import requests
import requests

#Give the URL for the file to be downloaded
url='http://www.gutenberg.org/cache/epub/19033/pg19033.txt'

#Look at the object returned by requests.get()
object = requests.get(url)
print('The encoding of text: ', object.encoding)
print('-'*100)
#Get the content from the downloaded text file
text_page = object.text
#print(text_page)

#Look at the first 500 characters of the book
print('CONTENT OF TEXT: [0->1000] CHARACTERS \n')
print(text_page[0:1000])
print('-'*100)

The encoding of text:  utf-8
----------------------------------------------------------------------------------------------------
CONTENT OF TEXT: [0->1000] CHARACTERS 

﻿The Project Gutenberg EBook of Alice in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice in Wonderland

Author: Lewis Carroll

Illustrator: Gordon Robinson

Release Date: August 12, 2006 [EBook #19033]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK ALICE IN WONDERLAND ***




Produced by Jason Isbell, Irma Spehar, and the Online
Distributed Proofreading Team at http://www.pgdp.net









          [Illustration: Alice in the Room of the Duchess.]


                       _THE "STORYLAND" SERIES_



                   A

### Get the data from Web URL

Let's get several pages in IrishTimes.

1) https://www.irishtimes.com/news/science/young-scientists-lauded-for-ethical-focus-on-climate-change-1.3630771

2) https://www.irishtimes.com/news/science/eu-contest-for-young-scientists-opens-in-dublin-1.3630668

3) https://www.irishtimes.com/business/retail-and-services/dunnes-stores-withdraws-from-talks-to-buy-donnybrook-fair-1.3629687


In [36]:
#Get an HTML file.
#Get news article from IrishTimes website.

#If not imported already, import requests
#import requests

#Give the URL for the file to be downloaded

url  = 'https://www.irishtimes.com/news/science/young-scientists-lauded-for-ethical-focus-on-climate-change-1.3630771'

#Get the content from the downloaded html file
html_page = requests.get(url).text

#Look at the format of the html file
print(html_page[:5000])

## Use the IPython magic function "store" to save into file
%store html_page >> irishTest01.txt

## You can have a viewing of html tags by using: https://codebeautify.org/htmlviewer/






								
																																																																																																																																																																																																																	

<!doctype html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.irishtimes.com/" version="HTML+RDFa 1.1"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.irishtimes.com/" version="HTML+RDFa 1.1"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.irishtimes.com/" version="HTML+RDFa 1.1"> <![endif]-->
<!--[if IE 9]>    <html class="no-js eq-ie9" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.ir

#### Use package 'beautifulsoup' to extract the content of HTML fields 

    Need to know the HTML structure and the tags containing the information we need

    To look at the HTML file open it in a text editor, look for the tags that contain headline, subheadline, article body 

    If you don't have beautifulsoup4 installed, run in shell: conda install beautifulsoup4 
    
#### Structure of IrishTimes:

    1) The headline and subheadline will be in the section:

   <img src="./supportDocs/irishtime_01.png">

    2) The content will stored in the tags:

   <img src="./supportDocs/irishtime_02.png">

In [37]:
from bs4 import BeautifulSoup

# Method to parse the structure of an html page using package beautifulsoup.
# The code looks for specific tags in the html structure and extracts the content
def getArticleDetailsByUrl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text,"html.parser")
    #soup.prettify()
    
    headline = soup.title.string
    subheadline = soup.head.find("meta",attrs={"name":"description"}).get('content')

    doc_body = ''
    if "The Irish Times" in soup.text:
        for body_p_tag in soup.article.find_all("p", attrs={"class": "no_name"}):
            doc_body += body_p_tag.get_text() + " "

    source = "Other"
    try:
        if "irishtimes" in url:
            source = "IrishTimes"
            body_p_tag = soup.article.find("div", attrs={"class": "last_updated"}).find("p")
    except:
        pass

    first_sentence = doc_body.split(".")[0]

    return [headline, subheadline, first_sentence, doc_body, source]



In [38]:
# Main code that calls our parsing method getArticleDetailsByUrl(url) for specific html pages.
article_url = 'https://www.irishtimes.com/news/science/young-scientists-lauded-for-ethical-focus-on-climate-change-1.3630771'
#print(getArticleDetailsByUrl(article_url))

print("\nField by field:\n")
[headline, subheadline, first_sentence, doc_body, source] = getArticleDetailsByUrl(article_url)
print("Headline:\n", headline, "\n")
print("Subheadline:\n", subheadline, "\n")
print("First sentence:\n", first_sentence, "\n")
print("Article body:\n", doc_body)


Field by field:

Headline:
 Young Scientists lauded for ethical focus on climate change 

Subheadline:
 Contributions critical to advancing Paris Agreement, Higgins tells EU contest in RDS 

First sentence:
 President Michael D Higgins has saluted young scientists for their “independence of thought, critical turn of mind and questioning of the status quo” 

Article body:
 President Michael D Higgins has saluted young scientists for their “independence of thought, critical turn of mind and questioning of the status quo”. Speaking at the opening of the European Union Contest for Young Scientists (EUCYS) at the RDS in Dublin on Saturday, Mr Higgins said such qualities were often combined with an ethical concern for “the community and the planet”. They were essential to meeting the challenges facing the world today, especially from climate change. “Some of you have sought to push barriers in the field of mathematics; to investigate causes of and solutions to a variety of social problems, 

### Get the data from RSSFeed (XML file)
Get an XML file.

Get the whole RSS feed for the Irish Times news articles.

This is an XML file listing the URLs of individual news articles published online.

Need to know the structure of the XML to be able to extract text from specific tags

In [39]:
import requests
#Feedparser is a library to parse RSS/XML feeds, these are files with a specific XML structure
#If you don't have it, install using conda or pip, e.g.,: pip install feedparser
import feedparser
#help(feedparser)

#Parse the XML file to retrieve the URLs for individual news articles.
#Parse each article's HTML page
def scrapRSSFeed(rss_feed):
    d = feedparser.parse(rss_feed)
    #print(d)
    #print(d['entries'], "\n")
        
    for item in d['entries']:
        #Extract an article URL
        article_url = item['link']
        [headline, subheadline, first_sentence, doc_body, source] = getArticleDetailsByUrl(article_url) 
        print("Article:", headline, "\n")

In [40]:
#The URL of the XML file
url='http://www.irishtimes.com/cmlink/news-1.1319192'
xml_page = requests.get(url).text

#Look at the structure of the XML file
#To have a proper look, open the XML file with a text editor

print('The XML content: \n',xml_page[:1000])
print('-'*100)


# Call the method that parses a given XML file
print('Just print the Headline of articles: \n')
scrapRSSFeed(url)

The XML content: 
 <?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title><![CDATA[The Irish Times - News]]></title>
    <link>/cmlink/the-irish-times-news-1.1319192</link>
    <description>
                    
          </description>
    <lastBuildDate>Sun, 16 Sep 2018 18:24:19 +0000</lastBuildDate>
    
                  <language></language>
              
          <item>
        <title><![CDATA[Elderly punter helps staff foil three-man armed raid on bookie’s]]></title>
        <link>https://www.irishtimes.com/news/crime-and-law/elderly-punter-helps-staff-foil-three-man-armed-raid-on-bookie-s-1.3631156</link>
        <description><![CDATA[Customer wrestles masked raider before chasing him and companion out of shop]]></description>
        <guid isPermaLink="false">1.3631156</guid>
        <pubDate>Sun, 16 Sep 2018 18:24:19 +0000</pubDate>
      </item>
          <item>
        <title><![CDATA[Number on trolleys could pass 1,000 this winter, IMO event told]]></title>
   

###  Get the data from PDF

In [41]:
#Get a PDF file, save it to disk.
#import requests

# Give url of the PDF file
url='http://www.greenteapress.com/thinkpython/thinkpython.pdf'
# Download the pdf file into request_object
request_object = requests.get(url)

#PDF is a binary format. Use request.content instead of request.text
#Write binary content on your machine's disk in a file named 'thinkpython.pdf'
with open("thinkpython.pdf", "wb") as pdffile:
    # Look at the conent of the file
    print(request_object.content[:500])
    
    #Print the content of the request_object to a file named "thinkpython.pdf"
    pdffile.write(request_object.content)

#Check that it downloaded the file to the current directory.
#%ls

b'%PDF-1.5\n%\xd0\xd4\xc5\xd8\n2 0 obj <<\n/Type /ObjStm\n/N 100\n/First 804\n/Length 1113      \n/Filter /FlateDecode\n>>\nstream\nx\xda\x9dV\xdbn\x9cH\x10}\x9f\xaf\xa8\xc7d\xb5\x8a\xe9\x0b\xdd\xb0\x8a\x12E\x9b8\xca\xc3*Vl%\xcf\x1d\xe8\x19\xa30\x80\x1a\xb0=\xfb\xf5{\x8a\x8b\xed\xec\xa5\x07\xed\x83M\x0f\xd49UuNu\x83\xa0\x84RR\x82\x0ciA9YA"\xa1\xdc\x90P$dFB\x930\x92\x84%\x91c\x99\x91\x14)\xfeHj\xbb\x93\x92\xa4\xc1\x1d\xe0\x93\x04KR\xe0\x919)\xa3p\x87T\x86\x8b\x02-\x9ek\xd2\xd2\x92\xb2\xa4S<\xcfH\xdb\x84\xf3\xa5\x89\xdciI)\xb2\xe8\x94R\x8d\x8b\xa14\xc3%\'#\xb0L\xc8(<Pd\x8c\xc5s26\xa7\xd4\xa2J`3\xb2\xa00\x82\xacU;\xd4h\xf3\x8cLJ\x19R\x1bC\x19\x12\x99\x9c\xb2\x1c\xcf\xd1\x10\xaa\xb3\x8a\xf2\tDy\x96\x00\x84F\x05P\xe81Q\x922n\xdc\xe8]\x86f\x13\x0b\x9a\x94\x84H2\xca \x85\x90\t\x88pE!y\x82\xab\x01\x07\xf4\x11\xb9%VE&9a)\xa4\xc2\x15|2e2\xc1\x02\xaa\x9dH\xc0\xc8\x02\x89\x04\x94J1?\xcbk\xf8\x0eHU.9\x13\x94\xe6_,\xbbF[\x82\x85\xd7)\x17\x01b\x9d\xe1\x1f\xc4\x17)\x07B~\x91j\xb9\x130@\xa4h\x19\t\xb1\

In [42]:
# Extract the text from the pdf file.
# Use the 'pdftotext' command line utility for your operating system, e.g., Unix, Max OS, Windows.
# Google "pdftotext install" and install for your operating system.

#Use 'pdftotext' to extract the text content from the pdf file to a text file with 
# the same name, but extension .txt
# Can call pdftotext in the command line or directly from Jupyter Notebook.
!pdftotext -enc UTF-8 thinkpython.pdf thinkpython.txt

#Read the text file with text extracted from the original PDF
with open('thinkpython.txt', 'rb') as f:
    text_content = f.read().decode('UTF-8')

print(text_content[:500]) 

Think Python
How to Think Like a Computer Scientist

Version 2.0.17

Think Python
How to Think Like a Computer Scientist

Version 2.0.17

Allen Downey

Green Tea Press
Needham, Massachusetts

Copyright © 2012 Allen Downey.
Green Tea Press
9 Washburn Ave
Needham MA 02492
Permission is granted to copy, distribute, and/or modify this document under the terms of the
Creative Commons Attribution-NonCommercial 3.0 Unported License, which is available at http:
//creativecommons.org/licenses/by-nc/3.


## Topic2: Crawling data from the Web.

As an alternative to using the Python package requests, you can use the wget utility to download an HTML page from given URL or an entire website. If you don't have wget on your computer, first install it for your platform.

For MAC:

        1) First install BREW as: https://brew.sh
        
        2) Install wget by: brew install wget 
For Windows:

        1) Check the installation here: http://gnuwin32.sourceforge.net/packages/wget.htm
        
For Linux: 

        1) Try: sudo apt install wget (https://www.cyberciti.biz/faq/how-to-install-wget-togetrid-of-error-bash-wget-command-not-found/)

In [43]:
!wget http://dmoz-odp.org/Computers/Programming/Languages/Python/Books/ -O dmoz_pythonbooks.html

--2018-09-16 21:31:53--  http://dmoz-odp.org/Computers/Programming/Languages/Python/Books/
Resolving dmoz-odp.org (dmoz-odp.org)... 72.10.193.113
Connecting to dmoz-odp.org (dmoz-odp.org)|72.10.193.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dmoz_pythonbooks.html’

dmoz_pythonbooks.ht     [  <=>               ]  45.29K   148KB/s    in 0.3s    

2018-09-16 21:31:53 (148 KB/s) - ‘dmoz_pythonbooks.html’ saved [46381]



The wget tool is great for crawling entire or parts of websites. It recursively follows URLs up to given depth. The example below downloads a part of the website locally, in a folder named www.dmoz.org. The parameter -l tells wget to what depth it should follow URLs from the original URL. The parameter --no-parent tells wget to not download anything other than given path. See http://linuxreviews.org/quicktips/wget/ for more details.

In [44]:
! wget http://dmoz-odp.org/Computers/Programming/Languages/Python/ -r -l 1 --no-parent

--2018-09-16 21:31:54--  http://dmoz-odp.org/Computers/Programming/Languages/Python/
Resolving dmoz-odp.org (dmoz-odp.org)... 72.10.193.113
Connecting to dmoz-odp.org (dmoz-odp.org)|72.10.193.113|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘dmoz-odp.org/Computers/Programming/Languages/Python/index.html’

dmoz-odp.org/Comput     [  <=>               ]  42.21K  93.9KB/s    in 0.4s    

2018-09-16 21:31:54 (93.9 KB/s) - ‘dmoz-odp.org/Computers/Programming/Languages/Python/index.html’ saved [43228]

Loading robots.txt; please ignore errors.
--2018-09-16 21:31:54--  http://dmoz-odp.org/robots.txt
Reusing existing connection to dmoz-odp.org:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘dmoz-odp.org/robots.txt’

dmoz-odp.org/robots     [ <=>                ]     339  --.-KB/s    in 0s      

2018-09-16 21:31:54 (8.98 MB/s) - ‘dmoz-odp.org/robots.txt’ saved [339]

--2018-09-16 21:3

In [45]:
#Data is downloaded to root folder named "dmoz-odp.org"
%ls dmoz-odp.org

[34mComputers[m[m/  robots.txt


In [46]:
#Need to stop crawling after a short while, otherwise it will fill hard disk or we will get banned by the website
! wget http://www.irishtimes.com/ -r -l 1 --no-parent

--2018-09-16 21:31:59--  http://www.irishtimes.com/
Resolving www.irishtimes.com (www.irishtimes.com)... 151.101.18.174
Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.irishtimes.com/ [following]
--2018-09-16 21:31:59--  https://www.irishtimes.com/
Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 371775 (363K) [text/html]
Saving to: ‘www.irishtimes.com/index.html’


2018-09-16 21:31:59 (1.81 MB/s) - ‘www.irishtimes.com/index.html’ saved [371775/371775]

Loading robots.txt; please ignore errors.
--2018-09-16 21:31:59--  https://www.irishtimes.com/robots.txt
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 666 [text/plain]
Saving to: ‘www.irishtimes.com/robots.txt’


2018-09-16 21:31:59 (24.4 MB/s) - ‘www.irishtime

HTTP request sent, awaiting response... 200 OK
Length: 19678 (19K) [text/xml]
Saving to: ‘www.irishtimes.com/rss/noa-rss-1.3093339’


2018-09-16 21:32:00 (3.95 MB/s) - ‘www.irishtimes.com/rss/noa-rss-1.3093339’ saved [19678/19678]

--2018-09-16 21:32:00--  https://www.irishtimes.com/rss/construction-rss-1.3020527
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 16379 (16K) [text/xml]
Saving to: ‘www.irishtimes.com/rss/construction-rss-1.3020527’


2018-09-16 21:32:00 (11.8 MB/s) - ‘www.irishtimes.com/rss/construction-rss-1.3020527’ saved [16379/16379]

--2018-09-16 21:32:00--  https://www.irishtimes.com/rss/commercial-property-rss-1.3020524
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 15905 (16K) [text/xml]
Saving to: ‘www.irishtimes.com/rss/commercial-property-rss-1.3020524’


2018-09-16 21:32:00 (13.2 MB/s) - ‘www.irishtimes.com/rss/commercial-property-rss-1.3

HTTP request sent, awaiting response... 302 Found
Location: / [following]
--2018-09-16 21:32:04--  https://www.irishtimes.com/
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 371775 (363K) [text/html]
www.irishtimes.com/news: Is a directory

Cannot write to ‘www.irishtimes.com/news’ (Success).
--2018-09-16 21:32:04--  https://www.irishtimes.com/sport
Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 225820 (221K) [text/html]
www.irishtimes.com/sport: Is a directory

Cannot write to ‘www.irishtimes.com/sport’ (Success).
--2018-09-16 21:32:04--  https://www.irishtimes.com/business
Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 268028 (262K) [text/html]
www.irishtimes.com/business: Is a directory

Cannot write to ‘www.irishtimes.com/business’

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/news/world/us/trump-presidency’

www.irishtimes.com/     [ <=>                ] 167.37K  --.-KB/s    in 0.1s    

2018-09-16 21:32:20 (1.28 MB/s) - ‘www.irishtimes.com/news/world/us/trump-presidency’ saved [171387]

--2018-09-16 21:32:20--  https://www.irishtimes.com/news/world/terror-attacks
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 179431 (175K) [text/html]
Saving to: ‘www.irishtimes.com/news/world/terror-attacks’


2018-09-16 21:32:20 (1.78 MB/s) - ‘www.irishtimes.com/news/world/terror-attacks’ saved [179431/179431]

--2018-09-16 21:32:20--  https://www.irishtimes.com/news/politics/inside-politics
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/news/politics/inside-politics’

www.irishtimes.co

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
www.irishtimes.com/culture/film: Is a directory

Cannot write to ‘www.irishtimes.com/culture/film’ (Success).
--2018-09-16 21:32:22--  https://www.irishtimes.com/news/politics/liadh-n%C3%AD-riada-named-as-sinn-f%C3%A9in-candidate-for-presidency-1.3630828
Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 200753 (196K) [text/html]
Saving to: ‘www.irishtimes.com/news/politics/liadh-ní-riada-named-as-sinn-féin-candidate-for-presidency-1.3630828’


2018-09-16 21:32:22 (1.91 MB/s) - ‘www.irishtimes.com/news/politics/liadh-ní-riada-named-as-sinn-féin-candidate-for-presidency-1.3630828’ saved [200753/200753]

--2018-09-16 21:32:22--  https://www.irishtimes.com/polopoly_fs/1.3630964.1537108463!/image/image.jpg_gen/derivatives/box_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting 

Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/life-and-style/fashion/simone-rocha-s-towering-veiled-hats-strike-a-grand-note-at-london-fashion-week-1.3631252?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Flife-and-style%2Ffashion%2Fsimone-rocha-s-towering-veiled-hats-strike-a-grand-note-at-london-fashion-week-1.3631252 [following]
--2018-09-16 21:32:23--  https://www.irishtimes.com/life-and-style/fashion/simone-rocha-s-towering-veiled-hats-strike-a-grand-note-at-london-fashion-week-1.3631252?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Flife-and-style%2Ffashion%2Fsimone-rocha-s-towering-veiled-hats-strike-a-grand-note-at-london-fashion-week-1.3631252
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.i

HTTP request sent, awaiting response... 200 OK
Length: 110679 (108K) [text/html]
Saving to: ‘www.irishtimes.com/life-and-style/health-family/we-ve-been-together-18-months-but-my-boyfriend-won-t-have-sex-with-me-1.3629590’


2018-09-16 21:32:28 (1.24 MB/s) - ‘www.irishtimes.com/life-and-style/health-family/we-ve-been-together-18-months-but-my-boyfriend-won-t-have-sex-with-me-1.3629590’ saved [110679/110679]

--2018-09-16 21:32:28--  https://www.irishtimes.com/polopoly_fs/1.3629589.1536943150!/image/image.jpg_gen/derivatives/box_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 3171 (3.1K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3629589.1536943150!/image/image.jpg_gen/derivatives/box_140/image.jpg’


2018-09-16 21:32:28 (12.5 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3629589.1536943150!/image/image.jpg_gen/derivatives/box_140/image.jpg’ saved [3171/3171]

--2018-09-16 21:32:28--  https://www.irish

HTTP request sent, awaiting response... 200 OK
Length: 104868 (102K) [text/html]
Saving to: ‘www.irishtimes.com/business/economy/macron-waters-down-but-keeps-france-s-exit-tax-1.3631208’


2018-09-16 21:32:29 (3.89 MB/s) - ‘www.irishtimes.com/business/economy/macron-waters-down-but-keeps-france-s-exit-tax-1.3631208’ saved [104868/104868]

--2018-09-16 21:32:29--  https://www.irishtimes.com/business/health-pharma/seroba-takes-part-in-45m-funding-round-for-endotronix-1.3631210
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/business/health-pharma/seroba-takes-part-in-45m-funding-round-for-endotronix-1.3631210?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fbusiness%2Fhealth-pharma%2Fseroba-takes-part-in-45m-funding-round-for-endotronix-1.3631210 [following]
--2018-09-16 21:32:29--  https://www.irishtimes.com/business/health-pharma/seroba-takes-part-in-45m-fundi

HTTP request sent, awaiting response... 200 OK
Length: 7640 (7.5K) [image/png]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3630886.1537096913!/image/image.png_gen/derivatives/box_140/image.png’


2018-09-16 21:32:32 (12.8 KB/s) - ‘www.irishtimes.com/polopoly_fs/1.3630886.1537096913!/image/image.png_gen/derivatives/box_140/image.png’ saved [7640/7640]

--2018-09-16 21:32:32--  https://www.irishtimes.com/sport/other-sports/craig-breen-escapes-unharmed-as-his-car-goes-up-in-flames-at-turkey-rally-1.3630899
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 137103 (134K) [text/html]
Saving to: ‘www.irishtimes.com/sport/other-sports/craig-breen-escapes-unharmed-as-his-car-goes-up-in-flames-at-turkey-rally-1.3630899’


2018-09-16 21:32:32 (932 KB/s) - ‘www.irishtimes.com/sport/other-sports/craig-breen-escapes-unharmed-as-his-car-goes-up-in-flames-at-turkey-rally-1.3630899’ saved [137103/137103]

--2018-09-16 21:32:32--  https://www.

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/news/social-affairs/masked-men-secure-dublin-property-after-housing-activists-removed-1.3626537’

www.irishtimes.com/     [ <=>                ] 144.27K  --.-KB/s    in 0.06s   

2018-09-16 21:32:33 (2.27 MB/s) - ‘www.irishtimes.com/news/social-affairs/masked-men-secure-dublin-property-after-housing-activists-removed-1.3626537’ saved [147728]

--2018-09-16 21:32:33--  https://www.irishtimes.com/polopoly_fs/1.3626537.1536853570!/image/image.png_gen/derivatives/box_140/image.png
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 10417 (10K) [image/png]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3626537.1536853570!/image/image.png_gen/derivatives/box_140/image.png’


2018-09-16 21:32:33 (1.99 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3626537.1536853570!/image/image.png_gen/derivatives/box_140/image.png’ saved [10417/1041

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/culture/film/superfly-a-cliche-of-strippers-cocaine-and-one-last-job-crime-1.3626871’

www.irishtimes.com/     [ <=>                ] 100.07K  --.-KB/s    in 0.07s   

2018-09-16 21:32:36 (1.38 MB/s) - ‘www.irishtimes.com/culture/film/superfly-a-cliche-of-strippers-cocaine-and-one-last-job-crime-1.3626871’ saved [102471]

--2018-09-16 21:32:36--  https://www.irishtimes.com/polopoly_fs/1.3626870.1536762646!/image/image.jpg_gen/derivatives/box_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 3452 (3.4K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3626870.1536762646!/image/image.jpg_gen/derivatives/box_140/image.jpg’


2018-09-16 21:32:37 (16.7 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3626870.1536762646!/image/image.jpg_gen/derivatives/box_140/image.jpg’ saved [3452/3452]

--2018-09-16 21:32:

HTTP request sent, awaiting response... 200 OK
Length: 17582 (17K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3631099.1537118109!/image/image.jpg_gen/derivatives/box_300/image.jpg’


2018-09-16 21:32:37 (4.16 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3631099.1537118109!/image/image.jpg_gen/derivatives/box_300/image.jpg’ saved [17582/17582]

--2018-09-16 21:32:37--  https://www.irishtimes.com/sport/other-sports/tears-flow-on-a-golden-day-for-ireland-s-sanita-puspure-1.3631042
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/sport/other-sports/tears-flow-on-a-golden-day-for-ireland-s-sanita-puspure-1.3631042?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fsport%2Fother-sports%2Ftears-flow-on-a-golden-day-for-ireland-s-sanita-puspure-1.3631042 [following]
--2018-09-16 21:32:37--  https://www.irishtimes.com/sport/other-sports/tears-flow-on-a-golden-

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/life-and-style/fashion/victoria-beckham-s-bold-first-london-fashion-week-show-1.3631020?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Flife-and-style%2Ffashion%2Fvictoria-beckham-s-bold-first-london-fashion-week-show-1.3631020 [following]
--2018-09-16 21:32:41--  https://www.irishtimes.com/life-and-style/fashion/victoria-beckham-s-bold-first-london-fashion-week-show-1.3631020?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Flife-and-style%2Ffashion%2Fvictoria-beckham-s-bold-first-london-fashion-week-show-1.3631020
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/life-and-style/fashion/victoria-beckham-s-bold-first-london-fashion-week-show-1.3631020’

www.irishtimes.com/     [ <=>                ] 106.82K  --.-KB/s    in 0.05s   


HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/opinion/david-mcwilliams-dereliction-is-legalised-vandalism-for-the-property-owning-classes-1.3628403’

www.irishtimes.com/     [ <=>                ]  94.06K  --.-KB/s    in 0.08s   

2018-09-16 21:32:55 (1.22 MB/s) - ‘www.irishtimes.com/opinion/david-mcwilliams-dereliction-is-legalised-vandalism-for-the-property-owning-classes-1.3628403’ saved [96316]

--2018-09-16 21:32:55--  https://www.irishtimes.com/polopoly_fs/1.3310934.1512049386!/image/image.jpg_gen/derivatives/box_140_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 4129 (4.0K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3310934.1512049386!/image/image.jpg_gen/derivatives/box_140_140/image.jpg’


2018-09-16 21:32:55 (17.0 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3310934.1512049386!/image/image.jpg_gen/derivatives/box_140_140/imag

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/business/economy/united-ireland-would-see-living-standards-in-republic-fall-by-15-1.3629748?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fbusiness%2Feconomy%2Funited-ireland-would-see-living-standards-in-republic-fall-by-15-1.3629748 [following]
--2018-09-16 21:32:59--  https://www.irishtimes.com/business/economy/united-ireland-would-see-living-standards-in-republic-fall-by-15-1.3629748?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fbusiness%2Feconomy%2Funited-ireland-would-see-living-standards-in-republic-fall-by-15-1.3629748
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 105676 (103K) [text/html]
Saving to: ‘www.irishtimes.com/business/economy/united-ireland-would-see-living-standards-in-republic-fall-by-15-1.3629748’


2018-09-16 21:32:59 (1.86 MB/s) - ‘www.irishtimes.com/bus

www.irishtimes.com/     [ <=>                ] 142.78K  --.-KB/s    in 0.1s    

2018-09-16 21:33:01 (1.38 MB/s) - ‘www.irishtimes.com/business/what-have-cities-done-to-curb-the-rise-of-airbnb-1.3624689’ saved [146204]

--2018-09-16 21:33:01--  https://www.irishtimes.com/polopoly_fs/1.3624689.1536608878!/image/image.png_gen/derivatives/box_300/image.png
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 23590 (23K) [image/png]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3624689.1536608878!/image/image.png_gen/derivatives/box_300/image.png’


2018-09-16 21:33:01 (2.05 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3624689.1536608878!/image/image.png_gen/derivatives/box_300/image.png’ saved [23590/23590]

--2018-09-16 21:33:01--  https://www.irishtimes.com/news/ireland/the-danger-drug-damaging-irish-soldiers-1.3612517
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 139493 

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/life-and-style/abroad/irish-citizens-in-britain-to-retain-rights-after-brexit-1.3628151?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Flife-and-style%2Fabroad%2Firish-citizens-in-britain-to-retain-rights-after-brexit-1.3628151 [following]
--2018-09-16 21:33:05--  https://www.irishtimes.com/life-and-style/abroad/irish-citizens-in-britain-to-retain-rights-after-brexit-1.3628151?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Flife-and-style%2Fabroad%2Firish-citizens-in-britain-to-retain-rights-after-brexit-1.3628151
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/life-and-style/abroad/irish-citizens-in-britain-to-retain-rights-after-brexit-1.3628151’

www.irishtimes.com/     [ <=>                ] 107.77K  --.-KB/s    in 0.03s   


HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/special-reports/future-of-work/spies-and-space-cadets-the-jobs-of-the-future-1.3624059’

www.irishtimes.com/     [ <=>                ] 105.87K  --.-KB/s    in 0.04s   

2018-09-16 21:33:11 (2.87 MB/s) - ‘www.irishtimes.com/special-reports/future-of-work/spies-and-space-cadets-the-jobs-of-the-future-1.3624059’ saved [108415]

--2018-09-16 21:33:11--  https://www.irishtimes.com/polopoly_fs/1.3624056.1536576108!/image/image.jpg_gen/derivatives/box_300_160/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 7302 (7.1K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3624056.1536576108!/image/image.jpg_gen/derivatives/box_300_160/image.jpg’


2018-09-16 21:33:11 (12.0 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3624056.1536576108!/image/image.jpg_gen/derivatives/box_300_160/image.jpg’ saved [7302/7302]

--2


2018-09-16 21:33:19 (6.22 MB/s) - ‘www.irishtimes.com/policy-and-terms/terms-conditions’ saved [191511/191511]

--2018-09-16 21:33:19--  https://www.irishtimes.com/policy-and-terms/privacy-policy
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 104461 (102K) [text/html]
Saving to: ‘www.irishtimes.com/policy-and-terms/privacy-policy’


2018-09-16 21:33:19 (8.56 MB/s) - ‘www.irishtimes.com/policy-and-terms/privacy-policy’ saved [104461/104461]

--2018-09-16 21:33:19--  https://www.irishtimes.com/policy-and-terms/community-standards
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 121528 (119K) [text/html]
Saving to: ‘www.irishtimes.com/policy-and-terms/community-standards’


2018-09-16 21:33:20 (32.3 MB/s) - ‘www.irishtimes.com/policy-and-terms/community-standards’ saved [121528/121528]

--2018-09-16 21:33:20--  https://www.irishtimes.com/policy-and-terms/copyright
R

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/business/companies’

www.irishtimes.com/     [ <=>                ] 144.58K  --.-KB/s    in 0.007s  

2018-09-16 21:33:25 (19.8 MB/s) - ‘www.irishtimes.com/business/companies’ saved [148048]

--2018-09-16 21:33:25--  https://www.irishtimes.com/business/technology
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 188131 (184K) [text/html]
www.irishtimes.com/business/technology: Is a directory

Cannot write to ‘www.irishtimes.com/business/technology’ (Success).
--2018-09-16 21:33:25--  https://www.irishtimes.com/business/work
Connecting to www.irishtimes.com (www.irishtimes.com)|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 176565 (172K) [text/html]
Saving to: ‘www.irishtimes.com/business/work’


2018-09-16 21:33:25 (1.94 MB/s) - ‘www.irishtimes.com/business/work’ saved [176565/

## Topic3: Getting data via APIs

### JSON format: 
JavaScript Object Notation - a text format used widely for web-based resource sharing. Many APIs return data in JSON.

Create a file named example.json using the Python code below to write a given string to a file.

In [47]:
json_string = """
{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}"""
with open("example.json", "w") as file:
    file.write(json_string)    

In [48]:
# Run shell command "cat" ("type" for Windows) to look at the file
# The sign ! tells Jupyter Notebook that the following command is a shell command.
# !type  example.json
!cat  example.json


{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}

In [49]:
#Library to parse json objects
import json
json_data = json.load(open('example.json'))
#json_data looks like a nested Python dictionary
print(json_data)

{'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}}}}


In [50]:
#We can refer to different fields of the json object
print(json_data['glossary']['title'])
print(json_data['glossary']['GlossDiv']['title'])
print(json_data['glossary']['GlossDiv']['GlossList']['GlossEntry']['ID'])

example glossary
S
SGML


> In the example below we use an URL called an API endpoint and the requests package to get a json file, as we have seen above in getting data from an URL.

In [51]:
import requests
url='https://data.colorado.gov/resource/4ykn-tg5h.json'
json_dataset = requests.get(url).text
print('Size of json file:',len(json_dataset),'\n')
print('-'*100)
print('Look at the first 500 characters of the json list')
print(json_dataset[:500])

with open("data_colorado_gov.json", "w") as file:
    file.write(json_dataset)


Size of json file: 767575 

----------------------------------------------------------------------------------------------------
Look at the first 500 characters of the json list
[ {
  "agentmailingstate" : "CO",
  "agentmailingzipcode" : "80538-9580",
  "entityname" : "The Colorado State Bee-Keepers' Association",
  "entityid" : "18881009142",
  "agentfirstname" : "Robert",
  "agentmailingaddress1" : "3905 Glade Road",
  "agentmailingcity" : "Loveland",
  "entitystatus" : "Good Standing",
  "agentprincipalcountry" : "US",
  "agentprincipaladdress1" : "3905 Glade Road",
  "principalcountry" : "US",
  "agentprincipalstate" : "CO",
  "agentprincipalcity" : "Loveland",
  "p


### Twitter API

You must have a Twitter account and Twitter OAuth credentials available from https://apps.twitter.com/. For now you can use the credentials below, but Twitter may reject too many connections on the same credentials. It is important to create and use your own authentification. The credentials below will be reset after this lab. Create a new application (using your own Twitter credentials) and then generate access tokens. See this tutorial for more details: http://socialmedia-class.org/twittertutorial.html

In [52]:
# Using Twitter Search API to get public tweets from the past
# Initiate the connection to Twitter REST API
import json
# Import the necessary methods from "twitter" library
# Please make sure twitter is installed: pip install twitter
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API 
# ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
# ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
# CONSUMER_KEY = 'YOUR API KEY'
# CONSUMER_SECRET = 'ENTER YOUR API SECRET'
ACCESS_TOKEN = '2839893905-pBXUzdrHCNXyjfPuBpSwxNbH1zyEpRaa2sXK0Jd'
ACCESS_SECRET = 'eNtB7YTAfsMhPIQtKji8aQT7zQFpFfDPR2lQ89WKfgI1U'
CONSUMER_KEY = 'ZqPrfLpc0znZlz3kW2a22VmUa'
CONSUMER_SECRET = 'BHD19T0DmUV2XVvEhUAgvpXMx0nGfxevAtr53NbCd9jQjPyTqn'

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter = Twitter(auth=oauth)
            
# Search for latest 100 tweets about "#analytics"
iterator = twitter.search.tweets(q='#analytics', result_type='recent', lang='en', count=100)
#print(json.dumps(iterator, indent=4))

file = open("twitter_search_100tweets.json", "w") 
for tweet in iterator['statuses']:
    #print(json.dumps(tweet))
    file.write(json.dumps(tweet)+"\n")

Assuming previous tweets were saved in a file twitter_search_100tweets.json, read file and look at tweets. If you couldn't get the above code to work use sample given file.

In [66]:
import json
# We use the file saved from last step as example
with open('twitter_search_100tweets.json', 'r') as f:
    tweets_file = f.readlines()
#print(tweets_file)

for line in tweets_file:
    #print(line)
    try:
        # Read in one line of the file, convert it into a json object 
        tweet = json.loads(line.strip())
        #print(tweet)
        if 'text' in tweet: # only messages contains 'text' field is a tweet
#             print(tweet['id']) # This is the tweet's id
#             print(tweet['created_at']) # when the tweet posted
#             print(tweet['text']) # content of the tweet
                        
#             print(tweet['user']['id']) # id of the user who posted the tweet
#             print(tweet['user']['name']) # name of the user, e.g. "Wei Xu"
#             print(tweet['user']['screen_name']) # name of the user account, e.g. "cocoweixu"

#             hashtags = []
#             for hashtag in tweet['entities']['hashtags']:
#             	hashtags.append(hashtag['text'])
#             print(hashtags)
            date = tweet['created_at']
            id = tweet['id']
            text = tweet['text']
            nfollowers = tweet['user']['followers_count']
            nfriends = tweet['user']['friends_count']
            hashtags = [hashtag['text'] for hashtag in tweet['entities']['hashtags']]
            users = [user_mention['screen_name'] for user_mention in tweet['entities']['user_mentions']]
            urls = [url['expanded_url'] for url in tweet['entities']['urls']]
    
            media_urls = []
            if 'media' in tweet['entities']:
                media_urls = [media['media_url'] for media in tweet['entities']['media']]	  
    
            print([date, id, text, hashtags, users, urls, media_urls, nfollowers, nfriends])
    except:
        # read in a line that is not in JSON format (sometimes error occured)
        print("JSON error!!!")
        continue

['Sun Sep 16 20:33:19 +0000 2018', 1041424817843388417, '13 steps to writing better #content #Infographic #SocialMedia #digitalmarketing @ASHISHCHOPRA001… https://t.co/4VP0o6bs4t', ['content', 'Infographic', 'SocialMedia', 'digitalmarketing'], ['ASHISHCHOPRA001'], ['https://twitter.com/i/web/status/1041424817843388417'], [], 5476, 6032]
['Sun Sep 16 20:33:13 +0000 2018', 1041424789162913792, "RT @Data_Mashup: Businesses can justify 'big data' processing on 'legitimate interests' grounds, says ICO https://t.co/bAk8NF77Gt #Analytic…", [], ['Data_Mashup'], ['http://tinyurl.com/lfr6fl5'], [], 31582, 88]
['Sun Sep 16 20:33:02 +0000 2018', 1041424743491092480, 'RT @IBMBizAnalytics: From introductory sessions for #analytics novices to highly technical sessions tailored for #DataScience and IT pros,…', ['analytics', 'DataScience'], ['IBMBizAnalytics'], [], [], 76803, 76471]
['Sun Sep 16 20:32:40 +0000 2018', 1041424653363892224, "Businesses can justify 'big data' processing on 'legitimate inte

In [None]:
#Using Twitter API to stream tweets in real-time
#Gather all tweets containing a given keyword
#You can also gather all tweets of given user, check Twitter Streaming API details.

# Import the necessary package to process data in JSON format
import json
# Import the necessary methods from "twitter" library
# Twitter API returns data in JSON format
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

# Variables that contains the user credentials to access Twitter API 
# ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
# ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
# CONSUMER_KEY = 'YOUR API KEY'
# CONSUMER_SECRET = 'ENTER YOUR API SECRET'
ACCESS_TOKEN = '2839893905-pBXUzdrHCNXyjfPuBpSwxNbH1zyEpRaa2sXK0Jd'
ACCESS_SECRET = 'eNtB7YTAfsMhPIQtKji8aQT7zQFpFfDPR2lQ89WKfgI1U'
CONSUMER_KEY = 'ZqPrfLpc0znZlz3kW2a22VmUa'
CONSUMER_SECRET = 'BHD19T0DmUV2XVvEhUAgvpXMx0nGfxevAtr53NbCd9jQjPyTqn'

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

# Get a sample of the public data published on Twitter in real-time
#iterator = twitter_stream.statuses.sample()
# Get a sample of tweets in English, containing #analytics"
iterator = twitter_stream.statuses.filter(track="analytics", language="en")

# Print each tweet in the stream to the screen 
# Here we set it to stop after getting 100 tweets. 
# You don't have to set it to stop, but can continue running 
# the Twitter API to collect data for days or even longer. 
tweet_count = 10
file = open("data_analytics_twitter_stream_10tweets.json", "w") 

for tweet in iterator:
    tweet_count -= 1
    # Twitter Python Tool wraps the data returned by Twitter 
    # as a TwitterDictResponse object.
    # We convert it back to the JSON format to print/score
    #print(json.dumps(tweet))  
    file.write(json.dumps(tweet)+"\n")

    # The command below will do pretty printing for JSON data, try it out
    print(json.dumps(tweet, indent=4))
       
    if tweet_count <= 0:
        break

{
    "created_at": "Sun Sep 16 22:04:37 +0000 2018",
    "id": 1041447792198209536,
    "id_str": "1041447792198209536",
    "text": "RT @antgrasso: Intel\u00ae Quark\u2122 SE: Low-power-consumption system-on-chip that provides Edge Analytics by combining an x86 MCU with a sensor su\u2026",
    "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 147709014,
        "id_str": "147709014",
        "name": "Digital SMEs Goals",
        "screen_name": "RoseTheReyes",
        "location": "United States",
        "url": null,
        "description": "Top Digital News for SMEs. Strategy. Digital Goals #EmergingTech #AI #Blockchain #IoT. Only on Twitter.",
        "translator_type": "none",
        "protected": false,
        "verifie

{
    "created_at": "Sun Sep 16 22:04:40 +0000 2018",
    "id": 1041447804252635136,
    "id_str": "1041447804252635136",
    "text": "RT @antgrasso: Intel\u00ae Quark\u2122 SE: Low-power-consumption system-on-chip that provides Edge Analytics by combining an x86 MCU with a sensor su\u2026",
    "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 741341480,
        "id_str": "741341480",
        "name": "The Highway To AI",
        "screen_name": "TheHighway2AI",
        "location": "United Kingdom",
        "url": null,
        "description": "Learn online what is happening with #AI for Businesses. This is a Micro Blog with content from top Influencers.",
        "translator_type": "none",
        "protected": false,
       

{
    "created_at": "Sun Sep 16 22:05:03 +0000 2018",
    "id": 1041447902021804032,
    "id_str": "1041447902021804032",
    "text": "Global Healthcare Big Data Analytics Market Opportunities, Top Trends, Drivers, Challenge ... https://t.co/i2Ha3wzn5q",
    "source": "<a href=\"https://dlvrit.com/\" rel=\"nofollow\">dlvr.it</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 856240505826496513,
        "id_str": "856240505826496513",
        "name": "Suriya Subramanian",
        "screen_name": "SuriyaSubraman",
        "location": "London, UK",
        "url": "https://www.linkedin.com/in/suriyansubramanian/",
        "description": "Change consultant focused on driving business transformations through data #datagovernance #bigdata #datascience #dataquality #IoT #gdpr #artificialintelligence",
        

{
    "created_at": "Sun Sep 16 22:05:31 +0000 2018",
    "id": 1041448016773869570,
    "id_str": "1041448016773869570",
    "text": "RT @Fisher85M: #Blockchain: Capital Markets Use Cases {Infographic}\n\n#Fintech #Analytics #BigData #AI #ML #Bitcoin #P2P #CyberSecurity #Ins\u2026",
    "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 741343590,
        "id_str": "741343590",
        "name": "The Startup Mentor",
        "screen_name": "TheStartupMento",
        "location": "United States",
        "url": null,
        "description": "Daily updates and trends for the Startup Digital World. #Fintech #Startups #Digital #Tech",
        "translator_type": "none",
        "protected": false,
        "verified": false,
        

You probably meet the error message as:

        Exceeded connection limit for user
Because:

    Twitter generally don't allow you to establish too many connections to the streaming API at the same time with the same set of credentials.
    
So, in that case, please shutdown Jupyter and restart the Notebook.

--------

There are several libraries for control Twitter 

https://developer.twitter.com/en/docs/developer-utilities/twitter-libraries.html

