# <center>Web Scraping II</center>

References: 
https://marcobonzanini.com/2015/03/02/mining-twitter-data-with-python-part-1/

## 1. Different ways to access data on the web
 - Scrape HTML web pages (covered in Web Scraping I)
 - Download data file directly 
    * data files such as csv, txt
    * pdf files
 - Access data through Application Programming Interface (API), e.g. The Movie DB, Twitter

## 3. Scrape data through API (e.g. tweets)
- Online content providers usually provide APIs for you to access data. Two types of APIs:
   * Python packages: e.g. tweepy package from Twitter
   * REST APIs: e.g. TMDB APIs (https://developers.themoviedb.org/3/getting-started)
- You need to read documentation of APIs to figure out how to access data



### 3.1. Access tweet stream through tweepy package
- **Steam**: transmitting or receiving data as a steady, continuous flow (the opposite is **batch**)

- Event **Listener**(or Event Handler): 
  - A procedure or function that waits for an event to occur.
  - Event examples: a user clicking or moving the mouse, pressing a key on the keyboard, an internal timer, or a tweet arriving.
  - A listener is in effect a loop that is programmed to react to an input or signal.
  
- Twitter Terminology (https://support.twitter.com/articles/166337)
  - **@{username}**: mentioning an accounts {username} in a tweet
  - **\#{topic}**: a hashtag indicates a keyword or topic.
  - **follow**: Subscribing to a Twitter account 
  - **reply**: A response to another person’s Tweet
  - **Retweet (n.)**: A tweet that you forward to your followers
  - **like (n.)**: indicates appreciating a tweet. 
  - **timeline**: A timeline is a real-time stream of tweets. Your Home timeline, for instance, is where you see all the Tweets shared by your friends and other people you follow.
  - **Twitter emoji**: A Twitter emoji is a specific series of letters immediately preceded by the # sign which generates an icon on Twitter such as a national flag or another small image.


In [1]:
!pip install tweepy

Collecting tweepy
  Downloading tweepy-3.5.0-py2.py3-none-any.whl
Collecting requests-oauthlib>=0.4.1 (from tweepy)
  Downloading requests_oauthlib-0.8.0-py2.py3-none-any.whl
Collecting oauthlib>=0.6.2 (from requests-oauthlib>=0.4.1->tweepy)
  Downloading oauthlib-2.0.4.tar.gz (127kB)
[K    100% |████████████████████████████████| 133kB 765kB/s ta 0:00:01
[?25hBuilding wheels for collected packages: oauthlib
  Running setup.py bdist_wheel for oauthlib ... [?25ldone
[?25h  Stored in directory: /home/ning/.cache/pip/wheels/f2/65/44/161426fc672522705a712b38a67376d8cc122ab7ca2e7dce2a
Successfully built oauthlib
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-2.0.4 requests-oauthlib-0.8.0 tweepy-3.5.0


In [3]:
# Exercise 3.1.1 define a listener which listens to tweets in real time


import tweepy
# to install tweepy, use: pip install tweepy

# import twitter authentication module
from tweepy import OAuthHandler

# import tweepy steam module
from tweepy import Stream

# import stream listener
from tweepy.streaming import StreamListener

# import the python package to handle datetime
import datetime

# set your keys to access tweets 
consumer_key = '9rbfp54yvlZbhzjflhc4HK4NA'
consumer_secret = 'fUSFIG74Eh6nkcChZZR2MyqccjgxqTdJyknE05TxeimMl1H7iE'
access_token = '376869783-lww06c9tSQltfTZLIXddZ0cNfyBvOhciVrFli2gm'
access_secret = 'rZcwReGs1LvR3o9OaJd1XD3flXC9YszZ7BgI1jcRTuiDr'
 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
# Customize a tweet event listener 
# inherited from StreamListener provided by tweepy
# This listener reacts when a tweet arrives or an error happens

class MyListener(StreamListener):
    
    # constructor
    def __init__(self, output_file, time_limit):
        
            # attribute to get listener start time
            self.start_time=datetime.datetime.now()
            
            # attribute to set time limit for listening
            self.time_limit=time_limit
            
            # attribute to set the output file
            self.output_file=output_file
            
            # initiate superclass's constructor
            StreamListener.__init__(self)
    
    # on_data is invoked when a tweet comes in
    # overwrite this method inheritted from superclass
    # when a tweet comes in, the tweet is passed as "data"
    def on_data(self, data):
        
        # get running time
        running_time=datetime.datetime.now()-self.start_time
        print(running_time)
        
        # check if running time is over time_limit
        if running_time.seconds/60.0<self.time_limit:
            
            # ***Exception handling*** 
            # If an error is encountered, 
            # a try block code execution is stopped and transferred
            # down to the except block. 
            # If there is no error, "except" block is ignored
            try:
                # open file in "append" mode
                with open(self.output_file, 'a') as f:
                    # Write tweet string (in JSON format) into a file
                    f.write(data)
                    
                    # continue listening
                    return True
                
            # if an error is encountered
            # print out the error message and continue listening
            
            except BaseException as e:
                print("Error on_data:" , str(e))
                
                # if return "True", the listener continues
                return True
            
        else:  # timeout, return False to stop the listener
            print("time out")
            return False
 
    # on_error is invoked if there is anything wrong with the listener
    # error status is passed to this method
    def on_error(self, status):
        print(status)
        # continue listening by "return True"
        return True

In [4]:
# Exercise 3.1.2 Collect tweets with specific topics within 2 minute

# initiate an instance of MyListener 
tweet_listener=MyListener(output_file="python.txt",time_limit=1)

# start a staeam instance using authentication and the listener
twitter_stream = Stream(auth, tweet_listener)
# filtering tweets by topics
twitter_stream.filter(track=['#python', '#java','#deeplearning','machinelearning'])

0:00:01.798200
0:00:10.230971
0:00:11.815853
0:00:12.590414
0:00:13.411504
0:00:19.137318
0:00:24.263912
0:00:27.637800
0:00:28.057109
0:00:28.407604
0:00:32.162213
0:00:32.166643
0:00:34.709059
0:00:38.010632
0:00:39.047699
0:00:41.567351
0:00:41.972013
0:00:42.280494
0:00:47.605007
0:00:47.706840
0:00:48.531797
0:00:49.664629
0:00:50.138622
0:00:50.549602
0:00:50.882732
0:00:51.155983
0:00:52.313820
0:00:55.736735
0:00:57.538054
0:00:57.744925
0:00:59.891111
0:00:59.899291
0:01:01.735505
time out


In [None]:
# Exercise 3.1.3. Collect 1% sample of all tweets within 30 seconds

tweet_listener=MyListener(output_file="tweets.txt",time_limit=0.5)
twitter_stream = Stream(auth, tweet_listener)
twitter_stream.sample()


## 4. JSON (JavaScript Object Notation)

### What is JSON
- A lightweight data-interchange format
- "self-describing" and easy to understand
- the JSON format is text only 
- Language independent: can be read and used as a data format by any programming language

###  JSON Syntax Rules
JSON syntax is derived from JavaScript object notation syntax:
- Data is in name/value pairs
- Data is separated by commas
- Curly braces hold objects
- Square brackets hold arrays

### A JSON file can be easily loaded into a dictionary or a list of dictionaries

In [5]:
# Exercise 4.1. Read/write JSON 
import json
tweets=[]

with open('python.txt', 'r') as f:
    # each line is one tweet string in JSON format
    for line in f: 
        
        # load a string in JSON format as Python dictionary
        tweet = json.loads(line) 
              
        tweets.append(tweet)

# write the whole list back to JSON
json.dump(tweets, open("all_tweets.json",'w'))

# to load the whole list
# pay attention to json.load and json.loads
tweets=json.load(open("all_tweets.json",'r'))

# open "all_tweets.json" and "python.txt" to see the difference

In [6]:
# Exercise 4.2. Investigating a tweet

# A tweet is a dictionary
# Some values are dictionaries too!
# for details, check https://dev.twitter.com/overview/api/tweets

print("# of tweets:", len(tweets))
first_tweet=tweets[0]

print("\nprint out first tweet nicely:")
print(json.dumps(first_tweet, indent=4))   

# note the difference between "json.dumps()" and "json.dump()"


('# of tweets:', 32)

print out first tweet nicely:
{
    "quote_count": 0, 
    "contributors": null, 
    "truncated": false, 
    "text": "RT @MikeQuindazzi: #IoT is driving #digital disruption in the physical world! #ai #machinelearning #datascience https://t.co/Em6ezOLBWQ", 
    "is_quote_status": false, 
    "in_reply_to_status_id": null, 
    "reply_count": 0, 
    "id": 912445033219002368, 
    "favorite_count": 0, 
    "source": "<a href=\"http://iot.com\" rel=\"nofollow\">The IoT Center</a>", 
    "retweeted": false, 
    "coordinates": null, 
    "timestamp_ms": "1506378822985", 
    "entities": {
        "user_mentions": [
            {
                "indices": [
                    3, 
                    17
                ], 
                "screen_name": "MikeQuindazzi", 
                "id": 2344530218, 
                "name": "Mike Quindazzi \u2728", 
                "id_str": "2344530218"
            }
        ], 
        "symbols": [], 
        "hashtags": [
  

In [None]:
# Exercise 4.3. Investigating attributes of a tweet

print("tweet text:", first_tweet["text"] )
# get all hashtags (i.e. topics) in this tweet
      
topics=[hashtag["text"] for hashtag in first_tweet["entities"]["hashtags"]]
print("\ntopics:", topics)

# get all user_mentions in this tweet
user_mentions=[user_mention["screen_name"] for user_mention in first_tweet["entities"]["user_mentions"]]
print("\nusers mentioned:", user_mentions)

In [2]:
# Exercise 4.4. count tweets per topic

# get the number of tweets for each topic as a dictionary
count_per_topic={}

# loop through each tweet in the list
for t in tweets:
    # check if "entities" exist and "hashtags" exist in "entities"
    if "entities" in t and "hashtags" in t["entities"]:
        # get all topics as a set (unique topics)
        topics=set([hashtag["text"].lower() for hashtag in t["entities"]["hashtags"]])
        
        for topic in topics:
            topic=topic.lower()
            if topic in count_per_topic:
                count_per_topic[topic]+=1
            else:
                count_per_topic[topic]=1
        
print(count_per_topic)


NameError: name 'tweets' is not defined

In [None]:
# Exercise 4.5. Get top 20 topics

# convert the dictionary into a list of tuples (topic, count)
topic_count_list=count_per_topic.items()

# sort the list by vcount in descending order
sorted_topics=sorted(topic_count_list, key=lambda item:-item[1])
print(sorted_topics)

# get top 20 topics
top_20_topics=sorted_topics[0:20]

# split the list of tuples into two tuples
topics, counts=zip(*top_20_topics)
print("\nTopics and counts in separated tuples:")
print(topics, counts)

In [None]:
# Exercise 4.6. Plot the tweet count of each topic as a bar chart

# display plot inline
# %matplotlib is called "magic function"
# "inline" is a parameter to the magic function
# see http://ipython.readthedocs.io/en/stable/interactive/tutorial.html#magics-explained for details
%matplotlib inline

import matplotlib.pyplot as plt

# get a range as the horizontal position of each topic
x_pos = range(len(topics))

# plot the bar chat
plt.bar(x_pos, counts)

# add the legend of each bar
plt.xticks(x_pos, topics)

# add the label for Y-axis
plt.ylabel('Count of Tweets')

# add title
plt.title('Count of Tweets per Topic')

# vetically align the text of each topic
plt.xticks(rotation=90) 

# display the plot
plt.show()


## 5. Visualization

- Often it is not easy to create nice plots using matplotlib
- Other visualization libraries may be helpful (https://blog.modeanalytics.com/python-data-visualization-libraries/)
- **Brunel** Visualization Package
    - Brunel provides very intutitive methods for nice visualization
    - Brunel Dependencies:
        - Brunel Visualization currently only works in IPython/Jupyter notebooks which must be installed prior to installing Brunel.
        - Java 1.7+ must be installed

In [1]:
!pip install brunel



In [1]:
import brunel

ImportError: /home/ning/anaconda2/lib/python2.7/site-packages/_jpype.so: undefined symbol: _ZTVNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEE

In [2]:
# Exercise 4.7. Fancy plot using Brunel package

# For better visualization effect, use brunel
# To install brunel, use:  pip install brunel
# You need to have java 1.7+ installed
# For details of brunel, see https://github.com/Brunel-Visualization/Brunel/wiki

import pandas as pd
import brunel

df=pd.DataFrame(top_20_topics, columns=["topic","count"])
print(df)

%brunel data('df') x(topic) y(count) color(topic) bar sort(count) \
tooltip(#all) title("Count of Tweets per Topic") :: width=600, height=500
 

ImportError: /home/ning/anaconda2/lib/python2.7/site-packages/_jpype.so: undefined symbol: _ZTVNSt7__cxx1118basic_stringstreamIcSt11char_traitsIcESaIcEEE

In [None]:
# Exercise 4.8. top topics in bubble chart

%brunel data('df') label(topic) size(count) color(topic) bubble tooltip(count)

In [None]:
# Exercise 4.9. top topics in cloud chart

%brunel data('df') label(topic) size(count) color(topic) cloud tooltip(count)

## 6. Scrape data by REST APIs (TMDB)
- A REST API is a web service that uses HTTP requests to GET, PUT, POST and DELETE data
- requests package can be used for REST API calls

In [6]:
# Exercise 6.1. search movies by name

import requests
import json

title='finding dory'

# Search API: http://api.themoviedb.org/3/search/movie
# has two parameters: query string and api_key
# For the get methods, parameters are attached to API URL after a "?"
# Parameters are separated by "&"

# to test, apply for an api key and use the key ere
url="http://api.themoviedb.org/3/search/movie?query="+title+"&api_key=26f55bcf8c4a8db24e494330dd2da118"

# invoke the API 
r = requests.get(url)

# if the API call returns a successful response
if r.status_code==200:
    
    # This API call returns a json object
    # r.json() gives the json object
    
    if "results" in r.json():
        results=r.json()["results"]
        print (json.dumps(results, indent=4))


[
    {
        "poster_path": "/z09QAf8WbZncbitewNk6lKYMZsh.jpg", 
        "title": "Finding Dory", 
        "overview": "Dory is reunited with her friends Nemo and Marlin in the search for answers about her past. What can she remember? Who are her parents? And where did she learn to speak Whale?", 
        "release_date": "2016-06-16", 
        "popularity": 133.87667, 
        "original_title": "Finding Dory", 
        "backdrop_path": "/3iSCdXjDmY3DuEOUYsElu35vQU6.jpg", 
        "vote_count": 4211, 
        "video": false, 
        "adult": false, 
        "vote_average": 6.8, 
        "genre_ids": [
            12, 
            16, 
            35, 
            10751
        ], 
        "id": 127380, 
        "original_language": "en"
    }, 
    {
        "poster_path": "/4tx2ynxwnsGdO64OMmEvMsY5jNZ.jpg", 
        "title": "Finding Dory: Marine Life Interviews", 
        "overview": "Interviews with the animals at the Marine Life Institute about their experiences with Dory.", 
  

In [9]:
r.json()

{u'page': 1,
 u'results': [{u'adult': False,
   u'backdrop_path': u'/3iSCdXjDmY3DuEOUYsElu35vQU6.jpg',
   u'genre_ids': [12, 16, 35, 10751],
   u'id': 127380,
   u'original_language': u'en',
   u'original_title': u'Finding Dory',
   u'overview': u'Dory is reunited with her friends Nemo and Marlin in the search for answers about her past. What can she remember? Who are her parents? And where did she learn to speak Whale?',
   u'popularity': 133.87667,
   u'poster_path': u'/z09QAf8WbZncbitewNk6lKYMZsh.jpg',
   u'release_date': u'2016-06-16',
   u'title': u'Finding Dory',
   u'video': False,
   u'vote_average': 6.8,
   u'vote_count': 4211},
  {u'adult': False,
   u'backdrop_path': u'/7Z152lh4xJYZ5nvCRZE2ahgTPJX.jpg',
   u'genre_ids': [],
   u'id': 427004,
   u'original_language': u'en',
   u'original_title': u'Finding Dory: Marine Life Interviews',
   u'overview': u'Interviews with the animals at the Marine Life Institute about their experiences with Dory.',
   u'popularity': 1.803511,
  

## 7. Scrape pdf files
- A number of Python libraries can handle PDFs (https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167)
- Some popular libraries:
  * pyPDF2: support both python2 and python3
    * To install, issue: pip install pypdf2
  * PDFMiner: only support python2
  * PDFQuery


In [4]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77kB)
[K    100% |████████████████████████████████| 81kB 698kB/s ta 0:00:01
[?25hBuilding wheels for collected packages: PyPDF2
  Running setup.py bdist_wheel for PyPDF2 ... [?25ldone
[?25h  Stored in directory: /home/ning/.cache/pip/wheels/86/6a/6a/1ce004a5996894d33d93e1fb1b67c30973dc945cc5875a1dd0
Successfully built PyPDF2
Installing collected packages: PyPDF2
Successfully installed PyPDF2-1.26.0


In [5]:
# Exercise 5.1. downloading and parse pdf files 

import requests
from PyPDF2 import  PdfFileReader

# First download the pdf file
pages=[]
r=requests.get("http://ciese.org/media/live/curriculum/airproj/docs/aqiworksheet.pdf")
if r.status_code==200:
    # write the content to a local file
    with open("some_pdf.pdf","wb") as f:
        f.write(r.content)

# Parse the pdf content. It may need further clean-up depending on the content
pdfreader = PdfFileReader(open("some_pdf.pdf", "rb"))

#loop through each page of the pdf file
for i in range(pdfreader.getNumPages()):
    # get each page
    page=pdfreader.getPage(i)
    # extract text
    page_content=page.extractText()
    
    # append the text to the list
    pages.append(page_content)
    
print(pages)

[u'EPA | NESCAUM | CIESE | Stevens Institute of Technology \n Student Worksheet : What Color is My Air? \n Name: _________________________________ Group: _______________  What Color is My Air? \nAccess the Air Quality Map and answer the following questions:  1.  What do the five colors on this map represent?  \n \n2.  Find Los Angeles, CA on the map.  What color is it?  Circle: GGrreeeenn    YYeellllooww  OOrraannggee  RReedd  PPuurrppllee  BBrroowwnn     \n3.  Find another city on the map that is RReedd.  Write the city and state below.  \n \n \n4.  Find two OOrraannggee cities on the map.  Write the city names and states below. \n   \n \n5.  Are there any GGrreeeenn cities on the map?   If so, list three.  \n \n \n \n \n6.  Write a sentence that compares the kinds of places where GGrreeeenn areas are found and the kind of areas where RReedd and OOrraannggee areas are found.  \n \n \n \n \n7.  Can you think of any factors or reasons that would cause poor air quality found in the RReed

