# Creating a Google Cloud Platform (GCP) Data Scraper

For this exercise, which will be built upon next week, we are going to stand up GCP resources to scrape [REDDIT](https://www.reddit.com/).
This scraping will be done by tapping into the REDDIT RSS Feeds.  

[Read about RSS here](https://en.wikipedia.org/wiki/RSS).

### Overview

 1. Create an Storage Bucket to collect the data
 1. Create a preemptible Compute Engine
 1. Install software on compute engine
 1. Write additional code modules to collect data into the bucket
 1. Collect data and write to storage bucket
 
#### Data Scraper Concept Overview
 
![DataScraperStructure_Mini_Project1.png MISSING](../images/DataScraperStructure_Mini_Project1.png)


**Note:** Please use the <span style="background:yellow">**us-central1**</span> region for all activities!


# 1. Create a Storage Bucket

Link: https://console.cloud.google.com/storage/
 * Name: **dsa_mini_project**
 * Select a Regional storage class

# 2. Create a Preemptible Compute Engine (VM)

Link: https://console.cloud.google.com/compute/instances
 * Name: **dsa-mini-project**
 * Select Micro Instance
![DataScraperVM_Instance.png MISSING](../images/DataScraperVM_Instance.png)


**BE SURE TO MAKE IT PREEMPTIBLE**

# 3. Install software on compute engine

**You will need to install software to your compute engine (VM)**
 * [RSS Feed Libraries](https://wiki.python.org/moin/RssLibraries)
 * [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
 

Also read through this helpful information about accessing Reddit RSS Feeds: 
https://www.reddit.com/r/pathogendavid/comments/tv8m9/pathogendavids_guide_to_rss_and_reddit/

##### Here is some sample python code to pull the REDDIT feed and just print it out.

In [11]:
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment

# Functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# Define URL of the RSS Feed I want
a_reddit_rss_url = 'http://www.reddit.com/new/.rss?sort=new'

feed = feedparser.parse( a_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data")    
else:
    for item in feed[ "items" ]:
        dttm = item[ "date" ]
        title = item[ "title" ]
        summary_text = text_from_html(item[ "summary" ])
        link = item[ "link" ]
        
        print("====================")
        print("Title: {} ({})\nTimestamp: {}".format(title,link,dttm))
        print("--------------------\nSummary:\n{}".format(summary_text))
           

Title: My Dad and his sister, circa 1942 (https://www.reddit.com/r/ImagesOfThe1940s/comments/7j7oxn/my_dad_and_his_sister_circa_1942/)
Timestamp: 2017-12-12T03:04:04+00:00
--------------------
Summary:
     submitted by /u/ImagesOfNetwork to r/ImagesOfThe1940s   [link]  [comments] 
Title: HP Spectre X360 8550U came with toshiba nvme ssd (https://www.reddit.com/r/Hewlett_Packard/comments/7j7oxl/hp_spectre_x360_8550u_came_with_toshiba_nvme_ssd/)
Timestamp: 2017-12-12T03:04:03+00:00
--------------------
Summary:
Hi, Just bought this model : 15-bl108CA which came with toshiba nvme ssd instead of samsung P961 nvme ssd. The toshiba drive is significantly slower than samsung.  How can I get support from HP to get my laptop or drive replaced with samsung drive ?  Thanks in advance.  /u/minitt r/Hewlett_Packard [link] [comments]
Title: Best Wings in New Orleans? (https://www.reddit.com/r/NewOrleans/comments/7j7oxk/best_wings_in_new_orleans/)
Timestamp: 2017-12-12T03:04:02+00:00
----------------

# 4. Write additional code modules to collect data into the bucket

Since you have created a preemptible VM, it may disappear at any time.

### ADD MORE "Raw NBConvert" Cells as needed for code you want to save

### This is also part of your submitted work for this module

### <span style="background:yellow">To-Do</span>

You will need to build off of RSS Feed Scrape code to write a JSON formatted file of data from when I showed above (title,url,summary, date/time) for each time the code runs.
The files should get a unique name each time, possibly look to making a file name from the run time.

### Helpful Link for Writing to the Cloud Storage

https://cloud.google.com/appengine/docs/standard/python/googlecloudstorageclient/read-write-to-cloud-storage

In [12]:
import time
import json
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment

# Functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# Define URL of the RSS Feed I want
#a1_reddit_rss_url = 'http://www.reddit.com/user/politics/submitted/.rss'
a1_reddit_rss_url = 'http://www.reddit.com/r/politics/new/.rss?sort=new'

feed = feedparser.parse( a1_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data") 
else:
        file_name=time.strftime("%d%m%Y%H%M%S.txt")

        with open(file_name, 'w') as outfile:

            data=[]

            for item in feed["items"]:

                ele = {"title": item[ "title" ], "link" : item[ "link" ], "date" : item["date"], "summary_text" : item["summary"]}

                data.append(ele)

            json.dump(data, outfile)
            dttm = item[ "date" ]
            title = item[ "title" ]
            summary_text = text_from_html(item[ "summary" ])
            link = item[ "link" ]
        
print("====================")
print("Title: {} ({})\nTimestamp: {}".format(title,link,dttm))
print("--------------------\nSummary:\n{}".format(summary_text))
        
        

Title: Trump Treasury Dept. Analysis of GOP Tax Plan Ripped as 'Pure Propaganda' (https://www.reddit.com/r/politics/comments/7j7bah/trump_treasury_dept_analysis_of_gop_tax_plan/)
Timestamp: 2017-12-12T02:02:35+00:00
--------------------
Summary:
     submitted by /u/interested21   [link]  [comments] 


In [13]:
import time
import json
import feedparser
from bs4 import BeautifulSoup
from bs4.element import Comment

# Functions from: https://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# Define URL of the RSS Feed I want
a2_reddit_rss_url = 'http://www.reddit.com/r/technology/new/.rss?sort=new'
feed = feedparser.parse( a2_reddit_rss_url )

if (feed['bozo'] == 1):
    print("Error Reading/Parsing Feed XML Data") 
else:
        file_name=time.strftime("%d%m%Y%H%M%S.txt")

        with open(file_name, 'w') as outfile:

            data=[]

            for item in feed["items"]:

                ele = {"title": item[ "title" ], "link" : item[ "link" ], "date" : item["date"], "summary_text" : item["summary"]}

                data.append(ele)

            json.dump(data, outfile)
            dttm = item[ "date" ]
            title = item[ "title" ]
            summary_text = text_from_html(item[ "summary" ])
            link = item[ "link" ]
        
print("====================")
print("Title: {} ({})\nTimestamp: {}".format(title,link,dttm))
print("--------------------\nSummary:\n{}".format(summary_text))
        
        

Title: H-1B reduced computer programmer employment by up to 11%, study finds (https://www.reddit.com/r/technology/comments/7j5vfc/h1b_reduced_computer_programmer_employment_by_up/)
Timestamp: 2017-12-11T22:16:46+00:00
--------------------
Summary:
/u/CintaBonita [link] [comments]


In [None]:
# 5. Collect data and write to storage bucket

### 5.1 Package your scrapping code into a script: `data_scrape1.py`.
You can either write this locally and upload or create the script directly on VM.


### 5.2 Run the script a few times / minutes

### 5.3 Get a listing of the contents of your bucket and paste into the cell below.

#### Optionally
You can grab a screen shot of the bucket contents from the console and imbed it below using the 
```
![my screenshot](screen_shot.png)
```
and changing the cell to _Markdown_.




# You can now go the VM console and Stop your instance. 

Then you can restart it next week to continue building it up instead of starting it over.


---


# Where is this exercise going?

Next module you will be introduced to various GCP Cloud APIs for things like Vision, Natural Language, etc.

You will be extending this scraper to utilize an API or two to process the data in the buckets.
The processing will produce analytical information that will feed into BigQuery tables, thereby faciltitating analytics and visualizations!



# Save your Notebook!