# Web Scraping

## Part \#5: Web Scraping Project (30 pts)

The final step of these notebooks is to create your own project. Your project must have the following characteristics:

**Project Summary:** Create a Jupyter notebook using Python code that writes out a "self-updating website" about an American city. Every time you run the cells in this notebook, you will get an updated website.  

**Project Details:**

* **To earn a C, your project must include the following:**
  * Write Python code in this notebook that includes a function, html_output(), that concatenates an output variable and writes out its value to an HTML file
  * The HTML file is a website about a city from one of the following lists:
    * [American cities with Open Data portals](https://www.forbes.com/sites/metabrown/2017/06/30/quick-links-to-municipal-open-data-portals-for-85-us-cities/#43a0a6e02290)
    * [Global cities with Open Data portas](https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/)
    * You can't use Chicago
    * You must pick a different city than your friends/neighbors
    * Optional: If you plan on earning extra credit on this project, you should verify that your chosen city has Open Data available in CSV format    
  * Your website must have custom CSS styling, a title, and a header with the city name
  * Your website must include a one-paragraph description of the city
    * Describe the location, geography, and history of your chosen city 
    * Between 100 and 150 words
    * Cite your sources with hyperlinks or URLs (it's okay to cite Wikipedia in this case)
  * html_output() must call a function that scrapes weather data to give up-to-date weather information for current conditions in the selected city
    * Example: "The weather in Chicago is currently **Cloudy** with a temperature of **47** degrees Fahrenheit."
    * If you can't find a weather feed for the selected city, pick one *near* your chosen city
  * html_output() must call a function that scrapes the top (first) headline from the RSS feed of a newspaper in this city to format an up-to-date "Top Headline" for your chosen city
    * Example: "Blackhawks trade Ryan Hartman to Predators for first-round pick. (Chicago Tribune)"
    * If you can't find a newspaper in the selected city, pick one *near* your chosen city
  * Your project code must be thoroughly explained using a combination of code comments and/or markdown cells
  * Your answers to "Written Response \#1" and "Written Response \#2" must be included below and be C-quality work or above 


* **To earn a B, your project must include the requirements above, as well as the following:**
  * html_output() must call a function that scrapes and displays an image of the city directly from Wikipedia
  * Instead of the weather report from the C-level requirements, html_output() must call a function that scrapes weather data to give a **descriptive**, up-to-date weather report
    * Example 1: If it is 34 degrees and cloudy: “The weather today is **cold** and **dry** with **cloudy skies**. The high temperature is **34** degrees Fahrenheit.”
    * Example 2: If it is 91 degrees and sunny: “The weather today is **hot** and **dry** with **clear skies**. The high temperature is **91** degrees.”
    * Example 3: If it is 68 degrees and raining: “The weather today is **cool** and **wet** with **rain. Bring your umbrella!** The high temperature is **68** degrees.”
    * Brainstorm other adjectives you may consider using, and customize the report to your own style/prefernces
    * Also incorporate wind and humidity into your weather report    
  * Your answers to "Written Response \#1" and "Written Response \#2" must be included below and be B-quality work or above 


* **To earn an A, your project must include the requirements above, as well as the following:**
  * Instead of a single "Top Headline," html_output() must call a function that scrapes three different random headlines from the RSS feed of a newspaper in this city to format an up-to-date set of "News Alerts" for your chosen city
    * If you can't find a newspaper in the selected city, pick one *near* your chosen city
  * Change the background color generated for your webpage page based on the temperature in the weather report. Use hex colors such as "#FF0000" instead of "RGB(255,0,0)" or a color name such as "red".
    * Below zero: Purple
    * Below 32: Blue
    * Below 40: Cyan
    * Below 50: Green
    * Below 60: Yellow
    * Below 70: Light Orange
    * Below 80: Dark Orange
    * Below 90: Red
    * Above 90: Dark Red
  * Your answers to "Written Response \#1" and "Written Response \#2" are included below and are A-quality work or above 


* **To earn Extra Credit, you can pick any (or all) of these optional enhancements to incorporate into your page. The number of points earned will depend on which enhancement(s) you pick and the quality of your work:**
  * html_output() calls a function that gets up-to-date data from a CSV file
    * That data must be formatted and displayed on your website
    * The CSV file must be loaded from a URL (this way, the data gets updated when you re-run the notebook)
    * For example, if you have a CSV with car thefts, you could include the time, location, and make/model of the most recent 10 car thefts in your city
  * Your Open Data output is turned into a graph or chart and displayed on the webpage as an image
    * For example, you might include a bar chart that shows how many traffic tickets were issued in your city each day for the past 7 days
  * Your page includes other kinds of interesting/relevant scraped data not listed in the above requirements or other kinds of JavaScipt functionality that improves your page  


### Written Response \#1

html_output() is an algorithm that includes other algorithms. Describe how each algorithm within html_output() functions independently, as well as in combination with others, to form html_output(). Also describe how html_output() helps to achieve the intended purpose of this computer program.

*Note: Your response cannot exceed 200 words.*

Your Answer: “html_output()” calls on the "tag_extractor(url,tag)", "get_image_url(article_url)", "return_image(wiki_url,image_width)" algorithm. It combines this with other functions and libraries to make a bigger algorithm. "tag_extractor(url,tag)" calls upon the bs4 library to extract xml files from the internet. By using this library, the "tag_extractor(url,tag)" function can call upon the information between the tags within the XML file. This is useful for the “html_output()” algorithm because it can call upon this to have up to date information on the webpage. "get_image_url(article_url)" and "return_image(wiki_url,image_width)" are used together in order to pull up an image from Wikipedia to the webpage.

### Written Response \#2

Give the name of a function in your project that serves as an abstraction. Explain how this function is an abstraction and how it helps you manage the complexity of your program. 

*Note: Your response cannot exceed 200 words.*

Your Answer: 

### Project Code:

Write the code for your project below. You can use as many code/markdown cells as you need to complete the project.

In [36]:
import numpy                                 # import numpy
from urllib.request import urlopen           # import to request url
from bs4 import BeautifulSoup                # import beautiful soup to reformat the page 
import datetime                              # import to get datetime information 
import random                                # import to get random numbers
import re                                    # import to use regular expressions
import time                                  # import to get times
from IPython.display import Image, display   # import to get images
import requests

def get_image_url(article_url):
    html_page = urlopen("http://en.wikipedia.org"+article_url)   #opens whatever page we are requesting
    bs_obj3 = BeautifulSoup(html_page, 'html.parser')    #Saves the html in a Beautiful Soup object
    #The next line finds specific HTML elements in the page opened - notice the familiar tags and properties
    try:
        image_url = bs_obj3.find("meta",{"property":"og:image"}).attrs['content']
    except AttributeError:
        image_url = False
    return image_url

def return_image(wiki_url,image_width):
    
    if get_image_url(wiki_url) != False:
        img=get_image_url(wiki_url)
        url_to_file = requests.get(img).content
        extension = img.split('.')[-1]
        name = "output/output_image." + extension
        with open(name, 'wb') as image:
            image.write(url_to_file)
        display(Image(filename=name,width=image_width))

def tag_extractor(url, tag):    
    xml_page = urlopen(url)   #opens whatever page we are requesting
    bs_obj = BeautifulSoup(xml_page, 'xml')
    return bs_obj.find(tag).getText()


def background_color():
    output1 = ""
    temp = float(tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'temp_f'))
    if temp < 0:
        output1 += "#9400D3"
        return output1
    elif temp == 0 or temp < 32:
        output1 += "#0000FF"
        return output1
    elif temp == 32 or temp < 40:
        output1 += "#00FFFF"
        return output1
    elif temp == 40 or temp < 50:
        output1 += "#00FF00"
        return output1
    elif temp == 50 or temp < 60:
        output1 += "#FFFF00"
        return output1
    elif temp == 60 or temp < 70:
        output1 += "#FFA500"
        return output1
    elif temp == 70 or temp < 80:
        output1 += "#FF8700"
        return output1
    elif temp == 80 or temp < 90:
        output1 += "#FF0000"
        return output1
    else:
        output1 += "#8B0000"
        return output1


def weather_report():
    temp = float(tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'temp_f'))
    output = ""
    if temp >= 70:
        output += "It is hot with a temperature of " + str(temp) + " degrees. Wear some short sleves!"
    elif temp < 70 or temp > 50:
        output += "It is mild with a temperature of " + str(temp) + " degrees. Wear longsleves to keep yourself warm."
    elif temp <= 50 or temp > 30:
        output += "It is chilly with a temperature of " + str(temp) + " degrees. Wear thick clothing and a light jacket."
    else:
        output += "It is freezing outside with a temperature of " + str(temp) + "degrees. Wear thick clothing with a heavy jacket."
    return output

xml_page2 = urlopen("https://www.seattletimes.com/sports/feed/")   #opens whatever page we are requesting
bs_obj2 = BeautifulSoup(xml_page2, 'xml')
bs_obj2.prettify()
headlines2 = bs_obj2.find_all('title')
headlines2 = [story.getText() for story in headlines2] # this is a list comprehension
headlines2 = headlines2[2:]
urls2 = bs_obj2.find_all('link')
urls2 = [link.getText() for link in urls2]
urls2 = urls2[3:]

# define the function for the webpage: 
def html_output():    
    output_string = """
    <html>
    <head>
        <style>
            body {
                padding: 5px 0;
                text-align: center;
                border: 12px blue;
                border-style: double;
                background-color: """ + background_color()+""";
                font-family: "Times New Roman"        
            }
            h1{
                font-size: 45px;
                color: blue;
                font-style: bold;
            }
            h3{
                
                font-size: 35px;
                font-style: italic;
                color: red;
            }
            d{
                font-size: 22px;
            }
            a{
                font-size: 22px;
            }
        </style>
        <meta http-equiv="refresh" content="120">
    </head>
    
    <body>
    <h1>Seattle</h1>
    """
    url = get_image_url("/wiki/Seattle")
    output_string += "<img src=" + url + " width=400 height=208>"
    output_string += """
    <br>
    <d>Seattle is a city located in Washington. Seattle is located near the Puget Sound of Washington.This state is located at the Northwest corner of the USA. Washington contains many mountainous regions,
    beaches, and lakes. Washington is home to many businesses such as Boeing, Microsoft, Amazon, and Eddie Bauer. Seattle was first created as a reservation for 
    the Indians. The tribe that had lived there was displaced by the new settlers. After living in poverty, the Indian tribe was almost wiped out. The settlers saw their wrong doing and then 
    decided to name the city in honor of the Chief of the tribe. His name was Chief Seattle.</d>
    <a href='https://www.seattle.gov/cityarchives/seattle-facts/brief-history-of-seattle'>Click Here to Read More History</a>
    <h3>Seattle Weather</h3>
    <p> The most recent temperature measurement was taken at
    """
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'observation_time_rfc822')
    output_string += '.'"""
    This was taken at 
    """
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'location')
    output_string += '. This is located at at ('
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'latitude')
    output_string += ','
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'longitude')
    output_string += ').'"""
    The wind speed is 
    """
    output_string += tag_extractor('http://w1.weather.gov/xml/current_obs/KSEA.xml', 'wind_mph')
    output_string += """
    mph.
    <br>
    <br>
    """
    output_string += weather_report()
    output_string += """    
    <br>
    <h3>Seattle Sports News:</h3>
    """
    output_string += headlines2[0]
    output_string +="""
    <br>
    <br>
    """
    output_string +='READ MORE AT: ' + urls2[0]
    """
    <br>
    """
    output_string += headlines2[1]
    output_string +="""
    <br>
    <br>
    """
    output_string +='READ MORE AT: ' + urls2[1]
    """
    <br>
    """
    output_string += headlines2[2]
    output_string +="""
    <br>
    <br>
    """
    output_string +='READ MORE AT: ' + urls2[2]
    """
    
    </body>
    </html>
    """

    html_file= open("output/write.html","w")
    html_file.write(output_string)
    html_file.close()
    
# now call the function: 

html_output()