# [Importing Data in Python](https://www.datacamp.com/completed/statement-of-accomplishment/course/987436db924b8f6ab6a975f730c5a359f71e4363)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adamelliotfields/datacamp/blob/main/notebooks/courses/importing_data_in_python/notebook.ipynb)

## Contents

* [Web Scraping](#Web-Scraping-🔝)
* [APIs](#APIs-🔝)


## Web Scraping 🔝

Beautiful Soup is a Python library for parsing structured HTML and XML data. It was created in 2004 (older than jQuery) by Leonard Richardson who maintains it to this day. The name comes from "tag soup", a term coined by Ken Holman to describe structurally or syntactically incorrect XML.

To get the HTML data, Python's standard library includes the `urllib` module. For more advanced functionality, the `requests` library is the most popular choice.

In [1]:
import sys
import pandas as pd
from urllib.request import urlretrieve

url = "https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv"
filename = "winequality-red.csv"

# only retrieve if it doesn't exist
try:
    file = open(filename)
    file.close()
except FileNotFoundError:
    urlretrieve(url, filename)

# can also pass the url to Pandas directly
df = pd.read_csv(filename, sep=";")
pd.set_option("display.width", sys.maxsize)

print(df.head())


   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  alcohol  quality
0            7.4              0.70         0.00             1.9      0.076                 11.0                  34.0   0.9978  3.51       0.56      9.4        5
1            7.8              0.88         0.00             2.6      0.098                 25.0                  67.0   0.9968  3.20       0.68      9.8        5
2            7.8              0.76         0.04             2.3      0.092                 15.0                  54.0   0.9970  3.26       0.65      9.8        5
3           11.2              0.28         0.56             1.9      0.075                 17.0                  60.0   0.9980  3.16       0.58      9.8        6
4            7.4              0.70         0.00             1.9      0.076                 11.0                  34.0   0.9978  3.51       0.56      9.4        5


In [2]:
from urllib.request import urlopen, Request

url = "https://campus.datacamp.com/courses/1606/4135?ex=2"
request = Request(url)
response = urlopen(request)
html = response.read()

print(type(response))
print(type(html))

response.close()


<class 'http.client.HTTPResponse'>
<class 'bytes'>


In [3]:
import requests

url = "https://www.datacamp.com/teach/documentation"
r = requests.get(url)
text = r.text

print(type(text))


<class 'str'>


In [4]:
import requests
from bs4 import BeautifulSoup

url = "https://www.python.org/~guido"
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc)

# format HTML
pretty_soup = soup.prettify()

# get the title tag
guido_title = soup.title

# get all the text on the page
guido_text = soup.get_text()

# get all the hyperlinks on the page
a_tags = soup.find_all("a")

# loop over tags
for tag in a_tags:
    print(tag.get("href"))


pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
images/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
Resume.html
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


## APIs 🔝

In [5]:
import requests

# this api key is from the course
url = "https://www.omdbapi.com?t=the+social+network&apikey=72bc447a"
r = requests.get(url)
j = r.json()

print(r.text)
print(j["Title"])


{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin, Ben Mezrich","Actors":"Jesse Eisenberg, Andrew Garfield, Justin Timberlake","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"United States","Awards":"Won 3 Oscars. 173 wins & 186 nominations total","Poster":"https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.8/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.8","imdbVotes":"735,077","imdbID":"tt1285016","Type":"movie","DVD

In [6]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza"
r = requests.get(url)
j = r.json()

# we want the text within the last "p" tag
pizza_extract = j["query"]["pages"]["24768"]["extract"]
soup = BeautifulSoup(pizza_extract)
text = soup.find_all("p")[-1].get_text()

print(text)


In 2017, the world pizza market was US$128 billion, and in the US it was $44 billion spread over 76,000 pizzerias.  Overall, 13% of the U.S. population aged two years and over consumed pizza on any given day.
