![NASA](http://www.nasa.gov/sites/all/themes/custom/nasatwo/images/nasa-logo.svg)

<center>
<h1><font size="+3">GSFC Python Bootcamp</font></h1>
</center>

---

<CENTER>
<H1 style="color:red">
Data Retrieval with Python
</H1>
</CENTER>

In [None]:
from __future__ import print_function

## <font color='red'>What will be Covered?</font>

+ FTP
+ wget
+ Accessing Web Pages with Requests
+ Web scrapping with BeautifulSoup

## <font color='red'>Reference Documents</font>

* <a href="http://www.blog.pythonlibrary.org/2016/06/23/python-101-an-intro-to-ftplib/"> Python 101: An Intro to ftplib</a>
* <a href="http://zetcode.com/python/ftp/">Python FTP tutorial </a>
* <a href="https://pythonprogramming.net/urllib-tutorial-python-3/">Python urllib tutorial for Accessing the Internet </a>
* <a href="http://zetcode.com/python/requests/">Python Requests Tutorial</a>
* <a href="https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/">Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup</a>

## <font color='red'> FTP </font>

<UL>
<LI> FTP (File Transfer Protocol) is a fast and convenient way to transfer files over the Internet. 
<LI> To make FTP work, you need a client (your machine) and a server (the machine to/from which you are putting/getting files).
</UL>

#### Basic ftp Session

- You first need to connect to a server.

In [None]:
import ftplib

In [None]:
# Session without userid and password requirement

ftp_server = 'ftp.cse.buffalo.edu'
ftp_session= ftplib.FTP(ftp_server)
ftp_session.login()
ftp_session.quit()

- Since we did not pass a username or password, Python assumes we want to login anonymously. 
- If you need to connect to the FTP server using a non-standard port, then you can do so using the connect method.

In [None]:
ftp_session= ftplib.FTP()
PORT = 12345
ftp_session.connect(ftp_server, PORT)
ftp_session.quit()

In [None]:
# Session that needs a userid and a password

ftp_server = "ftp.nluug.nl"
my_userid  = "anonymous"
my_passwd  = "ftplib-example-1"

ftp_session = ftplib.FTP(ftp_server)
ftp_session.login(my_userid, my_passwd)
 
ftp_session.quit()

In [None]:
# Write a function that initiate a FTP session

def open_ftp_session(ftp_server, my_userid, my_passwd):
    """
       Open a ftp session given the server ftp address,
       the user's ID and the user's password.
       
       @param ftp_server: name of the ftp server (string)
       @param my_userid:  user ID on the ftp server (string)
       @param my_passwd:  user password on the ftp server (string)
    """
    
    ftp_session = ftplib.FTP(ftp_server)
    ftp_session.login(my_userid, my_passwd)
    
    return ftp_session

ftp_session = open_ftp_session(ftp_server, my_userid, my_passwd)

#### List Directories

In [None]:
# To list the top directories in the server

ftp_session.retrlines('LIST')

- The above retrieve the list of directories and print the information.
- You may want to pass the information to a variable.

In [None]:
def ftp_list_top_dirs(ftp_session):
    """
       List the top directories on a ftp server
       
       @param ftp_session: ftp session object
          
       Returned Value:
          - List of directories and files 
           (similar to the Unix command 'ls -l')
    """
 
    data = []

    # Get the list of files
    ftp_session.dir(data.append)
    
    return data

In [None]:
data = ftp_list_top_dirs(ftp_session)
for line in data:
    print("-", line)

#### Go to a Specific Directory

In [None]:
def ftp_dir_content(ftp_session, dir_name=None):
    """
       List the content of a diirectory in a ftp server.
       If the directory is not provided, will list the content
       of the top directory.
       
       @param ftp_session: ftp session object
       @param dir_name:    name of the directory you want to access (string)
        
       Returned Value:
          - List of directories and files 
           (similar to the Unix command 'ls -l')
    """ 
 
    if dir_name != None:
        # Change directory
        ftp_session.cwd(dir_name)
    
    data = []

    # Get the list of files
    ftp_session.dir(data.append)

    return data

In [None]:
data = ftp_dir_content(ftp_session)
for line in data:
    print("-", line)

In [None]:
data = ftp_dir_content(ftp_session, dir_name='pub')
for line in data:
    print("-", line)

#### Download a File

In [None]:
import sys
 
def ftp_get_file(ftp_session, file_name):
    """
         Get a file from a ftp server

         @param ftp_session: ftp session object
         @param file_name: name of the file you want to download  
    """
    try:
        ftp_session.retrbinary("RETR " + file_name ,open(file_name, 'wb').write)
    except:
        print("Error - Cannot obtain file: "+ file_name)

In [None]:
dir_name  = '/pub/'
file_name = 'README.nluug'

ftp_session.cwd(dir_name)   
ftp_get_file(ftp_session, file_name)

#### Uploading a File

In [None]:
import os
 
def ftp_put_file(ftp_session, file_name):
    """
         Put a file to a ftp server.

         @param ftp_session: ftp session object
         @param file_name: name of the file you want to upload  
    """
    file_ext = os.path.splitext(file_name)[1]
    if file_ext in (".txt", ".htm", ".html"):
        ftp_session.storlines("STOR " + file_name, open(file_name))
    else:
        ftp_session.storbinary("STOR " + file, open(file_mane, "rb"), 1024)

In [None]:
ftp_put_file(ftp_session, "README.nluug")

In [None]:
ftp_session.quit()

## <font color='red'>wget</font>

<UL>
<LI> Command line utility for downloading files from internet.
<LI> It supports:
    <OL> 
    <LI> Downloading multiple files
    <LI> Downloading in the background 
    <LI> Resuming downloads
    <LI> Limiting the bandwidth used for downloads and viewing headers.
    </OL>
</UL>

In [None]:
import urllib

In [None]:
mylink = 'ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf-4.4.1.1.tar.gz'
urllib.request.urlretrieve(mylink, 'netcdf-4.4.1.1.tar.gz')

In [None]:
def wget_python(url_name, loc_file_name):
    """
       Implementation of wget.
       
       @param url_name: url pointing to the remote file name
       @param loc_file_name: local file name
    """
    urllib.request.urlretrieve(url_name, loc_file_name)

In [None]:
import os

list_urls = ["ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf-4.4.1.1.tar.gz",
            "ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf-4.4.0.tar.gz",
            "ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf-4.5.0.tar.gz"]
for url_name in list_urls:
    loc_file_name = os.path.basename(url_name)
    print("Retrieving: ", loc_file_name)
    wget_python(url_name, loc_file_name)

## <font color='red'>Python requests</font>

* Requests is a simple and elegant Python HTTP (Hypertext Transfer Protocol) library. 
* It provides methods for accessing Web resources via HTTP. 
* Requests is a built-in Python module.

In [None]:
import requests as reqs

print(reqs.__version__)
print(reqs.__copyright__)

The json module enables you to convert between JSON and Python Objects. 

**Reading a Web Page**
- We use the function `get()` to grab the content of a web page into an object.
- We extract from the object the HTML content of the page.

In [None]:
resp = reqs.get("http://www.webcode.me")

print(resp.text)

We can use the module `re` to strip all the HTML markups from the content.

In [None]:
import re

resp = reqs.get("http://www.webcode.me")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

**HTTP Request**
- An HTTP request is a message send from the client to the browser to retrieve some information or to make some action.
- Request's request method creates a new request. 
- We use the `request` module methods: `get()`, `post()`, or `put()`.

Create a `GET` request and send it to the web site.

In [None]:
resp = reqs.request(method='GET', url="http://www.webcode.me")
print(resp.text)

**Getting the Status of a Web Page**
- We perform an HTTP request with the `get()` method and check for the returned status.
- 200 is a standard response for a successful HTTP request and 404 tells that the requested resource could not be found.

In [None]:
resp = reqs.get("http://www.webcode.me")
print(resp.status_code)

In [None]:
resp = req.get("http://www.webcode.me/news")
print(resp.status_code)

**`requests` `head()` Method**
- The `head()` method retrieves document headers. 
- The headers consist of fields, including date, server, content type, or last modification time.

In [None]:
resp = reqs.head("http://www.webcode.me")

print("Server: " + resp.headers['server'])
print("Last modified: " + resp.headers['last-modified'])
print("Content type: " + resp.headers['content-type'])

**`requests` `get()` Method**
- The `get()` method issues a GET request to the server. 
- The GET method requests a representation of the specified resource.

The the following script sends a variable with a value to the httpbin.org server. The variable is specified directly in the URL.

In [None]:
resp = reqs.get("https://httpbin.org/get?name=Peter")
print(resp.text)

The `get()` method takes a params parameter where we can specify the query parameters.

We send a GET request to the web site and pass the data, which is specified in the params parameter.

In [None]:
payload = {'name': 'Peter', 'age': 23}
resp = reqs.get("https://httpbin.org/get", params=payload)

In [None]:
print(resp.url)

In [None]:
print(resp.text)

**`requests` Redirection**
- Redirection is a process of forwarding one URL to a different URL.
- The HTTP response status code 301 Moved Permanently is used for permanent URL redirection; 302 Found for a temporary redirection.

In the following example, we issue a GET request to a web page. This page redirects to another page; redirect responses are stored in the history attribute of the response.

In [None]:
resp = reqs.get("https://httpbin.org/redirect-to?url=/")

print(resp.status_code)
print(resp.history)
print(resp.url)      

A GET request to https://httpbin.org/redirect-to was 302 redirected to https://httpbin.org.

We can choose not to follow a redirect by using the `allow_redirects` parameter (set to `True` by default).

In [None]:
resp = reqs.get("https://httpbin.org/redirect-to?url=/", allow_redirects=False)

print(resp.status_code)
print(resp.url)

**`requests` `post` Value**
- The `post` method dispatches a POST request on the given URL, providing the key/value pairs for the fill-in form content.

The following script sends a request with a name key having Peter value. The POST request is issued with the `post` method.

In [None]:
data = {'name': 'Peter'}

resp = reqs.post("https://httpbin.org/post", data)
print(resp.text)

## <font color='red'>Python Beautiful Soup</font>

- Web scraping allows you to download the HTML of a website and extract the data that you need.
- Beautiful Soup is a Python library for scrapping data from websites.
- Beautiful Soup creates a parse tree from parsed HTML and XML documents.

In [None]:
import requests as reqs
from bs4 import BeautifulSoup

In [None]:
source = reqs.get("http://www.webcode.me")
##source = reqs.get("https://pythonprogramming.net/parsememcparseface/")
print(source)

**Create a beautiful soup object**

In [None]:
soup = bs4.BeautifulSoup(source.text, 'html.parser')
print(soup.prettify())

**Title of the page**

In [None]:
print(soup.title)

**Get attribute**

In [None]:
print(soup.title.name)

**Get values**

In [None]:
print(soup.title.string)

In [None]:
print(soup.title.text)

**Beginning navigation**

In [None]:
print(soup.title.parent.name)

**Getting specific values**

- We want to find paragraph tags `<p>`.

In [None]:
print(soup.p)

In [None]:
print(soup.find_all('p'))

In [None]:
for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))

**Grab the text**

- Use the method `get_text`.

In [None]:
print(soup.get_text())

## <font color="red">Application</font>

- We want to get all book names on historic New York Time Best Sellers (Business section)
- The purpose is to:
     1. Help to compile my reading list in 2020
     2. Serve as reference to use Python for simple web analytics
- We use the Python packages: `Pandas`, `Requests` and `Baeutiful Soup`
- We save data in `pickle` and `csv` formats.

The example was taken from: <a href="https://towardsdatascience.com/building-my-2020-reading-list-with-a-simple-python-script-b610c7f2c223">Building my 2020 reading list with a simple Python script</a> by Pan Wu.

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

# Create an empty Pandas dataframe
nylist = pd.DataFrame()

beg_year = 2013
end_year = 2020
for the_year in range(beg_year, end_year):
    for the_month in range(1, 13):
        cur_month = str(the_month).zfill(2) # month in two digits
        # one need to get the URL pattern first, and then use Requests package to get the URL content
        url = 'https://www.nytimes.com/books/best-sellers/{0}/{1}/01/business-books/'.format(the_year, cur_month)
        page = requests.get(url)
        print(" --  try: {0}, {1} -- ".format(the_year, cur_month))
        
        # Ensure proper result is returned
        if page.status_code != 200:
            continue
        
        # one may want to use BeautifulSoup to parse the right elements out
        soup = BeautifulSoup(page.text, 'html.parser')
        
        # the specific class names are unique for this URL and they don't change across all URLs
        top_list = soup.findAll("ol", {"class": "css-12yzwg4"})[0].findAll("div", {"class": "css-xe4cfy"})
        print("Year: {} - Month: {} - How many in the top list: {}".format(the_year, the_month, len(top_list)))
        
        # loop through the Best Seller list in each Year-Month, and append the information into a pandas DataFrame
        for i in range(len(top_list)):
            book   = top_list[i].contents[0]
            title  = book.findAll("h3", {"class": "css-5pe77f"})[0].text
            author = book.findAll("p",  {"class": "css-hjukut"})[0].text
            review = book.get("href")
            # print("{0}, {1}; review: {2}".format(title, author, review))
            one_item = pd.Series([the_year, the_month, title, author, i+1, review], index=['year', 'month', 'title', 'author', 'rank', 'review'])
            nylist = nylist.append(one_item, ignore_index=True, sort=False)

# write out the result to a pickle file for easy analysis later.
nylist.to_pickle("nylist.pkl")
#nylist.to_csv("nylist.csv", index=False)

In [None]:
nylist

#### Exercise

- Write a Python script that reads the file `nylist.pkl` and prints its content.