# Retrieving Product Information using AWS Product Advertising API - Part 1
The goal of this part of the project is to use <b>Amazon Web Services Product Advertising API</b> to retreive product specifications and information for advertising purposes. A previous project titled "AWS API: Retrieving GTIN values from ASINs" used a similar process but only logged the GTIN values for an Amazon Product(ASIN). This part of the project will log the entire XML response from the following <i>ItemLookup</i> response groups:
1. ItemAttributes
2. EditorialReview
3. Images

For more information about AWS Product Advertising API, visit the <a href='https://docs.aws.amazon.com/AWSECommerceService/latest/DG/CHAP_ApiReference.html'>Offical API Reference</a> website.

## Import Libraries, Packages, and Modules

In [1]:
# import libraries
import pandas as pd # Data structures and data analysis tools library
import bottlenose # Amazon Product Advertising API Python Library
from time import sleep # Time access and conversions module
from datetime import date # Basic date and time types module
from bs4 import BeautifulSoup # HTML and XML document parsing package
import credentials # private file which includes keys and tag for AWS API

## Set Credentials for Amazon Product Advertising API
Replaced with values of XXX when publishing for security purposes.
https://docs.aws.amazon.com/AWSECommerceService/latest/DG/TroubleshootingApplications.html

In [2]:
# Set AWS Request Credentials
AWS_ACCESS_KEY_ID = credentials.AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY = credentials.AWS_SECRET_ACCESS_KEY
AWS_ASSOCIATE_TAG = credentials.AWS_ASSOCIATE_TAG
amazon = bottlenose.Amazon(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ASSOCIATE_TAG) # set amazon credentials for bottlenose request
LIMIT = 1.1 # AWS Product Advertising API limits requests to 1 per second, maximum of 8640 requests per day

## Get list of ASINs as pandas DataFrame
List of ASINs is hosted on Google Drive as a CSV file. URL has been shortened for convenience. The path could be changed as needed.

In [3]:
# download list of ASINs for data retrieval
!wget -O asin_needs_data.csv 'https://bit.ly/2RM45f5' -q
print('Download complete!')

Download complete!


In [4]:
# create pandas DataFrame with list from CSV file
column_number_with_ASIN = 0 # set this variable to the numberical value of the csv column containing the desired ASINs
asin_responses_df = pd.read_csv('asin_needs_data.csv', header=0, sep=',', usecols=[column_number_with_ASIN]).astype('object')
print(asin_responses_df.head()) # preview DataFrame
print(asin_responses_df.shape) # print DataFrame Shape

         ASIN
0  B004XC6GJ0
1  B000P1DEHU
2  B004NEUJKA
3  B0011ULQNI
4  B001O0DP48
(107, 1)


In [5]:
# construct pandas series of ASIN column
asin_series = asin_responses_df['ASIN'].loc[0:]
type(asin_series) # check datatype

pandas.core.series.Series

## Define Function for Executing an AWS Product Advertising API ItemLookup Request

In [6]:
# define function to request item attributes and return entire xml response
def il_ia_request(asin, responsegroup):
    sleep(LIMIT) # prevent requests from exceeding AWS PA API request limit
    response = amazon.ItemLookup(ItemId=asin, ResponseGroup=responsegroup)
    return(BeautifulSoup(response, 'lxml'))

In [7]:
# reconstruct dataframe
asin_responses_df = pd.DataFrame(columns=['asin', 'item_attributes_xml', 'editorial_review_xml', 'images_xml', 'date'])
# loop through asins appending il_ia_request output to item_attributes_df with asin as index
current_date = date.today()
for i,asin in enumerate(asin_series):
    print('Requesting Data for {}: {} of {}'.format(asin, str(i + 1).zfill(3), len(asin_series)))
    asin_responses_dict = {'asin': asin, 'item_attributes_xml': il_ia_request(asin, 'ItemAttributes'), 'editorial_review_xml': il_ia_request(asin, 'EditorialReview'), 'images_xml': il_ia_request(asin, 'Images'), 'date': current_date}
    asin_responses_df = asin_responses_df.append(asin_responses_dict, ignore_index=True)
asin_responses_df.head()

Requesting Data for B004XC6GJ0: 001 of 107
Requesting Data for B000P1DEHU: 002 of 107
Requesting Data for B004NEUJKA: 003 of 107
Requesting Data for B0011ULQNI: 004 of 107
Requesting Data for B001O0DP48: 005 of 107
Requesting Data for B00DQV2BDO: 006 of 107
Requesting Data for B000E1B2SO: 007 of 107
Requesting Data for B00005K3B1: 008 of 107
Requesting Data for B00000J0E6: 009 of 107
Requesting Data for B0001B86GI: 010 of 107
Requesting Data for B075X4G51V: 011 of 107
Requesting Data for B006N9ZZS4: 012 of 107
Requesting Data for B003A1LK1O: 013 of 107
Requesting Data for B00NIAYOX8: 014 of 107
Requesting Data for B00IO3AQC2: 015 of 107
Requesting Data for B0028MD7F8: 016 of 107
Requesting Data for B001DFH0C2: 017 of 107
Requesting Data for B001DFPR9A: 018 of 107
Requesting Data for B00A4U1XVQ: 019 of 107
Requesting Data for B0091HQHZU: 020 of 107
Requesting Data for B00WLJLYJY: 021 of 107
Requesting Data for B000IJEZI6: 022 of 107
Requesting Data for B00PVV543G: 023 of 107
Requesting 

Unnamed: 0,asin,item_attributes_xml,editorial_review_xml,images_xml,date
0,B004XC6GJ0,"<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...",2019-01-29
1,B000P1DEHU,"<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...",2019-01-29
2,B004NEUJKA,"<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...",2019-01-29
3,B0011ULQNI,"<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...",2019-01-29
4,B001O0DP48,"<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...","<?xml version=""1.0"" ?><html><body><itemlookupr...",2019-01-29


In [8]:
## Sample Data to ensure successful request and log
print('item_attributes_xml:')
print(asin_responses_df['item_attributes_xml'].loc[0])
print('editorial_review_xml:')
print(asin_responses_df['editorial_review_xml'].loc[0])
print('images_xml:')
print(asin_responses_df['images_xml'].loc[0])

item_attributes_xml:
<?xml version="1.0" ?><html><body><itemlookupresponse xmlns="http://webservices.amazon.com/AWSECommerceService/2013-08-01"><operationrequest><httpheaders><header name="UserAgent" value="Python-urllib/3.6"></header></httpheaders><requestid>b54f18a1-c433-4429-b30e-74d5922564ed</requestid><arguments><argument name="AWSAccessKeyId" value="AKIAIWPFMMIGTH6HUC3Q"></argument><argument name="AssociateTag" value="thorson-20"></argument><argument name="ItemId" value="B004XC6GJ0"></argument><argument name="Operation" value="ItemLookup"></argument><argument name="ResponseGroup" value="ItemAttributes"></argument><argument name="Service" value="AWSECommerceService"></argument><argument name="Timestamp" value="2019-01-29T22:01:18Z"></argument><argument name="Version" value="2013-08-01"></argument><argument name="Signature" value="J95AkDu2X7aECe41ZqGbd4dtlQewiTUbkRS+MGeRlLE="></argument></arguments><requestprocessingtime>0.0053823640000000</requestprocessingtime></operationrequest>

## Save asin_responses_df state in a Tab-delimited values file
While we are unlikely to use every xml element from the response, in this case it is in our best interest to store the entire response. Given the AWS Product Advertising API daily request limit, we would be limited in our capability to quickly obtain data should we decide that we need something from the it in future iterations of this project or in future projects altogether. Keeping a running record of the complete response means that we wouldn't have to call the API over again, unless the data has changed. Adding a date column will help to determine whether or not said data is too stale for project requirements. The mode is set to 'a' for append.

In [9]:
# save as tsv
asin_responses_df.to_csv('aws_pa_item_attributes_responses.tsv', sep='\t' , index=False, mode='a')
print('Export complete.')

Export complete.


## Discussion
AWS Product Advertising API requests permit batches of up to ten queries in one request. So, a future iteration of this project should reconstruct the function for executing requests to iterate through the provided series in batches of ten ASINs at a time. This will save time and resources.

Now that the AWS Product Advertising API responses have been saved for each queried product(ASIN), the next step is to parse this information. The following fields will be needed for advertising purposes:
1. ASIN
2. Title
3. Brand
4. GTIN(s)
5. MPN/Model
6. Specifications/Details
7. Product Description
8. Features
9. Product Category
10. Stock Image(s)

Information from the AWS Product Advertising API responses will be used to fill as many fields as possible. This process will be completed in Part 2 of this project.