# 2.1 Unpacking Google Patent Data

by Constantin Knoll, Christopher Mosch, Rohan Thavarajah

## Summary

The purpose of this notebook is to automate the downloading, unpacking, and subsequent uploading (to S3) of patent data available on Google. We scrape patent files with supplements and images from the url found in the User Guide. The data is grouped by weeks, ranging from 2001 to 2015. The years 2011- are compressed in .tar format, while the earlier ones are compressed as .zip. Thus, the code accounts for these options by including the python libraries ZipFile and TarFile. 
The code found in Compact Version (which we used mainly) downloads a single week of patent data, extracts, uploads and deletes it before repeating the process with another week. This is optimized to work on very small storage space such as an SSD on our local machine.
Each week of patent data contains several folders with different content. For the purposes of our project, we are only interested in the abstracts, which are found in the .xml files of the patent application body. Thus, we need to get rid of all supplements (as defined by the user below) and all images that aren't conducive to data analysis. This is reflected in the main Code, where the tree of the downloaded data is searched for all unwanted folders, which are deleted.
It is important to realize that the "deleted" files are typically dropped into the recycle bin of the local machine, and thus the purpose of running the space-optimized code is compromised. Therefore we advise the user to us an automated recycle bin clean-up program. It should be set to clean every 10-15 minutes, which is the average time it takes to complete the cycle for one week of patent data.
Since we get rid of most of the data found in the patent files, the convoluted folder structure that remains is thrown out and replaced by a three-tiered one: 1) Root 2) Year 3) Week.
The uploading to S3 is controlled by the boto3 library, which takes care of most of the uploading automatically.

![Image](Data/Images/Workflow_2.1.png?raw=true)

## Table of Contents

* <a href='#Change Log'>Change Log</a>
    * <a href='#v1'>v1</a>
    * <a href='#v2'>v2</a>
    * <a href='#v3'>v3</a>
* <a href='#User Guide'>User Guide</a>
    * <a href='#Path Specification'>Path Specification</a>
* <a href='#Main Code'>Main Code</a>
    * <a href='#Extraction'>Extraction</a>
    * <a href='#Cleaning'>Cleaning</a>
    * <a href='#Uploading'>Uploading</a>
* <a href='#Compact Version'>Compact Version</a>

<a id='Change Log'></a>
## Change Log / Notes

<a id='v1'></a>
### v.1
- initial build
- only supports 'keep_supplements == False, etc.'
- unpacking can't be paused and resumed - only entire .tar files
- can't check available disk space
- can't deal with unexpected folders

<a id='v2'></a>
### v.2 
- Updated print messages

<a id='v3'></a>
### v.3
- Restructured to unpack onto SSD
- Developed functionality for pre-2005 patent data
- Included S3 Upload

<a id='User Guide'></a>
## User Guide

This program assumes that the patent data is in a .tar or .zip file format. These files can be accessed at https://www.google.com/googlebooks/uspto-patents-redbook.html. Note the .tar format was adopted for years 2011-present. 

<a id='Path Specification'></a>
#### Path Specification

Please replace Google Patent Data with the location of the existing or desired patent directory. Please be sure to omit the trailing /. If the patent data already exists, it should be organized into subfolders corresponding to its year.

In [None]:
import os

In [None]:
#set path of raw data
directory_path = 'Google Patent Data'

#create folder if not there
if not os.path.exists(directory_path):
    os.makedirs(directory_path)

#### If you do not wish to download relevant files through this program, please skip to 'Output Format'.

In [None]:
from bs4 import BeautifulSoup, SoupStrainer
import requests, urllib

The following cell generates a list of all available files.

In [None]:
#scrape list of available files
r = requests.get('https://www.google.com/googlebooks/uspto-patents-redbook.html')
soup = BeautifulSoup(r.text, 'html.parser').find_all('a')
tarlist = []
for link in soup:
    if ((link.has_attr('href') and '.tar' in link['href']) or (link.has_attr('href') and '.ZIP' in link['href'] and 'SUPP' not in link['href'])):
        print ('\''+link['href']+'\''+',')
        tarlist.append(link['href'])

Please copy the download links of the desired files into the following array, deleting the trailing comma at the end of the array.

In [None]:
files_to_download = ['http://storage.googleapis.com/patents/redbook/grants/2001/20011225.ZIP']

#### From this point onward, you may wish to run all remaining cells. (EDIT: Beware compact version at end)

In [None]:
#download selected files
for downloadfile in files_to_download:
    
    year = downloadfile[53:57]
    
    if '.' in downloadfile[58:71]:
        week = downloadfile[58:71]
        week2 = downloadfile[58:66]
    else:
        week = downloadfile[58:70]
        week2 = downloadfile[58:66]
    
    if os.path.exists(directory_path+'/'+year+'/'+week):
        print (directory_path+'/'+year+'/'+week+' already exists.')
    else:
        if not os.path.exists(directory_path+'/'+year):
            os.makedirs(directory_path+'/'+year)
        print ('Downloading '+week+'...')
        urllib.urlretrieve(downloadfile, directory_path+'/'+year+'/'+week)
        print ('Downloading '+week+' complete.')

#### Output Format

The following list determines the filetypes that are kept by the unpacker.

In [None]:
exts = ['.xml', '.sgm']

Also indicate whether you want to keep patent supplements (those which are in .xml format), design patents, DTDS, Plant and Reissue folders.

##### WARNING: Currently only supports FALSE for everything

In [None]:
keep_supplements = False
keep_Design = False
keep_DTDS = False
keep_Plant = False
keep_Reissue = False

<a id='Main Code'></a>
## Code

In [None]:
import tarfile, zipfile, shutil, time
from unipath import Path

In [None]:
#function that displays any directory tree
def list_files(startpath):
    for root, dirs, files in os.walk(startpath):
        level = root.replace(startpath, '').count(os.sep)
        indent = ' ' * 4 * (level)
        print('{}{}/'.format(indent, os.path.basename(root)))
        subindent = ' ' * 4 * (level + 1)
        for f in files:
            print('{}{}'.format(subindent, f))

#calculates the size of a directory
def get_size(start_path):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    print ('The directory has a size of '+str(total_size/1000000)+' MB.')

#confirm patent directory
list_files(directory_path)
get_size(directory_path)

![Image](Data/Images/2.1 Directory Sample.png?raw=true)

<a id='Extraction'></a>
#### Before Proceeding
If the tree above does not mirror the intended structure, please make sure you provided the correct directory path, or double check the directory on your local machine. As a reminder, it should be sorted in subfolders corresponding to the respective year.

The subsequent code may take up to several hours to run. Even though there are further checks along the way, it is vital that the information up to this point is in order.

In [None]:
%time

#unzip everything
if not os.path.exists(directory_path+'/Unpacked Data'):
    os.makedirs(directory_path+'/Unpacked Data')
for root, dirs, files in os.walk(directory_path, topdown=True):
    dirs[:] = [d for d in dirs if d not in 'Unpacked Data']
    for currentdir in dirs: 
        print ('Creating unpacked '+currentdir+' directory.')
        newdir_year = (directory_path+'/Unpacked Data/'+currentdir)
        if not os.path.exists(newdir_year):
            os.makedirs(newdir_year)
        else:
            pass
        for root, dirs, files in os.walk(directory_path+'/'+currentdir):
            for currentfile in files:
                print ('Extracting '+currentfile+'...')
                newfile = os.path.splitext(currentfile)[0]
                newdir_week = (newdir_year+'/'+newfile)
                if not os.path.exists(newdir_week):
                    os.makedirs(newdir_week)
                    if '.tar' in currentfile:
                        tarfile = tarfile.open(name=(directory_path+'/'+currentdir+'/'+currentfile), mode='r')
                        tarfile.extractall(path=newdir_week, members=None)
                    elif '.ZIP' in currentfile:
                        with zipfile.ZipFile(directory_path+'/'+currentdir+'/'+currentfile) as zf:
                            zf.extractall(path=newdir_week, members=None)
                    else:
                        print ('File type not recognized.')
                    print ('Extraction of '+currentfile+' complete.')
                else:
                    print ('Extracted '+currentfile+' already exists.')

Confirming the high-level output tree:

In [None]:
# list_files(directory_path+'/Unpacked Data')
# get_size(directory_path+'/Unpacked Data')

Based on your preferences in 'Path Specification', the following code will delete unwanted data. If something goes wrong in the subsequent lines of code, please return to 'Before Proceeding' to reprocess the raw .tar file.

<a id='Cleaning'></a>
#### Current Configuration (v. 3) automatically deletes all unwanted folders.

In [None]:
#delete all non-text data and supplements
for root, dirs, files in os.walk(directory_path+'/Unpacked Data/'+year+'/'+week2+'/'+week2):
    for currentdir in dirs:
        print currentdir
        if ('UTIL' not in currentdir):
                shutil.rmtree(root+'/'+currentdir)
                print (root+'/'+currentdir+' deleted')

In [None]:
# for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
#     for currentdir in dirs:
#         if (keep_Design == False and 'DESIGN' in currentdir):
#             shutil.rmtree(root+'/'+currentdir)
#             print (root+'/'+currentdir+' deleted.')
#         if (keep_Plant == False and 'PLANT' in currentdir):
#             shutil.rmtree(root+'/'+currentdir)
#             print (root+'/'+currentdir+' deleted.')
#         if (keep_Reissue == False and 'REISSUE' in currentdir):
#             shutil.rmtree(root+'/'+currentdir)
#             print (root+'/'+currentdir+' deleted.')
#         if (keep_DTDS == False and 'DTDS' in currentdir):
#             shutil.rmtree(root+'/'+currentdir)
#             print (root+'/'+currentdir+' deleted.')
#         if (keep_supplements == False and 'SUPP' in currentdir):
#             shutil.rmtree(root+'/'+currentdir)
#             print (root+'/'+currentdir+' deleted.')

In [None]:
%time

#unpack individual patents
for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
    for currentfile in files:
            with zipfile.ZipFile(root+'/'+currentfile) as z:
                z.extractall(root)
                newfile = os.path.splitext(currentfile)[0]
                z.close()
                os.remove(root+'/'+currentfile)
                shutil.move(root+'/'+newfile, Path((Path(root).parent)).parent)

In [None]:
%time

#clean up tree structure
for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):       
    for currentfile in files:
        if any(currentfile.lower().endswith(ext) for ext in exts):
            shutil.copy(root+'/'+currentfile, Path(root).parent)
            os.remove(root+'/'+currentfile)
        else:
            os.remove(root+'/'+currentfile)

In [None]:
%time

#delete all anomalies
for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
    for currentdir in dirs:
        if not os.listdir(root+'/'+currentdir):
            os.rmdir(root+'/'+currentdir)

Based on your preferences in 'Output Format', the following code will delete unwanted data.

In [None]:
%time

if keep_supplements == False:
    for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
        for currentfile in files:
            if currentfile.count('-') > 1:
                os.remove(root+'/'+currentfile)

for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
    for currentdir in dirs:
        if not os.listdir(root+'/'+currentdir):
            os.rmdir(root+'/'+currentdir)

View the result below:

In [None]:
list_files(directory_path+'/Unpacked Data')
get_size(directory_path+'/Unpacked Data')

In [None]:
shutil.make_archive(directory_path+'/'+week2, 'zip', directory_path+'/Unpacked Data/'+year+'/'+week2)

<a id='Uploading'></a>
### Uploading to S3

In [None]:
import boto3
s3 = boto3.resource('s3')

In [None]:
#create bucket to upload to
s3.create_bucket(Bucket='ckpatents', CreateBucketConfiguration={
    'LocationConstraint': 'us-west-1'})

In [None]:
bucket = s3.Bucket('ckpatents')

In [None]:
%time
#upload to S3
s3.Object('ckpatents', year+'/'+week2+'.zip').put(Body=open(directory_path+'/'+week2+'.zip', 'rb'))

In [None]:
#set bucket attributes
bucket.Acl().put(ACL='public-read')

![Image](Data/Images/2.1 S3.png?raw=true)

#### Deleting folders

In [None]:
#for SSD purposes, deletes all downloaded and unpacked files
shutil.rmtree('Google Patent Data', ignore_errors=True)
print ('Google Patent Data deleted')

<a id='Compact Version'></a>
# Compact Version

The following code mirrors the format above, except that weeks are downloaded/unpacked/uploaded individually. This is useful for running on a machine with little storage space (i.e. local machine SSD). The code can run the background.

In [None]:
import os

In [None]:
directory_path = 'Google Patent Data'

if not os.path.exists(directory_path):
    os.makedirs(directory_path)

In [None]:
from bs4 import BeautifulSoup, SoupStrainer
import requests, urllib

In [None]:
r = requests.get('https://www.google.com/googlebooks/uspto-patents-redbook.html')
soup = BeautifulSoup(r.text, 'html.parser').find_all('a')
tarlist = []
for link in soup:
    if ((link.has_attr('href') and '.tar' in link['href']) or (link.has_attr('href') and '.ZIP' in link['href'] and 'SUPP' not in link['href'])):
        print ('\''+link['href']+'\''+',')
        tarlist.append(link['href'])

In [None]:
exts = ['.xml', '.sgm']
keep_supplements = False
keep_Design = False
keep_DTDS = False
keep_Plant = False
keep_Reissue = False
import tarfile, zipfile, shutil, time
from unipath import Path
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('ckpatents')

In [None]:
files_to_download = ['http://storage.googleapis.com/patents/redbook/grants/2005/I20050111.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050118.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050125.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050201.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050208.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050215.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050222.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050301.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050308.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050315.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050322.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050329.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050405.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050412.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050419.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050426.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050503.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050510.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050517.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050524.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050531.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050607.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050614.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050621.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050628.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050705.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050712.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050719.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050726.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050802.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050809.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050816.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050823.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050830.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050906.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050913.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050920.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20050927.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051004.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051011.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051018.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051025.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051101.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051108.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051115.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051122.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051129.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051206.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051213.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051220.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2005/I20051227.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060221.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060228.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060307.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060314.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060321.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060328.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060404.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060411.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060418.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060425.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060502.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060509.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060516.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060523.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060530.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060606.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060613.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060620.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060627.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060704.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060711.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060718.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060725.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060801.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060808.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060815.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060822.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060829.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060905.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060912.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060919.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20060926.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061003.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061010.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061017.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061024.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061031.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061107.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061114.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061121.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061128.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061205.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061212.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061219.tar',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061219.ZIP',
'http://storage.googleapis.com/patents/redbook/grants/2006/I20061226.ZIP']

In [None]:
for downloadfile in files_to_download:
    
    year = downloadfile[53:57]
    
    if '.' in downloadfile[58:71]:
        week = downloadfile[58:71]
        week2 = downloadfile[58:67]
    else:
        week = downloadfile[58:70]
        week2 = downloadfile[58:66]
    print week, week2
    if os.path.exists(directory_path+'/'+year+'/'+week):
        print (directory_path+'/'+year+'/'+week+' already exists.')
    else:
        if not os.path.exists(directory_path+'/'+year):
            os.makedirs(directory_path+'/'+year)
        print ('Downloading '+week+'...')
        urllib.urlretrieve(downloadfile, directory_path+'/'+year+'/'+week)
        print ('Downloading '+week+' complete.')
        
    #function that displays any directory tree
    def list_files(startpath):
        for root, dirs, files in os.walk(startpath):
            level = root.replace(startpath, '').count(os.sep)
            indent = ' ' * 4 * (level)
            print('{}{}/'.format(indent, os.path.basename(root)))
            subindent = ' ' * 4 * (level + 1)
            for f in files:
                print('{}{}'.format(subindent, f))

    #calculates the size of a directory
    def get_size(start_path):
        total_size = 0
        for dirpath, dirnames, filenames in os.walk(start_path):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                total_size += os.path.getsize(fp)
        print ('The directory has a size of '+str(total_size/1000000)+' MB.')

    #confirm patent directory
    list_files(directory_path)
    get_size(directory_path)
    
    if not os.path.exists(directory_path+'/Unpacked Data'):
        os.makedirs(directory_path+'/Unpacked Data')
    for root, dirs, files in os.walk(directory_path, topdown=True):
        dirs[:] = [d for d in dirs if d not in 'Unpacked Data']
        for currentdir in dirs: 
            print ('Creating unpacked '+currentdir+' directory.')
            newdir_year = (directory_path+'/Unpacked Data/'+currentdir)
            if not os.path.exists(newdir_year):
                os.makedirs(newdir_year)
            else:
                pass
            for root, dirs, files in os.walk(directory_path+'/'+currentdir):
                for currentfile in files:
                    print ('Extracting '+currentfile+'...')
                    newfile = os.path.splitext(currentfile)[0]
                    newdir_week = (newdir_year+'/'+newfile)
                    if not os.path.exists(newdir_week):
                        os.makedirs(newdir_week)
                        if '.tar' in currentfile:
                            tarfile = tarfile.open(name=(directory_path+'/'+currentdir+'/'+currentfile), mode='r')
                            tarfile.extractall(path=newdir_week, members=None)
                        elif '.ZIP' in currentfile:
                            with zipfile.ZipFile(directory_path+'/'+currentdir+'/'+currentfile) as zf:
                                zf.extractall(path=newdir_week, members=None)
                        else:
                            print ('File type not recognized.')
                        print ('Extraction of '+currentfile+' complete.')
                    else:
                        print ('Extracted '+currentfile+' already exists.')
                        
    for root, dirs, files in os.walk(directory_path+'/Unpacked Data/'+year+'/'+week2+'/'+week2):
        for currentdir in dirs:
            print currentdir
            if ('UTIL' not in currentdir):
                    shutil.rmtree(root+'/'+currentdir)
                    print (root+'/'+currentdir+' deleted')
                    
    for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
        for currentfile in files:
                with zipfile.ZipFile(root+'/'+currentfile) as z:
                    z.extractall(root)
                    newfile = os.path.splitext(currentfile)[0]
                    z.close()
                    os.remove(root+'/'+currentfile)
                    shutil.move(root+'/'+newfile, Path((Path(root).parent)).parent)

    for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):       
        for currentfile in files:
            if any(currentfile.lower().endswith(ext) for ext in exts):
                shutil.copy(root+'/'+currentfile, Path(root).parent)
                os.remove(root+'/'+currentfile)
            else:
                os.remove(root+'/'+currentfile)
                
    for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
        for currentdir in dirs:
            if not os.listdir(root+'/'+currentdir):
                os.rmdir(root+'/'+currentdir)
    
    if keep_supplements == False:
        for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
            for currentfile in files:
                if currentfile.count('-') > 1:
                    os.remove(root+'/'+currentfile)

    for root, dirs, files in os.walk(directory_path+'/Unpacked Data'):
        for currentdir in dirs:
            if not os.listdir(root+'/'+currentdir):
                os.rmdir(root+'/'+currentdir)
                
    shutil.make_archive(directory_path+'/'+week2, 'zip', directory_path+'/Unpacked Data/'+year+'/'+week2)
    
    s3.Object('ckpatents', year+'/'+week2+'.zip').put(Body=open(directory_path+'/'+week2+'.zip', 'rb'))
    
    shutil.rmtree('Google Patent Data', ignore_errors=True)
    print ('Google Patent Data deleted')