# Download and extract preprints from Google Drive

Contents:
1. Introduction
2. Set up the connection 
3. Set up download and extraction
4. Download and extract preprints
5. Addendum: Download all files

## 1. Introduction

This notebook downloads .tar files that are stored in my Google Drive. These .tar files are from arxiv's source bucket on Amazon S3. 

While parts of this notebook can be used in the future to access these .tars, its primary purpose was to fix some errors I had in processing the preprints during my other notebook's S3 download. These errors are mainly that I found out that the identifier system uses different systems, and I queried metadata through the OAI2 and am using this metadata to determine whether an article belongs to astro-ph category, instead of relying on "astro-ph" in the submission filename. [EDIT: I no longer have the files on Google Drive. See 5. Addendum.]

**REQUIRES** a `client_secrets.json` file for an installed application (see [help docs](https://developers.google.com/api-client-library/dotnet/guide/aaa_client_secrets)), which looks like

```
{
  "installed": {
    "client_id": "837647042410-75ifg...usercontent.com",
    "client_secret":"asdlkfjaskd",
    "redirect_uris": ["http://localhost", "urn:ietf:wg:oauth:2.0:oob"],
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://accounts.google.com/o/oauth2/token"
  }
}
```

Mine is in the local repo but not on GitHub. 

## 2. Set up the connection 

Import dependencies:

In [1]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
import os, gzip, shutil, tarfile, pandas as pd

Connect to Google Drive:

In [2]:
def connect_to_google_drive():
    g_login = GoogleAuth()
    g_login.LocalWebserverAuth()
    drive = GoogleDrive(g_login)
    return drive

In [3]:
drive = connect_to_google_drive()

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=205689913441-4qvumj04tvu7o2h0j1cth62qhp0ck9ld.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&access_type=offline&response_type=code

Authentication successful.


All the .tars are contained within the arxiv folder on my Google Drive. 

The ID of the arxiv folder:

In [5]:
query = "'root' in parents and trashed=false and title='arxiv' and mimeType='application/vnd.google-apps.folder'"
arxiv_folder_id = drive.ListFile({'q': query}).GetList()[0].metadata['id']
print(arxiv_folder_id)

1f2WO6FlQhT3NyyfuBL6UEvkX1RX3cO_6


List the files in the arxiv folder:

In [6]:
uploaded_tars_list = drive.ListFile({'q': "'" + arxiv_folder_id + "' in parents and trashed=false"}).GetList()
uploaded_tars_list = [x.metadata['title'] for x in uploaded_tars_list]
uploaded_tars_list.sort()
print('Number of uploaded tars: ' + str(len(uploaded_tars_list)))
uploaded_tars_list

Number of uploaded tars: 2158


['arXiv_src_0001_001.tar',
 'arXiv_src_0002_001.tar',
 'arXiv_src_0003_001.tar',
 'arXiv_src_0004_001.tar',
 'arXiv_src_0005_001.tar',
 'arXiv_src_0006_001.tar',
 'arXiv_src_0007_001.tar',
 'arXiv_src_0008_001.tar',
 'arXiv_src_0009_001.tar',
 'arXiv_src_0010_001.tar',
 'arXiv_src_0011_001.tar',
 'arXiv_src_0012_001.tar',
 'arXiv_src_0101_001.tar',
 'arXiv_src_0102_001.tar',
 'arXiv_src_0103_001.tar',
 'arXiv_src_0104_001.tar',
 'arXiv_src_0105_001.tar',
 'arXiv_src_0106_001.tar',
 'arXiv_src_0107_001.tar',
 'arXiv_src_0108_001.tar',
 'arXiv_src_0109_001.tar',
 'arXiv_src_0110_001.tar',
 'arXiv_src_0111_001.tar',
 'arXiv_src_0112_001.tar',
 'arXiv_src_0201_001.tar',
 'arXiv_src_0202_001.tar',
 'arXiv_src_0203_001.tar',
 'arXiv_src_0204_001.tar',
 'arXiv_src_0205_001.tar',
 'arXiv_src_0206_001.tar',
 'arXiv_src_0207_001.tar',
 'arXiv_src_0208_001.tar',
 'arXiv_src_0209_001.tar',
 'arXiv_src_0210_001.tar',
 'arXiv_src_0211_001.tar',
 'arXiv_src_0212_001.tar',
 'arXiv_src_0301_001.tar',
 

## 3. Set up download and extraction

Get filenames for all the astro-ph preprints:

In [9]:
metadata_df = pd.read_csv('arXiv_metadata_astroph.csv', dtype={'filename': str, 'filename_parsed': str})
identifiers = metadata_df['filename_parsed']
identifiers

0               0704.0009
1               0704.0017
2               0704.0023
3               0704.0044
4               0704.0048
5               0704.0059
6               0704.0080
7               0704.0094
8               0704.0128
9               0704.0133
10              0704.0138
11              0704.0139
12              0704.0144
13              0704.0155
14              0704.0156
15              0704.0160
16              0704.0168
17              0704.0171
18              0704.0175
19              0704.0184
20              0704.0187
21              0704.0192
22              0704.0203
23              0704.0205
24              0704.0207
25              0704.0209
26              0704.0212
27              0704.0219
28              0704.0221
29              0704.0222
               ...       
250152    quant-ph0007104
250153    quant-ph0101091
250154    quant-ph0104067
250155    quant-ph0106059
250156    quant-ph0106076
250157    quant-ph0107011
250158    quant-ph0107070
250159    qu

Define functions to download a tar, as well as extract astro-ph preprints from this tar:

In [4]:
def download_file(file, title):
    '''
    Downloads given file from Google Drive.

    Parameters
    ----------
    file : str
        The file in the form of a GoogleFile object 
    title : str
        The filename of the file
    '''
    
    # Ensure src directory exists 
    if not os.path.isdir('src'):
        os.makedirs('src')
    
    # Download file
    print('Downloading ' + title + '...')
    file.GetContentFile(title) 
    print('Successfully downloaded drive://arxiv/{} to {}'.format(os.path.basename(title), title))
    
    
def extract_tar(filename):
    '''
    Extracts astro-ph preprints from given .tar.

    Parameters
    ----------
    filename : str
        Filepath of the .tar
    '''
    
    total_tex = 0
    tar_dir = 'latex/' + os.path.splitext(os.path.basename(filename))[0] + '/'
    
    # Quit file extraction if given file is not .tar
    if not tarfile.is_tarfile(filename):
        print('can\'t unzip ' + filename + ', not a .tar file')

    # Create .tar directory if it doesn't exist
    if not os.path.isdir(tar_dir):
        os.makedirs(tar_dir)

    # Proceed with file extraction if .tar
    print('Opening ' + filename + '...')
    # Open .tar, read-only
    tar = tarfile.open(filename)
    # Iterate over .tar subfiles
    for subfile in tar.getmembers():
        # Open subfile only if .gz and is identified as astro-ph preprint
        name = os.path.splitext(os.path.basename(subfile.name))[0]
        if subfile.name.endswith('.gz') and identifiers.str.contains(name).any():
            # Create submission directory if it doesn't exist
            if not os.path.isdir(tar_dir + name):
                os.makedirs(tar_dir + name)
            try:
                # Open .gz, read-only
                gz_obj = tar.extractfile(subfile) 
                gz = tarfile.open(fileobj=gz_obj) 
                # Iterate over .gz subfiles
                for subsubfile in gz.getmembers():
                    # Check if current subfile is .tex or .ltx 
                    if subsubfile.name.endswith('.tex') or subsubfile.name.endswith('.ltx'):
                        # Extract the file
                        gz.extract(subsubfile, path=tar_dir + name)
                        total_tex += 1
            except tarfile.ReadError:
                # Extract the entire .gz because we cannot read it using tarfile 
                # Note that these .gzs are single .tex files with no extension specified
                tar.extract(subfile, path='temp')
                # Uncompress the .gz file using gzip instead and place it with the other .tex files
                with gzip.open('temp/' + subfile.name, 'rb') as f_in:
                    basename = os.path.splitext(os.path.basename(subfile.name))[0]
                    with open(tar_dir + name + '/' + basename + '.tex', 'wb+') as f_out:
                        shutil.copyfileobj(f_in, f_out)
                        total_tex += 1
    
    print(filename + ' extraction complete')
    print('Number of .tex files obtained: ' + str(total_tex) + '\n')
    # Delete the temporary folder for those wonky gz files
    shutil.rmtree('temp/', ignore_errors=True)
    # Close tar
    tar.close()
    
    
def process_tars(target):
    '''
    Handles the download and processing of .tars from Google Drive. 
    
    Parameters
    ----------
    target : str
        Name of the .tar to begin the download at — 
        we will download all the .tars that come after it in alphabetical order 
    '''

    print('Beginning tar download & extraction...\n')

    query = "'root' in parents and trashed=false and title='arxiv' and mimeType='application/vnd.google-apps.folder'"
    arxiv_folder_id = drive.ListFile({'q': query}).GetList()[0].metadata['id']
    uploaded_tars_list = drive.ListFile({'q': "'" + arxiv_folder_id + "' in parents and trashed=false"}).GetList()
    for uploaded_tar in uploaded_tars_list:
        title = uploaded_tar['title']
        # If current title is less than the target .tar title in alphabetical order,
        if title < target:
            # Download .tar
            download_file(uploaded_tar, 'src/' + title)
            # Extract astrophysics preprints from the .tar
            extract_tar('src/' + title)
            # Remove tar from local storage/the computer)
            os.remove('src/' + title)
            target = title
            
    print('Processed ' + str(numFiles - 1) + ' tars')  

## 4. Download and extract preprints

In [None]:
process_tars('arXiv_src_1010_007.tar')

Beginning tar download & extraction...

Downloading src/arXiv_src_1010_006.tar...
Successfully downloaded drive://arxiv/arXiv_src_1010_006.tar to src/arXiv_src_1010_006.tar
Opening src/arXiv_src_1010_006.tar...
src/arXiv_src_1010_006.tar extraction complete
Number of .tex files obtained: 7

Downloading src/arXiv_src_1010_005.tar...
Successfully downloaded drive://arxiv/arXiv_src_1010_005.tar to src/arXiv_src_1010_005.tar
Opening src/arXiv_src_1010_005.tar...
src/arXiv_src_1010_005.tar extraction complete
Number of .tex files obtained: 56

Downloading src/arXiv_src_1010_004.tar...
Successfully downloaded drive://arxiv/arXiv_src_1010_004.tar to src/arXiv_src_1010_004.tar
Opening src/arXiv_src_1010_004.tar...
src/arXiv_src_1010_004.tar extraction complete
Number of .tex files obtained: 60

Downloading src/arXiv_src_1010_003.tar...
Successfully downloaded drive://arxiv/arXiv_src_1010_003.tar to src/arXiv_src_1010_003.tar
Opening src/arXiv_src_1010_003.tar...
src/arXiv_src_1010_003.tar extr

## 5. Addenum: Download all files

I now have room on my external hard drive to store all of the original arvix tar files in `../data/2020_03_03_original_arxiv_tars`, so I'm downloading all of them off of Google Drive. 

In [7]:
data_path = '../../data/2020_03_03_original_arxiv_tars/'
downloaded_tars = os.listdir(data_path)

def download_all_files():
    '''
    Downloads all .tars from Google Drive.
    '''

    print('Beginning download...\n')

    query = "'root' in parents and trashed=false and title='arxiv' and mimeType='application/vnd.google-apps.folder'"
    arxiv_folder_id = drive.ListFile({'q': query}).GetList()[0].metadata['id']
    uploaded_tars_list = drive.ListFile({'q': "'" + arxiv_folder_id + "' in parents and trashed=false"}).GetList()
    for uploaded_tar in uploaded_tars_list:
        title = uploaded_tar['title']
        # Download .tar
        if title not in downloaded_tars:
            download_file(uploaded_tar, data_path + title)
        else:
            print('Already downloaded ' + title + '...')

In [8]:
download_all_files()

Beginning download...

Already downloaded arXiv_src_9905_001.tar...
Already downloaded arXiv_src_9803_001.tar...
Already downloaded arXiv_src_9903_001.tar...
Already downloaded arXiv_src_9802_001.tar...
Already downloaded arXiv_src_9711_001.tar...
Already downloaded arXiv_src_9901_001.tar...
Already downloaded arXiv_src_9710_001.tar...
Already downloaded arXiv_src_9812_001.tar...
Already downloaded arXiv_src_9709_001.tar...
Already downloaded arXiv_src_9707_001.tar...
Already downloaded arXiv_src_9608_001.tar...
Already downloaded arXiv_src_9810_001.tar...
Already downloaded arXiv_src_9607_001.tar...
Already downloaded arXiv_src_9706_001.tar...
Already downloaded arXiv_src_9606_001.tar...
Already downloaded arXiv_src_9809_001.tar...
Already downloaded arXiv_src_9605_001.tar...
Already downloaded arXiv_src_9705_001.tar...
Already downloaded arXiv_src_9604_001.tar...
Already downloaded arXiv_src_9807_001.tar...
Already downloaded arXiv_src_9703_001.tar...
Already downloaded arXiv_src_950

Already downloaded arXiv_src_0306_001.tar...
Already downloaded arXiv_src_0305_001.tar...
Already downloaded arXiv_src_0304_001.tar...
Already downloaded arXiv_src_0303_001.tar...
Already downloaded arXiv_src_0302_001.tar...
Already downloaded arXiv_src_0301_001.tar...
Already downloaded arXiv_src_0212_001.tar...
Already downloaded arXiv_src_0211_001.tar...
Already downloaded arXiv_src_0210_001.tar...
Already downloaded arXiv_src_0209_001.tar...
Already downloaded arXiv_src_0208_001.tar...
Already downloaded arXiv_src_0207_001.tar...
Already downloaded arXiv_src_0206_001.tar...
Already downloaded arXiv_src_0205_001.tar...
Already downloaded arXiv_src_0204_001.tar...
Already downloaded arXiv_src_0203_001.tar...
Already downloaded arXiv_src_0202_001.tar...
Already downloaded arXiv_src_0201_001.tar...
Already downloaded arXiv_src_0112_001.tar...
Already downloaded arXiv_src_0111_001.tar...
Already downloaded arXiv_src_0110_001.tar...
Already downloaded arXiv_src_0109_001.tar...
Already do