# Extract MTA Data from archives

This notebook helps automate extracting data from the protobufs manually downloaded from [Historical GTFS data](http://web.mta.info/developers/data/archives.html)the latest source suggested at:
https://groups.google.com/d/msg/mtadeveloperresources/Whm5XTVINcE/z-LO12ANAAAJ

Note that another S3 hosted [historical datasource](http://web.mta.info/developers/MTA-Subway-Time-historical-data.html) referenced on the MTA website, but this is outdated, and the above MTA Alert Archive is correct.

NOTE: This notebook assumes that the protobufs have already been downloaded to <code>data/raw/status</code> e.g. <code>data/raw/status/201901.zip</code> from http://web.mta.info/developers/data/archives.html

In [None]:
import os
import pandas as pd
import sys
data_dir = '../data/raw/status'

In [None]:
import glob
protobuf_paths = glob.glob('{}/[0-9]*.zip'.format(data_dir))

if len(protobuf_paths) == 0:
    raise ValueError('No matching protbufs found in {}, please download from https://m.mymtaalerts.com/archive')
    
print(protobuf_paths)

### Helper cell for recursively unzipping monthly rollups
This is a bit finnicky, as the layout and zip format vary from month-to month, but this is a helpful tool for unzipping some of the.  This will fail for a handful of the monthly archives, and you will either need to modify it, or manually handle those cases.  Especially watch out for <code>201812.zip</code>, as that contains <code>201812.7z</code>

Additionally, there are a small number of corrupted daily zips, so this absorbs and logs those errors.

In [None]:
import zipfile
import shutil
import progressbar
import io

# Keep a list of files with failed extractions
failed_files = os.path.join(data_dir, 'failures.txt')

force = False

# unzip monthly rollups, then unzip the daily files inside
# This code is largely copied from: https://stackoverflow.com/questions/36285502/how-to-extract-zip-file-recursively-in-python
# The daily zipfiles are ~1GB, so there are big speed gains from unzipping in memory
#for monthly_file in protobuf_paths[-1:]:
for monthly_file in ['../data/raw/status/201906.zip',]:
    widgets = [progressbar.Percentage(), progressbar.Bar(), progressbar.Variable('failures')]    

    
    print("Extracting: " + monthly_file)
    z = zipfile.ZipFile(monthly_file)
    for i,f in enumerate(z.namelist()):
        print("{}/{}".format(i+1, len(z.namelist())))
        # get directory name from file
        dirname = os.path.join(data_dir, os.path.splitext(f)[0])
        # create new directory
        os.makedirs(dirname, exist_ok=True)
        # read inner zip file into bytes buffer 
        content = io.BytesIO(z.read(f))
        zip_file = zipfile.ZipFile(content)
        
        # Skip if already unzipped
        if not force:
            if len(glob.glob(dirname+'/*')) == len(zip_file.namelist()):
                print("Skipping " + os.path.basename(dirname))
                continue
         
        # Iterate through in-memory zipfile, dumping sub-minutely protobufs into daily directories
        bar = progressbar.ProgressBar(widgets=widgets, max_value=len(zip_file.namelist()), min_poll_interval=.5).start()
        failures = 0
        for j,f2 in enumerate(zip_file.namelist()):
            try:
                zip_file.extract(f2, dirname)
            except Exception as e:
                # At the moment, some messages a sporadically unable to parse
                with io.open(failed_files, 'a') as fh:
                    fh.write(f2+'\n')
                failures += 1
                
            sys.stdout.flush()
            bar.update(j+1, failures=failures)
        zip_file.close()
        
        bar.finish()
    
    