# Strava Export Parsing
Bulk Strava exports (obtained using the ['Bulk Export' instructions by Strava Support](https://support.strava.com/hc/en-us/articles/216918437-Exporting-your-Data-and-Bulk-Export)) come in the form of a large zipped file containing a variety of CSV and other files in the top-level folder, along with a few subdirectories (`activities`, `clubs`, `media`, and `routes`). Luckily for us, the majority of the CSVs and subdirectories are sparse or empty due to users not utilizing those features or contain data not directly relevant to activity logging, such as device identifiers, app login tracking, etc.
##### **Which files do we actually care about?**
In the `activities` subdirectory, there is a file for every action that is logged by Strava. These files are what we mainly care about, and they tend to come in two different flavors:
- **FIT (.fit) or Compressed FIT (.fit.gz) Files**\
  FIT (Flexible and Interoperable data Transfer) is a [binary file type developed by Garmin](https://developer.garmin.com/fit/file-types/) that is a able to store a breadth of data recorded by a fitness tracker (check out the link for what sort of data can be stored in there!). These are binary files, which means that they are not human-readable even when uncompressed. We can use the [fitdecode Python library](https://pypi.org/project/fitdecode/) for reading and parsing these files after uncompressing them.

- **GPX Files**\
  GPX (GPs eXchange format) is an open way to store location data such as waypoints and routes in an XML-based file format. XML is human-readable, which means we can view the raw data if desired. Although GPX focuses more on location data and can contains less descriptive activity information than a FIT file could, this data format is simpler and more widely used. We can use the [gpxpy Python library](https://pypi.org/project/gpxpy/) for reading and parsing GPX files.

- **TCX Files**\
  TCX (Training Center XML) files are also there ¯\\\_(ツ)\_/¯

This Jupyter Notebook aims to extract desired information from provided Strava bulk exports and load them into a MySQL relational schema


In [None]:
'''
Package Imports and Constants

Run this cell to import all the packages we need and define some constants. 
You'll likely need to install any missing packages to your Python environment
with pip or your package manager of choice.
'''

import os                   # For navigating bulk export on filesystem
import gzip                 # For uncompressing .gz files
import shutil
from zipfile import ZipFile # For uncompressing .zip files

import fitdecode            # For .fit file parsing
import gpxpy                # For .gpx file parsing


DATA_DIR_PATH = '../data/'  # Path of data directory relative to this Jupyter Notebook
ACTIVITY_DIR_PATH = os.path.join(DATA_DIR_PATH + 'export_activities')


We don't want to manually extract all the data we want out of the compressed exports, so let's write some code to do that for us. We'll need a directory where we can expect to find all of our export files, let's say they're in a directory called `data` in the repository. Let's see what's in there!

In [25]:
files_and_dirs = os.listdir(DATA_DIR_PATH)
zip_archives = []

# Let's see what's currently in the data directory
print("Files and directories in the data directory:")
for entry in files_and_dirs:
    print(f'\t{entry}')
    if (entry.split('.')[-1] == 'zip'):
      zip_archives.append(entry)

num_zips = len(zip_archives)
print(f'Found {num_zips} zips! Ready to extract.')

# Extracting the entire activity directory from each zip
i = 1
for entry in zip_archives:
  print(f'\tExtracting activities from zip {i} of {num_zips} ...')
  with ZipFile(DATA_DIR_PATH + entry) as archive:
      for file in archive.namelist():
        if file.startswith('activities/'):
          archive.getinfo(file).filename = entry.split('.')[0] + '/' + file.split('/')[-1]
          archive.extract(file, ACTIVITY_DIR_PATH)
  i += 1
print(f'All zips extracted to \'{ACTIVITY_DIR_PATH}\'!')

Files and directories in the data directory:
	export_101635319.zip
	export_148511532.zip
	export_57141745.zip
	export_96589216.zip
Found 4 zips! Ready to extract.
	Extracting activities from zip 1 of 4 ...
	Extracting activities from zip 2 of 4 ...
	Extracting activities from zip 3 of 4 ...
	Extracting activities from zip 4 of 4 ...
All zips extracted to '../data/export_activities'!


Nice, now we have the `activities` folder from each bulk export unpacked into a directory (of the same name as the export) in (`/data/export_activities`). But our work isn't done yet! Some of the activity files (.fit, .gpx, etc), have been compressed by gzip into .gz files, and we have to uncompress those too.

In [None]:
# Getting the paths to all of the activity export directories
activity_export_dirs = [os.path.join(ACTIVITY_DIR_PATH, entry) 
                        for entry 
                        in os.listdir(ACTIVITY_DIR_PATH) 
                        if os.path.isdir(os.path.join(ACTIVITY_DIR_PATH, entry))]
num_activity_exports = len(activity_export_dirs)
print(f'Found {num_activity_exports} activity export directories!')

# Iterating through all files in each activity directory, 
# extracting the compressed ones and deleting their compressed versions
i = 1
for dir in activity_export_dirs:
    print(f'\tExtracting all activities in directory {i} of {num_activity_exports} ...')
    gzipped_files = [os.path.join(dir, entry) 
                     for entry 
                     in os.listdir(dir) 
                     if entry.split('.')[-1] == 'gz']
    for gzipped_file in gzipped_files:
        with gzip.open(gzipped_file, 'rb') as f_in:
            with open(os.path.splitext(gzipped_file)[0], 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
        os.remove(gzipped_file)
    i += 1
print(f'All gzips extracted in place!')

Found 4 activity export directories!
	Extracting all activities from directory 1 of 4 ...
	Extracting all activities from directory 2 of 4 ...
	Extracting all activities from directory 3 of 4 ...
	Extracting all activities from directory 4 of 4 ...
All gzips extracted in place!


Now we have all of the data we care about extracted in `data/export_activities/`! Each person's export is now a subdirectory composed of .fit, .gpx, or even .tcx files.  