# Strava Exports to CSV Files
Now that all of our data is uncompressed, we have to transform all of the seperate files to something that can be loaded into a database. MySQL's `LOAD DATA INFILE` is quite efficient for bulk loading, so if we are able to transform the data for each table into a CSV file then loading the data should be more straightforward.

As we get into going through data for various users it gets more difficult to keep track of what data is supposed to represent and who it belongs to, so I think an object oriented approach might be better. In the end, the strat may be to load all the data for each table into a pandas DataFrame, use the `pandas.to_csv()` method, and then import the generated CSVs into MySQL.

For information and advice on the three filetypes used and their contents, this tutorial was heavily referenced: [Parsing fitness tracker data with Python](https://towardsdatascience.com/parsing-fitness-tracker-data-with-python-a59e7dc17418/)

In [1]:
'''
Package Imports, Constants, Global Variable

Run this cell to import all the packages we need and define some constants. 
You'll likely need to install any missing packages to your Python environment
with pip or your package manager of choice.
'''

import os
import gpxpy
import fitdecode
import pandas as pd
import csv

ACTIVITY_DIR_PATH = '../data/export_activities' # Parent directory of all exports
cur_activity_id = 0   # Global activity counter to give each activity a unique id across users

Let's define an `Activity` object. We will use this object to store all the data from an individual file from an export, whether it is a .fit, .gpx, or .tcx file. By feeding the path to the activity file in the constructor, we are able to make an `Activity` create itself from a file when it is instantiated.

In [2]:
class Activity:

  __mysql_null = 'NULL'
  __activity_summary_keys = ['user_id', 'activity_id', 'filename', 
                             'start_datetime', 'end_datetime', 
                             'distance_2d', 'distance_3d',
                             'avg_speed', 'max_speed',
                             'uphill', 'downhill',
                             'avg_hr', 'min_hr', 'max_hr',
                             'avg_cad','min_cad','max_cad',
                             'total_kcal']

  def __init__(self, activity_id: int, user_id: int, activity_filepath: os.PathLike):
    self.__activity_id = activity_id
    self.__user_id = user_id
    self.__activity_filepath = activity_filepath
    self.__points_df = pd.DataFrame()

    self.__activity_summary = dict.fromkeys(self.__activity_summary_keys, self.__mysql_null)
    self.__activity_summary.update({'user_id': self.__user_id, 
                                    'activity_id': self.__activity_id,
                                    'filename': os.path.basename(self.__activity_filepath)})

    self.__point_dict = {
      'activity_id': [],
      'latitude': [],
      'longitude': [],
      'elevation': [],
      'time': [],
      'speed': [],
      'hr': [],
      'cad': []
    }

    self.__load_from_file()
    
  def __load_from_file(self) -> None:
    match self.__activity_filepath.split('.')[-1].lower():
      case 'gpx':
        self.__load_from_gpx()
      case 'tcx':
        self.__load_from_tcx()
      case 'fit':
        self.__load_from_fit()

  def __load_from_gpx(self) -> None:

    with open(self.__activity_filepath) as f:
      gpx = gpxpy.parse(f)

      uphill, downhill = 0, 0

      if len(gpx.tracks) == 0:
        raise ValueError(f'No tracks found in gpx file {os.path.abspath(self.__activity_filepath)}')

      for track in gpx.tracks:

        uphill_downhill = track.get_uphill_downhill()
        uphill += uphill_downhill.uphill
        downhill += uphill_downhill.downhill
        
        for segment in track.segments:
          for point_idx, point in enumerate(segment.points):            
            self.__point_dict['activity_id'].append(self.__activity_id)
            self.__point_dict['time'].append(point.time)
            self.__point_dict['latitude'].append(point.latitude)
            self.__point_dict['longitude'].append(point.longitude)
            self.__point_dict['elevation'].append(point.elevation)

            # Adding speed
            point_speed = point.speed
            if point_idx == 0:
              point_speed = 0
            elif point_speed == None:
              point_speed = point.speed_between(segment.points[point_idx - 1])
            self.__point_dict['speed'].append(point_speed)

            # Adding extensions
            found_hr = False
            found_cad = False
            for extension in point.extensions:

              hr_element = extension.find('{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}hr')
              cad_element = extension.find('{http://www.garmin.com/xmlschemas/TrackPointExtension/v1}cad')
              
              # Adding heart rate, if exists
              if hr_element is not None and hr_element.text:
                self.__point_dict['hr'].append(int(hr_element.text))
                found_hr = True
              
              # Adding cadence, if exists
              if cad_element is not None and cad_element.text:
                self.__point_dict['cad'].append(int(cad_element.text))
                found_cad = True
            
            # Adding nulls if cadence or heart rate don't exist
            if not found_hr:
              self.__point_dict['hr'].append('\\N')
            if not found_cad:
              self.__point_dict['cad'].append('\\N')
      
      # Creating the dataframe of points from the dictionary
      self.__points_df = pd.DataFrame(self.__point_dict)

      # Populating activity summary dictionary
      timebounds = gpx.get_time_bounds()
      self.__activity_summary.update({'start_datetime': timebounds.start_time,
                                      'end_datetime': timebounds.end_time,
                                      'distance_2d': round(gpx.length_2d(), 3),
                                      'distance_3d': round(gpx.length_3d(), 3),
                                      'avg_speed': round(self.__points_df['speed'].mean(), 3),
                                      'max_speed': round(self.__points_df['speed'].max(), 3),
                                      'uphill': round(uphill, 3),
                                      'downhill': round(downhill, 3)})
      
      questionable_cols = ['hr', 'cad']
      for col in questionable_cols:
        if(self.__points_df[col].dtype == 'int64'):
          self.__activity_summary.update({'avg_' + col: round(self.__points_df[col].mean(), 3),
                                          'min_' + col: self.__points_df[col].min(),
                                          'max_' + col: self.__points_df[col].max()})
    

    # Ensure activity_id is stored as integer
    self.__points_df['activity_id'] = self.__points_df['activity_id'].astype(int)

  # Stubs to be replaced with real file parsing code
  def __load_from_tcx(self) -> None:
    pass

  def __load_from_fit(self) -> None:
    pass

  # Getter methods
  def get_summary(self) -> dict:
    return self.__activity_summary
  
  def get_points(self) -> pd.DataFrame:
    return self.__points_df

Now let's define a `User`. An entire export directory of activity files belongs to a Strava user, so our `User` can have a list of `Activities`. By feeding the path to the export directory into the constructor, a `User` is able to initialize itself with all of its `Activities` upon instantiation.

In [3]:
class User:

  def __init__(self, user_id: int, export_filepath: os.PathLike, name: str):
    self.__user_id = user_id
    self.__export_filepath = export_filepath
    self.__name = name
    self.__activities = []
    self.__load_all_activities()

  def __load_all_activities(self):
    files = os.listdir(self.__export_filepath)
    for file in files:
      # Only process known file types
      file_ext = file.split('.')[-1].lower()
      if file_ext in ['gpx']:#, 'tcx', 'fit']:
        try:
          global cur_activity_id
          self.__activities.append(Activity(activity_id = cur_activity_id, 
                                            user_id = self.__user_id, 
                                            activity_filepath = os.path.join(self.__export_filepath, file)))
          cur_activity_id += 1
        except ValueError as ve:
          print(f'Error occured when loading {file}: {ve}')

  # Getters for exporting
  def get_activity_summaries(self) -> pd.DataFrame:
    activity_summaries = [activity.get_summary() for activity in self.__activities]
    return pd.DataFrame(activity_summaries)

  def get_activity_points(self) -> pd.DataFrame:
    point_dfs = [activity.get_points() for activity in self.__activities]
    combined_df = pd.concat(point_dfs)
    return combined_df

Let's test it out by initializing a `User`:

In [4]:
steve = User(user_id = 1, export_filepath = '../data/export_activities/export_101635319', name='Steve')

Error occured when loading 7077892227.gpx: No tracks found in gpx file c:\Users\matth\Documents\Python\CS3200\CS3200_Strava_Secretary\data\export_activities\export_101635319\7077892227.gpx


And let's look at the activities summaries CSV for this user:

In [6]:
points_df = steve.get_activity_points()
points_df.to_csv('../data/points.csv', index_label='seq_num', lineterminator='\n')

activities_df = steve.get_activity_summaries()
print(activities_df.head(15))
activities_df.to_csv('../data/activities_summaries.csv', index = False, lineterminator='\n')

    user_id  activity_id        filename            start_datetime  \
0         1            0  6997176516.gpx 2022-04-17 14:16:52+00:00   
1         1            1  7014531890.gpx 2022-04-20 19:42:28+00:00   
2         1            2  7019748989.gpx 2022-04-21 19:44:08+00:00   
3         1            3  7024451292.gpx 2022-04-22 19:48:17+00:00   
4         1            4  7034624054.gpx 2022-04-24 15:34:29+00:00   
5         1            5  7040713235.gpx 2022-04-25 20:15:24+00:00   
6         1            6  7046031446.gpx 2022-04-26 20:21:17+00:00   
7         1            7  7056695803.gpx 2022-04-28 20:14:49+00:00   
8         1            8  7056695936.gpx 2022-04-28 20:29:28+00:00   
9         1            9  7056725873.gpx 2022-04-28 20:59:31+00:00   
10        1           10  7061409366.gpx 2022-04-29 19:48:46+00:00   
11        1           11  7071266617.gpx 2022-05-01 15:12:59+00:00   
12        1           12  7071443617.gpx 2022-05-01 15:23:21+00:00   
13        1         