# Strava Exports to CSV Files
Now that all of our data is uncompressed, we have to transform all of the seperate files to something that can be loaded into a database. MySQL's `LOAD DATA INFILE` is quite efficient according to the Google AI summary :skull:, so if we are able to transform the data for each table into a CSV file then loading the data should be more straightforward.

As we get into going through data for various users it gets more difficult to keep track of what data is supposed to represent and who it belongs to, so I think an object oriented approach might be better. In the end, the strat may be to load all the data for each table into a pandas DataFrame, use the pandas.to_csv() method, and then import the generated CSVs into MySQL.

For information and advice on the three filetypes used and their contents, this tutorial was heavily referenced: [Parsing fitness tracker data with Python](https://towardsdatascience.com/parsing-fitness-tracker-data-with-python-a59e7dc17418/)

In [None]:
'''
Package Imports, Constants, Global Variable

Run this cell to import all the packages we need and define some constants. 
You'll likely need to install any missing packages to your Python environment
with pip or your package manager of choice.
'''

import os
import gpxpy
import fitdecode
import pandas as pd

ACTIVITY_DIR_PATH = '../data/export_activities' # Parent directory of all exports
cur_activity_id = 1   # Global counter to properly increment activity_id primary key

In [None]:
class Activity:

  def __init__(self, activity_id: int, user_id: int, filepath: os.PathLike):
    self.__activity_id = activity_id
    self.__user_id = user_id
    self.__activity_summary
    self.__points_df
    self.__load_from_file(filepath)
    
    def __load_from_file(self, filepath: os.PathLike):
      match filepath.split('.')[-1]:
        case 'gpx':
          self.__load_from_gpx(filepath)
        case 'tcx':
          self.__load_from_tcx(filepath)
        case 'fit':
          self.__load_from_fit(filepath)

    # Stubs to be replaced with real file parsing code
    def __load_from_gpx(filepath: os.PathLike):
      pass

    def __load_from_tcx(filepath: os.PathLike):
      pass

    def __load_from_fit(filepath: os.PathLike):
      pass

    # Getter methods
    def get_summary(self) -> dict:
      return self.__activity_summary
    
    def get_points(self) -> pd.DataFrame:
      return self.__points_df


class User:

  def __init__(self, user_id: int, export_filepath: os.PathLike, name: str):
    self.__user_id = user_id
    self.__name = name
    self.__activities = []
    self.__load_all_activities(export_filepath)

  def __load_all_activities(self, export_filepath):
    files = os.listdir(export_filepath)
    for file in files:
      self.__activities.append(Activity(cur_activity_id, 
                                        self.__user_id, 
                                        os.path.join(export_filepath, file)))
      cur_activity_id += 1

  # Getters for exporting
  def get_activity_summaries(self) -> pd.DataFrame:
    activity_summaries = [activity.get_summary() for activity in self.__activities]
    return pd.DataFrame(activity_summaries)

  def get_activity_points(self) -> pd.DataFrame:
    point_dfs = [activity.get_points() for activity in self.__activities]
    return pd.concat(point_dfs)

The cell below gets some info out of every GPX file and puts it into a pandas DataFrame, and can be used as a starting point for parsing .gpx files. There are some issues with it in the sense that it doesn't link the data to the user at all, which is why I'm leaning towards the OO way up above

In [None]:
activity_export_dirs = [os.path.join(ACTIVITY_DIR_PATH, entry) 
                        for entry 
                        in os.listdir(ACTIVITY_DIR_PATH) 
                        if os.path.isdir(os.path.join(ACTIVITY_DIR_PATH, entry))]

point_dfs = []
activity_dfs = []
for dir in activity_export_dirs:
    gpx_paths = [os.path.join(dir, entry) 
                  for entry
                  in os.listdir(dir)
                  if entry.split('.')[-1] == 'gpx']
    for gpx_path in gpx_paths:
      with open(gpx_path) as f:
        gpx = gpxpy.parse(f)
        point_dict = {
          'latitude': [],
          'longitude': [],
          'elevation': [],
          'time': []
        }
        for track in gpx.tracks:
          for segment in track.segments:
            for point in segment.points:
              point_dict['time'].append(point.time)
              point_dict['latitude'].append(point.latitude)
              point_dict['longitude'].append(point.longitude)
              point_dict['elevation'].append(point.elevation)
              point.dis
        df = pd.DataFrame(point_dict)
        point_dfs.append(df)

In [12]:
print(point_dfs[0].head())
print(len(point_dfs))

    latitude  longitude  elevation                      time
0  12.536197 -70.057429        5.2 2022-04-17 14:16:52+00:00
1  12.536202 -70.057520        5.3 2022-04-17 14:16:55+00:00
2  12.536173 -70.057601        5.8 2022-04-17 14:16:57+00:00
3  12.536177 -70.057652        5.8 2022-04-17 14:16:59+00:00
4  12.536251 -70.057729        5.7 2022-04-17 14:17:02+00:00
1610
