STRAVA V4

Welcome to my spare time project. This code visits the Strava API and fetches my activities so that I can explore them in Tableau. 
You can borrow this code and fill in your own credentials to create a file of your own Strava activities with photos.

Credit to the following resources which greatly helped me get started:

https://mixedanalytics.com/blog/list-actually-free-open-no-auth-needed-apis/

https://developers.strava.com/docs/authentication/#:~:text=OAuth%20Overview,-When%20OAuth%20is&text=After%20the%20user%20accepts%20or,scope%20accepted%20by%20the%20user.

https://developers.strava.com/docs/reference/#api-Activities-getActivityById

https://towardsdatascience.com/using-the-strava-api-and-pandas-to-explore-your-activity-data-d94901d9bfde

In [None]:
#IMPORTS

#The packages required throughout this project:
import requests
import numpy as np
import pandas as pd
import os
from datetime import datetime

CREDENTIALS

You will need credentials in order to access the Strava API.
If you complete an application, you will get the following credentials:
-Client ID
-Client Secret
-Access Token
-Refresh Token
You then need to do some more work to get a read all access token and refresh token (this code only uses the refresh one).

Use this guide to help you through the process:
https://towardsdatascience.com/using-the-strava-api-and-pandas-to-explore-your-activity-data-d94901d9bfde

In [None]:
#MY CREDENTIALS
#This section won't work when running from GitHub. Sorry! Can't give out all my credentials!
#It's reading in my credentials from a file that I didn't include in my Git repository

credentials = open("C:/Users/Chloe/Desktop/Strava_API_credentials.txt", "r").read()
print("Reading in credentials...")
client_id = credentials.partition("client id: ")[2][:6]
client_secret = credentials.partition("client secret: ")[2][:40]
read_all_refresh_token = credentials.partition("read all refresh token: ")[2][:40]
print("Success!")

#YOUR CREDENTIALS
#This section should work when running from GitHub as I included a template credential file in the Git repository.
#The file is Strava_API_credentials_TEMPLATE.txt
#You will need to update the template credential file with some real credentials for the rest of the code to work
#You can use this guide to help you get your own credentials:
#https://towardsdatascience.com/using-the-strava-api-and-pandas-to-explore-your-activity-data-d94901d9bfde

#Get current directory so can do relative filepaths
current_dir = os.getcwd()

credentials = open("Strava_API_credentials_TEMPLATE.txt", "r").read()
print("Reading in credentials...")
client_id = credentials.partition("client id: ")[2][:6]
print("client id:")
print(client_id)
client_secret = credentials.partition("client secret: ")[2][:40]
print("client secret:")
print(client_secret)
read_all_refresh_token = credentials.partition("read all refresh token: ")[2][:40]
print("Read all refresh token:")
print(read_all_refresh_token)

if client_id == "123456" or client_secret == "1a2b3c4d5e6f7g8h9j0k1l2m3n4o5p6q7r8s9t0u" or read_all_refresh_token == "1a2b3c4d5e6f7g8h9j0k1l2m3n4o5p6q7r8s9t0u":
    print("It looks like you are using the template credentials.")
    print("Remember to go into the template and update it to your real credentials.")
    print("The template file is Strava_API_credentials_TEMPLATE.txt")
    print("Use the first half of the following guide to help:")
    print("https://towardsdatascience.com/using-the-strava-api-and-pandas-to-explore-your-activity-data-d94901d9bfde")
else:
    print("Success!")

In [None]:
#AUTHENTICATION AND ACCESS TOKENS

#You need the following to make any Strava API requests:
#Client ID
#Client Secret
#Access Token (read all)

#The access token only lasts for ?6? hours so we can use the read all refresh token to get a new read all access token.
#I used this documentation as a guide, see refreshing an expired access token section:
#https://developers.strava.com/docs/authentication/#:~:text=OAuth%20Overview,-When%20OAuth%20is&text=After%20the%20user%20accepts%20or,scope%20accepted%20by%20the%20user.

#Do a POST request to this url:
refresh_read_all_access_token_url = "https://www.strava.com/oauth/token"

#With these parameters:
refresh_read_all_access_token_params = {
    "client_id": client_id,
    "client_secret": client_secret,
    "grant_type": "refresh_token",
    "refresh_token": read_all_refresh_token
}

#The actual POST request:
refresh_read_all_access_token = requests.post(refresh_read_all_access_token_url, refresh_read_all_access_token_params)

#See what that did:
print(refresh_read_all_access_token)
# print(refresh_read_all_access_token.json()) # prints all if interested
print("current read all access token: ")
read_all_access_token = refresh_read_all_access_token.json()["access_token"]
print(read_all_access_token)

#IMPORTANT
#This will not work if you have not run the above section with REAL credentials
#Now we have a valid read all access token but it only lasts 6 hours (I think)
#Need to re-run this section if more than 6 hrs has elapsed since last run.

In [None]:
#GET MAIN DATA

#Using this documentation as a guide:
#https://developers.strava.com/docs/reference/#api-Activities-getLoggedInAthleteActivities

#Get the first page of activities (can only have 200 per page)
page = 1

#Do a GET request with this url:
all_activities_url = "https://www.strava.com/api/v3/athlete/activities?page=" + str(page) + "&per_page=200"

#The actual GET request:
all_activities_raw = requests.get(all_activities_url, headers={"Authorization": "Bearer " + read_all_access_token})

print(all_activities_raw)

all_activities = pd.json_normalize(all_activities_raw.json())
print(all_activities.shape)

#Get the rest of the activities
page = 2

#Do a GET request with this url:
all_activities_url = "https://www.strava.com/api/v3/athlete/activities?page=" + str(page) + "&per_page=200"

#The actual GET request:
all_activities_raw = requests.get(all_activities_url, headers={"Authorization": "Bearer " + read_all_access_token})

print(all_activities_raw)

all_activities = pd.concat([all_activities, pd.json_normalize(all_activities_raw.json())], axis=0)


#See dimensions of table and top 5 rows to do a rough check
print(all_activities.shape)

print("Found " + str(len(all_activities.index)) + " activities.")

In [None]:
all_activities.columns

In [None]:
#CLEAN
#Create new dataframe with only columns I care about and rename column with . in name
cols = ['name', 'id', 'type', 'distance', 'moving_time', 'elapsed_time',
        'max_speed','total_elevation_gain', 'start_date_local', 'start_date', 
        'start_latlng', 'end_latlng', 'map.summary_polyline'
       ]

all_activities_df = all_activities[cols]

all_activities_df = all_activities_df.rename(columns={'map.summary_polyline':'map_polyline'})


#Break date into start time and date
all_activities_df['start_date_local'] = pd.to_datetime(all_activities_df['start_date_local'])
all_activities_df['start_time'] = all_activities_df['start_date_local'].dt.time
all_activities_df['start_date_local'] = all_activities_df['start_date_local'].dt.date
all_activities_df.head()


In [None]:
#EXPORT .csv
#I plan to add more sections below to practise using Pandas to explore this data,
#But for now I want to export as a .csv to read into Tableau.

#Get today's date
date_now = str(datetime.now())[:10]
print(date_now)

#Save all activities to file for use in Tableau (today's date appended for versioning)
all_activities_df.to_csv("all_activities_df.csv")
print("Data saved as 'all_activities_df.csv'")
all_activities_df.to_csv("back_ups/all_activities_df_"+date_now+".csv")
print("Data saved as 'all_activities_df_"+date_now+".csv' in the back_ups folder")

In [None]:
#PICTURES AND DESCRIPTIONS
#Want to add pictures to the workbook which requires another API call using the ids from all my activities as a parameter
#Also need to add description which doesn't seem to be included in the all_activities API request (ANNOYNG)

#First step, get a big list of all the ids of my activities:
all_activity_ids = all_activities["id"]
all_activity_ids_list = all_activity_ids.values.tolist()

#Then do the API call with each id.
#There is a problem with this as there is a limit on requests (100 every 15 minutes)
#When setting this code up I requested all my pictures and descriptions waiting 15 mins after each batch of 45 (to be safe) and saved them all to a master file
#Now I can read in this master file, match up the activity ids and only run the requests for ids not in the master file
#Then add the new pictures and descriptions to the master files so I don't need to re-request them next time.

master = pd.read_csv("activity_photos_descr_master.csv")
master = master[['id', 'description', 'pic_url', 'pic_id', 'no_pic_flag']]

#Check how many photos and activites master file currently has
print("There are " + str(master["id"].nunique()) + " activities currently in master photos and descriptions file.")

#Join data to see which activity ids are already in the master file of pictures
joined_data = pd.merge(all_activity_ids, master, on='id', how = "left")

#Filter to rows with null for description (if no description then also no pics)
new_ids = joined_data[(joined_data["description"].isnull())]
new_ids = new_ids["id"]

#Make the dataframe of new ids into a list
#If list is empty then don't need to do anything
#If list is not empty then request the pictures for the new ids
#Limit to 90 requests and notify if there are more pics & descriptions that need to be requested in 15 mins.

if len(new_ids) > 50:
    print("There are more than 50 activities that require pictures and descriptions.")
    print("The API requests have been limited to the first 45 activities. This section will need re-running in 15 minutes.")
    new_ids = new_ids[:45].values.tolist()

if len(new_ids) == 0:
    print("There are no new activities which needs pictures or descriptions.")
    
else:
    print("Requesting pictures and descriptions for " + str(len(new_ids)) + " new ids: ")
    print(new_ids)
    for activity_id in new_ids:
        #Desciptions
        get_activity_descr_url = "https://www.strava.com/api/v3/activities/" + str(activity_id)
        activity_descr_raw = requests.get(get_activity_descr_url, headers={"Authorization": "Bearer " + read_all_access_token})
        activity_descr = pd.json_normalize(activity_descr_raw.json())
        
        #Clean
        activity_descr['description'] = "\"" + activity_descr['description'] + "\""
        activity_descr = activity_descr[['id', 'description']]
        
        #Pics
        get_activity_photos_url = "https://www.strava.com/api/v3/activities/" + str(activity_id) + "/photos?size=500"
        activity_photos_raw = requests.get(get_activity_photos_url, headers={"Authorization": "Bearer " + read_all_access_token})
        activity_photos = pd.json_normalize(activity_photos_raw.json())
        #Clean and add no pics flag (only if there are pics for this activity)
        if activity_photos.size != 0:    
            activity_photos = activity_photos[['unique_id', 'activity_id', 'urls.500']]
            activity_photos = activity_photos.rename(columns={'unique_id' : 'pic_id', 'urls.500' : 'pic_url'})
            activity_photos = activity_photos.assign(no_pic_flag = 0)
            
            #Join
            activity_photos_descr = pd.merge(activity_descr, activity_photos, how='left', left_on='id', right_on='activity_id')
        
            #Clean
            activity_photos_descr = activity_photos_descr[['id', 'description', 'pic_url', 'pic_id', 'no_pic_flag']]

            #Append to master file
            master = pd.concat([master, activity_photos_descr], axis=0)
        
        #If no pics to add then add description and flag no pics
        else:
            master = pd.concat([master, activity_descr], axis=0)
            master['no_pic_flag'] = master['no_pic_flag'].fillna(1)
        
   
    master.to_csv("activity_photos_descr_master.csv")
    print("activity_photos_descr_master file has been updated.")
    print("There are now " + str(master["id"].nunique()) + " activities in master photos and descriptions file.")

master.to_csv("back_ups/activity_photos_descr_"+date_now+".csv")


In [None]:
#JOIN THE PHOTOS TO THE MAIN DATA

photos_descr = pd.read_csv("activity_photos_descr_master.csv")
activities = pd.read_csv("all_activities_df.csv")

print("Number of activities in activities file:")
print(str(len(activities.index)))
print("Number of activities in photos and descr file:")
print(str(photos_descr["id"].nunique()))


all_activities_with_photos = pd.merge(activities, photos_descr, on = "id", how = "left")


print("Number of activities in all activities with photos:")
print(str(all_activities_with_photos["id"].nunique()))

#Clean
cols2 = ['name', 'id', 'description', 'type', 'distance', 'moving_time', 'elapsed_time',
        'max_speed','total_elevation_gain', 'start_date_local', 'start_date', 
        'start_latlng', 'end_latlng', 'map_polyline', 'pic_url', 'pic_id', 'no_pic_flag'
       ]
all_activities_with_photos = all_activities_with_photos[cols2]
print(all_activities_with_photos.head)

all_activities_with_photos.to_csv("all_activities_with_photos.csv")
print("Data saved as 'all_activities_with_photos.csv'")

What to do next:

Make it so the whole script can be run from the command line:
https://stackoverflow.com/questions/35545402/how-to-run-an-ipynb-jupyter-notebook-from-terminal

Explore Panadas and using Python to explore data (V5?)

#IF MASTER PHOTOS AND DESCRIPTIONS FILE GETS CURRUPTED THEN READ THIS SECTION

#***DO NOT RUN OTHERWISE***


#This is the code for setting up the master file above in case it ever becomes currupted and needs doing again.

#Take one activity id
activity_id_eg = "11291925681"

#PHOTOS
print("PICS")

#Do one API call to get the picture info for it
get_activity_photos_url = "https://www.strava.com/api/v3/activities/" + str(activity_id_eg) + "/photos?size=500"
activity_photos_raw = requests.get(get_activity_photos_url, headers={"Authorization": "Bearer " + read_all_access_token})
activity_photos = pd.json_normalize(activity_photos_raw.json())
print(activity_photos_raw)
print("PICS RAW:")
print(activity_photos)

#Clean up and add a flag to say there is a pic for this activity
print(activity_photos.columns)
activity_photos = activity_photos[{'unique_id', 'activity_id', 'urls.500'}]
activity_photos = activity_photos.rename(columns={'unique_id' : 'pic_id', 'urls.500' : 'pic_url'})
activity_photos = activity_photos.assign(no_pic_flag = 0)
print("PICS CLEAN:")
print(activity_photos)
print(activity_photos["pic_url"][0])

#DESCRIPTION
print("DESCR")

#Do another API call to get the description
get_activity_descr_url = "https://www.strava.com/api/v3/activities/" + str(activity_id_eg)
activity_descr_raw = requests.get(get_activity_descr_url, headers={"Authorization": "Bearer " + read_all_access_token})
activity_descr = pd.json_normalize(activity_descr_raw.json())
print(activity_descr_raw)

#Clean up
print(activity_descr.columns)
activity_descr = activity_descr[{'id', 'description'}]
activity_descr['description'] = "\"" + activity_descr['description'] + "\""
print("DESCR CLEAN")
print(activity_descr)

#JOIN
activity_photos_descr = pd.merge(activity_descr, activity_photos, how='left', left_on='id', right_on='activity_id')

#Clean and flag activities with no photos
activity_photos_descr = activity_photos_descr[{'id', 'description', 'pic_url', 'pic_id', 'no_pic_flag'}]
activity_photos_descr['no_pic_flag'] = activity_photos_descr['no_pic_flag'].fillna(1)
print("BOTH CLEAN")
print(activity_photos_descr)

#Save the repsonse as the master file.
activity_photos_descr.to_csv("activity_photos_descr_master.csv")
print("activity_photos_descr_master file has been created.")
print("!!!Remember to adapt and run the PICTURES AND DESCRIPTIONS section to fully populate the master file!!!")

#Now can run the above section although will need to adapt the loop to limit <100 calls and run it every 15 minutes.