# Tour de France historic analysis

The Tour de France is one of, if not the largest bike races in the world, taking part as one of the three Grand Tours (Giro d'Italia and Vuelta a España being the other two). It is a stage based race, where riders must take part in each stage, and accumulated time is used for the *general classification* (GC), which determines the overall winner.

Alongside the GC, there are 3 other categories in which riders compete:

1. Points classification - typically for sprinters, achieved by winning points at sprints occurring during and at the end of certain stages
2. King/Queen of the Mountains (KoM) - points are attributed to riders in the order they scale classified mountains
3. Best Young Rider - the best placed rider under the age of 25

Competitions are denoted by differently coloured jerseys. While these differ across each Grand Tour, the Tour de France follows the format of yellow jersey for GC, green jersey for points, polkadot jersey for KoM/QoM, and white jersey for best young rider.

Competitions are not distinct from each other, and a single rider can lead any combination of the 4 competitions, however, will only wear one jersey during the race, with the other jerseys 'lent' to the next best rider during stages.

<p align="center">
  <img src="https://library.sportingnews.com/2022-07/tour-de-france-jerseys-072222-getty-ftr.jpg" alt="The four TdF Jerseys"/>
</p>

Teams' different competion goals at the Tour can often define both their tactics and the way in which they prepare team composition. There is, therefore, a huge amount of analysis involved in the tour, and using this dataset, I will try to explore it further.

Firstly, my thanks to user PABLOMONLEON for collating this dataset for use on Kaggle.

In [3]:
import Pkg

# Pkg.add(["DataFrames","Plots","MLJ","CSV","StatsPlots","Dictionaries","Dates","DataFramesMeta","HTTP","Gumbo","AbstractTrees","TableScraper","Cascadia"])
using DataFrames, DataFramesMeta, Plots, StatsPlots, Dictionaries, MLJ, CSV, Dates, HTTP, Gumbo, AbstractTrees, Cascadia, TableScraper

# readin
# files=readdir(pwd() * "/TdFArchive/")

# stages,stagesFull,winner = [CSV.File(pwd()*"/TdFArchive/"*x) |> DataFrame for x in files];

The data collected contains information on the stages themselves, the stage results, and the winner for each edition. First, we can take a look at the conditions regarding the stages over time. For example, how do the distance vary over time?

## Stages through the ages

In [2]:
@df stagesFull scatter(
    :Date,
    :Distance,
    label = "Stage distance (km)"
)

UndefVarError: UndefVarError: stagesFull not defined

First off, we can just how incredibly long the Tour stages used to be compared to the current lengths. All the more incredible realising that until 1937, participants were required to ride on single gear bikes with wooden rims. Following the Second World War and the death of the Tour's progenitor, Henri Desgranges, the Tour settled into a more typical 20-25 stages usually lasting a day. This much is clearly visible from the above plot. If we include information on the types of stages involved:

In [3]:
function standardTypes(types)
    stageTypes = Dictionary(["flat","cobbles","mountain","tt"],[("flat stage","intermediate stage","plain stage","transition stage"),("flat cobblestone stage","plain stage with cobblestones"),("high mountain stage","hilly stage","medium mountain stage","mountain stage","stage with mountain","stage with mountain(s)"),("individual time trial","mountain time trial")])    
    vs = fill("",length(types))
    for key in keys(stageTypes)
        inds = findall(Bool.(sum([lowercase.(types) .== x for x in stageTypes[key]])))
        types[inds] .= key
    end
    return types
end
stagesFull.Type = standardTypes(stagesFull.Type);

@df stagesFull scatter(
    :Date,
    :Distance,
    group = :Type,
    ylabel = "Distance (km)"
)

UndefVarError: UndefVarError: stagesFull not defined

Moving from those early flat or mountain stages, all brutally long, to the introduction of teams (much to Desgranges' chagrin) with team and individual time trials, there is a gradual increase in the variety of stage type seen throughout the tour. It appears, therefore, that the variation in stages observed in the tour hasn't changed too dramatically since the mid to late 1940's. Would it, therefore, be safe to assume that the winners of the Tour have also not varied much in formula since those days?

## Winning qualities

In [4]:
@df winner scatter(
    :start_date,
    :weight./:height.^2,
    group=:nationality,
    legend=:best,
    legendcolumns=3,
    ylabel="BMI",
    markeralpha=0.8
)
bigDogs = Bool.(sum([winner.winner_name .== x for x in ["Lance Armstrong", "Miguel Induráin", "Eddy Merckx"]]))
# findall(winner.winner_name .== "Lance Armstrong")
annotate!(winner.start_date[Int(round(mean(findall(winner.winner_name .== "Lance Armstrong"))))],
    winner.weight[findall(winner.winner_name .== "Lance Armstrong")]./winner.height[findall(winner.winner_name .== "Lance Armstrong")].^2 .+ .2,
    text("Armstrong",10,:bottom))
annotate!(winner.start_date[Int(round(mean(findall(winner.winner_name .== "Miguel Induráin"))))],
    winner.weight[findall(winner.winner_name .== "Miguel Induráin")]./winner.height[findall(winner.winner_name .== "Miguel Induráin")].^2 .+ .2,
    text("Induráin",10,:bottom))
annotate!(winner.start_date[Int(round(mean(findall(winner.winner_name .== "Eddy Merckx"))))],
    winner.weight[findall(winner.winner_name .== "Eddy Merckx")]./winner.height[findall(winner.winner_name .== "Eddy Merckx")].^2 .+ .2,
    text("Merckx",10,:bottom))

UndefVarError: UndefVarError: winner not defined

With a greater focus on nutrition and modern science applied to training, we see that there is indeed a downward trend in BMI over the years. Some notable exceptions are present, those of Eddy Merckx (1969:1974), Miguel Induráin (1991:1995), and Lance Armstrong (1999:2005). 

Induráin had an extensive legacy on the Tour in being arguably the first time trial specialist to dominate the sport, winning individual time trials by such margins that he could 'get by' in the mountain stages.

Merckx was a phenomenal cyclist who has the joint most stages wins at the Tour, and even won the points jersey in 1969, 1971, 1972, and KoM in 1969 and 1970, alongside winning the yellow jersey in each year.

Armstrong infamously 'won' 7 TdF titles via a sophisticated doping system (albeit during a period tainted by serial performance enhancing scandals, indicative of a peloton-wide epidemic). 

These three cyclists all show quite high BMI values, where Induráin was known for his considerable size for a cyclist, and Armstrong's high BMI may be testament to his doping programme. Overall, however, a downward trend is certainly prevalent. One area that is interesting is speeds at which winners travelled across all competition stages.

In [5]:
scatter(
    winner.start_date,
    winner.distance./winner.time_overall,
    label="Overall speed (kmh)"
    # legend = nothing
    # legend=:topright
)

UndefVarError: UndefVarError: winner not defined

We see that a general increase in speed has occurred over time. An interesting area of analysis is looking at general trends of speeds comparing the 'doping era' of the 80s-2000s.

In [6]:
using Unicode

function sepName(name)
    #=
    Generate Regex expression to search for all names, in any order. Will remove accents, must be accounted for when searching through match

    Args:
        name:   string of full name

    Returns:
        Regex formatted expression of appropriate search term

    Example:
        >>> sepName("Stêphen La Faîre")
        >>> r"^(?=.*\bStephen\b).*^(?=.*\bLa\b).*^(?=.*\bFaire\b).*"
    =#
    name = Unicode.normalize(name,stripmark=true) # remove accents
    inds = [collect(x)[1] for x in findall.(" ", name)] # find spaces
    prepend!(append!(inds,length(name)+1),0)
    ranges = [collect(range(inds[x]+1,inds[x+1]-1)) for x in 1:length(inds)-1] # extract name indices only
    names = [name[x] for x in ranges] # return list of names
    return Regex(join(["^(?=.*\\b" * x * "\\b).*" for x in names]))
end

sepName (generic function with 1 method)

In [7]:
# calculate the number of riders per stage
stageSummary = @combine(groupby(stages,[:year,:Stage]),
    :N = length(:rank),
    :Age = mean(:age)
    )

@df stageSummary scatter(:year,
    :N,
    label="Number of riders")

maximum(out.year)

UndefVarError: UndefVarError: stages not defined

In [8]:
# out = DataFrame(Stage = String3[], Date = Date[], Distance = Float64[], Origin = String[], Destination = String31[], Type = String31[], Winner = String[], Winner_Country = Union{Missing, String31}, edition = Int64[], year = Int64[], stage_results_id = String15[], rank = String7[], time = Union{Missing, String3}, rider = String[], age = Union{Missing, Float64}, team = Union{Missing, String}, points = Union{Missing, Float64}, elapsed = Union{Missing, String3}, bib_number = Union{Missing, Float64})
out = DataFrame()
for (win,yr) in zip(winner[:,:winner_name],Dates.year.(Date.(winner.start_date)))
    indivInfo = stages[(occursin.(sepName(win),Unicode.normalize.(stages.rider,stripmark=true))) .& (yr .== stages.year),:]
    indivInfo.Stage .= replace.(indivInfo.stage_results_id,r"[^0-9]" => "")
    indivInfo.rank .= replace.(indivInfo.rank,r"[^0-9]" => "")
    stageInfo = stagesFull[(Dates.year.(Date.(stagesFull.Date)) .== yr),:]
    append!(out,innerjoin(indivInfo,stageInfo,on=:Stage),promote=true, cols == :union)
end
out.rank = parse.(Int64,out.rank);

UndefVarError: UndefVarError: winner not defined

In [9]:
# within each grouping, we want to define the relative rank of the rider, i.e. the rank/number of participants in that stage
# for (win,yr) in zip(winner[:,:winner_name],Dates.year.(Date.(winner.start_date)))
#     ranks = stages[(occursin.(sepName(win),Unicode.normalize.(stages.rider,stripmark=true))) .& (yr .== stages.year),:rank]
#     # ranks ./ stageSummary[(yr. == stageSummary.year),:N]
# end


# win = winner.winner_name[30]
# yr = Dates.year.(Date.(winner.start_date))[30]
# ranks = parse.(Int64,stages[(occursin.(sepName(win),Unicode.normalize.(stages.rider,stripmark=true))) .& (yr .== stages.year),:rank])


In [10]:
teamSummary = dropmissing(@combine(groupby(stages[occursin.(r"^[0-9]",stages.rank),:],[:team,:year]),
    :aveAge = mean(:age),
    :aveRank = mean(parse.(Int64,:rank))))
    
# plot(bar(teamSummary.year, teamSummary.aveRank))
    # @df teamSummary scatter(:year,
    # :aveAge,
    # label = "Average team age")
teamSummary

UndefVarError: UndefVarError: stages not defined

Knowing that cycling is a highly team-oriented, sport, if we look at the average age of riders in a team, that number has gradually increased, particularly 

In [11]:
@df out scatter(:year,
    :rank,
    group = :Winner,
    legend = nothing,
    ylabel = "Finish rank",
    xlabel = "Stage number")

MethodError: MethodError: no method matching _extract_group_attributes(::Symbol, ::Symbol, ::Symbol)
Closest candidates are:
  _extract_group_attributes(!Matched::AbstractVector{T} where T, ::Any...; legend_entry) at C:\Users\arang\.julia\packages\RecipesPipeline\BGM3l\src\group.jl:10
  _extract_group_attributes(!Matched::Tuple, ::Any...) at C:\Users\arang\.julia\packages\RecipesPipeline\BGM3l\src\group.jl:27
  _extract_group_attributes(!Matched::NamedTuple, ::Any...) at C:\Users\arang\.julia\packages\RecipesPipeline\BGM3l\src\group.jl:36
  ...

In [12]:
# [findall.(" ",winner.winner_name[23]),length(winner.winner_name[23])]

inds = [collect(x)[1] for x in findall.(" ", winner.winner_name[23])]

# deleteat!(collect(1:length(winner.winner_name[23])),findall.(" ", winner.winner_name[23])[1])
prepend!(append!(inds,length(winner.winner_name[23])+1),0)
ranges = [collect(range(inds[x]+1,inds[x+1]-1)) for x in 1:length(inds)-1]
# r"+" * [r"(?=.*\b" * winner.winner_name[23][x] * r"\b)" for x in ranges]

# r"(?=.*\b" * winner.winner_name[23][ranges[1]] * r"\b)"
# print(match(r"^" * Regex(join(["(?=.*\b" * winner.winner_name[23][x] * "\b)" for x in ranges])) * r".*$",winner.winner_name[23]))
# r"^" * Regex(join(["(?=.*\b" * winner.winner_name[23][x] * "\b)" for x in ranges])) * r".*$",winner.winner_name[23]
names = [winner.winner_name[23][x] for x in ranges]
occursin(Regex(join(["^(?=.*\\b" * x * "\\b).*" for x in names])),winner.winner_name[21])
# ^(?=.*\bSidney\b)(?=.*\bAlice\b)(?=.*\bPeter\b).*$



UndefVarError: UndefVarError: winner not defined

In [13]:
occursin(Regex("(?=.*\\bMaurice\\b).*(?=.*\\bDe\\b).*(?=.*\\bWaele\\b).*"),winner.winner_name[23])

UndefVarError: UndefVarError: winner not defined

In [14]:
Unicode.normalize.(stages.rider,stripmark=true)

UndefVarError: UndefVarError: stages not defined

In [15]:
# scrape wikipedia 

## Scraping procyclingstats.com

To gain access to more data, we'll scrape through some [procyclingstats](https://www.procyclingstats.com/index.php) pages to extract full stage data. This requires some wrangling of the data, especially as TTT stages are not formatted as the others. However, it will allow us far greater information, and we can really dig into what each rider goes through, the team composition, and the winners' journeys to the yellow jersey.

In [4]:
function numStages(year::Int)
    # determine number of stages of the Tour de France in `year`, if a race was run. Data pulled from procyclingstats.com
    try
        out = [split(x," | ")[1] for x in DataFrame(scrape_tables("https://www.procyclingstats.com/race/tour-de-france/" * string(year))[2]).Stage]
        filter!(x -> !isnothing(x), out)
        return out
    catch
        return missing
    end
end

function checkTTT(url)
    # check if stage is TTT
    r = HTTP.get(url)
    r_parsed = parsehtml(String(r.body))
    tables_elems = eachmatch(sel"table", r_parsed.root)
    return contains(getattr(tables_elems[1],"class"),"ttt")
end

function formatTimelag(tlag)
    # standardise timelag to HH:MM:SS

    # remove +
    tlag = replace(tlag,"+" => "")
    # check if empty (to be copied from above)
    if tlag != ""
        # check if only mins and seconds
        if isnothing(match(r"(:\d\d:\d\d)",tlag))
            # if only one min value
            isnothing(match(r"(\d\d:\d\d)",tlag)) ? tlag = "00:0" * tlag : tlag = "00:" * tlag
        end
    end
    return tlag
end

# function fullTTT(url,stage)
#     # add times for each individual rider of TTT and return format similar to standard stages
#     df = DataFrame(scrape_tables(url)[1])
#     inds = append!(findall(df[:,"Pos."] .!= ""),nrow(df))
#     df.Rnk = missings(String, nrow(df))
#     df.Rider = missings(String, nrow(df))
#     df.Timelag = missings(String, nrow(df))
#     for x in 1:(length(inds) - 1)
#         df.Rnk[(inds[x]+1):(inds[x+1]-1)] .= df[inds[x],"Pos."]
#         df.Rider[(inds[x]+1):(inds[x+1]-1)] .= df.Team[(inds[x]+1):(inds[x+1]-1)] .* df.Team[inds[x]]
#         df.Team[(inds[x]+1):(inds[x+1]-1)] .= df.Team[inds[x]]
#         df.Timelag[(inds[x]+1):(inds[x+1]-1)] .= "+" * df.Timegap[inds[x]]
#     end
#     df.Rnk[nrow(df)] = df.Rnk[nrow(df) - 1]
#     df.Rider[nrow(df)] = df.Team[nrow(df)] * df.Team[inds[length(inds)]]
#     df.Team[nrow(df)] = df.Team[nrow(df) - 1]
#     df.Timelag[findall(skipmissing(df.Timelag .== "+"))] .= "+0.00"
#     deleteat!(df,unique(inds))
#     return insertcols!(df,1,:Stage => string(stage)),cols=:union
# end

function stageResults(year::Int)
    # extract all stage results for `year` Tour de France. Returns DataFrame
    numStag = numStages(year) # list all stages in tour if any
    if !ismissing(numStag)
        out = DataFrame()
        for stage in numStag
            url = "https://www.procyclingstats.com/race/tour-de-france/" * string(year) *"/" * lowercase(replace(rsplit(stage," ";limit=2)[1]," " => "-"))
            # check if stage is TTT
            if checkTTT(url)
                append!(out,fullTTT(url,stage))
            else
                append!(out,insertcols!(DataFrame(scrape_tables(url)[1]),1,:Stage => string(stage)),cols=:union)
            end
        end
        out.Timelag = formatTimelag.(out.Timelag)
        return out
    end
end

stageResults (generic function with 1 method)

In [7]:
test = numStages(1974)
# if any(contains.(test,r"\d[a-z]")) # check for presence of multi-stage days

# end

# test
"https://www.procyclingstats.com/race/tour-de-france/" * string(year) *"/" * lowercase(replace(rsplit(test[8]," ";limit=2)[1]," " => "-"))

"https://www.procyclingstats.com/race/tour-de-france/year/stage-6b"

In [18]:
url="https://www.procyclingstats.com/race/tour-de-france/1974/stage-6b"
# check if stage is TTT
r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
tables_elems = eachmatch(sel"table", r_parsed.root)
# eachmatch(sel"tr th", tables_elems[1])


# [getattr(x,"href") for x in eachmatch(sel"a",tables_elems)]
test = eachmatch(sel"a",tables_elems[1])
getattr(test[2],"href")

"team/kas-kaskol-1974"

In [19]:
test = insertcols!(DataFrame(scrape_tables("https://www.procyclingstats.com/race/tour-de-france/1974/stage-6a")[1]),1,:Stage => string("Stage 6a"))
test[50:70,:]

Row,Stage,Rnk,GC,Timelag,BIB,H2H,Specialty,Rider,Age,Team,UCI,Pnt,Unnamed: 13_level_0,Time
Unnamed: 0_level_1,String,String,String,String,String,String,String,String,String,String,String,String,String,String
1,Stage 6a,50,,,101,,GC,PINGEON RogerJobo - Lejeune,33,Jobo - Lejeune,,,,",,0:27"
2,Stage 6a,51,12.0,+1:02,11,,TT,AGOSTINHO JoaquimBic,31,Bic,,,,",,0:27"
3,Stage 6a,52,10.0,+0:55,71,,TT,MANZANEQUE JesúsLa Casera - Bahamontes,31,La Casera - Bahamontes,,,,",,0:27"
4,Stage 6a,53,,,49,,Sprint,VAN NESTE WillySonolor - Gitane,30,Sonolor - Gitane,,,,",,0:27"
5,Stage 6a,54,,,9,,Sprint,SPRUYT JosMolteni,31,Molteni,,,,",,0:27"
6,Stage 6a,55,,,21,,GC,THÉVENET BernardPeugeot - BP - Michelin,26,Peugeot - BP - Michelin,,,,",,0:27"
7,Stage 6a,56,,,26,,Sprint,MOLLET AndréPeugeot - BP - Michelin,24,Peugeot - BP - Michelin,,,,",,0:27"
8,Stage 6a,57,,,85,,GC,MILLARD JoëlMerlin Plage - Shimano - Flandria,28,Merlin Plage - Shimano - Flandria,,,,",,0:27"
9,Stage 6a,58,,,32,,GC,AJA GonzaloKas - Kaskol,27,Kas - Kaskol,,,,",,0:27"
10,Stage 6a,59,,,3,,Classic,DELCROIX LudoMolteni,23,Molteni,,,,",,0:27"


## What to scrape?

Thinking about what would be most useful for delving into the data here, I thought that both getting information on each rider's progression throughout each stage of the race would be a start, but also extracting information on which teams and the team make-up for each year would also be intriguing.

Extracting (some of) the stage information is relatively simple thanks to the [TableScraper](https://github.com/xiaodaigh/TableScraper.jl) package. There are some issues there, which I'll get to later. However, the team information is unfortunately not so easy. While the majority of listed data is typically shown in table divisions, for data post-1929 is given as lists instead. So, we'll write something to circumvent this and put the data together in a readable format. In Julia, the one I have the most experience and familiarity with are `DataFrames.jl`, so that's what I'll use.

In [26]:
url="https://www.procyclingstats.com/race/tour-de-france/1950/startlist/"
r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
list_elems = eachmatch(sel".startlist_v4", r_parsed.root)
# eachmatch(sel"tr th", tables_elems[1])

length(list_elems[1].children)

14

In [9]:
# info to extract: rider name; rider team; rider nationality; rider age; rider height (if available); rider weight (if available)
# arg given: year

yr = 1929

# step 1: build url
url = "https://www.procyclingstats.com/race/tour-de-france/" * string(yr) * "/startlist/"
r = HTTP.get(url) # scrape data
r_parsed = parsehtml(String(r.body)) # parse to string
# determine if full list available
if Gumbo.children.(eachmatch(sel"h3", r_parsed.root))[1][1].text == "Individual participants"


true

In [157]:
table = eachmatch(sel"table", r_parsed.root)[1]

test = eachmatch(sel"tbody tr", table)
# tst = eachmatch(sel"th", test[1])
# occursin.("team",nodeText.(tst))

name = nodeText(eachmatch(sel"td",test[1])[3])
team = nodeText((eachmatch(sel"td",test[1]))[4])
nat = getattr(eachmatch(sel"td",test[1])[3].children[1],"class")[end-1:end]

# getattr(eachmatch(sel"td",test[1])[3],"class")


"it"

In [10]:
function fullListDetails(row)
    name = nodeText(eachmatch(sel"td",row)[3])
    name = match(r"\b[\p{Lu}]([\p{Ll}]+)*\b", name).match * " " * match(r"\b[\p{Lu}]+(?:\s+[\p{Lu}]+)*\b", name).match # reformat name
    team = nodeText((eachmatch(sel"td",row))[4])
    nat = getattr(eachmatch(sel"td",row)[3].children[1],"class")[end-1:end]
    return name, team, nat
end

fullListDetails (generic function with 1 method)

In [235]:
scrape_tables("https://www.procyclingstats.com/race/tour-de-france/1929")[2] |> DataFrame

Row,Date,Unnamed: 2_level_0,Stage,Winner
Unnamed: 0_level_1,String,String,String,String
1,30/06,,Stage 1 | Paris - Caen (206 km),DOSSCHE Aimé
2,01/07,,Stage 2 | Caen - Cherbourg (140 km),LEDUCQ André
3,02/07,,Stage 3 | Cherbourg - Dinan (199 km),TAVERNE Omer
4,03/07,,Stage 4 | Dinan - Brest (206 km),DELANNOY Louis
5,04/07,,Stage 5 | Brest - Vannes (208 km),VAN SLEMBROUCK Gustave
6,05/07,,Stage 6 | Vannes - Les Sables d'Olonne (206 km),LE DROGO Paul
7,06/07,,Stage 7 | Les Sables d'Olonne - Bordeaux (285 km),FRANTZ Nicolas
8,07/07,,Stage 8 | Bordeaux - Bayonne (182 km),MOINEAU Julien
9,08/07,,Restday,
10,09/07,,Stage 9 | Bayonne - Luchon (363 km),CARDONA Salvador


In [190]:
names = fullListDetails.(test)
names[16]

("Camille VAN DE CASTEELE", "J.B. Louvet", "be")

In [None]:
function getRiderInfo(name)
    # generate url
    url = "https://www.procyclingstats.com/rider/" * replace(lowercase(name), " " => "-")
    r = HTTP.get(url) # scrape data
    r_parsed = parsehtml(String(r.body)) # parse to string
    bday = split(join(nodeText.(eachmatch(sel"div div div div div div",r_parsed.root)[1].children[3].children[[2,4]]))[2:end]," (")[1]

In [231]:
url = "https://www.procyclingstats.com/rider/" * replace(lowercase("thomas-voeckler"), " " => "-")
r = HTTP.get(url) # scrape data
r_parsed = parsehtml(String(r.body)) # parse to string
split(join(nodeText.(eachmatch(sel"div div div div div div",r_parsed.root)[1].children[3].children[[2,4]]))[2:end]," (")[1]


"22 June 1979"

In [27]:
using Cascadia
url="https://www.procyclingstats.com/race/tour-de-france/1929/startlist/"
r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
table_elems = eachmatch(sel"table", r_parsed.root)

test = []
# for row in eachmatch(sel"tbody tr", table_elems[1])
#     tst = eachmatch(sel"td", row)



nodeText(tst[1])
tst[3]
# getattr(tst[3],"span")
replace(string(getattr(tst[3].children[1],"class")),"flag " => "")

[nodeText(tst[1]),nodeText(tst[2]),replace(string(getattr(tst[3].children[1],"class")),"flag " => ""),nodeText(tst[3].children[2].children[1])]

UndefVarError: UndefVarError: tst not defined

In [22]:
url="https://www.procyclingstats.com/race/tour-de-france/1929/startlist/"
r = HTTP.get(url)
r_parsed = parsehtml(String(r.body))
DataFrame(scrape_tables(url)[1])


Row,#,BIB,Rider,Contract team
Unnamed: 0_level_1,String,String,String,String
1,1,-,CRIPPA Alfonso,La Rafale
2,2,-,INNOCENTI Settimo,La Rafale
3,3,-,MARA Michele,La Rafale
4,4,-,ORECCHIA Michele,La Rafale
5,5,-,PANCERA Giuseppe,La Rafale
6,6,-,POMPOSI Mario,La Rafale
7,7,-,ROVIDA Carlo,La Rafale
8,8,-,BIDOT Marcel,La Française - Diamant - Dunlop
9,9,-,BUYSSE Lucien,J.B. Louvet
10,10,-,DE CORTE Raymond,J.B. Louvet


In [23]:

# findfirst(contains.([getattr(x,"href") for x in eachmatch(sel"a", r_parsed.root)],"team/"))
eachmatch(sel"a", r_parsed.root)[25]

HTMLElement{:a}:<a href="team/alpecin-deceuninck-2023">
  Alpecin-Deceuninck
</a>



In [49]:
test =[split(x," | ")[1] for x in DataFrame(scrape_tables("https://www.procyclingstats.com/race/tour-de-france/" * string(1974))[2]).Stage]

29-element Vector{SubString{String}}:
 "Prologue"
 "Stage 1"
 "Stage 2"
 "Stage 3"
 "Stage 4"
 "Stage 5"
 "Stage 6a"
 "Stage 6b (TTT)"
 "Stage 7"
 "Stage 8a"
 ⋮
 "Stage 16"
 "Stage 17"
 "Stage 18"
 "Stage 19a"
 "Stage 19b (ITT)"
 "Stage 20"
 "Stage 21a"
 "Stage 21b (ITT)"
 "Stage 22"

In [625]:
out

Row,Stage,Rnk,GC,Timelag,BIB,H2H,Specialty,Rider,Age,Team,UCI,Pnt,Unnamed: 13_level_0,Time,Avg,Pos.,Timegap,Speed,PCS points,UCI points
Unnamed: 0_level_1,String,String?,String?,String?,String?,String?,String?,String?,String?,String,String?,String?,String?,String,String?,String?,String?,String?,String?,String?
1,Prologue,1,1,+0:00,1,,TT,INDURAIN MiguelBanesto,27,Banesto,,100,,0:09:22,51.246,missing,missing,missing,missing,missing
2,Prologue,2,2,+0:02,89,,TT,ZÜLLE AlexO.N.C.E. - Look - Mavic,23,O.N.C.E. - Look - Mavic,,70,,0:020:02,51.064,missing,missing,missing,missing,missing
3,Prologue,3,3,+0:03,47,,TT,MARIE ThierryCastorama,29,Castorama,,50,,0:030:03,50.973,missing,missing,missing,missing,missing
4,Prologue,4,4,+0:04,136,,TT,NIJDAM JelleBuckler - Colnago - Decca,28,Buckler - Colnago - Decca,,40,,0:040:04,50.883,missing,missing,missing,missing,missing
5,Prologue,5,5,+0:11,4,,TT,DE LAS CUEVAS ArmandBanesto,24,Banesto,,32,,0:110:11,50.262,missing,missing,missing,missing,missing
6,Prologue,6,6,+0:12,6,,TT,GARMENDIA AitorBanesto,24,Banesto,,26,,0:120:12,50.174,missing,missing,missing,missing,missing
7,Prologue,7,7,+0:12,212,,GC,ALCALÁ RaúlPDM - Ultima - Concorde,28,PDM - Ultima - Concorde,,22,,",,0:12",50.174,missing,missing,missing,missing,missing
8,Prologue,8,8,+0:12,11,,Classic,BUGNO GianniGatorade - Chateau d'Ax,28,Gatorade - Chateau d'Ax,,18,,",,0:12",50.174,missing,missing,missing,missing,missing
9,Prologue,9,9,+0:12,138,,Classic,VAN HOOYDONCK EdwigBuckler - Colnago - Decca,25,Buckler - Colnago - Decca,,14,,",,0:12",50.174,missing,missing,missing,missing,missing
10,Prologue,10,10,+0:13,103,,TT,EKIMOV ViatcheslavPanasonic - Sportlife,26,Panasonic - Sportlife,,10,,0:130:13,50.087,missing,missing,missing,missing,missing


In [601]:
replace(out.Timelag[1], "+" => "")

MethodError: MethodError: no method matching _replace!(::Base.var"#new#352"{Tuple{Pair{String, String}}}, ::String, ::String, ::Int64)
Closest candidates are:
  _replace!(::Union{Function, Type}, !Matched::AbstractArray, !Matched::AbstractArray, ::Int64) at set.jl:716
  _replace!(::Union{Function, Type}, !Matched::Dict{K, V}, !Matched::AbstractDict, ::Int64) where {K, V} at set.jl:749
  _replace!(::Union{Function, Type}, !Matched::Set{T}, !Matched::AbstractSet, ::Int64) where T at set.jl:781
  ...

In [595]:
out.Timelag = chop.(out.Timelag,head=1,tail=0)
out.Timelag[findall(findall(isnothing.(match.(r"(:\d\d:\d\d)",out.Timelag))))]

3680-element Vector{SubString{String}}:
 "0:00"
 "0:02"
 "0:03"
 "0:04"
 "0:11"
 "0:12"
 "0:12"
 "0:12"
 "0:12"
 "0:13"
 ⋮
 "2:10:14"
 "2:49:32"
 "2:12:33"
 "1:43:24"
 "2:15:42"
 "2:36:55"
 "2:35:27"
 "2:32:38"
 "3:23:44"

In [465]:
standard = DataFrame(scrape_tables("https://www.procyclingstats.com/race/tour-de-france/" * string(2015) *"/" * lowercase(replace("Stage 2"," " => "-")))[1])[1:5,:]

Row,Rnk,GC,Timelag,BIB,H2H,Specialty,Rider,Age,Team,UCI,Pnt,Unnamed: 12_level_0,Time
Unnamed: 0_level_1,String,String,String,String,String,String,String,String,String,String,String,String,String
1,1,13,+0:59,75,,Sprint,GREIPEL AndréLotto Soudal,32,Lotto Soudal,20,100,,3:29:03
2,2,4,+0:33,47,,Sprint,SAGAN PeterTinkoff - Saxo,25,Tinkoff - Saxo,10,70,,",,0:00"
3,3,1,+0:00,143,,TT,CANCELLARA FabianTrek Factory Racing,34,Trek Factory Racing,6,50,,",,0:00"
4,4,21,+1:24,112,,Sprint,CAVENDISH MarkEtixx - Quick Step,30,Etixx - Quick Step,4,40,,",,0:00"
5,5,6,+0:42,64,,Classic,OSS DanielBMC Racing Team,28,BMC Racing Team,2,32,,",,0:00"


In [495]:


DataFramesMeta.dropmissing!(test)

# findall((test.Timelag .== "") .& (test.Timelag .!= missing))


Row,Pos.,Team,Time,Timegap,Speed,PCS points,UCI points,Rnk,Rider,Timelag
Unnamed: 0_level_1,String,String,String,String,String,String,String,String,String,String
1,,BMC Racing Team,,,,20,,1,OSS Daniel,+0.00
2,,BMC Racing Team,,,,20,,1,DENNIS Rohan,+0.00
3,,BMC Racing Team,,,,20,,1,CARUSO Damiano,+0.00
4,,BMC Racing Team,,,,20,,1,QUINZIATO Manuel,+0.00
5,,BMC Racing Team,,,,20,,1,VAN AVERMAET Greg,+0.00
6,,BMC Racing Team,,,,20,,1,SCHÄR Michael,+0.00
7,,BMC Racing Team,,,,20,,1,SÁNCHEZ Samuel,+0.00
8,,BMC Racing Team,,,,20,,1,WYSS Danilo,+0.00
9,,BMC Racing Team,,,,20,,1,VAN GARDEREN Tejay,+0.00
10,,Team Sky,,,,16,,2,THOMAS Geraint,+0:01
