# Data Concatenation / Initial Cleaning
Used to combine the disparate jsonlines files from [OpenBeta](https://github.com/OpenBeta/climbing-data/tree/next) into one large csv of only non-bouldering routes. This includes transforming the column types to be easier to work with, and creation of numeric grade columns, a date established column, and if the grade has a plus/minus columns. The final dataset is saved to a csv.

In [110]:
import pandas as pd
import numpy as np
import os
import json
import re
import datefinder

#so no data is hidden
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [None]:
route_file_list = []

for dirpath, subdirs, filenames in os.walk('./open-beta-routes/'):
    route_file_list.extend([os.path.join(dirpath, name) for name in filenames if 'routes' in name])

In [18]:
#https://sundararamanp.medium.com/a-relatively-faster-approach-for-reading-json-lines-file-into-pandas-dataframe-90b57353fd38
df_list = []
for file in route_file_list:
    with open(file, 'r') as f:
        lines = f.read().splitlines()
    df_inter = pd.DataFrame(lines)
    df_inter.columns = ['json_element']
    df_list.append(pd.json_normalize(df_inter['json_element'].apply(json.loads)))

df_final = pd.concat(df_list)

Unnamed: 0,route_name,safety,fa,description,location,protection,mp_sector_id,mp_route_id,grade.YDS,grade.French,...,grade.Font,type.boulder,type.sport,type.tr,type.alpine,grade.yds_aid,type.aid,type.snow,type.mixed,type.ice
0,Gravel Pit,,Jason Milford/ Matt Schutz Spring 2020,[Goes up slab on bolts to steep corner on gear...,,"[Chains on top, can lower off easy. Pro to 3"" ...",119029240,119029258,5.12b/c,7b+,...,,,,,,,,,,
1,Random Impulse,,"""Unknown"" or",[Some fun moves broken up by a few scree fille...,[25 feet to the right of Deep Springs Education.],[A small assortment of cams and maybe a nut or...,119100232,119101118,5.7,5a,...,,,,,,,,,,
2,The Tick Wall,,"7, July 2020",[Bouldering. Approximately 14’ tall and 20’ or...,[Park at Sycamore Creek bridge and walk upstre...,[None. Bring your own pad.],119181845,119181945,V-easy,,...,3,True,,,,,,,,
3,Orange Crush,,"Wade Griffith, Sterling Killion, Scott Williams",[Pretty cool orange arete that sports some int...,[The route is located on the far southern shou...,[7 QD's],105817198,105817201,5.11b/c,6c+,...,,,True,,,,,,,
4,Wimovi Wonder Winos,,Kroll and McHam,[Climb the most open looking slab on the more ...,"[Upper right of the wall, more facing the road.]",[Bolts],113627837,118979787,5.10-,6a,...,,,True,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2743,F-104,,unknown,[Well brushed steep slab at the lower corner o...,[The bottom corner of the west face.],"[5 bolts and gear, two bolt anchor for lowering.]",,,5.11b/c,6c+,...,,,,,,,,,,
2744,Echo Slab,,unknown,[Friction slab up past a bolt. Gear through a ...,"[From the main approach trail, when you hit th...","[One bolt. Then passive #2 BD cam, then red tr...",,,5.8,5b,...,,,,,,,,,,
2745,Dirty Boulevard,,unknown,[Face climb to the right of Mom's Meatfloaf. S...,[Right of Mom's Meatloaf. See approach on Mom'...,"[4 bolts, #2 TCU, bolted anchors.]",,,5.11a,6c,...,,,,,,,,,,
2746,Dime Store Mystery,,unknown,[Start's left of Mom's Meatloaf. Climb bolted ...,[To the left of Mom's meatloaf. See beta for a...,"[4 bolts, gear to 4"", bolted chain anchors.]",,,5.11a,6c,...,,,,True,,,,,,


In [19]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 211123 entries, 0 to 2747
Data columns (total 31 columns):
 #   Column                   Non-Null Count   Dtype 
---  ------                   --------------   ----- 
 0   route_name               211123 non-null  object
 1   safety                   211123 non-null  object
 2   fa                       211123 non-null  object
 3   description              211123 non-null  object
 4   location                 211123 non-null  object
 5   protection               211123 non-null  object
 6   mp_sector_id             36604 non-null   object
 7   mp_route_id              36604 non-null   object
 8   grade.YDS                206795 non-null  object
 9   grade.French             130665 non-null  object
 10  grade.Ewbanks            130665 non-null  object
 11  grade.UIAA               130665 non-null  object
 12  grade.ZA                 130665 non-null  object
 13  grade.British            130665 non-null  object
 14  type.trad             

### Drop Unnecessary Rows/Columns

In [None]:
#slice to get rid of bouldering routes - wrong grading system
df_final = df_final[df_final['type.boulder'].isna()].drop(columns=['type.boulder'])

In [45]:
#slice to get rid of aid routes - wrong grading system
df_final = df_final[df_final['grade.yds_aid'].isna()]

In [43]:
df_final.isna().sum()

route_name                      0
safety                          0
fa                              0
description                     0
location                        0
protection                      0
mp_sector_id               110089
mp_route_id                110089
grade.YDS                    4328
grade.French                 4328
grade.Ewbanks                4328
grade.UIAA                   4328
grade.ZA                     4328
grade.British                4328
type.trad                   70931
metadata.left_right_seq         0
metadata.parent_lnglat          0
metadata.parent_sector          0
metadata.mp_route_id            0
metadata.mp_sector_id           0
metadata.mp_path                0
grade.Font                 133993
type.sport                  70627
type.tr                    114457
type.alpine                128061
grade.yds_aid              132274
type.aid                   132274
type.snow                  133306
type.mixed                 132612
type.ice      

A lot of this information can be dropped. I would be tempted to use the threshold parameter of dropna, but there are a number of columns with many null values that still need to be explored simply because the route type likely has a large affect on the grade. Columns to drop:
- mp_sector_id: Missing for many rows, and the full sector ids are stored in the metadata.mp_sector_id
- mp_route_id: Same story as sector id
- any grade not grade.YDS: these grades all have a direct mapping, and YDS is the one we'll be working in
- metadata.left_right_seq: this tells if the routes are listed from left to right on the area page.

The more esoteric type columns, such as aid, snow, mixed, and ice will likely be dropped, as well as alpine and potentially tr (top rope). I would like to explore those in the more formal EDA though.

In [48]:
cols_to_drop = [x for x in df_final.columns if "grade" in x and not "YDS" in x]
cols_to_drop.extend(['mp_route_id', 'mp_sector_id', 'metadata.left_right_seq'])
df_final = df_final.drop(columns=cols_to_drop)

### Clean up Column Dtypes
Next I would like to amend the dtypes of the columns. The NLP features are in lists, which would be better as just the strings those lists contain. The type indicator columns should be one hot encoded rather than having nans. The rest are already strings and can remain that way for now.

In [50]:
for col in df_final.columns:
    print(f"{col} dtype: {type(df_final.loc[1, col])}")

route_name dtype: <class 'str'>
safety dtype: <class 'str'>
fa dtype: <class 'str'>
description dtype: <class 'list'>
location dtype: <class 'list'>
protection dtype: <class 'list'>
grade.YDS dtype: <class 'str'>
type.trad dtype: <class 'bool'>
metadata.parent_lnglat dtype: <class 'list'>
metadata.parent_sector dtype: <class 'str'>
metadata.mp_route_id dtype: <class 'str'>
metadata.mp_sector_id dtype: <class 'str'>
metadata.mp_path dtype: <class 'str'>
type.sport dtype: <class 'float'>
type.tr dtype: <class 'float'>
type.alpine dtype: <class 'float'>
type.aid dtype: <class 'float'>
type.snow dtype: <class 'float'>
type.mixed dtype: <class 'float'>
type.ice dtype: <class 'float'>


In [None]:
type_cols = [x for x in df_final.columns if "type" in x]
for col in type_cols:
    df_final[col] = df_final[col].apply(lambda x: 0 if type(x) != 'str' and np.isnan(x) else 1)

In [69]:
nlp_cols = ['description', 'location', 'protection']
for col in nlp_cols:
    df_final[col] = df_final[col].apply(lambda x: " ".join(x))

In [70]:
for col in df_final.columns:
    print(f"{col} dtype: {type(df_final.loc[1, col])}")

route_name dtype: <class 'str'>
safety dtype: <class 'str'>
fa dtype: <class 'str'>
description dtype: <class 'str'>
location dtype: <class 'str'>
protection dtype: <class 'str'>
grade.YDS dtype: <class 'str'>
type.trad dtype: <class 'numpy.int64'>
metadata.parent_lnglat dtype: <class 'list'>
metadata.parent_sector dtype: <class 'str'>
metadata.mp_route_id dtype: <class 'str'>
metadata.mp_sector_id dtype: <class 'str'>
metadata.mp_path dtype: <class 'str'>
type.sport dtype: <class 'numpy.int64'>
type.tr dtype: <class 'numpy.int64'>
type.alpine dtype: <class 'numpy.int64'>
type.aid dtype: <class 'numpy.int64'>
type.snow dtype: <class 'numpy.int64'>
type.mixed dtype: <class 'numpy.int64'>
type.ice dtype: <class 'numpy.int64'>


### Transforming YDS Grades
To compare grades, we need to do some processing to make the grades more computationally legible. There are a few components to the grades ([adapted with help from this article](https://www.sportrock.com/post/understanding-climbing-grades)):
- Class: 5
    - The class indicates the difficulty of the terrain, with 1 being flat land. While 4th class terrain may require ropes, all climbs are at least 5th class.
- Difficulty: .0-.15
    - Most true climbing starts at a grade of 5.4 or 5.5, with lower grades being seen mostly as scrambles.
- Letter: a-d
    - The letter grade indicates the sub-difficulty of the climb within the number difficulty. Letter grades are only used for 5.10 climbs and up. This is how we distinguish between easy and hard climbs within a grade that do not quite fall into the neighboring grades.
- +/-:
    - The + and - after the grade are similar to the letter system, but less precise. These may be used on any grade of climb, and are never used in conjunction with letters. A + on older routes (think pre-1980) can often be construed to mean that the real feel of the route is much more difficult than the given grade. In general, a + can be thought of as similar to the letter grade c/d, and a - can be thought of as similar to the letter grade a/b
- Risk Rating: PG, PG13, R, X
    - This conveys how run out a route is, and the true physical danger of the route if the lead climber were to take a fall. A climb with an X risk rating is a climb you -do not- want to whip on.

In [96]:
def parse_grade(yds):
    '''
    Takes in a YDS rating with no risk rating or extra text and returns that rating as a decimal
    Valid ratings take the form "5.(9-15)(a-d | a/b, b/c, c/d)(+-)"
    This function does not take into account + and -, opting to hold those as a separate predictive feature
    '''
    #remove +/-
    if yds[-1] in "+-":
        yds = yds[:-1]
    
    #reduce to difficulty grade, the 5. is not informational
    yds = yds.split(".")[1]
    
    #take care of split letter grades
    if "/" in yds:
        slashes = {
            "a/b" : .25,
            "b/c" : .5,
            "c/d" : .75
        }
        return int(yds[:-3]) + slashes[yds[-3:]]
    
    #take care of further letter grades
    if yds[-1] in 'abcd':
        letters = {
            "a" : .2,
            "b" : .4,
            "c" : .6,
            "d" : .8
        }
        return int(yds[:-1]) + letters[yds[-1]]
    
    #no letter grades, return base grade
    return int(yds)

In [98]:
#narrow down to only 5th class routes
df_final = df_final[(df_final['grade.YDS'].notna()) & (df_final['grade.YDS'].str.contains("5\."))]

In [99]:
df_final['grade.YDS']

0         5.12b/c
1             5.7
2         5.11b/c
3           5.10-
4           5.10-
           ...   
133988    5.11b/c
133989        5.8
133990      5.11a
133991      5.11a
133992        5.6
Name: grade.YDS, Length: 127513, dtype: object

In [101]:
df_final.loc[:, 'grade_numeric'] = df_final['grade.YDS'].apply(parse_grade)

In [106]:
#create plus, minus, and plus_minus secondary target columns - not sure which version to use
df_final.loc[:, 'plus'] = df_final['grade.YDS'].apply(lambda x: 1 if x[-1] == '+' else 0) 
df_final.loc[:, 'minus'] = df_final['grade.YDS'].apply(lambda x: 1 if x[-1] == '-' else 0)
df_final.loc[:, 'plus_minus'] = df_final["grade.YDS"].apply(lambda x: 1 if x[-1] == '+' else (-1 if x[-1] == '-' else 0))


In [107]:
def parse_grade_plus_minus(yds):
    '''
    Takes in a YDS rating with no risk rating or extra text and returns that rating as a decimal
    Valid ratings take the form "5.(9-15)(a-d | a/b, b/c, c/d)(+-)"
    This function takes into account + and -, treating them like a/b and c/d
    '''
    plus_minus_map = {'+': 'c/d',
                      '-': 'a/b'}
    #map +/- to letter
    if yds[-1] in "+-":
        yds = yds[:-1] + plus_minus_map[yds[-1]]
        
    
    #reduce to difficulty grade, the 5. is not informational
    yds = yds.split(".")[1]
    
    #take care of split letter grades
    if "/" in yds:
        slashes = {
            "a/b" : .25,
            "b/c" : .5,
            "c/d" : .75
        }
        return int(yds[:-3]) + slashes[yds[-3:]]
    
    #take care of further letter grades
    if yds[-1] in 'abcd':
        letters = {
            "a" : .2,
            "b" : .4,
            "c" : .6,
            "d" : .8
        }
        return int(yds[:-1]) + letters[yds[-1]]
    
    #no letter grades, return base grade
    return int(yds)

In [108]:
df_final.loc[:, 'grade_numeric_plus_minus'] = df_final['grade.YDS'].apply(parse_grade_plus_minus)

### Pulling Year out of FA
Currently the FA (first ascent, or establishment team of the route) is stored as a freeform string, although generally the names of the first ascentionists is listed first, followed by the year. There is also sometimes the FFA, or First Female Ascent. Because there are so many possible unique first ascentionists (not even including FA teams) my first goal is to pull out the year, and potentially pull out names later if I think it will truly help the model. For FA to be more useful, I believe that a smaller subset of known route establishers would have to be compiled, using deeper subject-area knowledge than I have.

In [109]:
df_final['fa'].head()

0             Jason Milford/ Matt Schutz Spring 2020
1                                       "Unknown" or
2    Wade Griffith, Sterling Killion, Scott Williams
3                                    Kroll and McHam
4                                      Bryan Carroll
Name: fa, dtype: object

In [111]:
#https://stackoverflow.com/questions/3276180/extracting-date-from-a-string-in-python
def extract_year(fa_string):
    matches = list(datefinder.find_dates(fa_string))

    if matches:
        # date returned will be a datetime.datetime object. here we are only using the first match.
        date = matches[0]
        year = date.year
        return year
    
    return np.nan

df_final.loc[:, 'year_established'] = df_final['fa'].apply(extract_year)

In [113]:
df_final['year_established'].notna().sum()

22533

In [112]:
df_final['year_established'].head()

0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: year_established, dtype: float64

A quick visual inspection shows that this captured many years, but I do see an example where there is a standalone year in row 0 that did not get captured. I will get this one with regex.

In [114]:
def regex_year(fa_string, year_established):
    if year_established and not np.isnan(year_established):
        return year_established
    match = re.search(r'\d{4}', fa_string)
    if match:
        year = match.group()
        return int(year)
    return np.nan

df_final.loc[:, 'year_established'] = df_final.apply(lambda x: regex_year(x['fa'], x['year_established']), axis=1)

In [119]:
df_final['year_established'].notna().sum()

44041

In [117]:
df_final['year_established'].head()

0    2020.0
1       NaN
2       NaN
3       NaN
4       NaN
Name: year_established, dtype: float64

Now we have all the standalone years captured as well! Although only about a third of the rows had the year established available, this is still potentially a feature for modeling. 

In [120]:
#final save to csv
df_final.to_csv("./data/routes.csv", index=False)