# Statistical Principles of Data Science - Group Project
## What makes a good climber?
**Hand-In Date**: xx.xx.xxxx <br/><br/>
Christina Kohlbacher, k11824719<br/>
David Obermann, k11717395<br/>
Fabio Pernegger, k11714227<br/>
Richard Wolfmayr, k11714228

## Imports

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

## Load Data Set

In [2]:
df_climber = pd.read_csv('climber_df.csv')
df_climber_orig = df_climber.copy()
df_climber.head()
random_state = 1337

FileNotFoundError: [Errno 2] No such file or directory: 'climber_df.csv'

In [None]:
df_routes = pd.read_csv('routes_rated.csv')
df_routes_orig = df_routes.copy()
df_routes.head()

In [None]:
df_grades = pd.read_csv('grades_conversion_table.csv')
df_grades_orig = df_grades.copy()
df_grades.head()

## Data Understanding - Exploratory Analysis

First look into the climbers dataframe - print info

As you can see below, there are no missing values in the data set.

In [None]:
df_climber.info()

Next, we want to get a description with basic statistical measures of the features.

In [None]:
df_climber[['height', 'weight', 'age', 'years_cl', 'grades_count', 'grades_first', 
            'grades_last', 'grades_max', 'grades_mean', 'year_first', 'year_last']].describe()

The mode of the nominal features is shown below.

In [None]:
df_climber[['country', 'sex']].mode(axis=0)

Let's look at the specific features and their distributions explicitly.

In [None]:
def plot_description(title, xlabel, ylabel):
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    #plt.show()

In [None]:
def plot_my_boxplot(col, unit):
    fig, axs = plt.subplots(1,2,figsize=(10,5))
    df_climber[df_climber.sex == 0][col].plot(kind='box' ,ax=axs[0])
    axs[0].set_title(f'{col} distribution male'), axs[0].set_xlabel(' '), axs[0].set_ylabel(f'{col} {unit}')
    df_climber[df_climber.sex == 1][col].plot(kind='box' , ax=axs[1])
    axs[1].set_title(f'{col} distribution female'), axs[0].set_xlabel(' ') , axs[1].set_ylabel(f'{col} {unit}')

In [None]:
df_climber.sex.value_counts().plot(kind='bar')
plot_description('sex distribution', 'sex', 'count')

As shown in the plot above, the data is highly skewed in terms of sex distribution. We decided to split the data into two dataframes and create the models for both of the groups because different features might be important for each of them, and each feature is differently distributed. We also perform the exploratory analysis for both groups.

In [None]:
fig, axs = plt.subplots(2,1,figsize=(10,12))
df_climber[df_climber.sex == 0].country.value_counts().sort_index().plot(kind='bar', ax= axs[0])
plot_description('country distribution male', 'country code', 'count')
df_climber[df_climber.sex == 1].country.value_counts().sort_index().plot(kind='bar', ax= axs[1])
plot_description('country distribution female', 'country code', 'count')

In [None]:
plot_my_boxplot('height', 'in cm')

In [None]:
plot_my_boxplot('weight', 'in kg')

In [None]:
plot_my_boxplot('age', 'in years')

In [None]:
plot_my_boxplot('years_cl', 'in years')

In [None]:
plot_my_boxplot('grades_count', '')

In [None]:
plot_my_boxplot('grades_first', '')

In [None]:
plot_my_boxplot('grades_last', '')

In [None]:
plot_my_boxplot('grades_max', '')

In [None]:
plot_my_boxplot('grades_mean', '')

In [None]:
plot_my_boxplot('year_first', '')

the climber rows with year_first below 1950 should be omitted from the data set since it is not realistic to have years 0 or 1100.

In [None]:
plot_my_boxplot('year_last', '')

In [None]:
df_climber[['country','grades_mean']][df_climber.sex == 1].groupby('country').mean().sort_values('grades_mean', ascending=False).plot(kind='bar', title='avg grades mean per country (female)', ylabel = 'average grades mean', legend=False, figsize=(15,5)), plt.show()
df_climber[['country','grades_mean']][df_climber.sex == 0].groupby('country').mean().sort_values('grades_mean', ascending=False).plot(kind='bar', title='avg grades mean per country (male)', ylabel = 'average grades mean', legend=False, figsize=(15,5)), plt.show()

In [None]:
corr_plt = df_climber.drop(columns=['user_id', 'grades_max', 'grades_first', 'grades_last']).corr()
corr_plt.style.background_gradient(cmap='coolwarm').format(precision=2)

## Preprocessing

Drop the rows with first year < 1950

In [None]:
df_climber.shape

In [None]:
df_climber = df_climber[df_climber.year_first >1950]

In [None]:
df_climber.shape

3 rows were dropped

In [None]:
df_climber.describe()

For the classification it would not make sense to use every single possible grade as a class. This would be 85 classes from 0 to 85. This is why we decided to discretise to three distinct classes. We simply split it into beginner=0, intermediate=1, expert=2.
We used our "expert knowledge" to find the following borders of these three classes: <br>
Until exclusive 6c -> beginner...45=6c <br> 
6c to exclusive 8a -> enthusiast...61=8a  <br>
Upwards of 8a -> pro...>62

In [None]:
# df_grades
beginner_upperbound = 45
intermediate_upperbound = 61
df_climber["grades_mean_discrete"] = 0
df_climber.loc[df_climber["grades_mean"]<beginner_upperbound, ["grades_mean_discrete"]] = 0
df_climber.loc[(df_climber["grades_mean"]>=beginner_upperbound) & (df_climber["grades_mean"]<intermediate_upperbound), ["grades_mean_discrete"]] = 1
df_climber.loc[(df_climber["grades_mean"]>=intermediate_upperbound), ["grades_mean_discrete"]] = 2
#df_climber.describe()
df_climber.head()

In [None]:
goodies = df_climber[df_climber.grades_mean_discrete == 2 ]
df_climber[df_climber.grades_mean_discrete == 2 ]
goodies[['country','grades_mean_discrete']][goodies.sex == 1].groupby('country').count().sort_values('grades_mean_discrete', ascending=False).plot(kind='bar', title='pro climbers per country (female)', ylabel = 'pro climber count', legend=False, figsize=(15,5)), plt.show()
goodies[['country','grades_mean_discrete']][goodies.sex == 0].groupby('country').count().sort_values('grades_mean_discrete', ascending=False).plot(kind='bar', title='pro climbers per country (male)', ylabel = 'pro climber count', legend=False, figsize=(15,5)), plt.show()

In [None]:
le = LabelEncoder()
le.fit(df_climber['country'])
df_climber['countryenc'] = le.transform(df_climber['country'])

In [None]:
df_climber_f = df_climber[df_climber.sex == 1]
df_climber_m = df_climber[df_climber.sex == 0]

df_climber_f.info(), df_climber_m.info()

## Data Modeling

Splitting for Regression Tasks:

In [None]:
x_column_names = ['countryenc', 'sex', 'height', 'weight', 'age', 'years_cl', 'grades_count', 'year_first', 'year_last']
X = df_climber[x_column_names]
y = df_climber.grades_mean
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1337)

x_column_names = ['countryenc', 'height', 'weight', 'age', 'years_cl', 'grades_count', 'year_first', 'year_last']
X_m = df_climber_m[x_column_names]
y_m = df_climber_m.grades_mean
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_m, y_m, test_size=0.3, random_state=1337)

X_f = df_climber_f[x_column_names]
y_f = df_climber_f.grades_mean
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_f, y_f, test_size=0.3, random_state=1337)

scaler = preprocessing.StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

scaler = preprocessing.StandardScaler().fit(X_train_m)
X_train_m_scaled = scaler.transform(X_train_m)
X_test_m_scaled = scaler.transform(X_test_m)

scaler = preprocessing.StandardScaler().fit(X_train_f)
X_train_f_scaled = scaler.transform(X_train_f)
X_test_f_scaled = scaler.transform(X_test_f)

### Regression

In [None]:
linreg = LinearRegression()
linreg.fit(X_train_scaled, y_train)

preds_linreg = linreg.predict(X_test_scaled)
print("Coefficients: \n")
for n, c in zip(linreg.coef_,['countryenc', 'sex\t', 'height\t', 'weight\t', 'age\t', 'years_cl', 'grades_count', 'year_first', 'year_last']):
    print(c + ':\t' + str(n))
print("Mean squared error: %.2f" % mean_squared_error(y_test, preds_linreg))

In [None]:
linreg_m = LinearRegression()
linreg_m.fit(X_train_m_scaled, y_train_m)

preds_linreg_m = linreg_m.predict(X_test_m_scaled)
print("Coefficients: \n")
for n, c in zip(linreg_m.coef_,['countryenc', 'height\t', 'weight\t', 'age\t', 'years_cl', 'grades_count', 'year_first', 'year_last']):
    print(c + ':\t' + str(n))
print("Mean squared error: %.2f" % mean_squared_error(y_test_m, preds_linreg_m))

In [None]:
linreg_f = LinearRegression()
linreg_f.fit(X_train_f_scaled, y_train_f)

preds_linreg_f = linreg_f.predict(X_test_f_scaled)
print("Coefficients: \n")
for n, c in zip(linreg_f.coef_,['countryenc', 'height\t', 'weight\t', 'age\t', 'years_cl', 'grades_count', 'year_first', 'year_last']):
    print(c + ':\t' + str(n))
print("Mean squared error: %.2f" % mean_squared_error(y_test_f, preds_linreg_f))

Interpreting the results:

Seeing that for both, males and females, have the highest coefficient for years climbed we can draw the obvious conlusion that climbing for more years improves the performance.
This is quite obvious, but what more can we see that helps us understand the data?
We can see that the second most important score seems to be for females year last.
This indicates that female climbers got better in recent years.

### Tree

In [None]:
x_column_names_c = ['countryenc', 'height', 'weight', 'age', 'years_cl', 'grades_count', 'year_first', 'year_last']

X_m_c = df_climber_m[x_column_names_c]
y_m_c = df_climber_m.grades_mean_discrete
X_train_m_c, X_test_m_c, y_train_m_c, y_test_m_c = train_test_split(X_m_c, y_m_c, test_size=0.3, random_state=random_state)

X_f_c = df_climber[x_column_names_c]
y_f_c = df_climber.grades_mean_discrete
X_train_f_c, X_test_f_c, y_train_f_c, y_test_f_c = train_test_split(X_f_c, y_f_c, test_size=0.3, random_state=random_state)


scaler = preprocessing.StandardScaler().fit(X_train_m_c)
X_train_m_scaled_c = scaler.transform(X_train_m_c)
X_test_m_scaled_c = scaler.transform(X_test_m_c)

scaler = preprocessing.StandardScaler().fit(X_train_f_c)
X_train_f_scaled_c = scaler.transform(X_train_f_c)
X_test_f_scaled_c = scaler.transform(X_test_f_c)

In [None]:

# Fit a decision tree and plot the tree
print("To understand the tree: left is always True, right is always false... for e.g.  age<=49.5 all the samples that ARE smaller go left")
# male tree
tree_m = tree.DecisionTreeClassifier(criterion="entropy", random_state=random_state)
tree_m = tree_m.fit(X_train_m_scaled_c,y_train_m_c)
plt.figure(figsize=(15,25))
tree.plot_tree(tree_m, max_depth=2, feature_names=x_column_names_c, fontsize=8, class_names=["beginner", "enthusiast", "pro"])

# female tree
tree_f = tree.DecisionTreeClassifier(criterion="entropy", random_state=random_state)
tree_f = tree_f.fit(X_train_f_scaled_c,y_train_f_c)
plt.figure(figsize=(15,25))
tree.plot_tree(tree_f, max_depth=2, feature_names=x_column_names_c, fontsize=8, class_names=["beginner", "enthusiast", "pro"])

In [None]:

# check accuracy
y_pred_m_c = tree_m.predict(X_test_m_scaled_c)
accuracy_m_c = accuracy_score(y_test_m_c.values, y_pred_m_c)

y_pred_f_c = tree_f.predict(X_test_f_scaled_c)
accuracy_f_c = accuracy_score(y_test_f_c, y_pred_f_c)

print(f"Accuracy for male tree: {accuracy_m_c}")
print(f"Accuracy for female tree: {accuracy_f_c}")

# feature importance
feature_importances_c = tree_m.feature_importances_
plt.figure(figsize=(10,5))
plt.bar([i for i in range(0, len(feature_importances_c))], feature_importances_c)
plt.xticks([i for i in range(0, len(x_column_names_c))], x_column_names_c)
plt.title(f'feature importance for Male Decision Tree')
plt.xlabel('features')
plt.ylabel('importance score')

feature_importances_c = tree_f.feature_importances_
plt.figure(figsize=(10,5))
plt.bar([i for i in range(0, len(feature_importances_c))], feature_importances_c)
plt.xticks([i for i in range(0, len(x_column_names_c))], x_column_names_c)
plt.title(f'feature importance for Female Decision Tree')
plt.xlabel('features')
plt.ylabel('importance score')

### Forest

In [None]:
randforest_m = RandomForestClassifier(random_state=random_state)
randforest_m = randforest_m.fit(X_train_m_scaled_c, y_train_m_c)

randforest_f = RandomForestClassifier(random_state=random_state)
randforest_f = randforest_f.fit(X_train_f_scaled_c, y_train_f_c)

y_pred_m_c = randforest_m.predict(X_test_m_scaled_c)
accuracy_m_c = accuracy_score(y_test_m_c.values, y_pred_m_c)

y_pred_f_c = randforest_f.predict(X_test_f_scaled_c)
accuracy_f_c = accuracy_score(y_test_f_c, y_pred_f_c)

print(f"Accuracy for male forest: {accuracy_m_c}")
print(f"Accuracy for female forest: {accuracy_f_c}")

feature_importances_c = randforest_m.feature_importances_
plt.figure(figsize=(10,5))
plt.bar([i for i in range(0, len(feature_importances_c))], feature_importances_c)
plt.xticks([i for i in range(0, len(x_column_names_c))], x_column_names_c)
plt.title(f'feature importance for Male Random Forest')
plt.xlabel('features')
plt.ylabel('importance score')

feature_importances_c = randforest_f.feature_importances_
plt.figure(figsize=(10,5))
plt.bar([i for i in range(0, len(feature_importances_c))], feature_importances_c)
plt.xticks([i for i in range(0, len(x_column_names_c))], x_column_names_c)
plt.title(f'feature importance for Female Random Forest')
plt.xlabel('features')
plt.ylabel('importance score')