In this analysis, we are conducting an analysis on an open-sourced powerlifting dataset to be able to gain insight on the future of the sport by predicting how the performance (TotalKg) of competitors will be affected in the future based on attributes like their Age, Sex, and Bodyweight(Kg). 

In [1]:
# Add dependencies
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import balanced_accuracy_score
from pathlib import Path
import warnings
import numpy as np
warnings.filterwarnings('ignore')

# Read the CSV and Perform Basic Data Cleaning

In [2]:
columns = ['Name', 'Sex', 'Event', 'Equipment', 'Age', 'AgeClass', 'Division',
       'BodyweightKg', 'WeightClassKg', 'Squat1Kg', 'Squat2Kg', 'Squat3Kg',
       'Squat4Kg', 'Best3SquatKg', 'Bench1Kg', 'Bench2Kg', 'Bench3Kg',
       'Bench4Kg', 'Best3BenchKg', 'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg',
       'Deadlift4Kg', 'Best3DeadliftKg', 'TotalKg', 'Place', 'Wilks',
       'McCulloch', 'Glossbrenner', 'IPFPoints', 'Tested', 'Country',
       'Federation', 'Date', 'MeetCountry', 'MeetState', 'MeetName']

target = ['TotalKg']

In [3]:
#Read data into DataFrame
file_path = Path('..\Resources\openpowerlifting.csv')
df = pd.read_csv(file_path)
df = df.loc[:, columns].copy()

In [4]:
df

Unnamed: 0,Name,Sex,Event,Equipment,Age,AgeClass,Division,BodyweightKg,WeightClassKg,Squat1Kg,...,McCulloch,Glossbrenner,IPFPoints,Tested,Country,Federation,Date,MeetCountry,MeetState,MeetName
0,Abbie Murphy,F,SBD,Wraps,29.0,24-34,F-OR,59.8,60,80.0,...,324.16,286.42,511.15,,,GPC-AUS,2018-10-27,Australia,VIC,Melbourne Cup
1,Abbie Tuong,F,SBD,Wraps,29.0,24-34,F-OR,58.5,60,100.0,...,378.07,334.16,595.65,,,GPC-AUS,2018-10-27,Australia,VIC,Melbourne Cup
2,Ainslee Hooper,F,B,Raw,40.0,40-44,F-OR,55.4,56,,...,38.56,34.12,313.97,,,GPC-AUS,2018-10-27,Australia,VIC,Melbourne Cup
3,Amy Moldenhauer,F,SBD,Wraps,23.0,20-23,F-OR,60.0,60,-105.0,...,345.61,305.37,547.04,,,GPC-AUS,2018-10-27,Australia,VIC,Melbourne Cup
4,Andrea Rowan,F,SBD,Wraps,45.0,45-49,F-OR,104.0,110,120.0,...,338.91,274.56,550.08,,,GPC-AUS,2018-10-27,Australia,VIC,Melbourne Cup
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1423349,Marian Cafalik,M,SBD,Raw,60.5,60-64,Masters 2,73.5,74,160.0,...,438.27,316.52,469.67,Yes,,PZKFiTS,2017-04-01,Poland,,Polish Classic Powerlifting Cup
1423350,Marian Piwowarczyk,M,SBD,Raw,55.5,55-59,Masters 2,63.5,66,90.0,...,372.60,295.66,423.03,Yes,Poland,PZKFiTS,2017-04-01,Poland,,Polish Classic Powerlifting Cup
1423351,Andrzej Bryniarski,M,SBD,Raw,62.5,60-64,Masters 2,94.4,105,140.0,...,382.36,264.22,378.84,Yes,,PZKFiTS,2017-04-01,Poland,,Polish Classic Powerlifting Cup
1423352,Stanisław Goroczko,M,SBD,Raw,63.5,60-64,Masters 2,80.8,83,-165.0,...,,,,Yes,,PZKFiTS,2017-04-01,Poland,,Polish Classic Powerlifting Cup


In [5]:
# Check the datatypes of the columns
df.dtypes

Name                object
Sex                 object
Event               object
Equipment           object
Age                float64
AgeClass            object
Division            object
BodyweightKg       float64
WeightClassKg       object
Squat1Kg           float64
Squat2Kg           float64
Squat3Kg           float64
Squat4Kg           float64
Best3SquatKg       float64
Bench1Kg           float64
Bench2Kg           float64
Bench3Kg           float64
Bench4Kg           float64
Best3BenchKg       float64
Deadlift1Kg        float64
Deadlift2Kg        float64
Deadlift3Kg        float64
Deadlift4Kg        float64
Best3DeadliftKg    float64
TotalKg            float64
Place               object
Wilks              float64
McCulloch          float64
Glossbrenner       float64
IPFPoints          float64
Tested              object
Country             object
Federation          object
Date                object
MeetCountry         object
MeetState           object
MeetName            object
d

Some of the columns like 'Place' and 'Date' will need to be formatted to be able to use in the model. We can drop columns we will not be using to fit to the model. For our analysis, we are using only Sex, Age, Best3BenchKg, Best3SquatKg, Best3DeadliftKg, and Date as the features and TotalKg as the target.  

In [6]:
place_mask = df['Place'] == '1'
df = df.loc[place_mask]

event_mask = df['Event'] == 'SBD'
df = df.loc[event_mask]

age_mask = df['Age'] >= 18
df = df.loc[age_mask]

df = df.drop(['Equipment', 'AgeClass', 'Division','WeightClassKg','Squat1Kg','Squat2Kg','Squat3Kg','Squat4Kg', 'Bench1Kg', 'Bench2Kg', 'Bench3Kg',
       'Bench4Kg', 'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg',
       'Deadlift4Kg','Name', 'Event',
       'McCulloch', 'Glossbrenner', 'IPFPoints', 'Tested', 'Country',
       'Federation','MeetCountry', 'MeetState', 'MeetName', 'Wilks'], axis=1)


Here, we look to filter the 'Place' column only for values equal to '1', 'Event' column only for values equal to 'SBD', and the 'Age' column only for values greater than or equal to '18'. This allows for us to keep a relevant samples for competitors over the age of 18 that placed 1st, with entries for the squat, bench, and deadlift. The 'Sex' column will be converted to a category to represent '0' for males and '1' for females. It is then converted to an integer dtype.  

In [7]:
df["Sex"] = df["Sex"].astype('category')
df["Sex"] = df["Sex"].cat.codes
df['Place'] = df['Place'].astype('int')
df['Date'] = pd.to_datetime(df['Date'])

df.insert(loc=0, column='ID', value=np.arange(len(df)))
df.dtypes

ID                          int32
Sex                          int8
Age                       float64
BodyweightKg              float64
Best3SquatKg              float64
Best3BenchKg              float64
Best3DeadliftKg           float64
TotalKg                   float64
Place                       int32
Date               datetime64[ns]
dtype: object

In [8]:
df

Unnamed: 0,ID,Sex,Age,BodyweightKg,Best3SquatKg,Best3BenchKg,Best3DeadliftKg,TotalKg,Place,Date
6,0,0,23.0,59.8,125.0,70.0,150.0,345.0,1,2018-10-27
8,1,0,36.0,108.0,220.0,100.0,200.0,520.0,1,2018-10-27
9,2,0,37.0,74.8,200.0,95.0,180.0,475.0,1,2018-10-27
12,3,0,27.0,78.6,182.5,105.0,205.0,492.5,1,2018-10-27
16,4,0,50.0,55.2,137.5,70.0,182.5,390.0,1,2018-10-27
...,...,...,...,...,...,...,...,...,...,...
1423327,189809,1,26.5,99.6,305.5,195.0,400.0,900.5,1,2017-04-01
1423332,189810,1,24.5,116.9,295.0,195.0,320.0,810.0,1,2017-04-01
1423337,189811,1,27.5,137.3,270.0,205.0,295.0,770.0,1,2017-04-01
1423340,189812,1,40.5,58.7,183.0,142.5,190.0,515.5,1,2017-04-01


Before, the features (X) and the target (y) were defined. 

In [10]:
# Determine y and x columns
X = pd.get_dummies(df, columns=['Sex','Age','Best3SquatKg','Best3BenchKg','Best3DeadliftKg','BodyweightKg','Date']).drop('TotalKg', axis=1)
y = df['TotalKg']

In [11]:
X.describe()

Unnamed: 0,ID,Place,Sex_0,Sex_1,Age_18.0,Age_18.5,Age_19.0,Age_19.5,Age_20.0,Age_20.5,...,Date_2019-03-29 00:00:00,Date_2019-03-30 00:00:00,Date_2019-03-31 00:00:00,Date_2019-04-05 00:00:00,Date_2019-04-06 00:00:00,Date_2019-04-07 00:00:00,Date_2019-04-12 00:00:00,Date_2019-04-13 00:00:00,Date_2019-04-14 00:00:00,Date_2019-04-20 00:00:00
count,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,...,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0,189814.0
mean,94906.5,1.0,0.321236,0.678764,0.018407,0.029571,0.022749,0.023154,0.020373,0.025989,...,0.000248,0.002355,0.000348,0.000527,0.003883,0.000495,0.000395,0.001749,0.000174,1.6e-05
std,54794.726335,0.0,0.466952,0.466952,0.13442,0.169401,0.149101,0.150394,0.141272,0.159102,...,0.015734,0.048471,0.018644,0.022947,0.062191,0.022248,0.019874,0.041786,0.013184,0.003976
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47453.25,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,94906.5,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,142359.75,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,189813.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [12]:
y.shape

(189814,)

In [13]:
# Check the balance of our target values
y.value_counts()

600.0    1146
700.0    1027
500.0    1023
590.0    1002
580.0     988
         ... 
578.4       1
662.2       1
156.5       1
562.4       1
424.5       1
Name: TotalKg, Length: 3371, dtype: int64

In [None]:
# Create training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

Because the dataset we are using is continuous, we are using linear regression as our supervised learning model. If we were looking for categorical and discreet results, we would use logistic regression instead. 

In [None]:
# Instantiate the model
model = LinearRegression()
model

In [None]:
# Train the dataset x_train and y_train
model.fit(X_train, y_train)

In [None]:
# Validate the model by Predicting the data 
y_predicted = model.predict(X_test)

# Load the predicted outcome into a DataFrame with the y_test data)
predicted_outcome = pd.DataFrame({"Prediction": y_predicted, "Actual": y_test}).reset_index(drop = True)
predicted_outcome.head()

In [None]:
# Test the simple ML model
print({"The accuracy Score of the model is"} balanced_accuracy_score(y_test, y_pred)