# Open International Powerlifting Federation (IPF)

# Data Dictionary

We are given the following data dictionary.  I have added DataType to each column from my best guess at the (uncleaned) dataset.

| Column Name | Datatype | Description |
|----------|----------|----------|
| Name | String | Mandatory. The name of the lifter in UTF-8 encoding. Lifters who share the same name are distinguished by use of a `#` symbol followed by a unique number. For example, two lifters both named `John Doe` would have `Name` values `John Doe #1` and `John Doe #2` respectively. |
| Sex | String | Mandatory. The sex category in which the lifter competed, `M`, `F`, or `Mx`. Mx (pronounced *Muks*) is a gender-neutral title — like Mr and Ms — originating from the UK. It is a catch-all sex category that is particularly appropriate for non-binary lifters. The `Sex` column is defined by [crates/opltypes/src/sex.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/sex.rs). |
| Event | String | Mandatory. The type of competition that the lifter entered. The `Event` column is defined by [crates/opltypes/src/event.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/event.rs). |
| Equipment | String | Mandatory. The equipment category under which the lifts were performed. Note that this *does not mean that the lifter was actually wearing that equipment!* For example, GPC-affiliated federations do not have a category that disallows knee wraps. Therefore, all lifters, even if they only wore knee sleeves, nevertheless competed in the `Wraps` equipment category, because they were allowed to wear wraps. The `Equipment` column is defined by [crates/opltypes/src/equipment.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/equipment.rs). |
| Age | Decimal | Optional. The age of the lifter on the start date of the meet, if known. Ages can be one of two types: exact or approximate. Exact ages are given as integer numbers, for example `23`. Approximate ages are given as an integer plus `0.5`, for example `23.5`. Approximate ages mean that the lifter could be either of *two* possible ages. For an approximate age of `n + 0.5`, the possible ages are `n` or `n+1`. For example, a lifter with the given age `23.5` could be either `23` or `24` -- we don't have enough information to know. Approximate ages occur because some federations only provide us with birth year information. So another way to think about approximate ages is that `23.5` implies that the lifter turns `24` that year. The `Age` column is defined by [crates/opltypes/src/age.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/age.rs). |
| AgeClass | String | Optional. The age class in which the filter falls, for example `40-45`. These classes are based on exact age of the lifter on the day of competition. AgeClass is mostly useful because sometimes a federation will report that a lifter competed in the 50-54 divison without providing any further age information. This way, we can still tag them as 50-54, even if the `Age` column is empty. The full range available to `AgeClass` is defined by [crates/opltypes/src/ageclass.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/ageclass.rs). |
| BirthYearClass | String | Optional. The birth year class in which the filter falls, for example `40-49`. The ages in the range are the oldest possible ages for the lifter that year. For example, `40-49` means "the year the lifter turns 40 through the full year in which the lifter turns 49." `BirthYearClass` is used primarily by the IPF and by IPF affiliates. Non-IPF federations tend to use `AgeClass` instead. The full range available to `BirthYearClass` is defined by [crates/opltypes/src/birthyearclass.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/birthyearclass.rs). |
| Division | String | Optional. Free-form UTF-8 text describing the division of competition, like `Open` or `Juniors 20-23` or `Professional`. Some federations are *configured* in our database, which means that we have agreed on a limited set of division options for that federation, and we have rewritten their results to only use that set, and tests enforce that. Even still, divisions are not standardized *between* configured federations: it really is free-form text, just to provide context. Information about age should not be extracted from the `Division`, but from the `AgeClass` column. |
| BodyweightKg | Decimal | Optional. The recorded bodyweight of the lifter at the time of competition, to two decimal places. |
| WeightClassKg | String | Optional. The weight class in which the lifter competed, to two decimal places. Weight classes can be specified as a maximum or as a minimum. Maximums are specified by just the number, for example `90` means "up to (and including) 90kg." minimums are specified by a `+` to the right of the number, for example `90+` means "above (and excluding) 90kg." `WeightClassKg` is defined by [crates/opltypes/src/weightclasskg.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/weightclasskg.rs). |
| **Lift**1Kg * | Decimal | Squat1-3Kg, Bench1-3Kg, Deadlift1-3Kg. Optional. First attempts for each of squat, bench, and deadlift, respectively. Maximum of two decimal places. Negative values indicate failed attempts. Not all federations report attempt information. Some federations only report Best attempts. |
| Best3**Lift**Kg * | Decimal | Best3SquatKg, Best3BenchKg, Best3DeadliftKg. Optional. Maximum of the first three successful attempts for the lift. Rarely may be negative: that is used by some federations to report the lowest weight the lifter attempted and failed. |
| TotalKg | Decimal| Optional. Sum of `Best3SquatKg`, `Best3BenchKg`, and `Best3DeadliftKg`, if all three lifts were a success. If one of the lifts was failed, or the lifter was disqualified for some other reason, the `TotalKg` is empty. Rarely, mostly for older meets, a federation will report the total but not *any* lift information. |
| Place | String | Mandatory. The recorded place of the lifter in the given division at the end of the meet. The `Place` column is defined by [crates/opltypes/src/place.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/place.rs). |
| Dots | Decimal | Optional. A positive number if Dots points could be calculated, empty if the lifter was disqualified. Dots is very similar to (and drop-in compatible with) the original Wilks formula. It uses an updated, simpler polynomial and is built against data from drug-tested Raw lifters, as opposed to against data from drug-tested Single-ply lifters. The Dots formula was created by Tim Konertz of the BVDK in 2019. The calculation of Dots points is defined by [crates/coefficients/src/dots.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/coefficients/src/dots.rs). |
| Wilks | Decimal | Optional. A positive number if Wilks points could be calculated, empty if the lifter was disqualified. Wilks is the most common formula used for determining Best Lifter in a powerlifting meet. The calculation of Wilks points is defined by [crates/coefficients/src/wilks.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/coefficients/src/wilks.rs). |
| Glossbrenner | Decimal | Optional. A positive number if Glossbrenner points could be calculated, empty if the lifter was disqualified. Glossbrenner was created by Herb Glossbrenner as an update of the Wilks formula. It is most commonly used by GPC-affiliated federations. The calculation of Glossbrenner points is defined by [crates/coefficients/src/glossbrenner.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/coefficients/src/glossbrenner.rs). |
| Goodlift | Decimal | IPF GL Points. The successor to IPF Points (2019-01-01 through 2020-04-30). Optional. A positive number if IPF GL Points could be calculated, empty if the lifter was disqualified or IPF GL Points were undefined for the Event type. IPF GL Points roughly express relative performance to the expected performance of that weight class at an IPF World Championship event, as a percentage. The calculation of IPF GL points is defined by [crates/coefficients/src/goodlift.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/coefficients/src/goodlift.rs). |
| Tested | String | Optional. `Yes` if the lifter entered a drug-tested category, empty otherwise. Note that this records whether the results *count as drug-tested*, which does not imply that the lifter actually took a drug test. Federations do not report which lifters, if any, were subject to drug testing. |
| Country | String | Optional. The home country of the lifter, if known. The full list of valid Country values is defined by [crates/opltypes/src/country.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/country.rs). |
| State | String | Optional. The home state/province/oblast/division/etc of the lifter, if known. The full list of valid State values is defined by [crates/opltypes/src/states.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/states.rs).Expanded names are given there in comments. |
| Federation | String | Mandatory. The federation that hosted the meet. Note that this may be different than the international federation that provided sanction to the meet. For example, USPA meets are sanctioned by the IPL, but we record USPA meets as `USPA`. The full list of valid Federation values is defined by [crates/opltypes/src/federation.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/federation.rs). Comments in that file help explain what each federation value means. |
| ParentFederation | String | Optional. The topmost federation that sanctioned the meet, usually the international body. For example, the `ParentFederation` for the `USAPL` and `EPA` is `IPF`. |
| Date | Date | Mandatory. The start date of the meet in [ISO 8601 format](https://en.wikipedia.org/wiki/ISO_8601). ISO 8601 looks like `YYYY-MM-DD`: as an example, `1996-12-04` would be December 4th, 1996. Meets that last more than one day only have the start date recorded. |
| MeetCountry | String | Mandatory. The country in which the meet was held. The full list of valid Country values is defined by [crates/opltypes/src/country.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/country.rs). |
| MeetState | String | Optional. The state, province, or region in which the meet was held. The full list of valid State values is defined by [crates/opltypes/src/state.rs](https://gitlab.com/openpowerlifting/opl-data/blob/main/crates/opltypes/src/state.rs). |
| MeetTown | String | Undefined in the provided data dictionary. |
| MeetName | String | Mandatory. The name of the meet. The name is defined to never include the year or the federation. For example, the meet officially called `2019 USAPL Raw National Championships` would have the MeetName `Raw National Championshps`. |

\* The `LiftKg` and `Best3LiftKg` columns contain the description for several distinct columns.  In the case of `Lift1Kg`, this includes the description for Squat1Kg, Squat2Kg, Squat3Kg, Bench1Kg, Bench2Kg, Bench3Kg, Deadlift1Kg, Deadlift2Kg, and Deadlift3Kg columns.  In the case of `Best3LiftKg`, this includes the description for Best3SquatKg, Best3BenchKg, and Best3DeadliftKg columns.


## Data Cleaning

I'm working with the data from <a href='https://openpowerlifting.gitlab.io/opl-csv/bulk-csv.html'>OpenPowerlifting Data Service</a>

The purpose in this notebook is to produce a clean dataset that can be utilized for further exploration.  I want to be able to determine the answers to questions like the following:

- an assessment of different strength metrics (dots, wilks, etc)
- record progression by year
- weight class cutoffs vs records
- meet location vs lifter federation
- Patterns between weight classes and between M/F lifters
- Popularity of equipment over time (single/multi/raw/etc)
- Performance by event type (full sbd, bench only, dl only)

No doubt some of these columns will require some work before I can pull meaningful information out of them.

In [2]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
openpl = pd.read_csv('data/openpowerlifting-2023-08-05-ae3dc469.csv')
openpl.head()

  openpl = pd.read_csv('data/openpowerlifting-2023-08-05-ae3dc469.csv')


Unnamed: 0,Name,Sex,Event,Equipment,Age,AgeClass,BirthYearClass,Division,BodyweightKg,WeightClassKg,...,Tested,Country,State,Federation,ParentFederation,Date,MeetCountry,MeetState,MeetTown,MeetName
0,Alona Vladi,F,SBD,Raw,33.0,24-34,24-39,O,58.3,60,...,Yes,Russia,,GFP,,2019-05-11,Russia,,Bryansk,Open Tournament
1,Galina Solovyanova,F,SBD,Raw,43.0,40-44,40-49,M1,73.1,75,...,Yes,Russia,,GFP,,2019-05-11,Russia,,Bryansk,Open Tournament
2,Daniil Voronin,M,SBD,Raw,15.5,16-17,14-18,T,67.4,75,...,Yes,Russia,,GFP,,2019-05-11,Russia,,Bryansk,Open Tournament
3,Aleksey Krasov,M,SBD,Raw,35.0,35-39,24-39,O,66.65,75,...,Yes,Russia,,GFP,,2019-05-11,Russia,,Bryansk,Open Tournament
4,Margarita Pleschenkova,M,SBD,Raw,26.5,24-34,24-39,O,72.45,75,...,Yes,Russia,,GFP,,2019-05-11,Russia,,Bryansk,Open Tournament


In [5]:
print(f'Column names with mixed data type: {openpl.columns[[33,35,38]]}')

column names with mixed data type: Index(['State', 'ParentFederation', 'MeetState'], dtype='object')


In [41]:
# Let's make sure this dataset actually has the columns that are described in the data dictionary.
openpl.columns

Index(['Name', 'Sex', 'Event', 'Equipment', 'Age', 'AgeClass',
       'BirthYearClass', 'Division', 'BodyweightKg', 'WeightClassKg',
       'Squat1Kg', 'Squat2Kg', 'Squat3Kg', 'Squat4Kg', 'Best3SquatKg',
       'Bench1Kg', 'Bench2Kg', 'Bench3Kg', 'Bench4Kg', 'Best3BenchKg',
       'Deadlift1Kg', 'Deadlift2Kg', 'Deadlift3Kg', 'Deadlift4Kg',
       'Best3DeadliftKg', 'TotalKg', 'Place', 'Dots', 'Wilks', 'Glossbrenner',
       'Goodlift', 'Tested', 'Country', 'State', 'Federation',
       'ParentFederation', 'Date', 'MeetCountry', 'MeetState', 'MeetTown',
       'MeetName'],
      dtype='object')

In [42]:
# For importing into a MySQL database, the empty cells appear as an empty string ('') rather than a Null value.
# So, I'll replace NaNs with \N to resolve this issue.
openpl.replace(np.nan, '\\N', inplace=True)

In [45]:
# I'll write this to a csv for use later.
openpl.to_csv('openpl.csv', index=False)