## Imports and settings

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

In [2]:
pd.set_option('display.max_columns', None)

## Importing Data

In [3]:
df = pd.read_csv('Data\AIDA_Results_IA_Institut.csv')

## Data exploration

In [4]:
df.head()

Unnamed: 0,Start,Diver,Gender,Discipline,Line,Official Top,AP,RP,Card,Points,Remarks,Title Event,Event Type,Day,Category Event
0,1,Tasos Grillakis (GRC),M,FIM,,00:00,33,23 m,YELLOW,12.0,-,Depth Event 2016,Depth Competition,2016-07-17,other
1,2,Antonis Papantonatos (GRC),M,FIM,,00:00,55,47 m,YELLOW,38.0,-,Depth Event 2016,Depth Competition,2016-07-17,other
2,3,Dimitris Koumoulos (GRC),M,CNF,,00:00,55,55 m,WHITE,55.0,-,Depth Event 2016,Depth Competition,2016-07-17,other
3,4,Christos Papadopoulos (GRC),M,CWT,,00:00,55,55 m,WHITE,55.0,OK,Depth Event 2016,Depth Competition,2016-07-17,other
4,5,Anna Chalari (GRC),F,CWT,,00:00,15,15 m,WHITE,15.0,OK,Depth Event 2016,Depth Competition,2016-07-17,other


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26841 entries, 0 to 26840
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Start           26841 non-null  int64  
 1   Diver           26841 non-null  object 
 2   Gender          26841 non-null  object 
 3   Discipline      26841 non-null  object 
 4   Line            4694 non-null   float64
 5   Official Top    26841 non-null  object 
 6   AP              26841 non-null  int64  
 7   RP              26841 non-null  object 
 8   Card            26841 non-null  object 
 9   Points          26841 non-null  object 
 10  Remarks         26836 non-null  object 
 11  Title Event     26841 non-null  object 
 12  Event Type      26841 non-null  object 
 13  Day             26841 non-null  object 
 14  Category Event  26841 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 3.1+ MB


In [6]:
df.isna().sum()


Start                 0
Diver                 0
Gender                0
Discipline            0
Line              22147
Official Top          0
AP                    0
RP                    0
Card                  0
Points                0
Remarks               5
Title Event           0
Event Type            0
Day                   0
Category Event        0
dtype: int64

At this moment, we already know that we will drop columns "Start" and "Line", the first because it's basically a duplicated index, and the second because it only has 4694/26841 values, which make it useless.

On top of that, based on our needs, we know that we aren't interested in the columns "Official Top", "Title Event".

The column "Diver" will only be used to keep diver's country.

### Exploring Diver

In [7]:
df[df["Diver"].str.len() < 3]["Diver"]

2639     ()
4687     ()
5119     ()
5695     ()
5696     ()
5697     ()
7124     ()
7782     ()
7817     ()
7818     ()
7819     ()
8508     ()
8529     ()
8553     ()
8973     ()
8989     ()
9120     ()
9156     ()
9166     ()
10417    ()
10786    ()
10787    ()
10788    ()
11224    ()
13306    ()
13343    ()
Name: Diver, dtype: object

We have to note that some divers' name is missing, but we will handle this later in the modifcations part.

### Exploring Gender

In [8]:
df["Gender"].value_counts()

Gender
M    17434
F     9407
Name: count, dtype: int64

No problems here.

### Exploring Discipline

In [9]:
df["Discipline"].value_counts()

Discipline
CWT     10725
FIM      8075
CNF      4813
CWTB     3228
Name: count, dtype: int64

We don't have any problems here : there is only the values that we were expecting, with no NaN according to previous checks.

### Exploring AP

In [10]:
df["AP"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 26841 entries, 0 to 26840
Series name: AP
Non-Null Count  Dtype
--------------  -----
26841 non-null  int64
dtypes: int64(1)
memory usage: 209.8 KB


We only have numeric data so it's ok.

### Exploring RP

In [11]:
df["RP"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 26841 entries, 0 to 26840
Series name: RP
Non-Null Count  Dtype 
--------------  ----- 
26841 non-null  object
dtypes: object(1)
memory usage: 209.8+ KB


We observe that we don't only have numeric data, so we need to check manually.

In [12]:
for value in df["RP"].value_counts().index:
    print(value)

40 m
50 m
30 m
45 m
35 m
60 m
0 m
55 m
70 m
25 m
42 m
65 m
52 m
51 m
43 m
20 m
80 m
46 m
36 m
48 m
41 m
53 m
75 m
61 m
56 m
32 m
47 m
57 m
62 m
38 m
33 m
-2 m
58 m
63 m
31 m
37 m
54 m
66 m
90 m
28 m
44 m
71 m
72 m
85 m
67 m
68 m
73 m
64 m
34 m
27 m
26 m
15 m
82 m
49 m
78 m
81 m
76 m
22 m
77 m
74 m
-
59 m
23 m
88 m
83 m
100 m
91 m
39 m
92 m
86 m
21 m
95 m
29 m
24 m
84 m
69 m
18 m
93 m
87 m
10 m
17 m
96 m
16 m
101 m
105 m
94 m
12 m
98 m
19 m
89 m
102 m
103 m
97 m
79 m
14 m
8 m
9 m
106 m
110 m
13 m
108 m
11 m
5 m
104 m
60 m NR
99 m
107 m
40 m NR
7 m
111 m
50 m NR
3 m
30 m NR
114 m
6 m
65 m NR
112 m
75 m NR
71 m NR
45 m NR
116 m
80 m NR
52 m NR
120 m
109 m
55 m NR
4 m
117 m
61 m NR
118 m
72 m NR
81 m NR
70 m NR
123 m
100 m NR
113 m
115 m
57 m NR
1 m
35 m NR
90 m NR
2 m
76 m NR
66 m NR
74 m NR
20 m NR
25 m NR
119 m
88 m NR
121 m
53 m NR
62 m NR
106 m NR
101 m NR
83 m NR
54 m NR
42 m NR
46 m NR
85 m NR
125 m
77 m NR
51 m NR
67 m NR
36 m NR
41 m NR
73 m NR
92 m NR
82 m NR
102 m NR
97 m NR
122

In [13]:
len(df[df["RP"] == '-'])

138

We will only keep the int value, and replace '-' by 0 for the computation.

Moreover, 138 lines have no value specified as '-', which we replace by a 0.

In [14]:
df["RP"] = df["RP"].replace('-', 0).apply(lambda x: int(str(x).strip().split(" ")[0]))

In [15]:
df["RP"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 26841 entries, 0 to 26840
Series name: RP
Non-Null Count  Dtype
--------------  -----
26841 non-null  int64
dtypes: int64(1)
memory usage: 209.8 KB


We now only have int as values.

### Exploring Card

In [16]:
df["Card"].value_counts()

Card
WHITE     19189
YELLOW     4990
RED        2524
-           138
Name: count, dtype: int64

In [17]:
df[df["Card"] == "-"]

Unnamed: 0,Start,Diver,Gender,Discipline,Line,Official Top,AP,RP,Card,Points,Remarks,Title Event,Event Type,Day,Category Event
25564,1,Chiara Obino (ITA),F,CWT,1.0,11:00,98,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-07,World Championship
25565,2,Daham Kim (KOR),M,CWTB,1.0,11:06,61,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-07,World Championship
25566,3,Rüstem Derİn (TUR),M,FIM,1.0,11:12,96,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-07,World Championship
25567,4,Mariam Shalan (EGY),F,FIM,1.0,11:18,60,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-07,World Championship
25568,5,Alejandro Lemus (MEX),M,CWT,1.0,11:24,92,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-07,World Championship
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25697,19,Jindřiška Zajacová (CZE),F,CWT,1.0,12:48,70,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-14,World Championship
25698,20,Lujain Talal (SAU),F,FIM,1.0,12:54,35,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-14,World Championship
25699,21,Arseniy Telegin (RUS),M,FIM,1.0,13:00,70,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-14,World Championship
25700,22,Essa Albarrk (SAU),M,FIM,1.0,13:06,35,0,-,-,-,AIDA Freediving World Cup August 2023,Depth Competition,2023-08-14,World Championship


In [18]:
df[df["Card"] == "-"]["RP"].value_counts()

RP
0    138
Name: count, dtype: int64

In [19]:
df[df["Card"] == "-"]["Remarks"].value_counts()

Remarks
-    138
Name: count, dtype: int64

Here, we clearly see that 138 lines are empty, so non interesting. We simply drop them.

In [20]:
df = df[df["Card"] != "-"]

### Exploring Points

In [21]:
df["Points"].info()

<class 'pandas.core.series.Series'>
Index: 26703 entries, 0 to 26840
Series name: Points
Non-Null Count  Dtype 
--------------  ----- 
26703 non-null  object
dtypes: object(1)
memory usage: 417.2+ KB


In [22]:
for value in df["Points"].value_counts().index:
    print(value)

0.0
40.0
50.0
30.0
45.0
35.0
60.0
55.0
25.0
70.0
42.0
65.0
20.0
52.0
41.0
43.0
80.0
51.0
33.0
53.0
46.0
36.0
75.0
32.0
61.0
47.0
48.0
38.0
56.0
62.0
57.0
31.0
37.0
58.0
28.0
44.0
63.0
27.0
54.0
90.0
66.0
71.0
72.0
85.0
34.0
68.0
26.0
15.0
67.0
73.0
64.0
81.0
23.0
22.0
76.0
82.0
0
21.0
49.0
77.0
39.0
24.0
29.0
78.0
83.0
100.0
91.0
18.0
74.0
59.0
86.0
88.0
92.0
19.0
17.0
69.0
95.0
10.0
84.0
16.0
12.0
93.0
87.0
96.0
14.0
101.0
11.0
13.0
105.0
94.0
40
7.0
9.0
8.0
30
45
50
102.0
5.0
98.0
97.0
103.0
25
70
89.0
6.0
79.0
106.0
65
55
60
2.0
111.0
110.0
35
3.0
51
99.0
4.0
108.0
20
57
33
80
104.0
52
120.0
107.0
36
48
32
56
114.0
-4.0
116.0
1.0
53
109.0
63
42
41
22
38
112.0
28
61
71
31
62
75
37
46
117.0
-3.0
-15.0
73
-8.0
16
115.0
18
47
-1.0
68
-5.0
90
15
66
-10.0
29
26
-11.0
113.0
-12.0
72
54
39
100
85
-13.0
-9.0
-2.0
21
101
23
123.0
64
67
76
83
91
12
121.0
-14.0
-6.0
-16.0
-25.0
118.0
124.0
82
125.0
43
-31.0
93
122.0
24
27
84
-17.0
77
14
-7.0
49
34
17
-24.0
78
126.0
119.0
-34.0
19
11
88
9
81
59


In [23]:
df["Points"] = df["Points"].astype(float)

In [24]:
df["Points"].info()

<class 'pandas.core.series.Series'>
Index: 26703 entries, 0 to 26840
Series name: Points
Non-Null Count  Dtype  
--------------  -----  
26703 non-null  float64
dtypes: float64(1)
memory usage: 417.2 KB


Everything is now for sure normal here.

### Exploring Remarks

### Exploring Event Type

In [25]:
df["Event Type"].value_counts()

Event Type
Depth Competition      19297
Mixed Competition       3794
World Championship      1921
Competition             1447
Pool Competition         140
Worldrecord attempt       57
Team Competition          47
Name: count, dtype: int64

No missing values, and everything is as expected.

### Exploring Day

In [26]:
for value in df["Day"].value_counts().index:
    print(value)

2008-09-03
2012-11-20
2011-09-15
2013-09-15
2014-11-27
2014-05-25
2010-06-14
2010-09-26
2015-09-11
2015-05-23
2015-04-27
2020-08-15
2013-11-09
2008-07-12
2009-04-03
2023-06-24
2007-10-27
2012-09-09
2012-10-20
2009-04-21
2011-04-17
2003-05-29
2009-12-03
2006-12-09
2010-04-27
2008-04-18
2010-04-29
2013-05-20
2020-08-14
2015-09-05
2007-10-21
2022-06-18
2011-06-23
2015-08-27
2008-06-20
2009-09-06
2008-06-10
2013-09-09
2010-08-14
2002-10-31
2021-09-24
2017-08-26
2004-06-14
2007-08-14
2019-08-24
2021-09-25
2006-08-19
2016-04-30
2023-05-26
2006-09-09
2012-05-04
2015-06-06
2017-07-22
2019-06-09
2014-09-19
2005-09-01
2009-07-03
2015-11-01
2014-06-12
2023-03-10
2008-08-10
2013-10-05
2006-05-27
2022-08-06
2022-08-20
2015-06-20
2015-05-25
2014-06-15
2022-08-05
2016-05-02
2002-07-27
2009-07-19
2012-08-18
2022-09-11
2022-08-02
2008-07-25
2017-08-19
2016-05-01
2015-05-02
2021-09-17
2010-05-11
2021-06-19
2013-07-06
2009-06-22
2010-05-21
2009-08-13
2006-06-04
2013-10-30
2018-09-02
2019-09-21
2022-06-25

Data is consistent, so we may use it for further analysis.

### Exploring Category Event

In [27]:
df["Category Event"].value_counts()

Category Event
other                 18685
World Championship     4120
VB                     1880
Panglao                1339
NAC                     679
Name: count, dtype: int64

With that much divers in an unknown category, this isn't relevant to keep this column.

## Modifications and creation of specific columns

In [28]:
# Dropping duplicates if there are any (based on previous exploration, there isn't).
df.drop_duplicates(inplace=True)

### Dropping columns

In [29]:
df = df.drop(["Start", "Line", "Official Top", "Title Event", "Category Event"], axis=1)

In [30]:
df.head()

Unnamed: 0,Diver,Gender,Discipline,AP,RP,Card,Points,Remarks,Event Type,Day
0,Tasos Grillakis (GRC),M,FIM,33,23,YELLOW,12.0,-,Depth Competition,2016-07-17
1,Antonis Papantonatos (GRC),M,FIM,55,47,YELLOW,38.0,-,Depth Competition,2016-07-17
2,Dimitris Koumoulos (GRC),M,CNF,55,55,WHITE,55.0,-,Depth Competition,2016-07-17
3,Christos Papadopoulos (GRC),M,CWT,55,55,WHITE,55.0,OK,Depth Competition,2016-07-17
4,Anna Chalari (GRC),F,CWT,15,15,WHITE,15.0,OK,Depth Competition,2016-07-17


### Creating new columns

In [31]:
# 1. Extracting the month from the 'Day' column
df['Month'] = pd.to_datetime(df['Day']).dt.month

# 2. Calculating total dive experience
# Sorting data by 'Day'
df.sort_values(by=['Day'], inplace=True)

# Cumulative count of dives per diver
df['Experience Dive'] = df.groupby('Diver').cumcount()

# 3. Calculating experience per discipline
# Cumulative count of dives per diver per discipline
df['Experience Discipline'] = df.groupby(['Diver', 'Discipline']).cumcount()

In [32]:
df.head(10)

Unnamed: 0,Diver,Gender,Discipline,AP,RP,Card,Points,Remarks,Event Type,Day,Month,Experience Dive,Experience Discipline
8929,Deborah Andollo (CUB),F,CWT,61,61,WHITE,61.0,OK,Worldrecord attempt,1994-06-12,6,0,0
3716,Umberto Pelizzari (ITA),M,CWT,72,72,WHITE,72.0,OK,Worldrecord attempt,1995-09-17,9,0,0
3713,Deborah Andollo (CUB),F,CWT,62,62,WHITE,62.0,OK,Worldrecord attempt,1996-10-05,10,1,1
5021,Michael Oliva (FRA),M,CWT,72,72,WHITE,72.0,OK,Worldrecord attempt,1996-10-11,10,0,0
3717,Alejandro Ravelo (CUB),M,CWT,73,73,WHITE,73.0,OK,Worldrecord attempt,1997-08-02,8,0,0
3715,Umberto Pelizzari (ITA),M,CWT,75,75,WHITE,75.0,OK,Worldrecord attempt,1997-09-13,9,1,1
8928,Deborah Andollo (CUB),F,CWT,0,65,WHITE,65.0,OK,Worldrecord attempt,1997-12-05,12,2,2
20860,Alexandra Louzine (CZE),F,CWT,35,35,WHITE,35.0,OK,Worldrecord attempt,1998-09-06,9,0,0
3712,Tanya Streeter (USA),F,CWT,67,67,WHITE,67.0,OK,Worldrecord attempt,1998-09-19,9,0,0
13224,Bernard Hugues (FRA),M,CWT,0,60,WHITE,60.0,OK,Competition,1999-01-01,1,0,0


### Creation of the "Country" column

In [33]:
df["Diver"]

8929       Deborah Andollo (CUB)
3716     Umberto Pelizzari (ITA)
3713       Deborah Andollo (CUB)
5021         Michael Oliva (FRA)
3717      Alejandro Ravelo (CUB)
                  ...           
26779    Alhadoom Almheiri (ARE)
26780         Firas Fayyad (PSE)
26781        Ahmed Abdulla (BHR)
26782        Dmitry Kataya (INT)
26783            Aws Lafta (IRQ)
Name: Diver, Length: 26703, dtype: object

In [34]:
df["Country"] = df["Diver"].apply(lambda x: str(x).strip().replace('()', '(NaN)')[-4:-1]).replace("NaN", "Unknown")

In [35]:
for value in df["Country"].value_counts().index:
    print(value)

KOR
FRA
JPN
USA
GBR
DEU
RUS
SWE
GRC
CAN
MEX
CHN
DNK
NZL
CHE
POL
TPE
AUS
HRV
COL
ITA
CZE
ISR
FIN
CYP
BRA
AUT
NLD
VEN
PHL
ESP
BEL
CHL
ZAF
UKR
SVN
MYS
NOR
SGP
PRT
ARG
HKG
EGY
THA
SRB
SVK
HUN
IRL
TUR
IDN
Unknown
SAU
TUN
LVA
ECU
ROU
OMN
KWT
GTM
BLR
BGR
HND
PER
SLV
CUB
ARE
BRB
URY
SYR
INT
EST
LBN
DMA
KAZ
LTU
PRI
MAR
PAN
MDV
GRD
CYM
DZA
ATG
PSE
BRN
MCO
BHS
AFG
IRN
NAM
LIE
BHR
BOL
MUS
AND
IND
YEM
ZWE
LUX
TTO
IRQ
ARM
ISL
LBY
MDA
MNE
ERI
MLT
GEO


Thanks to that, we now have countries for future vizualisations.

Please note that unknown values as '()' or "Name Surname ()" have been marked as 'Unknown'.

In [36]:
df.head()

Unnamed: 0,Diver,Gender,Discipline,AP,RP,Card,Points,Remarks,Event Type,Day,Month,Experience Dive,Experience Discipline,Country
8929,Deborah Andollo (CUB),F,CWT,61,61,WHITE,61.0,OK,Worldrecord attempt,1994-06-12,6,0,0,CUB
3716,Umberto Pelizzari (ITA),M,CWT,72,72,WHITE,72.0,OK,Worldrecord attempt,1995-09-17,9,0,0,ITA
3713,Deborah Andollo (CUB),F,CWT,62,62,WHITE,62.0,OK,Worldrecord attempt,1996-10-05,10,1,1,CUB
5021,Michael Oliva (FRA),M,CWT,72,72,WHITE,72.0,OK,Worldrecord attempt,1996-10-11,10,0,0,FRA
3717,Alejandro Ravelo (CUB),M,CWT,73,73,WHITE,73.0,OK,Worldrecord attempt,1997-08-02,8,0,0,CUB


## Transforming usefull categorical data to numeric data

In [37]:
df.head()

Unnamed: 0,Diver,Gender,Discipline,AP,RP,Card,Points,Remarks,Event Type,Day,Month,Experience Dive,Experience Discipline,Country
8929,Deborah Andollo (CUB),F,CWT,61,61,WHITE,61.0,OK,Worldrecord attempt,1994-06-12,6,0,0,CUB
3716,Umberto Pelizzari (ITA),M,CWT,72,72,WHITE,72.0,OK,Worldrecord attempt,1995-09-17,9,0,0,ITA
3713,Deborah Andollo (CUB),F,CWT,62,62,WHITE,62.0,OK,Worldrecord attempt,1996-10-05,10,1,1,CUB
5021,Michael Oliva (FRA),M,CWT,72,72,WHITE,72.0,OK,Worldrecord attempt,1996-10-11,10,0,0,FRA
3717,Alejandro Ravelo (CUB),M,CWT,73,73,WHITE,73.0,OK,Worldrecord attempt,1997-08-02,8,0,0,CUB


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26703 entries, 8929 to 26783
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Diver                  26703 non-null  object 
 1   Gender                 26703 non-null  object 
 2   Discipline             26703 non-null  object 
 3   AP                     26703 non-null  int64  
 4   RP                     26703 non-null  int64  
 5   Card                   26703 non-null  object 
 6   Points                 26703 non-null  float64
 7   Remarks                26698 non-null  object 
 8   Event Type             26703 non-null  object 
 9   Day                    26703 non-null  object 
 10  Month                  26703 non-null  int32  
 11  Experience Dive        26703 non-null  int64  
 12  Experience Discipline  26703 non-null  int64  
 13  Country                26703 non-null  object 
dtypes: float64(1), int32(1), int64(4), object(8)
memory usag

In [39]:
columns_to_transform = ["Gender", "Discipline", "Card", "Event Type"]

In [40]:
# Initializing one hot encoder
one_hot_encoder = OneHotEncoder()

In [41]:
# Initializing tranformer with desired settings
transformer = make_column_transformer(
    (OneHotEncoder(), columns_to_transform),
    remainder='passthrough')

In [42]:
# Transforming DataFrame's data
transformed = transformer.fit_transform(df)
transformed

array([[1.0, 0.0, 0.0, ..., 0, 0, 'CUB'],
       [0.0, 1.0, 0.0, ..., 0, 0, 'ITA'],
       [1.0, 0.0, 0.0, ..., 1, 1, 'CUB'],
       ...,
       [0.0, 1.0, 0.0, ..., 1, 0, 'BHR'],
       [0.0, 1.0, 0.0, ..., 1, 0, 'INT'],
       [0.0, 1.0, 0.0, ..., 2, 0, 'IRQ']], dtype=object)

In [43]:
# Convert data to DataFrame
transformed_df = pd.DataFrame(
    transformed, 
    columns=transformer.get_feature_names_out()
)

In [44]:
# Print all columns' name for transformed_df
transformer.get_feature_names_out()

array(['onehotencoder__Gender_F', 'onehotencoder__Gender_M',
       'onehotencoder__Discipline_CNF', 'onehotencoder__Discipline_CWT',
       'onehotencoder__Discipline_CWTB', 'onehotencoder__Discipline_FIM',
       'onehotencoder__Card_RED', 'onehotencoder__Card_WHITE',
       'onehotencoder__Card_YELLOW',
       'onehotencoder__Event Type_Competition',
       'onehotencoder__Event Type_Depth Competition',
       'onehotencoder__Event Type_Mixed Competition',
       'onehotencoder__Event Type_Pool Competition',
       'onehotencoder__Event Type_Team Competition',
       'onehotencoder__Event Type_World Championship',
       'onehotencoder__Event Type_Worldrecord attempt',
       'remainder__Diver', 'remainder__AP', 'remainder__RP',
       'remainder__Points', 'remainder__Remarks', 'remainder__Day',
       'remainder__Month', 'remainder__Experience Dive',
       'remainder__Experience Discipline', 'remainder__Country'],
      dtype=object)

In [45]:
# Checking data is complete
transformed_df.isna().sum()

onehotencoder__Gender_F                          0
onehotencoder__Gender_M                          0
onehotencoder__Discipline_CNF                    0
onehotencoder__Discipline_CWT                    0
onehotencoder__Discipline_CWTB                   0
onehotencoder__Discipline_FIM                    0
onehotencoder__Card_RED                          0
onehotencoder__Card_WHITE                        0
onehotencoder__Card_YELLOW                       0
onehotencoder__Event Type_Competition            0
onehotencoder__Event Type_Depth Competition      0
onehotencoder__Event Type_Mixed Competition      0
onehotencoder__Event Type_Pool Competition       0
onehotencoder__Event Type_Team Competition       0
onehotencoder__Event Type_World Championship     0
onehotencoder__Event Type_Worldrecord attempt    0
remainder__Diver                                 0
remainder__AP                                    0
remainder__RP                                    0
remainder__Points              

In [46]:
transformed_df.head()

Unnamed: 0,onehotencoder__Gender_F,onehotencoder__Gender_M,onehotencoder__Discipline_CNF,onehotencoder__Discipline_CWT,onehotencoder__Discipline_CWTB,onehotencoder__Discipline_FIM,onehotencoder__Card_RED,onehotencoder__Card_WHITE,onehotencoder__Card_YELLOW,onehotencoder__Event Type_Competition,onehotencoder__Event Type_Depth Competition,onehotencoder__Event Type_Mixed Competition,onehotencoder__Event Type_Pool Competition,onehotencoder__Event Type_Team Competition,onehotencoder__Event Type_World Championship,onehotencoder__Event Type_Worldrecord attempt,remainder__Diver,remainder__AP,remainder__RP,remainder__Points,remainder__Remarks,remainder__Day,remainder__Month,remainder__Experience Dive,remainder__Experience Discipline,remainder__Country
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Deborah Andollo (CUB),61,61,61.0,OK,1994-06-12,6,0,0,CUB
1,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Umberto Pelizzari (ITA),72,72,72.0,OK,1995-09-17,9,0,0,ITA
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Deborah Andollo (CUB),62,62,62.0,OK,1996-10-05,10,1,1,CUB
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Michael Oliva (FRA),72,72,72.0,OK,1996-10-11,10,0,0,FRA
4,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,Alejandro Ravelo (CUB),73,73,73.0,OK,1997-08-02,8,0,0,CUB


In [47]:
# Exporting data to a csv for other notebooks
transformed_df.to_csv("Data/transformed_df.csv")