[Data in Kaggle](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results?select=athlete_events.csv)

# Import the libraries

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# mount the data from Drive

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Checking the Data carefully
Loading the data, seeing its information, and having a statistical view of it.

In [8]:
df = pd.read_csv("/content/drive/MyDrive/athlete_events.csv")

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


In [10]:
df.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,261642.0,210945.0,208241.0,271116.0
mean,68248.954396,25.556898,175.33897,70.702393,1978.37848
std,39022.286345,6.393561,10.518462,14.34802,29.877632
min,1.0,10.0,127.0,25.0,1896.0
25%,34643.0,21.0,168.0,60.0,1960.0
50%,68205.0,24.0,175.0,70.0,1988.0
75%,102097.25,28.0,183.0,79.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


#Data cleaning
Turn all the strings characters in Dataframe to lowercase to avoid any duplicate data.


In [12]:
df = df.applymap(lambda x:x.lower() if type(x) == str else x)

Fix the Sex data in Dataframe, cause of there some players have multiple genders

In [14]:
names_with_def_sex=df.groupby('Name')['Sex'].apply(lambda x: np.NaN if x.unique().size == 1 else x.mode().max())
names_with_def_sex = names_with_def_sex.dropna()
names = names_with_def_sex.keys()
for name in names:
  df.loc[df["Name"]==name,"Sex"] = names_with_def_sex[name]

Filter all the numerical outlier data in the data frame using the IQR equation without dropping any NaN, cause I'm trying to estimate these Nan as well as possible.

In [15]:
df1 = df.groupby("Sport",as_index=True)

Find the IQR for every Sport ,cause of every sport has its ages and heights and weights...

so i saw filtering outliers on all the sports together is a dumb idea.


In [16]:
sports = df["Sport"].unique()
filterd_df = pd.DataFrame(df.columns)
nan_df = pd.DataFrame(df.columns)
num_data = df1[["Sport","Age","Height",	"Weight"]]
Q1 = num_data.quantile(0.25)
Q3 = num_data.quantile(0.75)
IQR = Q3 - Q1
lower_pd = Q1 - 1.5 * IQR
upper_pd = Q3 + 1.5 * IQR

for sport in sports:
  df_grouped = df1.get_group(sport)
  mask = (df_grouped['Age'].between(lower_pd.loc[sport]["Age"], upper_pd.loc[sport]["Age"], inclusive="both") | df_grouped['Age'].isna())\
       & (df_grouped['Height'].between(lower_pd.loc[sport]["Height"], upper_pd.loc[sport]["Height"], inclusive="both") | df_grouped['Height'].isna())\
       & (df_grouped['Weight'].between(lower_pd.loc[sport]["Weight"], upper_pd.loc[sport]["Weight"], inclusive="both") | df_grouped['Weight'].isna())
      
  mask = mask[mask==False]
  filterd_sport_df = df[df["Sport"]==sport].drop(mask.index,inplace=False)
  

  if ~ pd.isna(filterd_sport_df['Height'].mean()):
    filterd_sport_df['Height'].fillna(value=np.round(filterd_sport_df['Height'].mean(),0),inplace=True)

  if ~ pd.isna(filterd_sport_df['Weight'].mean()):
    filterd_sport_df['Weight'].fillna(value=np.round(filterd_sport_df['Weight'].mean(),0),inplace=True)

  
  filterd_df = pd.concat([filterd_df, filterd_sport_df], axis=0)

df = filterd_df.dropna(thresh=5,axis=0)
df = df.drop(columns=[0])
display(df)


Unnamed: 0,Age,City,Event,Games,Height,ID,Medal,NOC,Name,Season,Sex,Sport,Team,Weight,Year
0,24.0,barcelona,basketball men's basketball,1992 summer,180.0,1.0,,chn,a dijiang,summer,m,basketball,china,80.0,1992.0
167,19.0,beijing,basketball women's basketball,2008 summer,185.0,69.0,,esp,tamara abalde daz,summer,f,basketball,spain,72.0,2008.0
250,31.0,helsinki,basketball men's basketball,1952 summer,191.0,124.0,,egy,youssef mohamed abbas,summer,m,basketball,egypt,85.0,1952.0
264,29.0,sydney,basketball men's basketball,2000 summer,195.0,136.0,,ita,alessandro abbio,summer,m,basketball,italy,85.0,2000.0
346,25.0,munich,basketball men's basketball,1972 summer,189.0,192.0,,egy,ahmed el-sayed abdel hamid mobarak,summer,m,basketball,egypt,85.0,1972.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
230913,49.0,chamonix,alpinism mixed alpinism,1924 winter,,115888.0,gold,gbr,"edward lisle ""bill"" strutt",winter,m,alpinism,great britain,,1924.0
255672,47.0,chamonix,alpinism mixed alpinism,1924 winter,,128001.0,gold,gbr,arthur william wakefield,winter,m,alpinism,great britain,,1924.0
50275,26.0,paris,basque pelota men's two-man teams with cesta,1900 summer,,25866.0,gold,esp,jos de amzola y aspiza,summer,m,basque pelota,spain,,1900.0
252988,26.0,paris,basque pelota men's two-man teams with cesta,1900 summer,,126675.0,gold,esp,francisco villota y baquiola,summer,m,basque pelota,spain,,1900.0


Create a nan_df that contains all rows with NaN value.

In [17]:
nan_df = pd.concat([df[df['Age'].isna()], df[df['Height'].isna()],df[df['Weight'].isna()]], axis=0)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 264612 entries, 0 to 214105
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Age     255155 non-null  float64
 1   City    264612 non-null  object 
 2   Event   264612 non-null  object 
 3   Games   264612 non-null  object 
 4   Height  264514 non-null  float64
 5   ID      264612 non-null  float64
 6   Medal   38741 non-null   object 
 7   NOC     264612 non-null  object 
 8   Name    264612 non-null  object 
 9   Season  264612 non-null  object 
 10  Sex     264612 non-null  object 
 11  Sport   264612 non-null  object 
 12  Team    264612 non-null  object 
 13  Weight  264401 non-null  float64
 14  Year    264612 non-null  float64
dtypes: float64(5), object(10)
memory usage: 32.3+ MB


In [19]:
nan_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9766 entries, 2485 to 214105
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Age     254 non-null    float64
 1   City    9766 non-null   object 
 2   Event   9766 non-null   object 
 3   Games   9766 non-null   object 
 4   Height  9545 non-null   float64
 5   ID      9766 non-null   float64
 6   Medal   957 non-null    object 
 7   NOC     9766 non-null   object 
 8   Name    9766 non-null   object 
 9   Season  9766 non-null   object 
 10  Sex     9766 non-null   object 
 11  Sport   9766 non-null   object 
 12  Team    9766 non-null   object 
 13  Weight  9427 non-null   float64
 14  Year    9766 non-null   float64
dtypes: float64(5), object(10)
memory usage: 1.2+ MB


Fill NaN values in nan_df with the suitable values.

In [20]:
global player_names

In [21]:
player_names = nan_df.Name

In [22]:
meanbysex=df.groupby(["Sex"], as_index=False).mean()
meanbySport=df.groupby(["Sport"], as_index=True).Age.mean()
global favg_Height,mavg_Height,favg_Weight,mavg_Weight

favg_Height = meanbysex[meanbysex["Sex"]=="f"].Height
mavg_Height = meanbysex[meanbysex["Sex"]=="m"].Height
favg_Weight = meanbysex[meanbysex["Sex"]=="f"].Weight
mavg_Weight = meanbysex[meanbysex["Sex"]=="m"].Weight

In [23]:
names = player_names.unique()
def same_name(name,df):
  name_df = df[df["Name"] == name]
  name_Height = name_df[name_df["Height"].notna()]["Height"].mean()
  name_Weight = name_df[name_df["Weight"].notna()]["Weight"].mean()

  for i in range(name_df.Games.count()):
    if name_df.Games.count() == 1:
      name_df["Age"].fillna(value=np.round(meanbySport[name_df.iloc[0]["Sport"]],0),inplace=True)
    else:
      if  name_df.iloc[i]["Year"]==name_df["Year"].min():
        name_df.loc[name_df.index[i],"Age"] = np.round(meanbySport[name_df.iloc[0]["Sport"]],0)
      else:
        name_df.loc[name_df.index[i],"Age"] = np.round(meanbySport[name_df.iloc[0]["Sport"]],0) + name_df.iloc[i]["Year"] - name_df["Year"].min()

  if name_Height is np.NaN:
    name_df.loc[name_df["Sex"]=="f",'Height'] = np.round(favg_Height[0],0)
    name_df.loc[name_df["Sex"]=="m",'Height'] = np.round(mavg_Height[1],0)
  else: name_df['Height'].fillna(value=name_Height,inplace=True)
  
  if name_Weight is np.NaN:
    name_df.loc[name_df["Sex"]=="f",'Weight'] = np.round(favg_Weight[0],0)
    name_df.loc[name_df["Sex"]=="m",'Weight'] = np.round(mavg_Weight[1],0)
  else: name_df['Weight'].fillna(value=name_Weight,inplace=True)
  
  #display(name_df)
  return name_df

def nan_filler(names,df):
  flag = True
  i = 0
  for name in names:
    if flag:
      new_df = same_name(name,nan_df)
      flag=False
    else:
      new_df = pd.concat([new_df, same_name(name,nan_df)],ignore_index = True)
      print(name)
    i = i+1
    print(i)
  return new_df

new_df = nan_filler(names,nan_df)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
giuseppe crivelli
3847
carlos crosta noceti
3848
ricardo santos da benta
3849
delfim jos da silva
3850
demetrios dais
3851
douard dammann
3852
bonifacio de bortoli
3853
oscar de cock
3854
gerard de gezelle
3855
r. de landtsheere
3856
sylvio augusto de souza e silva
3857
lszl decker
3858
antoine g. f. decours
3859
henri delabarre
3860
lon delignires
3861
maurice delplanck
3862
jules demar
3863
victor denis
3864
lon deslinires
3865
alphonse dewette
3866
robert d'heilly
3867
adrien d'hondt
3868
iakovidis diakoumakos
3869
paulo diebold
3870
stamatios diomataras
3871
konstantinos ditsios
3872
stepan dmitriyevsky
3873
jos ribeiro do seixo
3874
ioannis dolas
3875
guillermo rafael douglas sabattini
3876
thomas ian g. "tom" dowdall
3877
duko orevi
3878
aristidis drakakis
3879
sveto drenovac
3880
andreas drivas
3881
richard duc
3882
wilhelm "willy" dskow
3883
paul chard
3884
toshiji eda
3885
rbert der
3886
theodoros emeraldis
3887


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try us

george john buckley
6347
francis romulus burchell
6348
frederick william christian
6349
harry richard corner
6350
frederick william cuming
6351
william stephens donne
6352
timothe jordan
6353
arthur macevoy
6354
alfred james powlesland
6355
douglas francis robinson
6356
alfred john schneidau
6357
john symes
6358
henry john terry, jr.
6359
montagu henry toller
6360
philip humphreys tomalin
6361
alfred aufdenblatten
6362
georges camille berthet
6363
josef bm
6364
vin elias bremer
6365
karel buchta
6366
stanisaw chrobak
6367
aku "august" eskelinen
6368
heikki hirvonen
6369
bohuslav josfek
6370
alfons julen
6371
anton julen
6372
stanisaw kdzioka
6373
gabriel maurice mandrillon
6374
herman vilhelm "ville" mattila
6375
adrien louis albert vandelle
6376
denis louis vaucher
6377
szczepan wiktor witkowski
6378
zbigniew czesaw woycicki
6379
gaston achille louis aumoitte
6380
jeanne marie henriette filleaul-brohy (hantjens-)
6381
marcel hantjens
6382


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)
A value is trying to be set on a copy of a slice from a DataFrame.
Try us

georges douard johin
6383
marie pierrette sophie pauline ohier
6384
marie maurice jacques alfred sautereau
6385
maurice marie joseph vignerot
6386
chrtien andr waydelich
6387
charles brown
6388
charles jacobus
6389
smith o. streeter
6390
charles granville bruce
6391
john geoffrey bruce
6392
colin grant crawford
6393
gnter oskar dyhrenfurth
6394
harriet pauline "hettie" dyhrenfurth (heyman-)
6395
george ingle finch
6396
tom george longstaff
6397
george herbert leigh mallory
6398
henry treise morshead
6399
john baptist lucius noel
6400
edward felix norton
6401
franz xaver schmid
6402
anton "toni" schmid
6403
theodore howard somervell
6404
edward lisle "bill" strutt
6405
arthur william wakefield
6406
jos de amzola y aspiza
6407
francisco villota y baquiola
6408
hermann schreiber
6409
terence de la mesa "terry" allen
6410
heinrich amsinck
6411
manuel ngel andrada
6412
frederick whitfield barrett
6413
klmn bartalis
6414
andreas ernst gustav walter bartram
6415
john graham hope de la poer be

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


john denis yelverton bingham
6418
elmer julius boeseke, jr.
6419
walter selby buckmaster
6420
roberto diego lorenzo cavanagh y hearne
6421
denis st. george daly
6422
david dawnay
6423
armand franois jules marie de la rochefoucauld-doudeauville
6424
lvaro de figueroa y alonso-martnez
6425
jos mara de figueroa y alonso-martnez
6426
luis de figueroa y alonso-martnez
6427
pierre antoine clment marie de chapelle de jumilhac
6428
jean pierre marie joseph  de madre de loos
6429
hubert georges douard conqur de monbrison
6430
charles marie csar ludovic de polignac
6431
douard alphonse de rothschild
6432
tivadar dienes-hm
6433
luis jorge duggan ham
6434
jos eustaquio luis francisco escandn y barrn
6435
jos manuel mara del corazn de jess escandn y barrn
6436
jos pablo eustaquio manuel francisco escandn y barrn
6437
william auguste fauquet-lematre
6438
rafael fernndez de henostrosa y salabert
6439
hernando carlos mara teresa fitz-james stuart y falc portocarrero y osorio
6440
jacobo mara del pilar

Drop the nan_df raws from main Dataframe

In [26]:
indexs = nan_df.index.tolist()
df = df.drop(indexs)
df

Unnamed: 0,Age,City,Event,Games,Height,ID,Medal,NOC,Name,Season,Sex,Sport,Team,Weight,Year
0,24.0,barcelona,basketball men's basketball,1992 summer,180.0,1.0,,chn,a dijiang,summer,m,basketball,china,80.0,1992.0
167,19.0,beijing,basketball women's basketball,2008 summer,185.0,69.0,,esp,tamara abalde daz,summer,f,basketball,spain,72.0,2008.0
250,31.0,helsinki,basketball men's basketball,1952 summer,191.0,124.0,,egy,youssef mohamed abbas,summer,m,basketball,egypt,85.0,1952.0
264,29.0,sydney,basketball men's basketball,2000 summer,195.0,136.0,,ita,alessandro abbio,summer,m,basketball,italy,85.0,2000.0
346,25.0,munich,basketball men's basketball,1972 summer,189.0,192.0,,egy,ahmed el-sayed abdel hamid mobarak,summer,m,basketball,egypt,85.0,1972.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239604,26.0,london,motorboating mixed b-class (under 60 feet),1908 summer,181.0,120083.0,gold,gbr,isaac thomas thornycroft,summer,m,motorboating,gyrinus-1,77.0,1908.0
239605,26.0,london,motorboating mixed c-class,1908 summer,181.0,120083.0,gold,gbr,isaac thomas thornycroft,summer,m,motorboating,gyrinus-1,77.0,1908.0
239707,46.0,london,motorboating mixed a-class (open),1908 summer,181.0,120129.0,gold,fra,"ernest blakelock ""mile"" thubron",summer,m,motorboating,camille,77.0,1908.0
259371,29.0,london,motorboating mixed a-class (open),1908 summer,181.0,129853.0,,gbr,hugh richard arthur grosvenor,summer,m,motorboating,wolseley-siddeley-1,77.0,1908.0


Create the cleaned_df that represents the clean version of data frame

In [62]:
cleaned_df =pd.concat([df, new_df], axis=0)
cleaned_df.drop_duplicates(inplace=True)

In [63]:
cleaned_df

Unnamed: 0,Age,City,Event,Games,Height,ID,Medal,NOC,Name,Season,Sex,Sport,Team,Weight,Year
0,24.0,barcelona,basketball men's basketball,1992 summer,180.0,1.0,,chn,a dijiang,summer,m,basketball,china,80.0,1992.0
167,19.0,beijing,basketball women's basketball,2008 summer,185.0,69.0,,esp,tamara abalde daz,summer,f,basketball,spain,72.0,2008.0
250,31.0,helsinki,basketball men's basketball,1952 summer,191.0,124.0,,egy,youssef mohamed abbas,summer,m,basketball,egypt,85.0,1952.0
264,29.0,sydney,basketball men's basketball,2000 summer,195.0,136.0,,ita,alessandro abbio,summer,m,basketball,italy,85.0,2000.0
346,25.0,munich,basketball men's basketball,1972 summer,189.0,192.0,,egy,ahmed el-sayed abdel hamid mobarak,summer,m,basketball,egypt,85.0,1972.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9761,39.0,london,jeu de paume men's singles,1908 summer,178.0,79794.0,silver,gbr,eustace hamilton miles,summer,m,jeu de paume,great britain,74.0,1908.0
9762,32.0,london,jeu de paume men's singles,1908 summer,178.0,90545.0,,gbr,arthur page,summer,m,jeu de paume,great britain,74.0,1908.0
9763,21.0,london,jeu de paume men's singles,1908 summer,178.0,90836.0,,gbr,arnold nottage palmer,summer,m,jeu de paume,great britain,74.0,1908.0
9764,42.0,london,jeu de paume men's singles,1908 summer,181.0,105390.0,,usa,charles edward sands,summer,m,jeu de paume,united states,74.0,1908.0


Reindexing cleaned_df

In [64]:
cleaned_df = cleaned_df.reset_index()
cleaned_df = cleaned_df.drop(columns=["index"])
cleaned_df

Unnamed: 0,Age,City,Event,Games,Height,ID,Medal,NOC,Name,Season,Sex,Sport,Team,Weight,Year
0,24.0,barcelona,basketball men's basketball,1992 summer,180.0,1.0,,chn,a dijiang,summer,m,basketball,china,80.0,1992.0
1,19.0,beijing,basketball women's basketball,2008 summer,185.0,69.0,,esp,tamara abalde daz,summer,f,basketball,spain,72.0,2008.0
2,31.0,helsinki,basketball men's basketball,1952 summer,191.0,124.0,,egy,youssef mohamed abbas,summer,m,basketball,egypt,85.0,1952.0
3,29.0,sydney,basketball men's basketball,2000 summer,195.0,136.0,,ita,alessandro abbio,summer,m,basketball,italy,85.0,2000.0
4,25.0,munich,basketball men's basketball,1972 summer,189.0,192.0,,egy,ahmed el-sayed abdel hamid mobarak,summer,m,basketball,egypt,85.0,1972.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
263241,39.0,london,jeu de paume men's singles,1908 summer,178.0,79794.0,silver,gbr,eustace hamilton miles,summer,m,jeu de paume,great britain,74.0,1908.0
263242,32.0,london,jeu de paume men's singles,1908 summer,178.0,90545.0,,gbr,arthur page,summer,m,jeu de paume,great britain,74.0,1908.0
263243,21.0,london,jeu de paume men's singles,1908 summer,178.0,90836.0,,gbr,arnold nottage palmer,summer,m,jeu de paume,great britain,74.0,1908.0
263244,42.0,london,jeu de paume men's singles,1908 summer,181.0,105390.0,,usa,charles edward sands,summer,m,jeu de paume,united states,74.0,1908.0


In [76]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263246 entries, 0 to 263245
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   Age     263246 non-null  float64
 1   City    263246 non-null  object 
 2   Event   263246 non-null  object 
 3   Games   263246 non-null  object 
 4   Height  263246 non-null  float64
 5   ID      263246 non-null  float64
 6   Medal   38731 non-null   object 
 7   NOC     263246 non-null  object 
 8   Name    263246 non-null  object 
 9   Season  263246 non-null  object 
 10  Sex     263246 non-null  object 
 11  Sport   263246 non-null  object 
 12  Team    263246 non-null  object 
 13  Weight  263246 non-null  float64
 14  Year    263246 non-null  float64
dtypes: float64(5), object(10)
memory usage: 30.1+ MB


In [67]:
cleaned_df.describe()

Unnamed: 0,Age,Height,ID,Weight,Year
count,263246.0,263246.0,263246.0,263246.0,263246.0
mean,25.315731,175.072841,68248.354076,70.119236,1978.5681
std,5.879985,9.525086,39033.233081,12.177562,29.7128
min,10.0,137.0,1.0,31.0,1896.0
25%,21.0,169.0,34630.0,62.0,1960.0
50%,24.0,175.0,68209.0,70.0,1988.0
75%,28.0,181.0,102091.0,77.0,2002.0
max,77.0,220.0,135571.0,135.0,2016.0


We note that we drop just 3% from the data, which is acceptable.

Save it as CSV file

In [75]:
cleaned_df.to_csv('filterd_athlete_events.csv')