# **Data Understanding**

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# **Cyclists Dataset**

In [3]:
cyclists_df = pd.read_csv('../dataset/cyclists.csv')

cyclists_numeric = ["birth_year", "height", "weight"]
cyclists_categorical = ["nationality"]

In [None]:
cyclists_df.head()

### **Basic Checks**

### • Attribute Types

In [None]:
cyclists_df.info()

From an initial check there are no particular anomalies in the attribute types.

### • Non-Null Values Check

In [None]:
cyclists_df.isnull().any()

We plot a histogram showing how many null values there are for each attribute to get a more clear view.

In [None]:
# Calculate the number of null values for each column
null_counts = cyclists_df.isnull().sum()

# Plot the histogram
null_counts.plot(kind='bar', figsize=(10, 6), title='Histogram of Null Values for Each Column in Cyclists Dataset')
plt.xlabel('Attributes')
plt.ylabel('Number of Null Values')
# Add y values over the columns
for i, v in enumerate(null_counts):
    plt.text(i, v + 50, str(v), ha='center', va='bottom')

plt.show()

The attributes with the most null values are weight and height.

### **Basic Statistics**

In [None]:
cyclists_df[cyclists_numeric].describe()

We briefly validate the min/max values querying the web.

In [None]:
print('Min values corresponding cyclists:')
print(f'- {cyclists_df[cyclists_df["birth_year"] == 1933]["name"].values[0]} was born in 1933')
print(f'- {cyclists_df[cyclists_df["height"] == 154]["name"].values[0]} was tall 154 cm')
print(f'- {cyclists_df[cyclists_df["weight"] == 48]["name"].values[0]} weighted 48 kg')

print()

print('Max values corresponding cyclists:')
print(f'- {cyclists_df[cyclists_df["birth_year"] == 2004]["name"].values[0]} was born in 2004')
print(f'- {cyclists_df[cyclists_df["height"] == 204]["name"].values[0]} was tall 204 cm')
print(f'- {cyclists_df[cyclists_df["weight"] == 94]["name"].values[0]} weighted 94 kg')

All of this data are real according to the web.

### Column Analysis

#### _url

This categorical column contains the unique URL identifier of a cyclist. As we can see after a simple check, the 6134 total values are unique: there are no duplicates in the column nor null values.

In [None]:
print(cyclists_df['_url'].duplicated().sum(), 'duplicates found')


In [None]:
cyclists_df['_url'].isnull().sum()

#### Name

It is a categorical column containing the name of a cyclist. As we can see by a first check, there are 7 duplicates in the column. So we analyze them in details.

In [None]:
print(cyclists_df['name'].duplicated().sum(), 'duplicates found')

Since the ```_url_``` values are unique we exclude that there are duplicates. Indeed, by visualising which duplicates are in the ```name``` column and their associated ```_url_``` values, we can assume that they are different people since, for istance, ```Sergio Domínguez``` is associated to ```sergio-dominguez-rodriguez``` and ```sergio-dominguez-munoz``` which are two existent and different cyclists. 

In this example, therefore, the value in the name column is simply a shortened name that is associated with two different cyclists. In case the extended name is the same, we can see that the ```_url_``` value contains a number in the tail to identify the two different cyclists. For example, ```Alessandro Pozzi``` or ```Andrea Peron``` are associated respectively with ```alessandro-pozzi```, ```alessandro-pozzi2``` and ```andrea-peron```, ```andrea-peron-1```. 

In [None]:
cyclists_df[cyclists_df.duplicated(subset='name', keep=False)][['_url', 'name']]

In [None]:
cyclists_df[cyclists_df['name'].isin(['Andrea  Peron', 'Alessandro  Pozzi'])]

Again, an online search allowed us to verify that these are four different people, and also validated the data associated with them in the other columns of the dataset. 

Apparently no standard is used in the ```_url_``` values to mark two different cyclists. For example, there are two cyclists in the dataset:```Jesús López Carril``` (1949) and ```Jesús López Soriano``` (1955). For the fisrt one, the ```_url``` is ```jesus-lopez-carril``` as expected. For the second one we would have expected an ```_url``` value like ```jesus-lopez-soriano```, instead it is ```jesus-lopez23```

In [None]:
cyclists_df[cyclists_df['name'] == 'Jesús  López']

#### birth_year

This is a numerical (?) attribute indicating the birth year of a cyclist. For obvious reasons duplicates are allowed. We check if there are null values.

In [None]:
int(cyclists_df['birth_year'].isnull().sum())

Since there are only 13 null values we show all the related rows.

In [None]:
cyclists_df[cyclists_df['birth_year'].isnull()]

As we can see from the table above, also the ```weight``` and ```height``` values are ```NaN``` when ```birth_year``` is null.

##### Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(cyclists_df['birth_year'].dropna())
plt.title('birth_year distribution')
plt.xlabel('Birth Year')
plt.ylabel('Frequency')
plt.show()

TODO: comment histogram

In [None]:
plt.figure(figsize=(5, 10))
sns.boxplot(y=cyclists_df['birth_year'])
plt.title('Boxplot of Birth Year')
plt.ylabel('Birth Year')
plt.show()

In [None]:
cyclists_df['birth_year'].describe()

Overall, the distribution of ```birth_year``` is centered in 1974, with most people being born between 1962 and 1987, and the total range spans from 1933 to 2004. There don't seem to be any outliers as no individual points are plotted outside of the whiskers.

#### weight

This is a numerical attribute. Duplicates are allowed for obvious reasons. We check null values.

In [None]:
int(cyclists_df['weight'].isnull().sum())

In [None]:
cyclists_df[cyclists_df['weight'].isnull()].sample(25)

There are 3056 ```NaN``` values for the ```weight``` attribute. By picking a small random sample of the rows where  the ```weight``` attribute is ```NaN``` we notice that also the ```height``` is ```NaN```.

##### Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(cyclists_df['weight'].dropna(), binwidth=1)
plt.title('weight distribution')
plt.xlabel('weight')
plt.ylabel('Frequency')
plt.show()

TODO: comment the histogram

In [None]:
plt.figure(figsize=(5, 10))
sns.boxplot(y=cyclists_df['weight'])
plt.title('Boxplot of weight')
plt.ylabel('Weight')
plt.show()

In [None]:
cyclists_df['weight'].describe()

There are some outliers, more concentrated on the upper whisker. This, together with the other characteristics of the boxplot suggests positive skewness.

#### height

This is a numerical attribute. Duplicates are allowed for obvious reasons. We check null values.

In [None]:
int(cyclists_df['height'].isnull().sum())

There are 2991 ```NaN``` values for the ```height``` attribute. As mentioned above, it is very likely to find a ```NaN``` in ```height``` column when even ```weight``` is ```NaN```.

##### Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(cyclists_df['height'].dropna(), binwidth=1)
plt.title('height distribution')
plt.xlabel('height')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(5, 10))
sns.boxplot(y=cyclists_df['height'])
plt.title('Boxplot of height')
plt.ylabel('Height')
plt.show()

In [None]:
cyclists_df['height'].describe()

There are some outliers, this time concentrated more or less equally beyond both the whiskers. The boxplot, together with the histogram plotted before, suggest a relatively symmetric distribution of the ```height``` values

#### nationality

This is a categorical attribute. Duplicates are allowed. We check the presence of null values.

In [None]:
int(cyclists_df['nationality'].isnull().sum())

In [None]:
cyclists_df[cyclists_df['nationality'].isnull()]

There is only one null value in the ```nationality``` column.

In [None]:
cyclists_df['nationality'].unique()

Displaying the unique values of ```nationality``` there are apparently no problematic values in the column. All values are semantically and syntactically (check!!!) correct.

##### Plots

In [None]:
plt.figure(figsize=(12, 6))
sns.countplot(x='nationality', data=cyclists_df, order=cyclists_df['nationality'].value_counts().index)
plt.title('Histogram of Nationality')
plt.xlabel('Nationality')
plt.ylabel('Count')
plt.xticks(rotation=90)  # Rotate x-axis labels for better readability
plt.show()

### Correlation Analysis

In [None]:
# Define the coefficients
coefficients = ['spearman', 'pearson', 'kendall']

# Plot the correlation matrices
plt.figure(figsize=(18, 6))

for i, c in enumerate(coefficients):
    plt.subplot(1, 3, i + 1)
    correlation_matrix = cyclists_df[cyclists_numeric].corr(method=c)
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
    plt.title(f'{c.capitalize()} Correlation Matrix')

plt.tight_layout()
plt.show()


In [None]:
# Calculate the correlation matrix excluding non-numeric columns
correlation_matrix = cyclists_df[cyclists_numeric].corr()

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix of Cyclists Dataset')
plt.show()

# **Races Dataset**

In [36]:
races_df = pd.read_csv('../dataset/races.csv')

races_numeric_stage = ["length", "climb_total", "startlist_quality", "average_temperature"]
races_numeric = ["position", "cyclist_age", "delta"]
races_categorical = ["points", "uci_points", "profile"]
races_binary = ["is_tarmac", "is_cobbled", "is_gravel"]

### **Basic Checks**

#### • Attributes types

In [None]:
races_df.info()

#### • Check null values

In [None]:
# Calculate the number of null values for each column
null_counts = races_df.isnull().sum()

# Plot the histogram
ax = null_counts.plot(kind='bar', figsize=(10, 6), title='Histogram of Null Values for Each Column in Races Dataset')
plt.xlabel('Attributes')
plt.ylabel('Number of Null Values')

# Add y values over the columns, rotating only for "climb_total" and "profile"
for i, v in enumerate(null_counts):
    if null_counts.index[i] in ["climb_total", "profile"]:
        ax.text(i, v + 50, str(v), ha='center', va='bottom', rotation=45)
    else:
        ax.text(i, v + 50, str(v), ha='center', va='bottom')

plt.show()

#### • Duplicates

It is easy to observe that for each entry related to a stage of a race, different cyclists should be associated. We therefore check that this is true and that there are no duplicates. 

In [None]:
# Group by '_url' and 'cyclist' and count the occurrences
duplicate_cyclists = races_df.groupby(['_url', 'cyclist']).size().reset_index(name='count')

# Filter the groups where count is greater than 1
duplicate_cyclists = duplicate_cyclists[duplicate_cyclists['count'] > 1]

# Display the duplicate cyclists
print(duplicate_cyclists)


We assume that there are 123 duplicates in the dataset that should be handled during the data preparation process.

In [40]:
# # Count the duplicate values in the 'cyclist' column with the same value in the '_url' column
# duplicate_counts = races_df.groupby('_url')['cyclist'].value_counts()

# # filter only cyclists duplicated in the same stage
# duplicate_counts = duplicate_counts[duplicate_counts > 1]

# duplicate_counts.to_csv('../dataset/cyclist_duplicate.csv', header=True)

# #count number of total duplicates in the races dataset
# print("Number of total duplicates in the races dataset: ", duplicate_counts.count())

#### • Unique values

Checking unique values for each column in the dataset

In [None]:
distinct_points_count = races_df.nunique()
print("Number of distinct values in 'points':", distinct_points_count)

Another important thing we want to assess is that details of a stage are the same i.e. there are no inconsistent entries.

In [None]:
race_attributes = ["name", "points", "uci_points", "length", "climb_total", 
                   "startlist_quality", "average_temperature", "is_tarmac"] #TODO: siamo sicuri che ci siano tutti gli attributi?

# Initialize an empty list to store _url values where inconsistencies are found
inconsistent_urls = []

# Check for each attribute in race_attributes
for attribute in race_attributes:
    # Group by _url and check if all values in the group are the same
    inconsistent = races_df.groupby('_url')[attribute].nunique() > 1
    # Append the _url values with inconsistencies to the list
    inconsistent_urls.extend(inconsistent[inconsistent].index.tolist())

# Display the inconsistent _url values
print("Inconsistent _url values:", len(inconsistent_urls))



Analyzing numeric columns

In [None]:
races_numeric_stage = ["length", "climb_total", "startlist_quality", "average_temperature"]

races_numeric_stage.append("_url")

numeric_df = races_df[races_numeric_stage].groupby("_url").first().reset_index()

numeric_df.describe()


### **Columns Analysis**

#### • 'url' column

In [None]:
races_df['_url'].nunique()

This column contains the unique identifier of a race's stage. There are in total 5281 different _url values. An _url is in the format "RACE_NAME/RACE_DATE/STAGE_NUMBER". For example, the URL "tour-de-france/1978/stage-6" denotes the 6th stage of the Tour de France, 1978 edition. It is associated to some race stage details TODO: evaluate adding which are the details?

In [None]:
races_df.head(10)

Obviously, if we look only at the column itself, we will find duplicate values. However, by looking at the first ten rows of the table, we assume that there is an entry for each rider who participated in a given stage. Given this observation, we already checked that there are no duplicates for the same stage in terms of participating riders (see the above Duplicate section for the results).

### • Name column

The attribute is categorical. First of all we check if there are null values.

In [None]:
races_df['name'].isnull().sum()

In [None]:
races_df['name'].nunique()

There are no null values in the column which contains 61 unique values. They are the names of different races. We show those values to check in a qualitative way if there are some errors TODO: check the correct term to identify errors (e.g. "asfdnajsfa")

In [None]:
races_df['name'].unique()

In [None]:
# Split the _url column at the first '/' and take the first part
race_names = races_df['_url'].str.split('/', n=1).str[0]
# Count the unique values
race_names.nunique()


In [None]:
for name, group in races_df.groupby('name'):
    print(f"Race Name: {name}")
    unique_urls = group['_url'].str.split('/', n=1).str[0].unique()
    print(unique_urls)
    print()

We notice that the total race names used in the unique identifier are only 27. This is because, as we show below, the same RACE_NAME part of the identifier is associated to different values in the names column. This does not mean that the same identifier is associated with different races. Rather, the values in the name column for the same race differ only by a few letters (e.g. e instead of è) or because the race has adopted a different name over the years or is simply known by different names. 

In particular:
- ```san-sebastian``` is associated to _Clasica Ciclista San Sebastian_, _Clásica Ciclista San Sebastian_, _Clásica Ciclista San Sebastián_, _Clásica San Sebastián_, _Donostia San Sebastian Klasikoa_
- ```dauphine``` is associated to _Criterium du Dauphiné_, _Criterium du Dauphiné Libére_, _Critérium du Dauphiné_, _Critérium du Dauphiné Libéré_
- ```dwars-door-vlaanderen``` is associated to _Dwars door België / À travers la Belgique_, _Dwars door Vlaanderen_, _Dwars door Vlaanderen - A travers la Flandre ME_, _Dwars door Vlaanderen / A travers la Flandre_, _Dwars door Vlaanderen / A travers la Flandre ME_
- ```e3-harelbeke``` is associated to _E3 BinckBank Classic_, _E3 Harelbeke_, _E3 Prijs Vlaanderen_, _E3 Prijs Vlaanderen - Harelbeke_, _E3 Saxo Bank Classic_, _E3 Saxo Classic_, _E3-Prijs Harelbek_
- ```gp-quebec``` is associated to _Grand Prix Cycliste de Quebec_, _Grand Prix Cycliste de Québec_
- ```liege-bastogne-liege``` is associated to _Liège - Bastogne - Liège_, _Liège-Bastogne-Liège_
- ```strade-bianche``` is associated to _Monte Paschi Eroica_, _Montepaschi Strade Bianche - Eroica Toscana_, _Strade Bianche_
- ```omloop-het-nieuwsblad``` is associated to _Omloop Het Nieuwsblad ME_, _Omloop Het Volk_, _Omloop Het Volk ME_
- ```paris-roubaix``` is associated to _Paris - Roubaix_, _Paris-Roubaix_
- ```ronde-van-vlaanderen``` is associated to _Ronde van Vlaanderen - Tour des Flandres ME_, _Ronde van Vlaanderen / Tour des Flandres_, _Ronde van Vlaanderen / Tour des Flandres ME_
- ```volta-a-catalunya``` is associated to _Volta Ciclista a Catalunya_, _Volta a Catalunya_
- ```itzulia-basque-country``` is associated to _Vuelta Ciclista al País Vasco_, _Vuelta al País Vasco_
- ```world-championship``` is associated to _World Championships - Road Race_, _World Championships ME - Road Race_

#### points column

This is a numerical attribute. Same points for different cyclist are supposedly allowed. We check now for null values.

In [None]:
int(races_df['points'].isnull().sum())

In [None]:
races_df['points'].unique()

In [None]:
races_df['points'].value_counts()

##### Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(races_df['points'].dropna(), bins=30, kde=False)
plt.title('Histogram of Points')
plt.ylabel('Points')
plt.xlabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(y=races_df['points'])
plt.title('Boxplot of Points')
plt.ylabel('Points')
plt.show()

#### uci_point column

In [None]:
int(races_df['uci_points'].isnull().sum())

In [None]:
races_df['uci_points'].dropna().unique()

In [None]:
# Create the histogram
plt.figure(figsize=(10, 6))
ax = sns.histplot(races_df['uci_points'].dropna(), bins=30, kde=False)

# Add text labels on each bin with rotation
for p in ax.patches:
    height = p.get_height()
    if height > 0:
        ax.annotate(f'{int(height)}', (p.get_x() + p.get_width() / 2. + 0.1, height),
                    ha='center', va='center', xytext=(0, 25), textcoords='offset points', rotation=90)

# Set titles and labels
plt.title('Histogram of count values for UCI Points attribute')
plt.xlabel('UCI Points')
plt.ylabel('Frequency')

# Show the plot
plt.show()

##### Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(races_df['uci_points'].dropna(), bins=30, kde=False)
plt.title('Histogram of UCI Points')
plt.ylabel('UCI Points')
plt.xlabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(6, 10))
sns.boxplot(y=races_df['uci_points'])
plt.title('Boxplot of UCI Points')
plt.ylabel('Points')
plt.show()

#### length column

In [None]:
int(races_df['length'].isnull().sum())

In [None]:
races_df['length'].nunique()

Controllo se le lunghezze delle tappe sono consistenti. Per tutte le copie di una tappa deve risultare sempre la stessa lunghezza

In [None]:
# Group by '_url' and filter out null values in 'length'
length_consistency = races_df[~races_df['length'].isnull()].groupby('_url')['length'].nunique()

# Count the number of cases where the length is not consistent
inconsistent_length_count = (length_consistency > 1).sum()

inconsistent_length_count
#print(f"Number of cases where the length is not consistent: {inconsistent_length_count}")

poichè c'è coerenza è possobile raggruppare su _url per controllare le lunghezze delle tappe così da non conisderare duplicati

In [64]:
# Raggruppa per '_url' e calcola la lunghezza media delle gare
grouped_races_df = races_df.groupby('_url').first().reset_index()

In [None]:
average_grouped_length = grouped_races_df['length'].mean()
print(f"La lunghezza media delle gare raggruppate è: {average_grouped_length:.2f} metri")


In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(races_df['length'].dropna(), binwidth=5000)
plt.title('Distribution of Length Attribute')
plt.xlabel('Length')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(6, 10))
sns.boxplot(y=grouped_races_df['length'])
plt.title('Boxplot of Race Length')
plt.ylabel('Length (meters)')
plt.show()

credo sotto i 50km siano tutti da scartare ?

In [None]:
short_stages_count = (grouped_races_df['length'] < 50000).sum()
print(f"Number of stages under 50 km: {short_stages_count}")

In [None]:
races_df['length'].describe()

#### climb_total column

In [None]:
# number of climb null
int(races_df['climb_total'].isnull().sum())

In [None]:
# Group by '_url' and filter out null values in 'length'
climb_consistency = races_df[~races_df['climb_total'].isnull()].groupby('_url')['climb_total'].nunique()

# Count the number of cases where the climb is not consistent
inconsistent_length_count = (climb_consistency > 1).sum()

inconsistent_length_count
#print(f"Number of cases where the length is not consistent: {inconsistent_length_count}")

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(y=grouped_races_df['climb_total'])
plt.title('Boxplot of Race Length')
plt.ylabel('Length (meters)')
plt.show()

In [None]:
# Column of interest
total_climb = grouped_races_df['climb_total']

# Create a DataFrame to store the statistics
stats_df = pd.DataFrame({
    'Statistic': ['Null Count', 'Unique Value Counts', 'Mean', 'Max', 'Min', 'Variance', 'Description'],
    'Value': [
        total_climb.isnull().sum(),
        total_climb.nunique(),
        total_climb.mean(),
        total_climb.max(),
        total_climb.min(),
        total_climb.var(),
        total_climb.describe().to_dict()
    ]
})

# Display the DataFrame
print(stats_df)

### profile column

FRANCESCO P. 

### position column

In [None]:
int(races_df['position'].isnull().sum())

For the same stage, we check if there are duplicates.

In [None]:
# Group by '_url' and check for duplicate positions within each group
duplicate_positions = races_df.groupby('_url')['position'].apply(lambda x: x.duplicated(keep=False)).reset_index(drop=True)

# Filter the DataFrame to show only the rows with duplicate positions
races_df[duplicate_positions]

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(races_df['position'], bins=30, kde=False)
plt.title('Histogram of Position')
plt.xlabel('Position')
plt.ylabel('Frequency')
plt.show()

#### ```cyclist``` column

In [None]:
races_df['cyclist'].isnull().sum()

There are no null values in the column. Duplicated are expected but we check if, for the same stage of a race, there are no duplicated cyclists values.

In [None]:
races_df[races_df.duplicated(subset=['_url', 'cyclist'], keep=False)]

As we can see from the table above, for the same stage, there are duplicated ```cyclist``` values. We can also notice that in the same rows of a duplicate there are also other altered values like ```date```, ```positivon``` or ```delta```.

#### ```cyclist_age``` column

This is a numerical attribute. First of all we check if there are null values and if all the entries are numerical.

In [None]:
races_df['cyclist_age'].isnull().sum()

In [None]:
races_df['cyclist_age'].unique()

As we can see, there are 113 null entries but the remaining values are numerical.

Duplicates in the column are allowed but no for the same cyclist which can be also duplicated for the same stage as we checked in the previous column analysis. We check if the age reported for each cyclist is consistent across the column to gather more informations for the DP part.

In [None]:
# Group by 'cyclist' and check if there are different 'cyclist_age' values
age_inconsistencies = races_df.groupby('cyclist')['cyclist_age'].nunique()

# Filter the cyclists with more than one unique age value
age_inconsistencies[age_inconsistencies > 1]


As expected the same cyclist is associated to more than one age value.

##### Plots

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(races_df['cyclist_age'].dropna(), bins=30)
plt.title('Histogram of Cyclist Age')
plt.xlabel('Cyclist Age')
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.figure(figsize=(6, 10))
sns.boxplot(y=races_df['cyclist_age'])
plt.title('Boxplot of Cyclist Age')
plt.ylabel('Cyclist Age')
plt.show()

#### ```is_tarmac``` column

This is a boolean column. We check the presence of null values, errors, and we plot the distribution of the True and False values.

In [None]:
races_df['is_tarmac'].isnull().sum()

In [None]:
races_df['is_tarmac'].unique()

In [None]:
plt.figure(figsize=(6, 10))
sns.histplot(races_df['is_tarmac'], bins=2)
plt.title('Histogram of Is Tarmac')
plt.xlabel('Is Tarmac')
plt.ylabel('Frequency')
plt.xticks([0, 1], ['False', 'True'])
plt.show()

FRANCESCO ALIP:


### Correlation Analysis

In [None]:
# Define the coefficients
coefficients = ['spearman']

# Exclude the '_url' column from the numeric columns list
races_numeric_stage_no_url = [col for col in races_numeric_stage + races_numeric + races_categorical if col != '_url']

# Plot the correlation matrices
plt.figure(figsize=(18, 6))

for i, c in enumerate(coefficients):
    plt.subplot(1, 3, i + 1)
    correlation_matrix = races_df[races_numeric_stage_no_url].corr(method=c)
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
    plt.title(f'{c.capitalize()} Correlation Matrix')

plt.tight_layout()
plt.show()

In [None]:
# Include the 'profile' attribute in the numeric columns list
races_numeric_stage_with_profile = races_numeric_stage_no_url + ['profile']

# Calculate the correlation matrix including the 'profile' attribute
correlation_matrix_with_profile = races_df[races_numeric_stage_with_profile].corr(method='spearman')

# Plot the heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix_with_profile, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Including Profile Attribute')
plt.show()

FEATURE ENGENEERING TODO

Essendo il ciclismo uno sport in cui mediamente a parità di altezza il peso è simile, raggruppiamo per categoria l'altezza con intervalli di 5cm per cercare possibili outliers nel peso utilizzando dei boxplot condizionali.

In [None]:
# Define the bins for the height category
bins = [154, 159, 164, 169, 174, 179, 184, 189, 194, 199, 204]
labels = list(range(10))

# Create a new column with the height category mantaining the null values
cyclists_df['height_category'] = pd.cut(cyclists_df['height'], bins=bins, labels=labels, right=False, include_lowest=True)

# Show the first 20 rows of the height and height_category columns
print(cyclists_df[['height', 'height_category']].head(10))

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='height_category', y='weight', data=cyclists_df)
plt.title('Boxplot of Weight Conditioned on Height Category')
plt.xlabel('Height Category')
plt.ylabel('Weight')

# Set y-axis intervals
plt.yticks(np.arange(50, cyclists_df['weight'].max() + 5, 5))

plt.show()

Visualizziamo il numero di ciclisti per ogni categoria che abbiamo definito (escludendo quelli con peso nullo) per valutare se il numero di outlier ottenuti dai boxplot è significativo.

In [None]:
# Exclude cyclists with null weight
cyclists_df_non_null_weight = cyclists_df[cyclists_df['weight'].notnull()]

# Count the number of cyclists in each height category
height_category_counts = cyclists_df_non_null_weight['height_category'].value_counts().sort_index()

# Plot the results
plt.figure(figsize=(10, 6))
ax = height_category_counts.plot(kind='bar')
plt.title('Number of Cyclists per Height Category (Excluding Null Weights)')
plt.xlabel('Height Category')
plt.ylabel('Number of Cyclists')

# Add the count above each bin
for i, count in enumerate(height_category_counts):
    ax.text(i, count + 5, str(count), ha='center', va='bottom')

# Rotate x-axis labels to horizontal
plt.xticks(rotation=0)

plt.show()

In [None]:
import pycountry_convert as pc

# Function to convert country name to continent
def country_to_continent(country_name):
    try:
        country_alpha2 = pc.country_name_to_country_alpha2(country_name)
        continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
        continent_name = pc.convert_continent_code_to_continent_name(continent_code)
        return continent_name
    except:
        return 'Unknown'

# Apply the function to create a new column 'continent'
cyclists_df['continent'] = cyclists_df['nationality'].apply(country_to_continent)

# Display the first few rows to verify
print(cyclists_df[['nationality', 'continent']].head(30))

In [None]:
# Create a pairplot excluding the 'height_category' column
sns.pairplot(cyclists_df.drop(columns=['height_category']), hue='continent')
plt.show()

In [None]:
# Elements in 'cyclist' column of races_data and not in '_url' column of cyclists_data
diff_races_not_in_cyclists = np.setdiff1d(races_df['cyclist'].unique(), cyclists_df['_url'].unique())
print("In 'races_data' but not in 'cyclists_data':", diff_races_not_in_cyclists)

# Elements in '_url' column of cyclists_data and not in 'cyclist' column of races_data
diff_cyclists_not_in_races = np.setdiff1d(cyclists_df['_url'].unique(), races_df['cyclist'].unique())
print("In 'cyclists_data' but not in 'races_data':", diff_cyclists_not_in_races)

print(races_df['cyclist'].unique().size)

len(diff_cyclists_not_in_races)