<div class="alert alert-block alert-info"><font size="6"><b>Sprint 6 Task 1 (S06_T01)</b></font><h6 align="right"><u>Author: Alberto Achaval</u></h6></div>

## <SPAN style=color:#1F618D>Level 1</SPAN>

### <SPAN style=color:#1F618D>Practice 1</SPAN>

<SPAN style=color:#1F618D>Grab a sports theme dataset you like and select an attribute from the dataset. Calculate the mode, median, standard deviation, and arithmetic mean.</SPAN>

To solve this practice we start by **importing** all the required libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statistics

For this practice, I'm going to use data from the ATP (Association of Professional Tennis) matches from the year 2018, the same one I used for Sprint 5.   
In this data set, the winner and loser are coded with an ID. We have another dataset that references this ID with player information.

For the matches dataset we have the following column reference:
* ace = absolute number of aces  
* df = number of double faults  
* svpt = total serve points  
* 1stin = 1st serve in  
* 1stwon = points won on 1st serve  
* 2ndwon = points won on 2nd serve  
* SvGms = serve games  
* bpSaved = break point saved  
* bpFaced = break point faced  

source: https://www.kaggle.com/datasets/pablodroca/atp-tennis-matches-20002019?select=atp_matches_2019.csv

In [None]:
ranking = pd.read_csv('atp_matches_2018.csv') # matches 2018
players = pd.read_csv('atp_players.csv') # players information

In [None]:
pd.set_option('display.max_columns', 50)
ranking.sample(5)

In [None]:
ranking.info()

In [None]:
players.head()

In [None]:
players.info()

I want to merge both dataframes in order to replace in **matches** the **winner_id** and **loser_id** with actual name and contry from **players**.

In [None]:
# new column for players with joined name

players['player'] =  players ['name_list'] + ', ' + players['name_first']

In [None]:
# merge dataframes, first winner and then loser

winners = pd.merge(ranking, players.loc[:, ['player_id', 'country', 'player']], 
                 left_on=['winner_id'], right_on=['player_id']).drop(columns = ['winner_id', 'player_id'])

atp_2018 = pd.merge(winners, players.loc[:, ['player_id', 'country', 'player']], how= 'inner', 
                  left_on=['loser_id'], right_on=['player_id']).drop(columns = ['loser_id', 'player_id'])

In [None]:
# reorder and rename columns

atp_2018 = atp_2018[['tourney_id', 'tourney_name', 'tourney_date', 'surface', 'player_x', 'country_x',  'player_y', 
                 'country_y', 'score', 'best_of', 'round', 'minutes', 'w_ace', 'w_df', 'w_svpt', 'w_1stIn', 
                 'w_1stWon', 'w_2ndWon', 'w_SvGms', 'w_bpSaved', 'w_bpFaced', 'l_ace', 'l_df', 'l_svpt', 'l_1stIn', 
                 'l_1stWon', 'l_2ndWon', 'l_SvGms', 'l_bpSaved', 'l_bpFaced', 'winner_rank', 'winner_rank_points', 
                 'loser_rank', 'loser_rank_points']]

atp_2018.rename(columns={'player_x': 'winner', 'country_x': 'country_winner', 
                       'player_y': 'loser', 'country_y': 'country_loser' }, inplace=True)

In [None]:
atp_2018.head()

In [None]:
atp_2018.info()

Now we have the dataframe as we wanted it!! Let's do the **mode, median, standard deviation, and arithmetic mean** for the number of aces per minute from the winner of the match, It will be new column **w_ace/hour** (since we have matches "to the best of 3 games" and "to the best of 5 games").

Let's see about **NaN** values in **w_ace** and **minutes** columns:

In [None]:
atp_2018[atp_2018['w_ace'].isnull() | atp_2018['minutes'].isnull()]

Half of these rows are W/O (walk over) matches in **score** column. This means matches that have not been played or finished early due to some problem.
Some other rows are from matches in early stages of the tournament where no data has been collected. We could fill these NaN values with the mean from the rest, but early stages of a tournament indicates """""not so good players""""" (among a lot of quotes) and it might not be a good idea to replace them with the average aces of winners in advenced stages of the tournament.
Both these rows are missing **w_ace** and **minutes** data.

Conclusion: I'll drop these rows and the remaing ones were only the **minutes** are missing I will fill them with mean. In this case only in matches "to the best of three" are missing the minutes.

In [None]:
atp_2018.dropna(subset = ['w_ace'], how = 'any', inplace = True) # rows with NaN values in -w_ace- column

In [None]:
atp_2018[atp_2018['w_ace'].isnull() | atp_2018['minutes'].isnull()] # rows with NaN in -minutes- column

In [None]:
grouped_mins = atp_2018.groupby('best_of')['minutes'].mean() # excluding missing values by default
grouped_mins

In [None]:
mean_bo3 = round(grouped_mins.loc[3,], 2) # minutes average for matches "to the best of 3"
atp_2018['minutes'].fillna(value = mean_bo3, inplace=True)

In [None]:
atp_2018[atp_2018['w_ace'].isnull() | atp_2018['minutes'].isnull()]

I will round the new **aces** column without decimals to have a useful **mode** value. Otherwise, the mode would be 0 since the variable **minutes** is a continuous with no repeatability. Let's see:

In [None]:
atp_2018['aces'] = (atp_2018['w_ace'] * 60) / atp_2018['minutes'] # new column in hours!

atp_2018.groupby('aces')['minutes'].count()

In [None]:
atp_2018['aces'] = round(atp_2018['aces'], 0) # rounded

In [None]:
atp_2018[atp_2018['aces'].isnull()] # check if there are any NaN values in new column (they shouldn't)

In [None]:
atp_2018.sample(10)
atp_2018.to_csv('atp_2018.csv', index = False) # save csv with data handling

<u>**Mode:**</u> the value or values that occurs most often.

In [None]:
w_ace_mode = atp_2018.aces.mode()
print(f"Aces from winner mode is {w_ace_mode[0]} per hour of match")

In [None]:
atp_2018.groupby('aces')['minutes'].count()

<u>**Median:**</u> the "middle" of a sorted list of numbers.

In [None]:
w_ace_median = round(atp_2018.aces.median(), 2)
print(f"Aces from winner median is {w_ace_median} per hour of match")

<u>**Standard Deviation:**</u> a measure of how spread out numbers are.

In [None]:
w_ace_sd = round(atp_2018.aces.std(), 2)
print(f'Aces from winner standard deviation is {w_ace_sd} per hour of match')

<u>**Arithmetic Mean:**</u> the average of the numbers.

In [None]:
w_ace_mean = round(atp_2018.aces.mean(), 2)
print(f'Aces from winner mean is {w_ace_mean} per hour of match')

In [None]:
atp_2018.aces.describe()

In [None]:
sns.set_style('darkgrid')
fig, ax = plt.subplots(figsize = (8, 5))
sns.histplot(data = atp_2018, x = 'aces')
plt.show()

### <SPAN style=color:#1F618D>Practice 2</SPAN>

<SPAN style=color:#1F618D>Continuing with the sports topic data, select two attributes and calculate their correlation.</SPAN>

For this practice we are going to study if there is a linear correlation between **w_1stin** (if the match winner's first serve is in) and **w_1stwon** (if the match winner wins the points when his first serve is in).

In [None]:
sns.set_style('darkgrid')
fig, ax = plt.subplots(figsize = (8, 5))
sns.regplot(data = atp_2018, x = 'w_1stIn', y = 'w_1stWon')
plt.show()

In [None]:
corr_1serve = round(atp_2018['w_1stIn'].corr(atp_2018.w_1stWon),2)
print(f'Correlation is = {corr_1serve}')

We have a strong correlation between these two variables as expected.

## <SPAN style=color:#1F618D>Level 2</SPAN>

### <SPAN style=color:#1F618D>Practice 3</SPAN>

<SPAN style=color:#1F618D>Continue with the sports theme data and calculate the correlation of all the attributes with each other and represent them in a matrix with different colors of intensity.</SPAN>

In [None]:
m_corr = atp_2018.corr()
m_corr

In [None]:
mask = np.triu(np.ones_like(m_corr, dtype = bool)) # mask the upper triangle, easier to read
f, ax = plt.subplots(figsize = (14, 16))

sns.heatmap(m_corr, mask = mask, cmap = 'flare', annot = True, annot_kws = {"size": 10})

for t in ax.texts:
    if float(t.get_text()) >= 0.5:
        t.set_text(t.get_text()) # if the value is greater than 0.5 then I set the text 
    elif float(t.get_text()) < -0.5: # if the value is smaller than -0.5 then I set the text 
        t.set_text(t.get_text())
    else:
        t.set_text("") # if not it sets an empty text

plt.title('Correlation Matrix', fontsize = 20)
plt.show()

## <SPAN style=color:#1F618D>Level 3</SPAN>

### <SPAN style=color:#1F618D>Practice 4</SPAN>

<SPAN style=color:#1F618D>Continuing with the sports theme data, select an attribute and calculate the geometric mean and harmonic mean.</SPAN>

<u>**Geometric Mean:**</u> is a special type of average where we multiply the numbers together and then take a square root (for two numbers), cube root (for three numbers) etc. Is useful when we want to compare things with very different properties.


![image.png](attachment:image.png)

In [None]:
geom_mean = stats.gmean(atp_2018.loc[:,'w_1stWon'])
print(f'The geometric mean is: {round(geom_mean,2)}')

<u>**Harmonic Mean:**</u> the reciprocal of the average of the reciprocals. The harmonic mean is little influenced by the existence of certain values much larger than the set of others, being instead sensitive to values much smaller than the set.

![image.png](attachment:image.png)

In [None]:
harm_mean = statistics.harmonic_mean(atp_2018['w_1stWon'])
print(f'The harmonic mean is: {round(harm_mean, 2)}')
