# Notes:

You are banned from using loops (`for` or `while` or any other) for this entire workshop!

You shouldn't be using loops almost ever with pandas in any case, so break out of the habit now.

## 1. DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**1.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

**2.** Select only the rows where visits are 3 or more. Which types of animals are these?

**3.** Select the rows where visists are 3 and the animal is a cat

**4.** Calculate the sum of all visits in `df` (i.e. the total number of visits).

**5.** Calculate the mean age for each different animal in `df`.

**6.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.



In [1]:
import pandas as pd
import numpy as np

In [2]:
# Exercise 1.1

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df_pets = pd.DataFrame(data, columns = ['animal', 'age', 'visits', 'priority'], index = labels)
df_pets

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [3]:
# Exercise 1.2.1 Select only the rows where visits are 3 or more.

df_pets.loc[df_pets.visits >= 3]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
d,dog,,3,yes
f,cat,2.0,3,no


In [4]:
# Exercise 1.2.2 Which types of animals are these?

animal_visits3 = df_pets[df_pets.visits >= 3]['animal'].unique()

print(f"The types of animals that have 3 visits or more are: {animal_visits3}")

The types of animals that have 3 visits or more are: ['cat' 'dog']


In [5]:
# Exercise 1.3 Select the rows where visists are 3 and the animal is a cat

df_pets.loc[
    (df_pets.visits == 3)
    & (df_pets.animal == 'cat')
]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
f,cat,2.0,3,no


In [6]:
# Exercise 1.4 Calculate the sum of all visits in df (i.e. the total number of visits).

visit_sum = (
    df_pets.visits
       .sum()
)

print(f"The total number of visits is: {visit_sum}")

The total number of visits is: 19


In [7]:
# Exercise 1.5. Calculate the mean age for each different animal in df.

mean_age_pets = (
    df_pets.groupby('animal')
       .mean()['age']
)

mean_age_pets

animal
cat      2.5
dog      5.0
snake    2.5
Name: age, dtype: float64

In [8]:
# Exercise 1.6.1 Append a new row 'k' to df with your choice of values for each column. 

new_row = {'animal': ['cat'],
        'age': [7.0],
        'visits': [4],
        'priority': ['yes']}

labels_1 = ['k']

# create new df with row
df2 = pd.DataFrame(new_row, columns = ['animal', 'age', 'visits', 'priority'], index = labels_1)
df2

# append df2 to df
df_pets = df_pets.append(df2)

df_pets

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [9]:
# Exercise 1.6.2 Then delete that row to return the original DataFrame
df_pets = df_pets.drop('k',axis=0)

# new 'k' row is now deleted
df_pets

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


# 2.1 Shifty problem

You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

### Hint: use the `shift()` method

In [10]:
# Exercise 2.1 Filter out rows which contain the same integer as the row immediately above?

df_shifty = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df_shifty.loc[df_shifty.A.shift() != df_shifty.A]

Unnamed: 0,A
0,1
1,2
3,3
4,4
5,5
8,6
9,7


# 2.2 columns sum min

Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? Return that column's label.

In [11]:
df_random = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df_random

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
0,0.260372,0.136255,0.938496,0.258332,0.291394,0.008483,0.946653,0.601258,0.381346,0.972684
1,0.698079,0.902073,0.20912,0.642411,0.48521,0.254741,0.918863,0.850099,0.257321,0.970257
2,0.432796,0.546197,0.323092,0.920123,0.782675,0.066733,0.614379,0.99344,0.66957,0.829539
3,0.52412,0.206033,0.750892,0.964766,0.661221,0.324943,0.180773,0.220518,0.32197,0.250505
4,0.374863,0.535815,0.01698,0.956682,0.970673,0.838714,0.152751,0.996399,0.365656,0.445755


In [12]:
df_random.sum()

a    2.290229
b    2.326371
c    2.238580
d    3.742313
e    3.191172
f    1.493614
g    2.813419
h    3.661714
i    1.995864
j    3.468741
dtype: float64

In [13]:
# Exercise 2.2 Return column label with the smallest sum.

df_random = df_random.sum(axis=0)
sort = df_random.sort_values(ascending=True)

print(f"The columns with the smallest sum is:\n {sort[0:1]}")

The columns with the smallest sum is:
 f    1.493614
dtype: float64


# 2.3 Duplicates

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

**hint:** There's a method for to find duplicate rows for you

In [14]:
# Exercise 2.3 (I used the df from Exercise 2.1 as an example)

df_shifty.duplicated(keep=False).sum()

7

# 2.4 Group Values

A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.  You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

In [15]:
# Exercise 2.4

df_grps = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df_grps

(df_grps.groupby(['grps'])
 .vals
 .nlargest(3)
 .sum(level=0)
)

grps
a    409
b    156
c    345
Name: vals, dtype: int64

# 3. Cleaning Data

### Making a DataFrame easier to work with

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?

Take this monstrosity as the DataFrame to use in the following puzzles:

```python
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
```

Formatted, it looks like this:

```
            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)
1      MAdrid_miLAN           NaN            []    <Air France> (12)
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis           NaN          [13]       12. Air France
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"
```

**1.** Some values in the the **FlightNumber** column are missing (they are `NaN`). These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column).

In [16]:
df_flights = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df_flights

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


In [17]:
# Exercise 3.1 Fill in Nan in 'FlightNumber' column

df_flights = df_flights.interpolate(method='linear', limit_direction='forward', axis=0)

df_flights.FlightNumber = df_flights.FlightNumber.astype(int)

df_flights

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045,"[23, 47]",KLM(!)
1,MAdrid_miLAN,10055,[],<Air France> (12)
2,londON_StockhOlm,10065,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,10075,[13],12. Air France
4,Brussels_londOn,10085,"[67, 32]","""Swiss Air"""


# 3.2 column splitting

The **From\_To** column would be better as two separate columns! Split each string on the underscore delimiter `_` to make two new columns `From` and `To` to your dataframe.

In [18]:
# Exercise 3.2 Split 'From_To' column on the underscore delimiter

df_flights[['From','To']] = df_flights.From_To.str.split("_",expand=True,)
df_flights

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,From,To
0,LoNDon_paris,10045,"[23, 47]",KLM(!),LoNDon,paris
1,MAdrid_miLAN,10055,[],<Air France> (12),MAdrid,miLAN
2,londON_StockhOlm,10065,"[24, 43, 87]",(British Airways. ),londON,StockhOlm
3,Budapest_PaRis,10075,[13],12. Air France,Budapest,PaRis
4,Brussels_londOn,10085,"[67, 32]","""Swiss Air""",Brussels,londOn


# 3.3 Clean Text

Make the text in your dataframe:

- From and To columns should be lowercase with only first letter capitalized

- In the **Airline** column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`.

In [19]:
# Exercise 3.3.1 Capitalize only first letter of 'From' and 'To' columns

df_flights.From = df_flights.From.str.title()
df_flights.To = df_flights.To.str.title()

df_flights

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,From,To
0,LoNDon_paris,10045,"[23, 47]",KLM(!),London,Paris
1,MAdrid_miLAN,10055,[],<Air France> (12),Madrid,Milan
2,londON_StockhOlm,10065,"[24, 43, 87]",(British Airways. ),London,Stockholm
3,Budapest_PaRis,10075,[13],12. Air France,Budapest,Paris
4,Brussels_londOn,10085,"[67, 32]","""Swiss Air""",Brussels,London


In [20]:
# Exercise 3.3.2 Clean 'Airline' names

df_flights.Airline = (
    
    df_flights.Airline
       .str.replace(r'[^\w\s]+', '')
       .str.replace(r'[0-9]+', '')

)

df_flights

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,From,To
0,LoNDon_paris,10045,"[23, 47]",KLM,London,Paris
1,MAdrid_miLAN,10055,[],Air France,Madrid,Milan
2,londON_StockhOlm,10065,"[24, 43, 87]",British Airways,London,Stockholm
3,Budapest_PaRis,10075,[13],Air France,Budapest,Paris
4,Brussels_londOn,10085,"[67, 32]",Swiss Air,Brussels,London


# Exercise 4.1: Column Splitting

Given the unemployment data in `data/country_total.csv`, split the `month` column into two new columns: a `year` column and a `month` column, both integers

In [25]:
data_fetch = '/Users/celiagoogle/Documents/GitHub/m2-1-pandas/data/country_total.csv'
df_country_total = pd.read_csv(data_fetch)
df_country_total.head()

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9


In [26]:
df_country_total.dtypes

# turn float 'month'to string
df_country_total.month = (
    df_country_total.month
       .astype(str)
)

In [27]:
# split the month column into two new columns: a year column and a month column

df_country_total[['year','month']] = (
    df_country_total.month
       .str.split(".",expand=True,)
)

In [28]:
# both 'month' and 'year' as integer column type

df_country_total.month = (
    df_country_total.month
       .astype(int)   
)


df_country_total.year = (
    df_country_total.year
       .astype(int)
)

# the 'month' and 'year' columns are now int64
df_country_total.dtypes

country               object
seasonality           object
month                  int64
unemployment           int64
unemployment_rate    float64
year                   int64
dtype: object

In [29]:
# Exercise 4.1

df_country_total

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate,year
0,at,nsa,1,171000,4.5,1993
1,at,nsa,2,175000,4.6,1993
2,at,nsa,3,166000,4.4,1993
3,at,nsa,4,157000,4.1,1993
4,at,nsa,5,147000,3.9,1993
...,...,...,...,...,...,...
20791,uk,trend,6,2429000,7.7,2010
20792,uk,trend,7,2422000,7.7,2010
20793,uk,trend,8,2429000,7.7,2010
20794,uk,trend,9,2447000,7.8,2010


In [None]:
'''
Matthieu's way shown in class is more readable.
Where possible, I've used it as a guide
to develop a less clunky code style,
which I've tried to implement in the previous
and next exercises. But it's still a work in progress.


df_country_total['year2'] = df_country_total.month.astype(int)

df_country_total['month'] = df_country_total.month
df_country_total.month = (
    df_country_total.month
    .astype(str)
    .str[-2:]
    .str.replace(".", "")
    .astype(int)
)

'''

# 4.2 Group Statistics

Given the unemployment data in `data/country_sex_age.csv`, give the average unemployment rate for:

- Each gender
- Each Age Group
- Both Together

**HINT:** The `seasonality` column makes it such that the data is repeated for each method of calculating unemployment (`nsa`, `trend`, etc.). Can you ignore this and group over it? Or should you take the average for each?

In [None]:
'''
For Exercises 4.2.1 and 4.2.2, I've ignored the seasonality group
when calculating the unemployment by sex and by age group separately.

But for Exercise 4.2.3, I've used a df that filters only for
'sa' when calculating both sex and age group together.

'''

In [30]:
# Exercise 4.2

data_fetch = '/Users/celiagoogle/Documents/GitHub/m2-1-pandas/data/country_sex_age.csv'
df = pd.read_csv(data_fetch)
df

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate
0,at,nsa,f,y25-74,1993.01,61000,4.5
1,at,nsa,f,y25-74,1993.02,62000,4.5
2,at,nsa,f,y25-74,1993.03,62000,4.5
3,at,nsa,f,y25-74,1993.04,63000,4.6
4,at,nsa,f,y25-74,1993.05,63000,4.6
...,...,...,...,...,...,...,...
83155,uk,trend,m,y_lt25,2010.06,518000,21.1
83156,uk,trend,m,y_lt25,2010.07,513000,20.8
83157,uk,trend,m,y_lt25,2010.08,509000,20.5
83158,uk,trend,m,y_lt25,2010.09,513000,20.7


In [31]:
# Exercise 4.2.1 average unemployment rate for each sex

df.groupby('sex')['unemployment_rate'].mean()

sex
f    12.982629
m    11.671026
Name: unemployment_rate, dtype: float64

In [32]:
# Exercise 4.2.2 average unemployment rate for each age group

df.groupby('age_group')['unemployment_rate'].mean()

age_group
y25-74     6.905394
y_lt25    17.774057
Name: unemployment_rate, dtype: float64

In [33]:
# Exercise 4.2.3 average unemployment rate for both sex and age group

average = df.loc[
    df.seasonality =='sa'
]

grp = average.groupby(['sex', 'age_group'])
grp.unemployment_rate.mean()

sex  age_group
f    y25-74        7.579982
     y_lt25       18.323837
m    y25-74        6.256909
     y_lt25       17.067671
Name: unemployment_rate, dtype: float64

# 4.3 Estimating group size

Given that we have the unemployment **rate** as a % of total population, and the number of total unemployed, we can estimate the total population.

Give an estimate of the total population for men and women in each age group.

Does this change depending on the unemployment seasonality calculation method?

In [None]:
'''
The population estimates for each age group
do vary depending on 'seasonality'.


After creating a new column 'est_labor_popu', 
containing an estimate of the labor population,
the three sets of population estimates grouped
by sex and age based on each seasonality average
appear below.

'''

In [41]:
df['est_labor_popu'] = df.unemployment / df.unemployment_rate *1e2
#df['tip_percentage'] = 100* df['tip'] / df['total_bill']

In [42]:
average_sa = df.loc[
    df.seasonality =='sa'
]

grp = average_sa.groupby(['sex', 'age_group'])
grp.est_labor_popu.mean()

sex  age_group
f    y25-74       3.360649e+06
     y_lt25       5.876348e+05
m    y25-74       4.533201e+06
     y_lt25       6.867673e+05
Name: est_labor_popu, dtype: float64

In [43]:
average_nsa = df.loc[
   df.seasonality =='nsa'
] 
grp = average_nsa.groupby(['sex', 'age_group'])
grp.est_labor_popu.mean()

sex  age_group
f    y25-74       3.006779e+06
     y_lt25       5.270112e+05
m    y25-74       4.110194e+06
     y_lt25       6.354751e+05
Name: est_labor_popu, dtype: float64

In [44]:
average_trend = df.loc[
   df.seasonality =='trend'
] 
grp = average_trend.groupby(['sex', 'age_group'])
grp.est_labor_popu.mean()

sex  age_group
f    y25-74       3.289520e+06
     y_lt25       5.848969e+05
m    y25-74       4.423824e+06
     y_lt25       6.809973e+05
Name: est_labor_popu, dtype: float64

# 5.1 Tennis

In `data/tennis.csv` you have games that Roger Federer played against various opponents. Questions:

1. How many games did Federer win?

2. What is Federer's win/loss ratio?

3. Who were Federer's top 5 opponents?

In [2]:
import pandas as pd
import numpy as np

In [3]:
# Exercise 5.1
data_fetch = '/Users/celiagoogle/Documents/GitHub/m2-1-pandas/data/tennis.csv'
df = pd.read_csv(data_fetch)

In [78]:
# Exercise 5.1.1 How many games did Federer win?

F_wins = (
    df.win
       .value_counts()[1]
)

print(f"Federer won {F_wins} games")
#total_wins.dtypes

Federer won 972 games


In [79]:
# Exercise 5.1.2 What is Federer's win/loss ratio?

# number of losses
F_losses = (
    df.win
       .value_counts()[0]
)

# number of total games
F_games = len(df.win)
    

# percentage of wins
perc_wins = (
    ((F_wins / F_games) * 1e2)
       .astype(int)
)
    

# percentage of losses
perc_losses = (
    ((F_losses / F_games) * 1e2)
       .round(0).astype(int)
              )
    

print(f"Federer's win/loss ratio is {perc_wins}% win to {perc_losses}% loss")

Federer's win/loss ratio is 82% win to 18% loss


In [80]:
# Exercise 5.1.3 Who were Federer's top 5 opponents?
opponent_wins = df.loc[df.winner != 'Roger Federer']

(
    opponent_wins.winner
       .value_counts()
       [:5]
)

Rafael Nadal        18
Novak Djokovic      13
Andy Murray         10
David Nalbandian     8
Lleyton Hewitt       8
Name: winner, dtype: int64

# 5.2 Over time

1. What was Federer's best year? In terms of money, and then in terms of number of wins

2. Did Federer get better or worse over time?

In [81]:
# clean up 'tournament prize money' string values

df['tournament prize money'] = (
    
    df['tournament prize money']
       .str.replace("[a-zA-Z]",'')
       .str.replace("$",'')
       .str.replace(",",'')
       .str.replace("[a-zA-Z]",'')
    
)

In [82]:
# exclude spaces
df = df[df['tournament prize money'] != '']

In [83]:
# exclude nulls
df = df[~df['tournament prize money'].isnull()]

In [84]:
# change 'tournament prize money' to integer type
df['tournament prize money'] = df['tournament prize money'].astype(int)

In [85]:
# change 'year' to string type
df['year'] = df['year'].astype(str)

In [86]:
# Federer df where Federer is the winner
Federer = df[df['winner'].str.contains('Federer')]

In [87]:
# Exercise 5.2.1.1 What was Federer's best year, in terms of money?

best_year_money = (
    
    Federer.groupby(['year','win'])
       .agg(prizes=('tournament prize money','sum'))
       .reset_index()
    
)

best = best_year_money.sort_values(by=['prizes'],ascending=False)
best.year[0:1]

9    2007
Name: year, dtype: object

In [88]:
# Exercise 5.2.1.2 What was Federer's best year in terms of number of wins?

best_year_wins = (
    
    Federer.groupby('year')
       #Federer.win.value_counts()
       .agg(prizes=('win','sum'))
       .reset_index()
    
)

best2 = best_year_wins.sort_values(by=['prizes'],ascending=False)
best2.year[0:1]


8    2006
Name: year, dtype: object

In [None]:
# Exercise 5.2.2 Did Federer get better or worse over time? 

'''
When looking at his best years, 
Federer peaked in 2007, in terms of money,
and in terms of wins peaked in 2006.

So he seemed to have improved until 2006/2007, then got 
less good, though still quite amazing.

'''

In [90]:
# best years, in descending order, in terms of money
best.year

9     2007
8     2006
6     2004
7     2005
12    2010
14    2012
11    2009
13    2011
10    2008
5     2003
4     2002
3     2001
2     2000
1     1999
0     1998
Name: year, dtype: object

In [91]:
# best years, in descending order, in terms of wins
best2.year

8     2006
7     2005
5     2003
9     2007
12    2010
14    2012
6     2004
10    2008
13    2011
11    2009
4     2002
3     2001
2     2000
1     1999
0     1998
Name: year, dtype: object

# 5.3 Total money won

In the data, you'll find the `tournament round`, one value of which, `F` indicates the final.

Assuming Federer wins the money in the `tournament prize money` if he wins a final in a tournament, how much money has Federer made in tournaments in this dataset?

In [93]:
# DONE!!! Exercise 5.3 

finals = df.loc[
    (df['win'] == True)
    & (df['tournament round'] == 'F')
]

finals['tournament prize money'].dtypes
finals['tournament prize money'].sum()

44934964