# Notes:

You are banned from using loops (`for` or `while` or any other) for this entire workshop!

You shouldn't be using loops almost ever with pandas in any case, so break out of the habit now.

## 1. DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**1.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

**2.** Select only the rows where visits are 3 or more. Which types of animals are these?

**3.** Select the rows where visists are 3 and the animal is a cat

**4.** Calculate the sum of all visits in `df` (i.e. the total number of visits).

**5.** Calculate the mean age for each different animal in `df`.

**6.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.



In [18]:
import pandas as pd
import numpy as np

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

data['labels'] = labels

"""
1:
"""
df = pd.DataFrame(data)
df.set_index('labels')


"""
2:
"""
three_visits = df.loc[df.visits >= 3]
print("The rows with three visits are:")
print(three_visits)


"""
3:
"""
three_vists_and_a_cat = df.query('visits == 3 & animal == "cat"')
print("The rows with three visits and a cat are:")
print(three_vists_and_a_cat)


"""
4:
"""
sum_of_all_visits = df.visits.sum()
print("There were a total of: {s_v} visits".format(s_v=sum_of_all_visits))


"""
5:
"""
mean_of_ages = df.age.mean()
print("The average age is {s_v}".format(s_v=mean_of_ages))


"""
6:
"""
df.loc[len(df.index)] = [4,'parrot',2, 'no', 'k']
print(df)


The rows with three visits are:
  animal  age  visits priority labels
1    cat  3.0       3      yes      b
3    dog  NaN       3      yes      d
5    cat  2.0       3       no      f
The rows with three visits and a cat are:
  animal  age  visits priority labels
1    cat  3.0       3      yes      b
5    cat  2.0       3       no      f
There were a total of: 19 visits
The average age is 3.4375
   animal     age  visits priority labels
0     cat     2.5       1      yes      a
1     cat       3       3      yes      b
2   snake     0.5       2       no      c
3     dog     NaN       3      yes      d
4     dog       5       2       no      e
5     cat       2       3       no      f
6   snake     4.5       1       no      g
7     cat     NaN       1      yes      h
8     dog       7       2       no      i
9     dog       3       1       no      j
10      4  parrot       2       no      k


# 2.1 Shifty problem

You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

### Hint: use the `shift()` method

In [3]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.shift()


# 2.2 columns sum min

Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? Return that column's label.

In [28]:
import pandas as pd


df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
summed = df.sum(axis=0)
min_index = summed.idxmin()

print(summed)
print("The colum with the smallest sum is {label}".format(label=min_index))

a    2.038144
b    2.972550
c    2.182087
d    1.412166
e    2.578099
f    1.593930
g    2.613984
h    3.224152
i    1.674383
j    2.007061
dtype: float64
The colum with the smallest sum is d


# 2.3 Duplicates

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

**hint:** There's a method for to find duplicate rows for you

In [45]:
import pandas as pd

df = pd.DataFrame({
    'first': [0,0,0,0,4],
    'second': [0,1, 0, 2, 4],
    'third': [0,3, 0, 6, 4]
})
uniques = df.drop_duplicates()
print(uniques)

   first  second  third
0      0       0      0
1      0       1      3
3      0       2      6
4      4       4      4


# 2.4 Group Values

A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.  You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

In [50]:
import pandas as pd

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

result = df.groupby('grps')['vals'].apply(lambda group: group.nlargest(3).sum())
print(result)

grps
a    409
b    156
c    345
Name: vals, dtype: int64


# 3. Cleaning Data

### Making a DataFrame easier to work with

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?

Take this monstrosity as the DataFrame to use in the following puzzles:

```python
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
```

Formatted, it looks like this:

```
            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)
1      MAdrid_miLAN           NaN            []    <Air France> (12)
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis           NaN          [13]       12. Air France
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"
```

**1.** Some values in the the **FlightNumber** column are missing (they are `NaN`). These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column).

In [75]:
import pandas as pd

df = pd.DataFrame({
    'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'],
    'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
    'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
    'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']
})
# print(df)
"""
Get all Flight Routes
"""
unique_indexes = pd.Index(df.From_To)
"""
Get the id of the first route
"""
starting_fn = df.iloc[0].FlightNumber

def apply_to_cell(x):
    return (unique_indexes.get_loc(x) * 10) + starting_fn

"""
Apply our lambda to each FlightNumber cell
"""
df.FlightNumber = df.From_To.apply(apply_to_cell)

print(df)


            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)
1      MAdrid_miLAN       10055.0            []    <Air France> (12)
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis       10075.0          [13]       12. Air France
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"


# 3.2 column splitting

The **From\_To** column would be better as two separate columns! Split each string on the underscore delimiter `_` to make two new columns `From` and `To` to your dataframe.

In [152]:
import pandas as pd

df = pd.DataFrame({
    'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 'Budapest_PaRis', 'Brussels_londOn'],
    'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
    'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
    'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', '12. Air France', '"Swiss Air"']
})

def create_from_to(df):
    
    def add_from_to(row): 
        elems = row.split('_')
        return pd.Series({'From': elems[0], 'To': elems[1]})
    
    # We add From and To onto the original dataframe
    return df.merge(df.From_To.apply(add_from_to), left_index=True, right_index=True)

with_from_to = create_from_to(df)

print(with_from_to)


            From_To  FlightNumber  RecentDelays              Airline  \
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)   
1      MAdrid_miLAN           NaN            []    <Air France> (12)   
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )   
3    Budapest_PaRis           NaN          [13]       12. Air France   
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"   

       From         To  
0    LoNDon      paris  
1    MAdrid      miLAN  
2    londON  StockhOlm  
3  Budapest      PaRis  
4  Brussels     londOn  


# 3.3 Clean Text

Make the text in your dataframe:

- From and To columns should be lowercase with only first letter capitalized

- In the **Airline** column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`.

In [166]:
import re


def capitalize_from_to(df):
    df['From'] = df['From'].str.title()
    df['To'] = df['To'].str.title()
    

# We remove all special characters, and numbers, then make sure there is no trailing
# and leading whitespace
def cleanup_airport(df):
    df['Airline'] = df['Airline'].apply(lambda row: re.sub(r"[^a-zA-Z]+", ' ', row).strip())

capitalize_from_to(with_from_to)
cleanup_airport(with_from_to)


print(with_from_to)


            From_To  FlightNumber  RecentDelays          Airline      From  \
0      LoNDon_paris       10045.0      [23, 47]              KLM    London   
1      MAdrid_miLAN           NaN            []       Air France    Madrid   
2  londON_StockhOlm       10065.0  [24, 43, 87]  British Airways    London   
3    Budapest_PaRis           NaN          [13]       Air France  Budapest   
4   Brussels_londOn       10085.0      [67, 32]        Swiss Air  Brussels   

          To  
0      Paris  
1      Milan  
2  Stockholm  
3      Paris  
4     London  


# Exercise 4.1: Column Splitting

Given the unemployment data in `data/country_total.csv`, split the `month` column into two new columns: a `year` column and a `month` column, both integers

In [22]:
import pandas as pd

df = pd.read_csv('data/country_total.csv')

print(df)


def create_year_month(df):
    
    def add_month_year(row): 
        elems = str(row).split('.')
        return pd.Series({'month_t': elems[1], 'year': elems[0]})
    
    # We add From and To onto the original dataframe
    new_df = df.merge(df.month.apply(add_month_year), left_index=True, right_index=True)
    new_df = new_df.drop(['month'], axis=1)
    return new_df.rename(columns={'month_t': 'month'})
    
df = create_year_month(df)
print(df)

      country seasonality    month  unemployment  unemployment_rate
0          at         nsa  1993.01        171000                4.5
1          at         nsa  1993.02        175000                4.6
2          at         nsa  1993.03        166000                4.4
3          at         nsa  1993.04        157000                4.1
4          at         nsa  1993.05        147000                3.9
...       ...         ...      ...           ...                ...
20791      uk       trend  2010.06       2429000                7.7
20792      uk       trend  2010.07       2422000                7.7
20793      uk       trend  2010.08       2429000                7.7
20794      uk       trend  2010.09       2447000                7.8
20795      uk       trend  2010.10       2455000                7.8

[20796 rows x 5 columns]
      country seasonality  unemployment  unemployment_rate month  year
0          at         nsa        171000                4.5    01  1993
1          at   

# 4.2 Group Statistics

Given the unemployment data in `data/country_sex_age.csv`, give the average unemployment rate for:

- Each gender
- Each Age Group
- Both Together

**HINT:** The `seasonality` column makes it such that the data is repeated for each method of calculating unemployment (`nsa`, `trend`, etc.). Can you ignore this and group over it? Or should you take the average for each?

In [44]:
import pandas as pd
from itertools import product

df = pd.read_csv('data/country_sex_age.csv')
# print(df)
unique_genders = df['sex'].unique()
unique_age_group = df['age_group'].unique()
unique_gender_age_groups = list(product(unique_genders, unique_age_group))

genders = pd.DataFrame({'gender': unique_genders})
age_groups = pd.DataFrame({'age_group': unique_age_group})
# We create a third dataframe with the combinations of each unique gender and age group
gender_age_groups = pd.DataFrame(data=unique_gender_age_groups, columns=['gender','age_group'])


# We apply a merge and apply to each dataframe to create the means columns from data in the csv file

genders = genders.merge(
    genders.gender.apply(
        lambda row: pd.Series(
            {
                'mean': df.loc[df['sex'] == row, 'unemployment_rate'].mean()
            }
        )
    ),
    left_index=True, right_index=True
)
age_groups = age_groups.merge(
    age_groups.age_group.apply(
        lambda row: pd.Series(
            {
                'mean': df.loc[df['age_group'] == row, 'unemployment_rate'].mean()
            }
        )
    ),
    left_index=True, right_index=True
)

gender_age_groups = gender_age_groups.merge(
    gender_age_groups.apply(
        lambda row: pd.Series(
            {
                'mean': df.loc[(df['sex'] == row.gender) & (df['age_group'] == row.age_group), 'unemployment_rate'].mean()
            }
        ),
        axis=1
    ),
    left_index=True, right_index=True
)

print(genders)
print(age_groups)
print(gender_age_groups)


  gender       mean
0      f  12.982629
1      m  11.671026
  age_group       mean
0    y25-74   6.905394
1    y_lt25  17.774057
  gender age_group       mean
0      f    y25-74   7.566771
1      f    y_lt25  18.457435
2      m    y25-74   6.244016
3      m    y_lt25  17.098036


# 4.3 Estimating group size

Given that we have the unemployment **rate** as a % of total population, and the number of total unemployed, we can estimate the total population.

Give an estimate of the total population for men and women in each age group.

Does this change depending on the unemployment seasonality calculation method?

In [67]:

"""
total population for men and women in each age group
"""
# Let's add a column of estimated total populations
df['pop_est'] = (100 / df['unemployment_rate']) * df['unemployment']
unemployed_by_genders_time_cumulative = df.groupby(['sex', 'age_group'])['pop_est'].mean()

"""
total population for men and women in each age group over each time period
"""
df['pop_est'] = (100 / df['unemployment_rate']) * df['unemployment']
unemployed_by_genders_over_time = df.groupby(['sex', 'age_group', 'month'])['pop_est'].mean()


print("BY GENDER GROUPS ONLY:")
print(unemployed_by_genders_time_cumulative)
print(" ")
print("BY TIME PERIOD:")
print(unemployed_by_genders_over_time)
print(" ")

"""
unemployment seasonality calculation method
"""
# We check our dataframe of unemployed_by_gender_age_groups_over_time, since this df
# Is already filtered
# To remove seasonal variation, we will exclude data for sample populations under the age
# of 25, meaning, we will only consider adults in the workforce, who we
# assume to no longer be in school.
# First we calculate our seasonal baselines

seasonal_unemployed_males = df.loc[df['sex'] == 'm']
seasonal_unemployed_females = df.loc[df['sex'] == 'f']

seasonal_std_males_ot = seasonal_unemployed_males['unemployment'].std()
seasonal_average_males_ot = seasonal_unemployed_males['unemployment'].mean()
seasonal_std_females_ot = seasonal_unemployed_females['unemployment'].std()
seasonal_average_females_ot = seasonal_unemployed_females['unemployment'].mean()


# un-seasonal dataframes
unemployed_males = seasonal_unemployed_males.loc[seasonal_unemployed_males['age_group'] == 'y25-74']
unemployed_females = seasonal_unemployed_females.loc[seasonal_unemployed_females['age_group'] == 'y25-74']


std_males_ot = unemployed_males['unemployment'].std()
average_males_ot = unemployed_males['unemployment'].mean()
std_females_ot = unemployed_females['unemployment'].std()
average_females_ot = unemployed_females['unemployment'].mean()


print("The variance of seasonal unemployed males over time is {v} +/- {std}".format(
    v=seasonal_average_males_ot,
    std=seasonal_std_males_ot
))
print("The variance of NON-seasonal unemployed males over time is {v} +/- {std}".format(
    v=average_males_ot,
    std=std_males_ot
))
print("The variance of seasonal unemployed females over time is {v} +/- {std}".format(
    v=seasonal_average_females_ot,
    std=seasonal_std_females_ot
))
print("The variance of NON seasonal unemployed females over time is {v} +/- {std}".format(
    v=average_females_ot,
    std=std_females_ot
))

"""
We can determine that the seasonal calculation has no impact if the standard deviations
of seasonal vs non-seasonal pools of the population have an overlap
"""

# Males
seasonal_males_min_range = seasonal_average_males_ot - seasonal_std_males_ot
males_range = average_males_ot + std_males_ot
overlap = seasonal_males_min_range <= males_range
print("The seasonality is not significantly impactfull for males" if overlap else "The seasonality is significantly impactfull for males")


# Males
seasonal_females_min_range = seasonal_average_females_ot - seasonal_std_females_ot
females_range = average_females_ot + std_females_ot
fe_overlap = seasonal_females_min_range <= females_range
print("The seasonality is not significantly impactfull for females" if fe_overlap else "The seasonality is significantly impactfull for females")

BY GENDER GROUPS ONLY:
sex  age_group
f    y25-74       3.220602e+06
     y_lt25       5.665826e+05
m    y25-74       4.357668e+06
     y_lt25       6.679826e+05
Name: pop_est, dtype: float64
 
BY TIME PERIOD:
sex  age_group  month  
f    y25-74     1983.01    2.833058e+06
                1983.02    2.841320e+06
                1983.03    2.841420e+06
                1983.04    2.855877e+06
                1983.05    2.853827e+06
                               ...     
m    y_lt25     2010.08    5.692591e+05
                2010.09    5.687628e+05
                2010.10    6.319955e+05
                2010.11    4.636428e+05
                2010.12    4.612474e+05
Name: pop_est, Length: 1344, dtype: float64
 
The variance of seasonal unemployed males over time is 206702.405002405 +/- 315546.4107734553
The variance of NON-seasonal unemployed males over time is 289391.5824915825 +/- 396777.87992010266
The variance of seasonal unemployed females over time is 188137.56613756614 +/- 284589

### 5.1 Tennis

In `data/tennis.csv` you have games that Roger Federer played against various opponents. Questions:

1. How many games did Federer win?

2. What is Federer's win/loss ratio?

3. Who were Federer's top 5 opponents?

In [97]:
import pandas as pd

df1 = pd.read_csv('data/tennis.csv')
df = df1.loc[df1['opponent'] != 'Bye']


federer_wins = df.loc[df['winner'] == 'Roger Federer']['winner'].count()
federer_losses = df['winner'].count() - federer_wins
oponnents = df['opponent'].value_counts()
top5 = oponnents.nlargest(5)
print("Federer has won {co} games".format(co=federer_wins))
print("Federer's win/loss ratio is {r}".format(r=federer_wins/federer_losses))
print("Federer's top 5 oponents are")
print(top5)

Federer has won 903 games
Federer's win/loss ratio is 4.36231884057971
Federer's top 5 oponents are
Novak Djokovic (SRB)       29
Rafael Nadal (ESP)         28
Lleyton Hewitt (AUS)       26
Andy Roddick (USA)         24
Nikolay Davydenko (RUS)    19
Name: opponent, dtype: int64


# 5.2 Over time

1. What was Federer's best year? In terms of money, and then in terms of number of wins

2. Did Federer get better or worse over time?

In [223]:
from scipy import stats

df.dropna(subset=['tournament prize money'])
df = df.loc[df['tournament prize money'] != '']

df['tournament prize money'].replace(r'[\$,a-zA-Z]', '', regex=True)
df['tournament prize money'].apply(lambda row: 0.0 if row == '' else row)
df['tournament prize money'] = df['tournament prize money'].astype(float)
fed_wins = df.loc[df['winner'] == 'Roger Federer']

def check_results():
    sums = pd.DataFrame({
        'year': fed_wins['year'].unique()
    })

    # We check how much money earned and how many wins
    sums = sums.merge(
        sums.year.apply(
            lambda year: pd.Series({
                'earned': fed_wins.loc[fed_wins['year'] == year]['tournament prize money'].sum(),
                'wins': fed_wins.loc[fed_wins['year'] == year]['tournament prize money'].count()
            })
        ),
        left_index=True, 
        right_index=True
    )
    sums['earning_per_win'] = sums['earned'] / sums['wins']
    # fed_wins.groupby(['year','tournament prize money'])['tournament prize money'].sum()
    top_year = sums.nlargest(1, 'earned').iloc[0]['year']
    top_year_wins = sums.nlargest(1, 'wins').iloc[0]['year']
    top_year_earning_per_wins = sums.nlargest(1, 'earning_per_win').iloc[0]['year']
    lin_reg = stats.linregress(sums['year'], sums['wins'])
    print(sums)
    print("Federer has gotten better over time" if lin_reg.slope > 0 else "Federer has not gotten better over time")
    print("But his best year in terms of earnings was {year}".format(year=top_year))
    print("And his best year in terms of wins was {y}".format(y=top_year_wins))
    print("And his best year in terms of earning per win was {y}".format(y=top_year_earning_per_wins))

check_results()


    year      earned  wins  earning_per_win
0   1998     21600.0   2.0     10800.000000
1   1999    365975.0  28.0     13070.535714
2   2000   1023498.0  30.0     34116.600000
3   2001   2444072.0  46.0     53132.000000
4   2002   6696200.0  55.0    121749.090909
5   2003  19881618.0  74.0    268670.513514
6   2004  38593375.0  70.0    551333.928571
7   2005  35922490.0  80.0    449031.125000
8   2006  50157595.0  90.0    557306.611111
9   2007  51186795.0  67.0    763982.014925
10  2008  22140900.0  63.0    351442.857143
11  2009  32362250.0  59.0    548512.711864
12  2010  34625940.0  66.0    524635.454545
13  2011  24006110.0  61.0    393542.786885
14  2012  32048985.0  67.0    478343.059701
Federer has gotten better over time
But his best year in terms of earnings was 2007.0
And his best year in terms of wins was 2006.0
And his best year in terms of earning per win was 2007.0


  res_values = method(rvalues)


# 5.3 Total money won

In the data, you'll find the `tournament round`, one value of which, `F` indicates the final.

Assuming Federer wins the money in the `tournament prize money` if he wins a final in a tournament, how much money has Federer made in tournaments in this dataset?

In [224]:
fed_wins = fed_wins.loc[fed_wins['tournament round'] == "F"]
# print(fed_tourney_wins)
check_results()

    year     earned  wins  earning_per_win
0   1999    14400.0   1.0     14400.000000
1   2001    54000.0   1.0     54000.000000
2   2002   540600.0   3.0    180200.000000
3   2003  3026502.0   7.0    432357.428571
4   2004  6229377.0  11.0    566307.000000
5   2005  4733250.0  11.0    430295.454545
6   2006  7221635.0  12.0    601802.916667
7   2007  7245735.0   8.0    905716.875000
8   2008  1819800.0   4.0    454950.000000
9   2009  2938500.0   4.0    734625.000000
10  2010  4561045.0   5.0    912209.000000
11  2011  2579000.0   4.0    644750.000000
12  2012  3971120.0   6.0    661853.333333
Federer has gotten better over time
But his best year in terms of earnings was 2007.0
And his best year in terms of wins was 2006.0
And his best year in terms of earning per win was 2010.0
