# Notes:

You are banned from using loops (`for` or `while` or any other) for this entire workshop!

You shouldn't be using loops almost ever with pandas in any case, so break out of the habit now.

## 1. DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**1.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

**2.** Select only the rows where visits are 3 or more. Which types of animals are these?

**3.** Select the rows where visists are 3 and the animal is a cat

**4.** Calculate the sum of all visits in `df` (i.e. the total number of visits).

**5.** Calculate the mean age for each different animal in `df`.

**6.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.



In [2]:
#Question 1.1
import pandas as pd
import numpy as np

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no'],
        'labels' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

}

df = pd.DataFrame.from_dict(data)
df = df.set_index('labels')
df

Unnamed: 0_level_0,animal,age,visits,priority
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [3]:
#Question 1.2
visits_3_more = df.visits >= 3
df.loc[visits_3_more] #Gives 2 cats and a dog

Unnamed: 0_level_0,animal,age,visits,priority
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,cat,3.0,3,yes
d,dog,,3,yes
f,cat,2.0,3,no


In [4]:
#Question 1.3
df.loc[
    (df.visits == 3)
    & (df.animal == 'cat')]

Unnamed: 0_level_0,animal,age,visits,priority
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b,cat,3.0,3,yes
f,cat,2.0,3,no


In [5]:
#Question 1.4
df['visits'].sum()

19

In [6]:
#Question 1.5
ax = df.dropna()
ax.groupby('animal')['age'].mean()

animal
cat      2.5
dog      5.0
snake    2.5
Name: age, dtype: float64

In [7]:
#Question 1.6
favorites = ['no', 'no', 'no', 'yes', 'yes', 'no', 'no', 'no', 'yes', 'yes']
df['Favorites'] = favorites
df

Unnamed: 0_level_0,animal,age,visits,priority,Favorites
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
a,cat,2.5,1,yes,no
b,cat,3.0,3,yes,no
c,snake,0.5,2,no,no
d,dog,,3,yes,yes
e,dog,5.0,2,no,yes
f,cat,2.0,3,no,no
g,snake,4.5,1,no,no
h,cat,,1,yes,no
i,dog,7.0,2,no,yes
j,dog,3.0,1,no,yes


In [8]:
df.drop(['Favorites'], axis = 1)

Unnamed: 0_level_0,animal,age,visits,priority
labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


# 2.1 Shifty problem

You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

### Hint: use the `shift()` method

In [9]:
#easy way
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
drop = df.drop_duplicates()
drop

Unnamed: 0,A
0,1
1,2
3,3
4,4
5,5
8,6
9,7


In [10]:
#shift() way
df['repeats'] = df.shift(1) == df #look for repeating values
df

Unnamed: 0,A,repeats
0,1,False
1,2,False
2,2,True
3,3,False
4,4,False
5,5,False
6,5,True
7,5,True
8,6,False
9,7,False


In [11]:
df_filter = df[df['repeats'] == False] #delete repeating values (True values)
df_filter

Unnamed: 0,A,repeats
0,1,False
1,2,False
3,3,False
4,4,False
5,5,False
8,6,False
9,7,False


In [12]:
df_filter.drop(['repeats'], axis = 1) #delete useless repeats column

Unnamed: 0,A
0,1
1,2
3,3
4,4
5,5
8,6
9,7


# 2.2 columns sum min

Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? Return that column's label.

In [13]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
0,0.545836,0.606948,0.096357,0.480048,0.131087,0.101347,0.600561,0.542289,0.596332,0.73064
1,0.621824,0.576517,0.26422,0.390805,0.827957,0.66158,0.196113,0.732399,0.212533,0.542927
2,0.558283,0.867852,0.926982,0.975437,0.330219,0.211817,0.092365,0.332355,0.594377,0.811954
3,0.652838,0.20566,0.616704,0.945271,0.578023,0.823084,0.094506,0.230106,0.091865,0.864832
4,0.714141,0.905723,0.598105,0.801233,0.586906,0.359328,0.923883,0.89699,0.5628,0.384959


In [14]:
sums = df.sum(axis = 0) #create new row of sums
ax = df.append(sums, ignore_index = True)
ax

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
0,0.545836,0.606948,0.096357,0.480048,0.131087,0.101347,0.600561,0.542289,0.596332,0.73064
1,0.621824,0.576517,0.26422,0.390805,0.827957,0.66158,0.196113,0.732399,0.212533,0.542927
2,0.558283,0.867852,0.926982,0.975437,0.330219,0.211817,0.092365,0.332355,0.594377,0.811954
3,0.652838,0.20566,0.616704,0.945271,0.578023,0.823084,0.094506,0.230106,0.091865,0.864832
4,0.714141,0.905723,0.598105,0.801233,0.586906,0.359328,0.923883,0.89699,0.5628,0.384959
5,3.092922,3.1627,2.502368,3.592794,2.454191,2.157156,1.907428,2.734138,2.057907,3.335312


In [15]:
ax.idxmin(axis = 1) #find all minimum values. From sums, min value in row 5

0    c
1    g
2    g
3    i
4    f
5    g
dtype: object

# 2.3 Duplicates

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

**hint:** There's a method for to find duplicate rows for you

In [16]:
#Use df.pivot_table(index = ['column name', 'possible other column name'], aggfunc = 'size'). From the example below, we see how many times each value appears in column A.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
dff = df.pivot_table(index = ['A'], aggfunc = 'size')
dff

A
1    1
2    2
3    1
4    1
5    3
6    1
7    2
dtype: int64

# 2.4 Group Values

A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.  You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

In [17]:
import pandas as pd

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
df

Unnamed: 0,grps,vals
0,a,12
1,a,345
2,a,3
3,b,1
4,b,45
5,c,14
6,a,4
7,a,52
8,b,54
9,c,23


In [18]:
#Extra - solved by mistake
df.groupby(['grps'], as_index = False)['vals'].sum() #Gets total of each

Unnamed: 0,grps,vals
0,a,416
1,b,160
2,c,380


In [19]:
#Desired answers
df.loc[df.grps == 'a'].sort_values(by = 'vals', ascending = False).iloc[ : 3].sum()

grps    aaa
vals    409
dtype: object

In [20]:
df.loc[df.grps == 'b'].sort_values(by = 'vals', ascending = False).iloc[ : 3].sum()

grps    bbb
vals    156
dtype: object

In [21]:
df.loc[df.grps == 'c'].sort_values(by = 'vals', ascending = False).iloc[ : 3].sum()

grps    ccc
vals    345
dtype: object

# 3. Cleaning Data

### Making a DataFrame easier to work with

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?

Take this monstrosity as the DataFrame to use in the following puzzles:

```python
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
```

Formatted, it looks like this:

```
            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)
1      MAdrid_miLAN           NaN            []    <Air France> (12)
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis           NaN          [13]       12. Air France
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"
```

**1.** Some values in the the **FlightNumber** column are missing (they are `NaN`). These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column).

In [22]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


In [23]:
#Question 3.1
ax = df.interpolate()
ax

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,10055.0,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,10075.0,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


# 3.2 column splitting

The **From\_To** column would be better as two separate columns! Split each string on the underscore delimiter `_` to make two new columns `From` and `To` to your dataframe.

In [24]:
ax[['From', 'To']] = ax.From_To.str.split('_', expand = True)
ax.drop(['From_To'], axis = 1, inplace = True) #Drop unnecessary column
bx = ax[ax.columns[ : : -1]] #reversed it so that it's not ugly
bx

Unnamed: 0,To,From,Airline,RecentDelays,FlightNumber
0,paris,LoNDon,KLM(!),"[23, 47]",10045.0
1,miLAN,MAdrid,<Air France> (12),[],10055.0
2,StockhOlm,londON,(British Airways. ),"[24, 43, 87]",10065.0
3,PaRis,Budapest,12. Air France,[13],10075.0
4,londOn,Brussels,"""Swiss Air""","[67, 32]",10085.0


# 3.3 Clean Text

Make the text in your dataframe:

- From and To columns should be lowercase with only first letter capitalized

- In the **Airline** column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`.

In [25]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


In [26]:
df[['From', 'To']] = df.From_To.str.split('_', expand = True)
df.drop(['From_To'], axis = 1, inplace = True)
df

Unnamed: 0,FlightNumber,RecentDelays,Airline,From,To
0,10045.0,"[23, 47]",KLM(!),LoNDon,paris
1,,[],<Air France> (12),MAdrid,miLAN
2,10065.0,"[24, 43, 87]",(British Airways. ),londON,StockhOlm
3,,[13],12. Air France,Budapest,PaRis
4,10085.0,"[67, 32]","""Swiss Air""",Brussels,londOn


In [27]:
s1 = pd.Series(df['From']).str.title()
s2 = pd.Series(df['To']).str.title()
df['From2'] = s1
df['To2'] = s2
ax = df.drop(['From', 'To'], axis = 1)
ax

Unnamed: 0,FlightNumber,RecentDelays,Airline,From2,To2
0,10045.0,"[23, 47]",KLM(!),London,Paris
1,,[],<Air France> (12),Madrid,Milan
2,10065.0,"[24, 43, 87]",(British Airways. ),London,Stockholm
3,,[13],12. Air France,Budapest,Paris
4,10085.0,"[67, 32]","""Swiss Air""",Brussels,London


In [28]:
bx = ax.interpolate()

bx['Fixed Airline'] = bx['Airline'].str.replace(r'[^[\w\s]+', '')#wouldn't get rid of the 12 for some reason when d was put in this regex formula
bx['Fixed Airline'] = bx['Fixed Airline'].str.replace('\d+', '')
bx

Unnamed: 0,FlightNumber,RecentDelays,Airline,From2,To2,Fixed Airline
0,10045.0,"[23, 47]",KLM(!),London,Paris,KLM
1,10055.0,[],<Air France> (12),Madrid,Milan,Air France
2,10065.0,"[24, 43, 87]",(British Airways. ),London,Stockholm,British Airways
3,10075.0,[13],12. Air France,Budapest,Paris,Air France
4,10085.0,"[67, 32]","""Swiss Air""",Brussels,London,Swiss Air


# Exercise 4.1: Column Splitting

Given the unemployment data in `data/country_total.csv`, split the `month` column into two new columns: a `year` column and a `month` column, both integers

In [29]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/country_total.csv')

df['month2'] = df.month.astype(str)
df[['Year', 'Month']] = df.month2.str.split('.', expand = True)
df.drop(['month2'], axis = 1, inplace = True)
df.drop(['month'], axis = 1, inplace = True)
df['Year'] = df['Year'].astype(int)
df['Month'] = df['Month'].astype(int)
df

Unnamed: 0,country,seasonality,unemployment,unemployment_rate,Year,Month
0,at,nsa,171000,4.5,1993,1
1,at,nsa,175000,4.6,1993,2
2,at,nsa,166000,4.4,1993,3
3,at,nsa,157000,4.1,1993,4
4,at,nsa,147000,3.9,1993,5
...,...,...,...,...,...,...
20791,uk,trend,2429000,7.7,2010,6
20792,uk,trend,2422000,7.7,2010,7
20793,uk,trend,2429000,7.7,2010,8
20794,uk,trend,2447000,7.8,2010,9


# 4.2 Group Statistics

Given the unemployment data in `data/country_sex_age.csv`, give the average unemployment rate for:

- Each gender
- Each Age Group
- Both Together

**HINT:** The `seasonality` column makes it such that the data is repeated for each method of calculating unemployment (`nsa`, `trend`, etc.). Can you ignore this and group over it? Or should you take the average for each?

In [30]:
import pandas as pd
import numpy as np

df = pd.read_csv('data/country_sex_age.csv')

df

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate
0,at,nsa,f,y25-74,1993.01,61000,4.5
1,at,nsa,f,y25-74,1993.02,62000,4.5
2,at,nsa,f,y25-74,1993.03,62000,4.5
3,at,nsa,f,y25-74,1993.04,63000,4.6
4,at,nsa,f,y25-74,1993.05,63000,4.6
...,...,...,...,...,...,...,...
83155,uk,trend,m,y_lt25,2010.06,518000,21.1
83156,uk,trend,m,y_lt25,2010.07,513000,20.8
83157,uk,trend,m,y_lt25,2010.08,509000,20.5
83158,uk,trend,m,y_lt25,2010.09,513000,20.7


In [31]:
df2 = df.groupby(['sex', 'seasonality'])
df2.unemployment_rate.mean()

sex  seasonality
f    nsa            13.179216
     sa             12.908743
     trend          12.862163
m    nsa            11.708432
     sa             11.662290
     trend          11.643015
Name: unemployment_rate, dtype: float64

In [32]:
df3 = df.groupby(['age_group', 'seasonality'])
df3.unemployment_rate.mean()

age_group  seasonality
y25-74     nsa             6.870746
           sa              6.918446
           trend           6.926319
y_lt25     nsa            18.016902
           sa             17.690707
           trend          17.617224
Name: unemployment_rate, dtype: float64

In [33]:
df4 = df.groupby(['sex', 'age_group', 'seasonality'])
df4.unemployment_rate.mean()

sex  age_group  seasonality
f    y25-74     nsa             7.539839
                sa              7.579982
                trend           7.579934
     y_lt25     nsa            18.818593
                sa             18.323837
                trend          18.231025
m    y25-74     nsa             6.201653
                sa              6.256909
                trend           6.272703
     y_lt25     nsa            17.215211
                sa             17.067671
                trend          17.013327
Name: unemployment_rate, dtype: float64

# 4.3 Estimating group size

Given that we have the unemployment **rate** as a % of total population, and the number of total unemployed, we can estimate the total population.

Give an estimate of the total population for men and women in each age group.

Does this change depending on the unemployment seasonality calculation method?

In [34]:
df['unemployed_pop_size'] = df['unemployment'] / df['unemployment_rate']
df

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate,unemployed_pop_size
0,at,nsa,f,y25-74,1993.01,61000,4.5,13555.555556
1,at,nsa,f,y25-74,1993.02,62000,4.5,13777.777778
2,at,nsa,f,y25-74,1993.03,62000,4.5,13777.777778
3,at,nsa,f,y25-74,1993.04,63000,4.6,13695.652174
4,at,nsa,f,y25-74,1993.05,63000,4.6,13695.652174
...,...,...,...,...,...,...,...,...
83155,uk,trend,m,y_lt25,2010.06,518000,21.1,24549.763033
83156,uk,trend,m,y_lt25,2010.07,513000,20.8,24663.461538
83157,uk,trend,m,y_lt25,2010.08,509000,20.5,24829.268293
83158,uk,trend,m,y_lt25,2010.09,513000,20.7,24782.608696


In [35]:
mean_factors = df.groupby(['sex', 'age_group'])
mean_factors.unemployed_pop_size.mean()

sex  age_group
f    y25-74       32206.017729
     y_lt25        5665.825758
m    y25-74       43576.680612
     y_lt25        6679.826296
Name: unemployed_pop_size, dtype: float64

In [36]:
mean_factors2 = df.groupby(['sex', 'age_group', 'seasonality'])
mean_factors2.unemployed_pop_size.mean()

sex  age_group  seasonality
f    y25-74     nsa            30067.785488
                sa             33606.490836
                trend          32895.200243
     y_lt25     nsa             5270.111609
                sa              5876.347541
                trend           5848.969402
m    y25-74     nsa            41101.938284
                sa             45332.013013
                trend          44238.241829
     y_lt25     nsa             6354.751334
                sa              6867.673103
                trend           6809.972722
Name: unemployed_pop_size, dtype: float64

# 5.1 Tennis

In `data/tennis.csv` you have games that Roger Federer played against various opponents. Questions:

1. How many games did Federer win?

2. What is Federer's win/loss ratio?

3. Who were Federer's top 5 opponents?

In [37]:
#Question 5.1.1
import pandas as pd
import numpy as np

df = pd.read_csv('data/tennis.csv')
df

Unnamed: 0,year,tournament,start date,type,surface,draw,atp points,atp ranking,tournament prize money,round,...,player2 break points converted won,player2 break points converted total,player2 return games played,player2 total service points won,player2 total service points total,player2 total return points won,player2 total return points total,player2 total points won,player2 total points total,win
0,1998,"Basel, Switzerland",1998-10-05,WS,Indoor: Hard,Draw: 32,1,396.0,"$9,800",R32,...,4.0,8.0,8.0,36.0,50.0,26.0,53.0,62.0,103.0,False
1,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,"$10,800",R32,...,0.0,1.0,8.0,33.0,65.0,8.0,41.0,41.0,106.0,True
2,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,"$10,800",R16,...,0.0,4.0,10.0,46.0,75.0,23.0,73.0,69.0,148.0,True
3,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,"$10,800",Q,...,3.0,10.0,10.0,44.0,63.0,26.0,67.0,70.0,130.0,False
4,1998,"Geneva, Switzerland",1998-08-24,CH,Outdoor: Clay,Draw: 32,1,680.0,$520,R32,...,,,,,,,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1174,2012,"Australian Open, Australia",2012-01-16,GS,Outdoor: Hard,Draw: 128,720,3.0,"A$437,000",S,...,6.0,16.0,21.0,95.0,141.0,51.0,135.0,146.0,276.0,False
1175,2012,"Doha, Qatar",2012-01-02,250,Outdoor: Hard,Draw: 32,90,3.0,"$50,030",R32,...,0.0,0.0,8.0,22.0,45.0,9.0,41.0,31.0,86.0,True
1176,2012,"Doha, Qatar",2012-01-02,250,Outdoor: Hard,Draw: 32,90,3.0,"$50,030",R16,...,0.0,2.0,9.0,28.0,50.0,11.0,49.0,39.0,99.0,True
1177,2012,"Doha, Qatar",2012-01-02,250,Outdoor: Hard,Draw: 32,90,3.0,"$50,030",Q,...,4.0,9.0,16.0,47.0,78.0,34.0,95.0,81.0,173.0,True


In [38]:
df.win.value_counts() #972 wins

True     972
False    207
Name: win, dtype: int64

In [39]:
#Question 5.1.2
df['win'].value_counts(normalize=True) * 100

True     82.442748
False    17.557252
Name: win, dtype: float64

In [40]:
#Question 5.1.3 - take 5 top point values to get top 5 players
ax = df.sort_values(by = 'player2 total points total', ascending = False).head(5)
ax['opponent']

867         Andy Roddick (USA)
779         Rafael Nadal (ESP)
545          Marat Safin (RUS)
95     Michel Kratochvil (SUI)
468     David Nalbandian (ARG)
Name: opponent, dtype: object

# 5.2 Over time

1. What was Federer's best year? In terms of money, and then in terms of number of wins

2. Did Federer get better or worse over time?

In [41]:
#5.2.1 - help from Pamela
df['tournament prize money'] = df['tournament prize money'].str.replace('[a-zA-Z]', '')
df['tournament prize money'] = df['tournament prize money'].str.replace('$', '')
df['tournament prize money'] = df['tournament prize money'].str.replace(',', '')
df = df[df['tournament prize money'] != '']
df = df[~df['tournament prize money'].isnull()]
df

Unnamed: 0,year,tournament,start date,type,surface,draw,atp points,atp ranking,tournament prize money,round,...,player2 break points converted won,player2 break points converted total,player2 return games played,player2 total service points won,player2 total service points total,player2 total return points won,player2 total return points total,player2 total points won,player2 total points total,win
0,1998,"Basel, Switzerland",1998-10-05,WS,Indoor: Hard,Draw: 32,1,396.0,9800,R32,...,4.0,8.0,8.0,36.0,50.0,26.0,53.0,62.0,103.0,False
1,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,10800,R32,...,0.0,1.0,8.0,33.0,65.0,8.0,41.0,41.0,106.0,True
2,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,10800,R16,...,0.0,4.0,10.0,46.0,75.0,23.0,73.0,69.0,148.0,True
3,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,10800,Q,...,3.0,10.0,10.0,44.0,63.0,26.0,67.0,70.0,130.0,False
4,1998,"Geneva, Switzerland",1998-08-24,CH,Outdoor: Clay,Draw: 32,1,680.0,520,R32,...,,,,,,,,,,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1174,2012,"Australian Open, Australia",2012-01-16,GS,Outdoor: Hard,Draw: 128,720,3.0,437000,S,...,6.0,16.0,21.0,95.0,141.0,51.0,135.0,146.0,276.0,False
1175,2012,"Doha, Qatar",2012-01-02,250,Outdoor: Hard,Draw: 32,90,3.0,50030,R32,...,0.0,0.0,8.0,22.0,45.0,9.0,41.0,31.0,86.0,True
1176,2012,"Doha, Qatar",2012-01-02,250,Outdoor: Hard,Draw: 32,90,3.0,50030,R16,...,0.0,2.0,9.0,28.0,50.0,11.0,49.0,39.0,99.0,True
1177,2012,"Doha, Qatar",2012-01-02,250,Outdoor: Hard,Draw: 32,90,3.0,50030,Q,...,4.0,9.0,16.0,47.0,78.0,34.0,95.0,81.0,173.0,True


In [42]:
df['tournament prize money'] = df['tournament prize money'].astype(int)
df['year'].astype(str)
federer_wins = df[df['winner'].str.contains('Federer')]
best_year_money = federer_wins.groupby(['year', 'win']).agg(prizes = ('tournament prize money', 'sum')).reset_index
best_year_money #based on money, best year was 2007

<bound method DataFrame.reset_index of              prizes
year win           
1998 True     21600
1999 True    365975
2000 True   1023498
2001 True   2543712
2002 True   7058200
2003 True  20117218
2004 True  39034705
2005 True  36910840
2006 True  51748945
2007 True  52680795
2008 True  22940000
2009 True  33892300
2010 True  36160885
2011 True  25066780
2012 True  34535490>

In [43]:
best_year_wins = federer_wins.groupby(['year', 'win']).agg(prizes = ('win', 'sum')).reset_index
best_year_wins #2006 has highest number of wins

<bound method DataFrame.reset_index of            prizes
year win         
1998 True     2.0
1999 True    28.0
2000 True    30.0
2001 True    49.0
2002 True    59.0
2003 True    77.0
2004 True    72.0
2005 True    82.0
2006 True    94.0
2007 True    76.0
2008 True    72.0
2009 True    67.0
2010 True    76.0
2011 True    69.0
2012 True    74.0>

In [44]:
#Question 5.2.2
#He got better with time. He wasn't making as much money after 2007 because he wasn't playing as many games.

# 5.3 Total money won

In the data, you'll find the `tournament round`, one value of which, `F` indicates the final.

Assuming Federer wins the money in the `tournament prize money` if he wins a final in a tournament, how much money has Federer made in tournaments in this dataset?

In [45]:
finals = df[(df['tournament round'] == 'F') & (df['win'] == True)]
finals

Unnamed: 0,year,tournament,start date,type,surface,draw,atp points,atp ranking,tournament prize money,round,...,player2 break points converted won,player2 break points converted total,player2 return games played,player2 total service points won,player2 total service points total,player2 total return points won,player2 total return points total,player2 total points won,player2 total points total,win
10,1999,"Brest, France",1999-10-25,CH,Indoor: Hard,Draw: 32,78,66.0,14400,W,...,,,,,,,,,,True
190,2001,"Milan, Italy",2001-01-29,WS,Indoor: Carpet,Draw: 32,175,27.0,54000,W,...,5.0,7.0,16.0,64.0,113.0,37.0,103.0,101.0,216.0,True
217,2002,"Vienna, Austria",2002-10-07,CS,Indoor: Hard,Draw: 32,250,13.0,119750,W,...,3.0,10.0,18.0,68.0,116.0,41.0,116.0,109.0,232.0,True
247,2002,"ATP Masters Series Hamburg, Germany",2002-05-13,SU,Outdoor: Clay,Draw: 64,500,14.0,372000,W,...,3.0,7.0,13.0,55.0,117.0,32.0,81.0,87.0,198.0,True
281,2002,"Sydney, Australia",2002-01-07,WS,Outdoor: Hard,Draw: 32,175,13.0,48850,W,...,0.0,0.0,9.0,29.0,47.0,10.0,48.0,39.0,95.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1125,2012,"Wimbledon, Great Britain",2012-06-25,GS,Outdoor: Grass,Draw: 128,2000,3.0,1150000,W,...,2.0,7.0,21.0,94.0,157.0,43.0,131.0,137.0,288.0,True
1147,2012,"ATP World Tour Masters 1000 Madrid, Spain",2012-05-06,1000,Outdoor: Clay,Draw: 56,1000,3.0,585800,W,...,3.0,9.0,16.0,63.0,100.0,38.0,104.0,101.0,204.0,True
1157,2012,"ATP World Tour Masters 1000 Indian Wells, CA, ...",2012-03-08,1000,Outdoor: Hard,Draw: 96,1000,3.0,1000000,W,...,0.0,3.0,10.0,46.0,71.0,9.0,56.0,55.0,127.0,True
1162,2012,"Dubai, U.A.E.",2012-02-27,500,Outdoor: Hard,Draw: 32,500,3.0,409170,W,...,1.0,3.0,11.0,40.0,66.0,19.0,64.0,59.0,130.0,True


In [46]:
tournament_money = finals['tournament prize money']
tournament_money.sum(axis = 0) #total prize money from tournament wins

44934964