# Notes:

You are banned from using loops (`for` or `while` or any other) for this entire workshop!

You shouldn't be using loops almost ever with pandas in any case, so break out of the habit now.

## 1. DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**1.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

**2.** Select only the rows where visits are 3 or more. Which types of animals are these?

**3.** Select the rows where visists are 3 and the animal is a cat

**4.** Calculate the sum of all visits in `df` (i.e. the total number of visits).

**5.** Calculate the mean age for each different animal in `df`.

**6.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.



In [1]:
#1: create data frame with the index labels

import pandas as pd
import numpy as np

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df_vet = pd.DataFrame.from_dict(data, orient='index',
                       columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
df_vet.T
#sneaky way of getting there quick : >

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [2]:
df = df_vet.T
df
#permanently convert DF to this format

Unnamed: 0,animal,age,visits,priority
a,cat,2.5,1,yes
b,cat,3.0,3,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [3]:
df.describe()
#getting a macro view of the data
#and getting ready to prep the data so
#i can make calculations

Unnamed: 0,animal,age,visits,priority
count,10,8.0,10,10
unique,3,7.0,3,2
top,cat,3.0,1,no
freq,4,2.0,4,6


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   animal    10 non-null     object
 1   age       8 non-null      object
 2   visits    10 non-null     object
 3   priority  10 non-null     object
dtypes: object(4)
memory usage: 400.0+ bytes


In [5]:
df["age"] = df.age.astype(float)
df.info()
#changed data type from object to float
#as the 'mean' func requires numerical data

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   animal    10 non-null     object 
 1   age       8 non-null      float64
 2   visits    10 non-null     object 
 3   priority  10 non-null     object 
dtypes: float64(1), object(3)
memory usage: 400.0+ bytes


In [6]:
#2. Select only the rows where visits are 3 or more.
df[(df.visits >= 3)]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
d,dog,,3,yes
f,cat,2.0,3,no


In [7]:
#3. Select the rows where visists are 3 and the animal is a cat

df[df.eval("visits>=3 & animal.str.startswith('cat').values")]

Unnamed: 0,animal,age,visits,priority
b,cat,3.0,3,yes
f,cat,2.0,3,no


In [8]:
#4. Calculate the sum of all visits in df (i.e. the total number of visits).
total_visits = df['visits'].sum()
print(total_visits)

19


In [9]:
import numpy as np
#5. Calculate the mean age for each different animal in df.
mean_pet_age = df.groupby('animal')['age'].mean()
print(mean_pet_age)

animal
cat      2.5
dog      5.0
snake    2.5
Name: age, dtype: float64


In [10]:
#6. Append a new row 'k' to df with your choice of values for each column.
#Then delete that row to return the original DataFrame.

In [11]:
#create a secondary data frame with a new row to append
df2=pd.DataFrame({'animal': ['shark'],
                 'age' : [7],
                 'visits' : [4],
                 'priority' : ['yes']})
#add row using append
df3 = df.append(df2, ignore_index = True)
df3

Unnamed: 0,animal,age,visits,priority
0,cat,2.5,1,yes
1,cat,3.0,3,yes
2,snake,0.5,2,no
3,dog,,3,yes
4,dog,5.0,2,no
5,cat,2.0,3,no
6,snake,4.5,1,no
7,cat,,1,yes
8,dog,7.0,2,no
9,dog,3.0,1,no


In [12]:
df4 = df3.reset_index()
df4

Unnamed: 0,index,animal,age,visits,priority
0,0,cat,2.5,1,yes
1,1,cat,3.0,3,yes
2,2,snake,0.5,2,no
3,3,dog,,3,yes
4,4,dog,5.0,2,no
5,5,cat,2.0,3,no
6,6,snake,4.5,1,no
7,7,cat,,1,yes
8,8,dog,7.0,2,no
9,9,dog,3.0,1,no


In [13]:
df5 = df4.drop(10)
df5 

Unnamed: 0,index,animal,age,visits,priority
0,0,cat,2.5,1,yes
1,1,cat,3.0,3,yes
2,2,snake,0.5,2,no
3,3,dog,,3,yes
4,4,dog,5.0,2,no
5,5,cat,2.0,3,no
6,6,snake,4.5,1,no
7,7,cat,,1,yes
8,8,dog,7.0,2,no
9,9,dog,3.0,1,no


# 2.1 Shifty problem

You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

### Hint: use the `shift()` method

In [14]:
#generating dataframe
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df

Unnamed: 0,A
0,1
1,2
2,2
3,3
4,4
5,5
6,5
7,5
8,6
9,7


In [15]:
df.loc[df['A'].shift(1) != df['A']]

Unnamed: 0,A
0,1
1,2
3,3
4,4
5,5
8,6
9,7


In [16]:
df.shift(-1)!= df

Unnamed: 0,A
0,True
1,False
2,True
3,True
4,True
5,False
6,False
7,True
8,True
9,False


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       11 non-null     int64
dtypes: int64(1)
memory usage: 216.0 bytes


In [18]:
df.shift(-1)!= df

Unnamed: 0,A
0,True
1,False
2,True
3,True
4,True
5,False
6,False
7,True
8,True
9,False


# 2.2 columns sum min

Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? Return that column's label.

In [19]:
#define dataframe
df6 = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df6

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
0,0.872548,0.705899,0.544556,0.3741,0.95303,0.763444,0.960986,0.18621,0.243582,0.908661
1,0.931654,0.598498,0.882757,0.232874,0.829851,0.671607,0.643466,0.955026,0.985251,0.697334
2,0.496249,0.497202,0.746878,0.909762,0.45131,0.507803,0.392373,0.449203,0.390897,0.486192
3,0.722314,0.075446,0.364824,0.923084,0.042825,0.639016,0.77271,0.854035,0.826058,0.466468
4,0.981957,0.684124,0.918208,0.851233,0.784275,0.782199,0.859417,0.844888,0.98597,0.969101


In [20]:
df6_sum = df6.sum(axis=0)
df6_sum

a    4.004722
b    2.561169
c    3.457223
d    3.291053
e    3.061292
f    3.364069
g    3.628952
h    3.289362
i    3.431757
j    3.527756
dtype: float64

# 2.3 Duplicates

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

**hint:** There's a method for to find duplicate rows for you

In [21]:
#generate dataframe with duplicate rows (the 2 first here)

data2 = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [1, 1, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [2, 2, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df_dupe = pd.DataFrame.from_dict(data2, orient='index',
                       columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
df_dupe.T
df7 = df_dupe.T
df7


Unnamed: 0,animal,age,visits,priority
a,cat,1.0,2,yes
b,cat,1.0,2,yes
c,snake,0.5,2,no
d,dog,,3,yes
e,dog,5.0,2,no
f,cat,2.0,3,no
g,snake,4.5,1,no
h,cat,,1,yes
i,dog,7.0,2,no
j,dog,3.0,1,no


In [168]:
#using groupby with the reset index function shows that the 1st and 2nd rows
#of the dataframe i generated are identical

df7.groupby(['animal','age', 'visits', 'priority']).size().reset_index(name='Count')

Unnamed: 0,animal,age,visits,priority,Count
0,cat,1.0,2,yes,2
1,cat,2.0,3,no,1
2,dog,3.0,1,no,1
3,dog,5.0,2,no,1
4,dog,7.0,2,no,1
5,snake,0.5,2,no,1
6,snake,4.5,1,no,1


# 2.4 Group Values

A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.  You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

# 3. Cleaning Data

### Making a DataFrame easier to work with

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?

Take this monstrosity as the DataFrame to use in the following puzzles:

```python
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
```

Formatted, it looks like this:

```
            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)
1      MAdrid_miLAN           NaN            []    <Air France> (12)
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis           NaN          [13]       12. Air France
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"
```

**1.** Some values in the the **FlightNumber** column are missing (they are `NaN`). These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column).

In [104]:
#generate dataframe
import numpy as np


df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!)
1,MAdrid_miLAN,,[],<Air France> (12)
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,,[13],12. Air France
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air"""


In [105]:
df['FlightNumber'] = df['FlightNumber'].interpolate().astype(int)
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045,"[23, 47]",KLM(!)
1,MAdrid_miLAN,10055,[],<Air France> (12)
2,londON_StockhOlm,10065,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,10075,[13],12. Air France
4,Brussels_londOn,10085,"[67, 32]","""Swiss Air"""


# 3.2 column splitting

The **From\_To** column would be better as two separate columns! Split each string on the underscore delimiter `_` to make two new columns `From` and `To` to your dataframe.

In [106]:
df[['From', 'To']]= df.From_To.str.split('_', expand=True)
df2 = df[['From', 'To']]
frames = [df, df2]
result = pd.concat(frames)
#use assign not concat
#assign new columns to previous dataframe
#view new df
result.head()

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,From,To
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!),LoNDon,paris
1,MAdrid_miLAN,10055.0,[],<Air France> (12),MAdrid,miLAN
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. ),londON,StockhOlm
3,Budapest_PaRis,10075.0,[13],12. Air France,Budapest,PaRis
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air""",Brussels,londOn


In [107]:
result

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,From,To
0,LoNDon_paris,10045.0,"[23, 47]",KLM(!),LoNDon,paris
1,MAdrid_miLAN,10055.0,[],<Air France> (12),MAdrid,miLAN
2,londON_StockhOlm,10065.0,"[24, 43, 87]",(British Airways. ),londON,StockhOlm
3,Budapest_PaRis,10075.0,[13],12. Air France,Budapest,PaRis
4,Brussels_londOn,10085.0,"[67, 32]","""Swiss Air""",Brussels,londOn
0,,,,,LoNDon,paris
1,,,,,MAdrid,miLAN
2,,,,,londON,StockhOlm
3,,,,,Budapest,PaRis
4,,,,,Brussels,londOn


# 3.3 Clean Text

Make the text in your dataframe:

- From and To columns should be lowercase with only first letter capitalized

- In the **Airline** column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`.

In [108]:
#clean From and To columns case
result = df
df['From'] = df['From'].str.capitalize()
df['To'] = df['To'].str.capitalize()
df.drop(['From_To'], axis=1)

#Only keep characters in the Airline column
df.Airline = (
    df.Airline.str.extract("([a-zA-Z\s]+)",
                           expand=False)
    .str.strip()
)
df

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline,From,To
0,LoNDon_paris,10045,"[23, 47]",KLM,London,Paris
1,MAdrid_miLAN,10055,[],Air France,Madrid,Milan
2,londON_StockhOlm,10065,"[24, 43, 87]",British Airways,London,Stockholm
3,Budapest_PaRis,10075,[13],Air France,Budapest,Paris
4,Brussels_londOn,10085,"[67, 32]",Swiss Air,Brussels,London


# Exercise 4.1: Column Splitting

Given the unemployment data in `data/country_total.csv`, split the `month` column into two new columns: a `year` column and a `month` column, both integers

In [125]:
import pandas as pd
df = pd.read_csv("../m2-1-pandas/data/country_total.csv")
df

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate
0,at,nsa,1993.01,171000,4.5
1,at,nsa,1993.02,175000,4.6
2,at,nsa,1993.03,166000,4.4
3,at,nsa,1993.04,157000,4.1
4,at,nsa,1993.05,147000,3.9
...,...,...,...,...,...
20791,uk,trend,2010.06,2429000,7.7
20792,uk,trend,2010.07,2422000,7.7
20793,uk,trend,2010.08,2429000,7.7
20794,uk,trend,2010.09,2447000,7.8


In [126]:
#ensure the month is an integer before attempting conversion
df["year"] = df.month.astype(int)

#split month from year, change to single digit format when applicable
df['month'] = ((df['month'] - df['year']) * 100).round(0).astype(int)
df

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate,year
0,at,nsa,1,171000,4.5,1993
1,at,nsa,2,175000,4.6,1993
2,at,nsa,3,166000,4.4,1993
3,at,nsa,4,157000,4.1,1993
4,at,nsa,5,147000,3.9,1993
...,...,...,...,...,...,...
20791,uk,trend,6,2429000,7.7,2010
20792,uk,trend,7,2422000,7.7,2010
20793,uk,trend,8,2429000,7.7,2010
20794,uk,trend,9,2447000,7.8,2010


# 4.2 Group Statistics

Given the unemployment data in `data/country_sex_age.csv`, give the average unemployment rate for:

- Each gender
- Each Age Group
- Both Together

**HINT:** The `seasonality` column makes it such that the data is repeated for each method of calculating unemployment (`nsa`, `trend`, etc.). Can you ignore this and group over it? Or should you take the average for each?

In [131]:
df = pd.read_csv("../m2-1-pandas/data/country_sex_age.csv")
df.head(10)

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate
0,at,nsa,f,y25-74,1993.01,61000,4.5
1,at,nsa,f,y25-74,1993.02,62000,4.5
2,at,nsa,f,y25-74,1993.03,62000,4.5
3,at,nsa,f,y25-74,1993.04,63000,4.6
4,at,nsa,f,y25-74,1993.05,63000,4.6
5,at,nsa,f,y25-74,1993.06,59000,4.3
6,at,nsa,f,y25-74,1993.07,57000,4.2
7,at,nsa,f,y25-74,1993.08,58000,4.2
8,at,nsa,f,y25-74,1993.09,58000,4.3
9,at,nsa,f,y25-74,1993.1,62000,4.5


In [132]:
#taking an agile path: seeing if doing a simple groupby for gender will
#generate a different answer than if i isolate seasonality

gender_simple = df.groupby('sex').unemployment_rate.mean()
print("Average unemployment rate by gender without taking into account col2:")
print(gender_simple)

Average unemployment rate by gender without taking into account col2:
sex
f    12.982629
m    11.671026
Name: unemployment_rate, dtype: float64


In [133]:
#now taking into consideration seasonality
season_gender = df.groupby(['sex', 'seasonality'], as_index=False)['unemployment_rate'].mean()
print(season_gender)

  sex seasonality  unemployment_rate
0   f         nsa          13.179216
1   f          sa          12.908743
2   f       trend          12.862163
3   m         nsa          11.708432
4   m          sa          11.662290
5   m       trend          11.643015


In [134]:
#conclusion: lack of additional filtering creates up to
# .3% difference. Filtering will matter depending on 
#data usage and rounding standards

In [135]:
#Average unemployment rate for Each Age Group
season_age = df.groupby(['age_group', 'seasonality'], as_index=False)['unemployment_rate'].mean()
print(season_age)

  age_group seasonality  unemployment_rate
0    y25-74         nsa           6.870746
1    y25-74          sa           6.918446
2    y25-74       trend           6.926319
3    y_lt25         nsa          18.016902
4    y_lt25          sa          17.690707
5    y_lt25       trend          17.617224


In [136]:
#Average unemployment rate by age group and gender
season_age_gender = df.groupby(['age_group', 'sex', 'seasonality'], as_index=False)['unemployment_rate'].mean()
print(season_age_gender)

   age_group sex seasonality  unemployment_rate
0     y25-74   f         nsa           7.539839
1     y25-74   f          sa           7.579982
2     y25-74   f       trend           7.579934
3     y25-74   m         nsa           6.201653
4     y25-74   m          sa           6.256909
5     y25-74   m       trend           6.272703
6     y_lt25   f         nsa          18.818593
7     y_lt25   f          sa          18.323837
8     y_lt25   f       trend          18.231025
9     y_lt25   m         nsa          17.215211
10    y_lt25   m          sa          17.067671
11    y_lt25   m       trend          17.013327


In [137]:
#Unemployment rate by age group and gender, without seasonal trending
#Average unemployment rate by age group and gender
season_age_gender = df.groupby(['age_group', 'sex'], as_index=False)['unemployment_rate'].mean()
print(season_age_gender)

  age_group sex  unemployment_rate
0    y25-74   f           7.566771
1    y25-74   m           6.244016
2    y_lt25   f          18.457435
3    y_lt25   m          17.098036


# 4.3 Estimating group size

Given that we have the unemployment **rate** as a % of total population, and the number of total unemployed, we can estimate the total population.

Give an estimate of the total population for men and women in each age group.

Does this change depending on the unemployment seasonality calculation method?

In [138]:
#total population
#total population = (['unemployment']*100/['unemployment_rate'])

df['total_population'] = df.unemployment * 100/ df.unemployment_rate
df

Unnamed: 0,country,seasonality,sex,age_group,month,unemployment,unemployment_rate,total_population
0,at,nsa,f,y25-74,1993.01,61000,4.5,1.355556e+06
1,at,nsa,f,y25-74,1993.02,62000,4.5,1.377778e+06
2,at,nsa,f,y25-74,1993.03,62000,4.5,1.377778e+06
3,at,nsa,f,y25-74,1993.04,63000,4.6,1.369565e+06
4,at,nsa,f,y25-74,1993.05,63000,4.6,1.369565e+06
...,...,...,...,...,...,...,...,...
83155,uk,trend,m,y_lt25,2010.06,518000,21.1,2.454976e+06
83156,uk,trend,m,y_lt25,2010.07,513000,20.8,2.466346e+06
83157,uk,trend,m,y_lt25,2010.08,509000,20.5,2.482927e+06
83158,uk,trend,m,y_lt25,2010.09,513000,20.7,2.478261e+06


In [139]:
#Does this change depending on the unemployment seasonality calculation method?
df.groupby('seasonality').total_population.mean()


seasonality
nsa      2.069865e+06
sa       2.298884e+06
trend    2.251533e+06
Name: total_population, dtype: float64

# 5.1 Tennis

In `data/tennis.csv` you have games that Roger Federer played against various opponents. Questions:

1. How many games did Federer win?

2. What is Federer's win/loss ratio?

3. Who were Federer's top 5 opponents?

In [146]:
#loading data
df = pd.read_csv('data/tennis.csv')
df.head()

Unnamed: 0,year,tournament,start date,type,surface,draw,atp points,atp ranking,tournament prize money,round,...,player2 break points converted won,player2 break points converted total,player2 return games played,player2 total service points won,player2 total service points total,player2 total return points won,player2 total return points total,player2 total points won,player2 total points total,win
0,1998,"Basel, Switzerland",1998-10-05,WS,Indoor: Hard,Draw: 32,1,396.0,"$9,800",R32,...,4.0,8.0,8.0,36.0,50.0,26.0,53.0,62.0,103.0,False
1,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,"$10,800",R32,...,0.0,1.0,8.0,33.0,65.0,8.0,41.0,41.0,106.0,True
2,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,"$10,800",R16,...,0.0,4.0,10.0,46.0,75.0,23.0,73.0,69.0,148.0,True
3,1998,"Toulouse, France",1998-09-28,WS,Indoor: Hard,Draw: 32,59,878.0,"$10,800",Q,...,3.0,10.0,10.0,44.0,63.0,26.0,67.0,70.0,130.0,False
4,1998,"Geneva, Switzerland",1998-08-24,CH,Outdoor: Clay,Draw: 32,1,680.0,$520,R32,...,,,,,,,,,,False


In [153]:
#How many games did Federer win?
#checking out where I can get the winner info from : 'winner' column (17)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1179 entries, 0 to 1178
Data columns (total 71 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   year                                   1179 non-null   int64  
 1   tournament                             1179 non-null   object 
 2   start date                             1179 non-null   object 
 3   type                                   1179 non-null   object 
 4   surface                                1179 non-null   object 
 5   draw                                   1179 non-null   object 
 6   atp points                             1139 non-null   object 
 7   atp ranking                            1177 non-null   float64
 8   tournament prize money                 1170 non-null   object 
 9   round                                  1179 non-null   object 
 10  opponent                               1179 non-null   object 
 11  rank

In [165]:
#validating that this is a Federer data set
df['player1 name'].value_counts()

Roger Federer    1179
Name: player1 name, dtype: int64

In [179]:
#Answering 2 questions:
#How many games did Federer win? 972
#Top 5 opponents: list below his name here

df.winner.value_counts()[:6]


Roger Federer       972
Rafael Nadal         18
Novak Djokovic       13
Andy Murray          10
Lleyton Hewitt        8
David Nalbandian      8
Name: winner, dtype: int64

In [191]:
#Federer win rate
df.win.sum() / len(df)

0.8244274809160306

# 5.2 Over time

1. What was Federer's best year? In terms of money, and then in terms of number of wins

2. Did Federer get better or worse over time?

In [194]:
#Top years in terms of money and wins : prep

#change complicated $$$ column name
df['wallet'] = df['tournament prize money']

#clean up to digits only
df.wallet = pd.to_numeric(df.wallet.str.replace("(\D+)", ""))
df.wallet

0         9800.0
1        10800.0
2        10800.0
3        10800.0
4          520.0
          ...   
1174    437000.0
1175     50030.0
1176     50030.0
1177     50030.0
1178     50030.0
Name: wallet, Length: 1179, dtype: float64

In [208]:
##Top years in terms of money and wins: top 5

best = df.groupby('year')[['wallet', 'win']].sum()
best.sort_values('wallet', ascending=False)[:5]

Unnamed: 0_level_0,wallet,win
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2007,55246570.0,78.0
2006,52860895.0,96.0
2004,39142988.0,77.0
2005,38288058.0,83.0
2010,38058905.0,76.0


# 5.3 Total money won

In the data, you'll find the `tournament round`, one value of which, `F` indicates the final.

Assuming Federer wins the money in the `tournament prize money` if he wins a final in a tournament, how much money has Federer made in tournaments in this dataset?

In [211]:
df.loc[
    (df['tournament round'] == 'F')
    & (df.win)
].wallet.sum()

44934964.0