# Notes:

You are banned from using loops (`for` or `while` or any other) for this entire workshop!

You shouldn't be using loops almost ever with pandas in any case, so break out of the habit now.

## 1. DataFrame basics


Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**1.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

**2.** Select only the rows where visits are 3 or more. Which types of animals are these?

**3.** Select the rows where visists are 3 and the animal is a cat

**4.** Calculate the sum of all visits in `df` (i.e. the total number of visits).

**5.** Calculate the mean age for each different animal in `df`.

**6.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.



In [1]:
#1.1 Create a DataFrame df from this dictionary data which has the index labels.
import pandas as pd
import numpy as np


data = {
        'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
        }
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, columns = ['animal','age','visits','priority'],index=labels)
print(df)

  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no


In [2]:
#1.2 Select only the rows where visits are 3 or more. Which types of animals are these?
visit_3_or_more = df[df.visits >=3]
print(visit_3_or_more)

  animal  age  visits priority
b    cat  3.0       3      yes
d    dog  NaN       3      yes
f    cat  2.0       3       no


In [3]:
#1.3 Select the rows where visists are 3 and the animal is a cat
visit_3_N_cat = df[(df.visits >=3) & (df.animal== 'cat')] 
print(visit_3_N_cat)

  animal  age  visits priority
b    cat  3.0       3      yes
f    cat  2.0       3       no


In [4]:
#1.4 Calculate the sum of all visits in df (i.e. the total number of visits).
total=sum(df.visits)
print(total)

19


In [5]:
#1.5 Calculate the mean age for each different animal in df.
age_mean = df.groupby('animal').age.mean()
print(age_mean)

animal
cat      2.5
dog      5.0
snake    2.5
Name: age, dtype: float64


In [6]:
#1.6.1 Append a new row 'k' to df with your choice of values for each column.
new_row=['dog',3,4,'no']
df.loc['k'] = new_row
print(df)

  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no
k    dog  3.0       4       no


In [7]:
#1.6.2 Then delete that row to return the original DataFrame.
df = df.drop('k') 
print(df)

  animal  age  visits priority
a    cat  2.5       1      yes
b    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
h    cat  NaN       1      yes
i    dog  7.0       2       no
j    dog  3.0       1       no


# 2.1 Shifty problem

You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

### Hint: use the `shift()` method

In [8]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df1['Boolean']=df1.A==df1.A.shift(1)
df1[df1['Boolean']==False][['A']]

Unnamed: 0,A
0,1
1,2
3,3
4,4
5,5
8,6
9,7


# 2.2 columns sum min

Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum? Return that column's label.

In [9]:
import pandas as pd
import numpy as np
df2 = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))

print(df2)

non_sorted = df2.sum(axis=0)
sorted1 =non_sorted.sort_values(ascending=True)
sorted1[0:1]

          a         b         c         d         e         f         g  \
0  0.135272  0.975310  0.822433  0.598420  0.275009  0.520278  0.689836   
1  0.787001  0.237058  0.894021  0.283906  0.664316  0.671623  0.727137   
2  0.623741  0.055215  0.860298  0.297840  0.632029  0.663014  0.707577   
3  0.880443  0.901773  0.354073  0.251164  0.749390  0.375516  0.335041   
4  0.401831  0.464734  0.808728  0.644470  0.617130  0.672591  0.679733   

          h         i         j  
0  0.441835  0.742520  0.653503  
1  0.893765  0.684087  0.891531  
2  0.990720  0.627222  0.952283  
3  0.837684  0.738498  0.243736  
4  0.494535  0.026613  0.303161  


d    2.0758
dtype: float64

# 2.3 Duplicates

How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)?

**hint:** There's a method for to find duplicate rows for you

In [10]:
#modified data from exercise 1
data = {
        'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [3, 3, 0.5, np.nan, 5, 2, 4.5, 3, 7, 3],
        'visits': [3, 3, 2, 3, 2, 3, 1, 3, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']
        }
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df3 = pd.DataFrame(data, columns = ['animal','age','visits','priority'],index=labels)
#finding duplicates
dups = df3.pivot_table(index = ['animal','age','visits','priority'], aggfunc ='size')
#dropping duplicates
df3.drop_duplicates(inplace=True)
print(df3)
print(dups)
print("Before: "+str(len(labels))+". After: "+str(len(df3.index)))

  animal  age  visits priority
a    cat  3.0       3      yes
c  snake  0.5       2       no
d    dog  NaN       3      yes
e    dog  5.0       2       no
f    cat  2.0       3       no
g  snake  4.5       1       no
i    dog  7.0       2       no
j    dog  3.0       1       no
animal  age  visits  priority
cat     2.0  3       no          1
        3.0  3       yes         3
dog     3.0  1       no          1
        5.0  2       no          1
        7.0  2       no          1
snake   0.5  2       no          1
        4.5  1       no          1
dtype: int64
Before: 10. After: 8


# 2.4 Group Values

A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values.  You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

In [11]:
import pandas as pd
import numpy as np

df4 = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})


df5 = df4.groupby('grps')['vals']\
         .apply(lambda x: x.nlargest(3).sum())\
         .nlargest(3)\
         .reset_index()

df5.sort_values(['grps'],inplace=True)
df5

Unnamed: 0,grps,vals
0,a,409
2,b,156
1,c,345


# 3. Cleaning Data

### Making a DataFrame easier to work with

It happens all the time: someone gives you data containing malformed strings, Python, lists and missing data. How do you tidy it up so you can get on with the analysis?

Take this monstrosity as the DataFrame to use in the following puzzles:

```python
df = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})
```

Formatted, it looks like this:

```
            From_To  FlightNumber  RecentDelays              Airline
0      LoNDon_paris       10045.0      [23, 47]               KLM(!)
1      MAdrid_miLAN           NaN            []    <Air France> (12)
2  londON_StockhOlm       10065.0  [24, 43, 87]  (British Airways. )
3    Budapest_PaRis           NaN          [13]       12. Air France
4   Brussels_londOn       10085.0      [67, 32]          "Swiss Air"
```

**1.** Some values in the the **FlightNumber** column are missing (they are `NaN`). These numbers are meant to increase by 10 with each row so 10055 and 10075 need to be put in place. Modify `df` to fill in these missing numbers and make the column an integer column (instead of a float column).

In [12]:
#3.1. Cleaning Flight Number
import pandas as pd
import numpy as np
df6 = pd.DataFrame({'From_To': ['LoNDon_paris', 'MAdrid_miLAN', 'londON_StockhOlm', 
                               'Budapest_PaRis', 'Brussels_londOn'],
              'FlightNumber': [10045, np.nan, 10065, np.nan, 10085],
              'RecentDelays': [[23, 47], [], [24, 43, 87], [13], [67, 32]],
                   'Airline': ['KLM(!)', '<Air France> (12)', '(British Airways. )', 
                               '12. Air France', '"Swiss Air"']})

df6.FlightNumber = df6.FlightNumber.fillna(0)
df6.FlightNumber = df6.FlightNumber.replace(0,10+df6.FlightNumber.shift(1))
df6.FlightNumber = df6.FlightNumber.astype(int)
df6

Unnamed: 0,From_To,FlightNumber,RecentDelays,Airline
0,LoNDon_paris,10045,"[23, 47]",KLM(!)
1,MAdrid_miLAN,10055,[],<Air France> (12)
2,londON_StockhOlm,10065,"[24, 43, 87]",(British Airways. )
3,Budapest_PaRis,10075,[13],12. Air France
4,Brussels_londOn,10085,"[67, 32]","""Swiss Air"""


# 3.2 column splitting

The **From\_To** column would be better as two separate columns! Split each string on the underscore delimiter `_` to make two new columns `From` and `To` to your dataframe.

In [13]:
#3.2. Cleaning From and To Column
if "From_To" in df6.columns:    
    frm = lambda x : str.title(x.split('_')[0])
    to = lambda x : str.title(x.split('_')[-1])
    #df6['From'] = df6.From_To.apply(frm)
    #df6['To'] = df6.From_To.apply(to)
    df6.insert(0,"From", df6.From_To.apply(frm))
    df6.insert(1,"To", df6.From_To.apply(to))
    df6.drop(['From_To'],inplace = True,axis=1)
df6

Unnamed: 0,From,To,FlightNumber,RecentDelays,Airline
0,London,Paris,10045,"[23, 47]",KLM(!)
1,Madrid,Milan,10055,[],<Air France> (12)
2,London,Stockholm,10065,"[24, 43, 87]",(British Airways. )
3,Budapest,Paris,10075,[13],12. Air France
4,Brussels,London,10085,"[67, 32]","""Swiss Air"""


# 3.3 Clean Text

Make the text in your dataframe:

- From and To columns should be lowercase with only first letter capitalized

- In the **Airline** column, you can see some extra puctuation and symbols have appeared around the airline names. Pull out just the airline name. E.g. `'(British Airways. )'` should become `'British Airways'`.

In [14]:
#3.3. Cleaning Airline Column
df6['Airline'] = df6.Airline.str.extract('([a-zA-Z\s]+)')
df6

Unnamed: 0,From,To,FlightNumber,RecentDelays,Airline
0,London,Paris,10045,"[23, 47]",KLM
1,Madrid,Milan,10055,[],Air France
2,London,Stockholm,10065,"[24, 43, 87]",British Airways
3,Budapest,Paris,10075,[13],Air France
4,Brussels,London,10085,"[67, 32]",Swiss Air


# Exercise 4.1: Column Splitting

Given the unemployment data in `data/country_total.csv`, split the `month` column into two new columns: a `year` column and a `month` column, both integers

In [15]:
import pandas as pd
import numpy as np

ct = pd.read_csv("data/country_total.csv")
df7 = pd.DataFrame(ct)
decimals=2

ct.month = df7.month.astype(str)
ct["Year"] = df7.month.astype(int)
df7.month = ((df7.month * 100)
                 .astype(int)
                 .astype(str)
                 .str[-2:]\
                 .str.replace(".","")\
                 .astype(int))
df7

Unnamed: 0,country,seasonality,month,unemployment,unemployment_rate,Year
0,at,nsa,1,171000,4.5,1993
1,at,nsa,2,175000,4.6,1993
2,at,nsa,3,166000,4.4,1993
3,at,nsa,4,157000,4.1,1993
4,at,nsa,5,147000,3.9,1993
...,...,...,...,...,...,...
20791,uk,trend,6,2429000,7.7,2010
20792,uk,trend,7,2422000,7.7,2010
20793,uk,trend,8,2429000,7.7,2010
20794,uk,trend,9,2447000,7.8,2010


# 4.2 Group Statistics

Given the unemployment data in `data/country_sex_age.csv`, give the average unemployment rate for:

- Each gender
- Each Age Group
- Both Together

**HINT:** The `seasonality` column makes it such that the data is repeated for each method of calculating unemployment (`nsa`, `trend`, etc.). Can you ignore this and group over it? Or should you take the average for each?

In [16]:
import pandas as pd
import numpy as np

csa = pd.read_csv("data/country_sex_age.csv")

gender_avg = csa.groupby('sex').unemployment_rate.mean()
age_avg = csa.groupby('age_group').unemployment_rate.mean()
gender_age_avg=csa.groupby(['age_group','sex']).unemployment_rate.mean()
seasonality_gender_age_avg=csa.groupby(['seasonality','age_group','sex']).unemployment_rate.mean()
print(gender_avg)#4.2.1
print(age_avg)#4.2.2
print(gender_age_avg)#4.2.3
print(seasonality_gender_age_avg)#4.4

sex
f    12.982629
m    11.671026
Name: unemployment_rate, dtype: float64
age_group
y25-74     6.905394
y_lt25    17.774057
Name: unemployment_rate, dtype: float64
age_group  sex
y25-74     f       7.566771
           m       6.244016
y_lt25     f      18.457435
           m      17.098036
Name: unemployment_rate, dtype: float64
seasonality  age_group  sex
nsa          y25-74     f       7.539839
                        m       6.201653
             y_lt25     f      18.818593
                        m      17.215211
sa           y25-74     f       7.579982
                        m       6.256909
             y_lt25     f      18.323837
                        m      17.067671
trend        y25-74     f       7.579934
                        m       6.272703
             y_lt25     f      18.231025
                        m      17.013327
Name: unemployment_rate, dtype: float64


# 4.3 Estimating group size

Given that we have the unemployment **rate** as a % of total population, and the number of total unemployed, we can estimate the total population.

Give an estimate of the total population for men and women in each age group.

Does this change depending on the unemployment seasonality calculation method?

In [17]:
length = len(csa)
g_a_count=csa.groupby(['age_group','sex'])\
             .agg(max_un=("unemployment",'mean'),mean_rate=('unemployment_rate','mean'))\
             .reset_index()
g_a_count["total population"] = g_a_count["max_un"]/(g_a_count["mean_rate"]/100)
print(g_a_count)

g_a_s_count=csa.groupby(['seasonality','age_group','sex'])\
             .agg(max_un=("unemployment",'mean'),mean_rate=('unemployment_rate','mean'))\
             .reset_index()
g_a_s_count["total population"] = g_a_s_count["max_un"]/(g_a_s_count["mean_rate"]/100)
print(g_a_s_count)

  age_group sex         max_un  mean_rate  total population
0    y25-74   f  263423.472823   7.566771      3.481319e+06
1    y25-74   m  289391.582492   6.244016      4.634703e+06
2    y_lt25   f  112851.659452  18.457435      6.114157e+05
3    y_lt25   m  124013.227513  17.098036      7.253069e+05
   seasonality age_group sex         max_un  mean_rate  total population
0          nsa    y25-74   f  244200.087758   7.539839      3.238797e+06
1          nsa    y25-74   m  267525.815416   6.201653      4.313782e+06
2          nsa    y_lt25   f  111623.665350  18.818593      5.931563e+05
3          nsa    y_lt25   m  121616.937253  17.215211      7.064505e+05
4           sa    y25-74   f  275348.154482   7.579982      3.632570e+06
5           sa    y25-74   m  302136.810603   6.256909      4.828851e+06
6           sa    y_lt25   f  114021.946701  18.323837      6.222602e+05
7           sa    y_lt25   m  125866.182129  17.067671      7.374538e+05
8        trend    y25-74   f  270308.535179

# 5.1 Tennis

In `data/tennis.csv` you have games that Roger Federer played against various opponents. Questions:

1. How many games did Federer win?

2. What is Federer's win/loss ratio?

3. Who were Federer's top 5 opponents?

In [18]:
import pandas as pd
import numpy as np

t = pd.read_csv("data/tennis.csv")
df8=pd.DataFrame(t)
#5.1.1
federer = df8[df8['winner'].str.contains('Federer')]
win_games=len(federer)
print("He won "+str(win_games)+" games")

#5.1.2
num_of_games = len(t)
win_ratio=win_games/num_of_games
loss_ratio=(num_of_games-win_games)/num_of_games
print("Federer win/loss ratio is "+"{:.0%}".format(win_ratio)+" vs. "+"{:.0%}".format(loss_ratio))

#5.1.3
non_Federer=df8[~df8['winner'].str.contains('Federer')]
top5 = non_Federer.groupby("winner").agg(num_of_wins=("win",'count'))
top5.sort_values(by=['num_of_wins'],ascending=False,inplace=True)
print("His top 5 opponents are below")
top5.head(5)

He won 972 games
Federer win/loss ratio is 82% vs. 18%
His top 5 opponents are below


Unnamed: 0_level_0,num_of_wins
winner,Unnamed: 1_level_1
Rafael Nadal,18
Novak Djokovic,13
Andy Murray,10
David Nalbandian,8
Lleyton Hewitt,8


# 5.2 Over time

1. What was Federer's best year? In terms of money, and then in terms of number of wins

2. Did Federer get better or worse over time?

In [19]:
t["tournament prize money"]  = t["tournament prize money"].str.replace("$",'')
t["tournament prize money"]  = t["tournament prize money"].str.replace(",",'')
t["tournament prize money"]  = t["tournament prize money"].str.replace("[a-zA-Z]",'')
t=t[t['tournament prize money'] != '']
t=t[~t["tournament prize money"].isnull()]

In [37]:
t["tournament prize money"]=t["tournament prize money"].astype(int)
t["year"].astype('str')
Federer=t[t['winner'].str.contains('Federer')]
best_year_money=Federer.groupby(['year'])\
             .agg(prizes=("tournament prize money",'sum'))\
             .reset_index()
print(best_year_money)
print("In general, he got better over time.")
best_year_money.sort_values(by=["prizes"],ascending=False)
print("His best year in terms of money is below: "+str(best_year_money.nlargest(1,'prizes')))

    year    prizes
0   1998     21600
1   1999    365975
2   2000   1023498
3   2001   2543712
4   2002   7058200
5   2003  20117218
6   2004  39034705
7   2005  36910840
8   2006  51748945
9   2007  52680795
10  2008  22940000
11  2009  33892300
12  2010  36160885
13  2011  25066780
14  2012  34535490
In general, he got better over time.
His best year in terms of money is below:    year    prizes
9  2007  52680795


In [35]:
best_year_play=Federer.groupby(['year'])\
             .agg(wins=("win",'count'))\
             .reset_index()
best_year_play.sort_values(by=["wins"],ascending=False)
print("In general, he got better over time.")
print("His best year in terms of number of wins is below: ")
print(best_year_play.nlargest(1,'wins'))

In general, he got better over time.
His best year in terms of number of wins is below: 
   year  wins
8  2006    94


# 5.3 Total money won

In the data, you'll find the `tournament round`, one value of which, `F` indicates the final.

Assuming Federer wins the money in the `tournament prize money` if he wins a final in a tournament, how much money has Federer made in tournaments in this dataset?

In [33]:

tr_F=t[t['tournament round'].str.contains('F') & (t["win"] == True)]

total_money1=tr_F["tournament prize money"].sum()
total_money=tr_F.groupby("tournament prize money")["tournament prize money"].sum()
print('He made ${:0,.2f}'.format(total_money1).replace('$-','$-'))

He made $44,934,964.00
