# Module 6. Data Manipulation and Analysis with Pandas.

**_Author: Favio Vázquez_**

**Expected time = 3 hours**

**Total points = 140 points**


## Assignment Overview

In this assignment you will be practicing and testing your understanding on how to manipulate and analyse data with Pandas. You will begin by reviewing basic concepts on how to read data, then you will learn about the Series and Dataframe API, it's functioanlties and methods. After that, you will index, select and edit data inside dataframes. In the final parts of the assignment you will be combining, grouping and and aggregating dataframes.

This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 


### Learning Objectives

- Use Pandas to build, extract, filter, and transform DataFrames.
- Describe Pandas data structures: DataFrames and Series.  
- Use Pandas objects for analyses. 

## Index:

#### Module 6: Data Manipulation and Analysis with Pandas.

- [Question 1](#Question-1)
- [Question 2](#Question-2)
- [Question 3](#Question-3)
- [Question 4](#Question-4)
- [Question 5](#Question-5)
- [Question 6](#Question-6)
- [Question 7](#Question-7)
- [Question 8](#Question-8)
- [Question 9](#Question-9)
- [Question 10](#Question-10)
- [Question 11](#Question-11)
- [Question 12](#Question-12)
- [Question 13](#Question-13)
- [Question 14](#Question-14)
- [Question 15](#Question-15)
- [Question 16](#Question-16)
- [Question 17](#Question-17)

In [1]:
# Let's start by importing Pandas
import pandas as pd

# Avoid warnings
import warnings
warnings.filterwarnings("ignore")

### Importing data

We will begin this assignment with a review of how to import data with Pandas. For several parts of this assignment we will be using two datasets coming from the past 120 years of Olympic history: athletes and results. This dataset can be found on Kaggle in this link:

https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

This is a historical dataset on the modern Olympic Games, including all the Games from Athens 1896 to Rio 2016. 

The file `athlete_events.csv` contains 271116 rows and 15 columns. Each row corresponds to an individual athlete competing in an individual Olympic event (athlete-events). The columns are:

- ID - Unique number for each athlete
- Name - Athlete's name
- Sex - M or F
- Age - Athlete's age
- Height - In centimeters
- Weight - In kilograms
- Team - Team name
- NOC - National Olympic Committee 3-letter code
- Games - Year and season
- Year - Year of game
- Season - Summer or Winter
- City - Host city
- Sport - Sport
- Event - Event
- Medal - Gold, Silver, Bronze, or NA

The file `noc_regions.csv` contains 230 rows and 3 columns. Each row contains information about the different Nationnal Olympic Committee (NOC). The columns are:

- NOC - National Olympic Committee abreviation
- region - Name of country in NOC
- notes - Notes about the region and NOC


[Back to top](#Index:) 

### Question 1
*5 points*

Read the CSV file named `"athlete_events.csv"` in the `data/` folder and assign it to a dataframe called `df`.

In [2]:
### GRADED

### YOUR SOLUTION HERE
from pathlib import Path
file_path = Path('data/') / 'athlete_events.csv'


df = pd.read_csv(file_path)


###
### YOUR CODE HERE
###


In [3]:
df

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271111,135569,Andrzej ya,M,29.0,179.0,89.0,Poland-1,POL,1976 Winter,1976,Winter,Innsbruck,Luge,Luge Mixed (Men)'s Doubles,
271112,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Individual",
271113,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Team",
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,


In [4]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [5]:
# Let's take a look at our dataframe df
df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [6]:
# Let's see the shape of out dataframe df
print("Number of rows: {}, number of columns: {}".format(df.shape[0],df.shape[1]))

Number of rows: 271116, number of columns: 15


[Back to top](#Index:) 

### Question 2
*5 points*

Read the CSV file named `"noc_regions.csv"` in the `data/` folder and assign it to a dataframe called `regions`.

In [169]:
### GRADED

### YOUR SOLUTION HERE

file_path = Path('data/') / 'noc_regions.csv'


regions = pd.read_csv(file_path)


###
### YOUR CODE HERE
###


In [10]:
regions

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,
...,...,...,...
225,YEM,Yemen,
226,YMD,Yemen,South Yemen
227,YUG,Serbia,Yugoslavia
228,ZAM,Zambia,


In [11]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [12]:
# Let's take a look at our dataframe regions
regions.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


In [13]:
# Let's see the shape of out dataframe regions
print("Number of rows: {}, number of columns: {}".format(regions.shape[0],regions.shape[1]))

Number of rows: 230, number of columns: 3


### Pandas Objects

In this part of the assignment we will begin studying the two most important objects exposed by Pandas: Series and Dataframes. As you remember:
- **Series** is a 1 dimensional data structure in Pandas
- **DataFrame** is a 2 dimentional data structure in Pandas, made up of columns and rows

[Back to top](#Index:) 

### Question 3
*5 points*

Get a series from the dataframe `df` with the contents of the column `Height` and store it in a variable called `height`. 

In [14]:
### GRADED

### YOUR SOLUTION HERE
height = df['Height']

###
### YOUR CODE HERE
###


In [17]:
height

0         180.0
1         170.0
2           NaN
3           NaN
4         185.0
          ...  
271111    179.0
271112    176.0
271113    176.0
271114    185.0
271115    185.0
Name: Height, Length: 271116, dtype: float64

In [18]:
type(height)

pandas.core.series.Series

In [19]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 4
*10 points*
    
Use a lambda function to rename the index (or labels) of the series `height` from above with new labels that start from 0 and after that each one should be the squared value of the original label. Like this:

$$
0 \rightarrow 0 \\
1 \rightarrow 1 \\
2 \rightarrow 4 \\
3 \rightarrow 9 \\
\vdots
$$

Save this new series in a variable called `height_new`.

In [20]:
### GRADED

### YOUR SOLUTION HERE
height_new = height.rename(lambda x: x ** 2) 

###
### YOUR CODE HERE
###


In [21]:
height_new

0              180.0
1              170.0
4                NaN
9                NaN
16             185.0
               ...  
73501174321    179.0
73501716544    176.0
73502258769    176.0
73502800996    185.0
73503343225    185.0
Name: Height, Length: 271116, dtype: float64

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 5
*5 points*

Get a series from the dataframe `regions` with the contents of the column `region` and store it in a variable called `reg`.

In [22]:
### GRADED

### YOUR SOLUTION HERE
reg = regions['region']

###
### YOUR CODE HERE
###


In [23]:
reg

0      Afghanistan
1          Curacao
2          Albania
3          Algeria
4          Andorra
          ...     
225          Yemen
226          Yemen
227         Serbia
228         Zambia
229       Zimbabwe
Name: region, Length: 230, dtype: object

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 6
*10 points*

Find how many regions in the series `reg` start with the letter `A` and save it in a variable called `a_number`. Then find how many regions in the series `reg` start with the letter `V` and save it in a variable called `v_number`.

**Hints:** 

** - Make sure you don't count any missing values.**

** - There are 14 regions that start with A, and 6 regions that start with V.**

In [92]:
### GRADED
#print(reg[reg.str.startswith('A')].count())

#reg[reg.str.startswith('A')]



### YOUR SOLUTION HERE
a_number = len([i for i in [i for i in reg if str(i) != 'nan'] if i.startswith('A')])
v_number = len([i for i in [i for i in reg if str(i) != 'nan'] if i.startswith('V')])

###
### YOUR CODE HERE
###


In [95]:
print(a_number)
print(v_number)

14
6


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 7
*5 points*
    
Create a new dataframe from the dataframe `df`, that only contain the columns `ID`, `Age`, `Height`, `Weight` and `Sex` in that specific order. Name this new dataframe `df_subset`.

In [98]:
### GRADED

### YOUR SOLUTION HERE
df_subset = df[['ID', 'Age', 'Height', 'Weight', 'Sex']]

###
### YOUR CODE HERE
###


In [99]:
df_subset

Unnamed: 0,ID,Age,Height,Weight,Sex
0,1,24.0,180.0,80.0,M
1,2,23.0,170.0,60.0,M
2,3,24.0,,,M
3,4,34.0,,,M
4,5,21.0,185.0,82.0,F
...,...,...,...,...,...
271111,135569,29.0,179.0,89.0,M
271112,135570,27.0,176.0,59.0,M
271113,135570,27.0,176.0,59.0,M
271114,135571,30.0,185.0,96.0,M


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 8
*10 points*
    
Create a new column for the dataframe `df_subset` named `Ratio`, where you will be storing the following calculation:

$$
Ratio = \frac{Height + Weight}{Age}
$$

Make sure to store the new column in the same dataframe `df_subset`.

In [100]:
### GRADED

### YOUR SOLUTION HERE

df_subset['Ratio'] = (df['Height']+df['Weight'])/df_subset['Age']

###
### YOUR CODE HERE
###


In [102]:
df_subset

Unnamed: 0,ID,Age,Height,Weight,Sex,Ratio
0,1,24.0,180.0,80.0,M,10.833333
1,2,23.0,170.0,60.0,M,10.000000
2,3,24.0,,,M,
3,4,34.0,,,M,
4,5,21.0,185.0,82.0,F,12.714286
...,...,...,...,...,...,...
271111,135569,29.0,179.0,89.0,M,9.241379
271112,135570,27.0,176.0,59.0,M,8.703704
271113,135570,27.0,176.0,59.0,M,8.703704
271114,135571,30.0,185.0,96.0,M,9.366667


In [103]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Indexing and selecting data from Dataframes

In this part of the assignment we will work with the dataframes from above to select specific data using Pandas different methods and attributes. You have learned to use `loc[]` and `iloc[]` to do this.

[Back to top](#Index:) 

### Question 9
*5 points*

Select rows 4 through 12 and the first 6 columns from the dataframe `df` and store them in a new dataframe called `df_1`.

In [114]:
df.iloc[3:12,:6]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight
3,4,Edgar Lindenau Aabye,M,34.0,,
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0
5,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0
6,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0
7,5,Christine Jacoba Aaftink,F,25.0,185.0,82.0
8,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0
9,5,Christine Jacoba Aaftink,F,27.0,185.0,82.0
10,6,Per Knut Aaland,M,31.0,188.0,75.0
11,6,Per Knut Aaland,M,31.0,188.0,75.0


In [None]:
### GRADED

### YOUR SOLUTION HERE
df_1 = df

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 10
*10 points*

Select all the rows from the dataframe `df` when the `Year` is greater than 1980 and the `Team` is equal to "China", "United States", "Italy" or "Spain". Save your results in a dataframe called `df_2`.

In [210]:
#df.loc[(df['Team'].isin(["China", "United States", "Italy","Spain"])) & (df['Year']>1980)]

df[(df['Year']>1980) & 
   ((df['Team']=='China') | 
    (df['Team']=='Italy') |
    (df['Team']=='Spain') |
    (df['Team']=='United States'))]

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
10,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,USA,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10 kilometres,
11,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,USA,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 50 kilometres,
12,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,USA,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10/15 kilometres Pu...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270850,135458,Rami Zur,M,27.0,175.0,77.0,United States,USA,2004 Summer,2004,Summer,Athina,Canoeing,"Canoeing Men's Kayak Doubles, 500 metres",
270851,135458,Rami Zur,M,31.0,175.0,77.0,United States,USA,2008 Summer,2008,Summer,Beijing,Canoeing,"Canoeing Men's Kayak Singles, 500 metres",
270852,135458,Rami Zur,M,31.0,175.0,77.0,United States,USA,2008 Summer,2008,Summer,Beijing,Canoeing,"Canoeing Men's Kayak Singles, 1,000 metres",
270891,135471,Jos Zurera Alberca,M,22.0,162.0,52.0,Spain,ESP,1988 Summer,1988,Summer,Seoul,Weightlifting,Weightlifting Men's Bantamweight,


In [211]:
### GRADED

### YOUR SOLUTION HERE
df_2 = df.loc[(df['Team'].isin(["China", "United States", "Italy","Spain"])) & \
              (df['Year']>1980)]

###
### YOUR CODE HERE
###


In [212]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 11
*5 points*

Using the function `iloc()` select the rows with index 0, 10, 20, 40, 43, 66 and the columns with index 0, 3, 5 from the dataframe `df`. Store your results in a dataframe called `df_3`.

In [213]:
### GRADED

### YOUR SOLUTION HERE
df_3 = df.iloc[[0, 10, 20, 40, 43, 66],[0, 3, 5]]

###
### YOUR CODE HERE
###


In [214]:
df_3

Unnamed: 0,ID,Age,Weight
0,1,24.0,80.0
10,6,31.0,75.0
20,7,31.0,72.0
40,16,28.0,85.0
43,17,28.0,64.0
66,20,22.0,85.0


In [215]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Editing data in DataFrames

In this section we will modify the internal structure and data of dataframes, deleting some of its columns and transforming others.

[Back to top](#Index:) 

### Question 12
*15 points*

Drop the column `notes` from the dataframe `regions`. Then create a new column called `flag` that contains a 1 if the value in the column `region` contains the letter "a" or 0 otherwise. Make sure to keep the changes in a dataframe with the same name `regions`. The final dataframe should look like this:


| NOC | region  | flag |
|----|-------|-----------|
| AFG  | Afghanistan | 1      |
| AHO  | Curacao | 1    |
| YEM  | Yemen  | 0     |
| ...  | ...     | ... |

Make sure to validate if the region contains `NaN` and set it to 0 in the new column `flag`.

In [286]:
### GRADED

### YOUR SOLUTION HERE
regions.drop('notes', axis=1, inplace=True)

def upflag(x):
    if type(x)==str and ('a' in x):
        return 1
    else:
        return 0
    
#regions['flag'] = regions['region']

regions['flag'] = regions.apply(lambda x: upflag(x['region']),axis=1)


###
### YOUR CODE HERE
###




In [287]:
regions

Unnamed: 0,NOC,region,flag
0,AFG,Afghanistan,1
1,AHO,Curacao,1
2,ALB,Albania,1
3,ALG,Algeria,1
4,AND,Andorra,1
...,...,...,...
225,YEM,Yemen,0
226,YMD,Yemen,0
227,YUG,Serbia,1
228,ZAM,Zambia,1


In [281]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Combining DataFrames

In this section we will be combining our dataframes `df` and `regions` and learn different ways of working with them.

[Back to top](#Index:) 

### Question 13
*10 points*
    
Combine the dataframes `df` and `regions` into a new dataframe called `merged`. The join should be a "left" join for `df` on the column `NOC`.

In [291]:
regions['NOC'].unique()

array(['AFG', 'AHO', 'ALB', 'ALG', 'AND', 'ANG', 'ANT', 'ANZ', 'ARG',
       'ARM', 'ARU', 'ASA', 'AUS', 'AUT', 'AZE', 'BAH', 'BAN', 'BAR',
       'BDI', 'BEL', 'BEN', 'BER', 'BHU', 'BIH', 'BIZ', 'BLR', 'BOH',
       'BOL', 'BOT', 'BRA', 'BRN', 'BRU', 'BUL', 'BUR', 'CAF', 'CAM',
       'CAN', 'CAY', 'CGO', 'CHA', 'CHI', 'CHN', 'CIV', 'CMR', 'COD',
       'COK', 'COL', 'COM', 'CPV', 'CRC', 'CRO', 'CRT', 'CUB', 'CYP',
       'CZE', 'DEN', 'DJI', 'DMA', 'DOM', 'ECU', 'EGY', 'ERI', 'ESA',
       'ESP', 'EST', 'ETH', 'EUN', 'FIJ', 'FIN', 'FRA', 'FRG', 'FSM',
       'GAB', 'GAM', 'GBR', 'GBS', 'GDR', 'GEO', 'GEQ', 'GER', 'GHA',
       'GRE', 'GRN', 'GUA', 'GUI', 'GUM', 'GUY', 'HAI', 'HKG', 'HON',
       'HUN', 'INA', 'IND', 'IOA', 'IRI', 'IRL', 'IRQ', 'ISL', 'ISR',
       'ISV', 'ITA', 'IVB', 'JAM', 'JOR', 'JPN', 'KAZ', 'KEN', 'KGZ',
       'KIR', 'KOR', 'KOS', 'KSA', 'KUW', 'LAO', 'LAT', 'LBA', 'LBR',
       'LCA', 'LES', 'LIB', 'LIE', 'LTU', 'LUX', 'MAD', 'MAL', 'MAR',
       'MAS', 'MAW',

In [297]:
df['NOC'].unique()

array(['CHN', 'DEN', 'NED', 'USA', 'FIN', 'NOR', 'ROU', 'EST', 'FRA',
       'MAR', 'ESP', 'EGY', 'IRI', 'BUL', 'ITA', 'CHA', 'AZE', 'SUD',
       'RUS', 'ARG', 'CUB', 'BLR', 'GRE', 'CMR', 'TUR', 'CHI', 'MEX',
       'URS', 'NCA', 'HUN', 'NGR', 'ALG', 'KUW', 'BRN', 'PAK', 'IRQ',
       'UAR', 'LIB', 'QAT', 'MAS', 'GER', 'CAN', 'IRL', 'AUS', 'RSA',
       'ERI', 'TAN', 'JOR', 'TUN', 'LBA', 'BEL', 'DJI', 'PLE', 'COM',
       'KAZ', 'BRU', 'IND', 'KSA', 'SYR', 'MDV', 'ETH', 'UAE', 'YAR',
       'INA', 'PHI', 'SGP', 'UZB', 'KGZ', 'TJK', 'EUN', 'JPN', 'CGO',
       'SUI', 'BRA', 'FRG', 'GDR', 'MON', 'ISR', 'URU', 'SWE', 'ISV',
       'SRI', 'ARM', 'CIV', 'KEN', 'BEN', 'UKR', 'GBR', 'GHA', 'SOM',
       'LAT', 'NIG', 'MLI', 'AFG', 'POL', 'CRC', 'PAN', 'GEO', 'SLO',
       'CRO', 'GUY', 'NZL', 'POR', 'PAR', 'ANG', 'VEN', 'COL', 'BAN',
       'PER', 'ESA', 'PUR', 'UGA', 'HON', 'ECU', 'TKM', 'MRI', 'SEY',
       'TCH', 'LUX', 'MTN', 'CZE', 'SKN', 'TTO', 'DOM', 'VIN', 'JAM',
       'LBR', 'SUR',

In [300]:
### GRADED

# Let's read our data again to have the original datasets
df = pd.read_csv("data/athlete_events.csv")
regions = pd.read_csv("data/noc_regions.csv")

### YOUR SOLUTION HERE
merged = pd.merge(df, regions, how='left', on='NOC')

###
### YOUR CODE HERE
###


In [299]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 14
*10 points*

Given the dataset `add` that contains columns A, B, C, and D with 230 rows of random integers. Make a copy of the dataframe regions called regions add, and append the columns `A` and `B` to the dataframe `regions_add`.

In [301]:
### GRADED

import numpy as np

np.random.seed(42)
add = pd.DataFrame(np.random.randint(0,230,size=(230, 4)), columns=list('ABCD'))
add.reindex_like(regions)

### YOUR SOLUTION HERE
regions_add = regions.copy()

regions_add[['A','B']] = add[['A','B']]

###
### YOUR CODE HERE
###


In [302]:
# See your final dataframe
regions_add.head()

Unnamed: 0,NOC,region,notes,A,B
0,AFG,Afghanistan,,102,179
1,AHO,Curacao,Netherlands Antilles,106,71
2,ALB,Albania,,102,121
3,ALG,Algeria,,74,202
4,AND,Andorra,,99,103


In [303]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Grouping and aggregating dataframes

In this final section we will group and perform aggregations on our dataframes.

In [304]:
# Let's read our data again to have the original datasets
df = pd.read_csv("data/athlete_events.csv")
regions = pd.read_csv("data/noc_regions.csv")

[Back to top](#Index:) 

### Question 15
*10 points*

Get the top 5 countries with the most amount of gold medals of all time from the dataframe `merged`. Save your results in a dataframe called `gold`.

**Hint: Use `reset_index(name='Medal')` to get the results in a dataframe.**

In [360]:
merged

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,region,notes
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,China,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,China,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,,Denmark,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,Denmark,
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,,Netherlands,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271111,135569,Andrzej ya,M,29.0,179.0,89.0,Poland-1,POL,1976 Winter,1976,Winter,Innsbruck,Luge,Luge Mixed (Men)'s Doubles,,Poland,
271112,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Individual",,Poland,
271113,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Team",,Poland,
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,,Poland,


In [363]:

gold = merged[['NOC','ID']]\
[merged['Medal']=='Gold'].groupby('NOC').count().sort_values(by=['ID'],ascending=False).head()


gold.set_index('ID')
gold.rename(columns={'ID':'Medal'},inplace=True)
#gold.set_index('Medal')


Unnamed: 0_level_0,Medal
NOC,Unnamed: 1_level_1
USA,2638
URS,1082
GER,745
GBR,678
ITA,575


In [348]:
gold

Unnamed: 0_level_0,ID
region,Unnamed: 1_level_1
USA,2638
Russia,1599
Germany,1301
UK,678
Italy,575


In [364]:
### GRADED

### YOUR SOLUTION HERE
gold = None

###
### YOUR CODE HERE
###


In [365]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 16
*10 points*

What was the average height for the gold winners each year of the olympics (using the dataframe `merged`)? Store your result in a Series called `gold_height`.

**Hint: Remove NaNs before perfoming the operation.**

In [369]:
merged[['Year','Height']][merged['Medal']=='Gold'].groupby('Year').mean()

Unnamed: 0_level_0,Height
Year,Unnamed: 1_level_1
1896,175.153846
1900,177.407407
1904,177.215686
1906,179.125
1908,176.492537
1912,178.135135
1920,177.986014
1924,176.278689
1928,176.01
1932,174.176


In [376]:
### GRADED

### YOUR SOLUTION HERE
gold_height = merged[['Year','Height']][merged['Medal']=='Gold'].groupby('Year').mean()
gold_height=gold_height['Height']

###
### YOUR CODE HERE
###


In [380]:
gold_height

Year
1896    175.153846
1900    177.407407
1904    177.215686
1906    179.125000
1908    176.492537
1912    178.135135
1920    177.986014
1924    176.278689
1928    176.010000
1932    174.176000
1936    177.782609
1948    179.630952
1952    178.539683
1956    177.556098
1960    176.207602
1964    176.609023
1968    177.981087
1972    177.626327
1976    178.138340
1980    178.053743
1984    178.135088
1988    179.161716
1992    178.720339
1994    175.438095
1996    178.013937
1998    175.444444
2000    178.465961
2002    175.438272
2004    177.993976
2006    176.181818
2008    178.328358
2010    176.603448
2012    179.011094
2014    176.509901
2016    179.012048
Name: Height, dtype: float64

In [379]:
type(gold_height)

pandas.core.series.Series

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 

### Question 17
*10 points*

What was the average weight for the USA team for each year of the Olympics (using the `merge` dataframe)? Store your result in a Series called `weight_USA`.

**Hint: Remove NaNs before perfoming the operation.**

In [385]:
### GRADED

### YOUR SOLUTION HERE
weight_USA = None
weight_USA = merged[['Year','Weight']][merged['NOC']=='USA'].groupby('Year').mean()
weight_USA = weight_USA['Weight']

###
### YOUR CODE HERE
###


In [386]:
weight_USA

Year
1896    72.461538
1900    74.441176
1904    71.600000
1906    71.525424
1908    76.148936
1912    73.622222
1920    74.174825
1924    71.116667
1928    72.750000
1932    71.721805
1936    72.985401
1948    72.045699
1952    75.421296
1956    72.787257
1960    72.039030
1964    72.129241
1968    72.223645
1972    71.113889
1976    70.893182
1980    70.673611
1984    71.477435
1988    71.409885
1992    71.878835
1994    72.620690
1996    73.171582
1998    72.063604
2000    73.598155
2002    72.405145
2004    73.594744
2006    71.753086
2008    74.447090
2010    73.398827
2012    74.692308
2014    72.378747
2016    73.750696
Name: Weight, dtype: float64

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


# Office Hours

In [224]:
# Pol talks about describe
df.describe()

Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,261642.0,210945.0,208241.0,271116.0
mean,68248.954396,25.556898,175.33897,70.702393,1978.37848
std,39022.286345,6.393561,10.518462,14.34802,29.877632
min,1.0,10.0,127.0,25.0,1896.0
25%,34643.0,21.0,168.0,60.0,1960.0
50%,68205.0,24.0,175.0,70.0,1988.0
75%,102097.25,28.0,183.0,79.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


In [227]:
n_row = df.shape[0]

In [228]:
n_columns = df.shape[1]

In [229]:
print(n_row,n_columns)

271116 15


In [265]:
import numpy as np
df1 = pd.DataFrame(np.zeros((4,4)),index="A B C D ".split(), columns="Q R S T".split())

In [272]:
df1

Unnamed: 0,Q,R,S,T
A,0.0,0.0,0.0,0.0
B,0.0,0.0,0.0,0.0
C,0.0,0.0,0.0,0.0
D,0.0,0.0,0.0,0.0


In [273]:
myseries = pd.Series([0.0,0.0,0.0,0.0])

In [274]:
type(myseries)

pandas.core.series.Series

In [277]:
df[df['Sex']=='M']

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
10,6,Per Knut Aaland,M,31.0,188.0,75.0,United States,USA,1992 Winter,1992,Winter,Albertville,Cross Country Skiing,Cross Country Skiing Men's 10 kilometres,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271111,135569,Andrzej ya,M,29.0,179.0,89.0,Poland-1,POL,1976 Winter,1976,Winter,Innsbruck,Luge,Luge Mixed (Men)'s Doubles,
271112,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Individual",
271113,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Team",
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,
