![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

In [None]:
#from IPython.display import HTML, display
#display(HTML("<table><tr><td><img src='data/rings2.png' width='620'></td><td><img src='data/sports.png' width='300'></td></tr></table>"))

### Prep work

In [None]:
#library should be installed already
#!pip install cufflinks ipywidgets

Run the next cells to load libaries and pre-defined functions:

In [None]:
!wget https://raw.githubusercontent.com/callysto/hackathon/master/Group3_Olympics/helper_code/olympics.py -P helper_code -nc

In [None]:
import pandas as pd

import cufflinks as cf
cf.go_offline()

colors20 = ['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', 
          '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', 
          '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080', '#ffffff', '#000000']


#to enable plotting in colab
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
    init_notebook_mode(connected=False)
    
get_ipython().events.register('pre_run_cell', enable_plotly_in_cell)

#helper code
from helper_code.olympics import *

# Group goal

 
Go through the  analysis below, work on challenges.


**Extra challenge**:

Is there anything else interesting you can find and visualize for these data? 

### Getting data
Olympics dataset was downloaded from [Kaggle](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/data#athlete_events.csv)

**Kaggle** is the online community of data scientists and machine learners and the most well known competition platform for predictive modeling and analytics.

In [None]:
#reading from cloud object storage
olympics_url ="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/olympics.csv"


In [None]:
olympics =pd.read_csv(olympics_url) 

In [None]:
#how many rows and colums does the dataframe have?
olympics.shape

In [None]:
#what are the column names?
olympics.columns

Here is the column description from Kaggle:

**ID** - Unique number for each athlete  
**Name** - Athlete's name  
**Sex** - M or F  
**Age** - Integer  
**Height** - In centimeters  
**Weight** - In kilograms  
**Team** - Team name  
**NOC** - National Olympic Committee 3-letter code  
**Games** - Year and season  
**Year** - Integer  
**Season** - Summer or Winter  
**City** - Host city  
**Sport** - Sport  
**Event** - Event  
**Medal** - Gold, Silver, Bronze, or NA  
**region** - Country 

In [None]:
#display first 5 rows to explore what the data looks like
olympics.head()

### Number of participants by year

In [None]:
#lets group by year and calculate number of rows for every group
athletes_by_year = olympics.groupby(["Year"]).size()

#create additional column "count" to store the number of athlets
athletes_by_year = athletes_by_year.reset_index(name='count')

#print first 5 years and number of athletes
athletes_by_year.head()

In [None]:
#what is the maximum number of participants:
athletes_by_year.max()

In [None]:
#creating a line graph

athletes_by_year.set_index("Year").iplot(xTitle="Year",yTitle="Number of participants")

### Challenge

Find the minimum number of Olympics participants using `min()` function

Experiment with different kinds of plots:

 - Try creating new cell by copying the call above and change `iplot()` to `iplot(kind="bar")` or `iplot(kind="barh")` or `.iplot(kind="area",fill=True)`. Which plot helps you better understand the data?
 
 - What interesting can you notice on this plot? What do you think happened between the years 1992 and 1994?

### Number of participants by year and by season

In [None]:
#in this case we call function "get_counts_by_group()" 
athletes_by_season = get_counts_by_group(olympics, "Season")

athletes_by_season.head()

In [None]:
athletes_by_season.iplot(kind="bar", barmode="stack",xTitle="Year",yTitle="Number of participants")

Looks like Summer and Winter Olympics were  run in the same year before 1994!

Let's find the year with the most participants in Summer season:
 - we will do this using `sort_values()` function:

In [None]:
athletes_by_season.sort_values("Summer", ascending = False).head(10)

### Challenge

 - Using the example above, create new cell(s) and try to find number of participants by year and by sex (using "Sex" column)
 - Which year had the most female participants?

### Number of medals by country by sport

In [None]:
#we will keep only the rows for athletes who got medals
medals = olympics.dropna(subset=["Medal"])

#lets select only Winter season
medals_winter = medals[medals["Season"]=="Winter"]

#grouping by year and country and calculating the number  of rows
medals_by_region = get_counts_by_group(medals_winter, "region")

#displaying top 5 rows
medals_by_region.head()

In [None]:
#we will display data only for some countries. There are too many of them, it will get too messy if we plot all
medals_subset = medals_by_region[["Canada","Russia","USA","Norway","Japan","China"]]

medals_subset.iplot(kind="area",fill=True,xTitle="Year",yTitle="Number of medals")

### Challenge
 - Using the example above, create new code cell(s) and display number of medals for the Summer Olympics
 - Is Canada more successful at winning medals in Winter or in Summer Olympics?
 - What was the year when Canada got the most medals in the Winter Olympics? in the Summer Olympics?
     

### Extra:   

We can choose country using interactive input   

**Note**: if you enter a country that doesn't exist in the data set, the code will give an error. Restart the cell to start over.

In [None]:
print("Enter country: ")

country = input()

medals_subset1 = medals_by_region[country]

medals_subset1.iplot(kind="area",fill=True,xTitle="Year",yTitle="Number of medals")

### For Summer Olympics in 1984, how many gold/silver/bronze medals in total and by sport

In [None]:
# subset by specific year, county and season
medals_by_country = medals[(medals["Season"]=="Summer") 
                            &(medals["Year"]==1984) 
                            &(medals["region"]=="Canada")]

In [None]:
#count number of rows
medals_by_kind = medals_by_country.groupby(["Medal"]).size()

#create additional column "count" to store the number of athlets
medals_by_kind = medals_by_kind.reset_index(name='count')

medals_by_kind

In [None]:
#using new kind of plot - Pie chart, note it needs labels and values set so specific columns
medals_by_kind.iplot(kind="pie", labels="Medal",values="count")

In [None]:
# calling function to get medal counts by sport
medal_by_sport = get_counts_by_medal(medals_by_country)

medal_by_sport

In [None]:
#note: barmode ='stack'  means bars stack on top of each other
medal_by_sport.iplot(kind = "bar", barmode = "stack",xTitle="Sport",yTitle="Count")

### Challenge

- Using the example above, create new cell(s) and analyze the number of medals for Russia in Summer 1980
  - What was the location of these Olympic games?

## Extra

On the plot below we can compare the number of participants versus number of medals, feel free to play with the 
different years, countries, and seasons.

In [None]:
summary = get_participation_counts(olympics ,year=1984, season="Summer", country="Canada")

summary.iplot(kind= "bar", barmode="stack",xTitle="Year",yTitle="Number of participants")

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)