![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)

# Olympics - Let's play the game of numbers

**Submitted by: A, B, C, D**

<table><tr>
<td> <img src="data/rings2.png" alt="Drawing" style="width: 620px;"/> </td>
<td> <img src="data/sports.png" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>
    
[Olympics](https://en.wikipedia.org/wiki/Olympic_Games) is the most prestigious sports competition in the world with more than 200 countries participating in about 35 sports. Thousands of atheletes around the world take part in summer and winter games to showcase their abilities and make their countries proud.

On a regular day, you would be watching your favourite game or athlete in Olympics. However, in this hackathon, let us try to play with the dataset related to Olympic games. Hopefully, you will encounter some interesting findings which might be difficult to know otherwise and that too, while learning some new coding/hacking skills.

## Getting ready

This section sets up many things behind the scenes which are required to follow through this notebook smoothly. Most of the code blocks in this section are *ready-to-run* and hence you won't have to do any modifications. Also, you do not need to know everything about various tasks being accomplished by the code cell in this section to complete the challenges. However, feel free to ask mentors about anything that makes you really curious.

### 1. Install/Import libraries

Run the cell below to download and install required Python libraries. It may take few minutes to complete the execution of the cell.

In [None]:
! pip install cufflinks ipywidgets

Run the next cells to load libaries and pre-defined functions which will help us later to complete various challenges.

In [None]:
!wget https://raw.githubusercontent.com/callysto/hackathon/master/Group3_Olympics/helper_code/olympics.py -P helper_code -nc

In [None]:
# load libraries
import pandas as pd
import cufflinks as cf
cf.go_offline()

# color pallete with more than 20 colors
colors20 = ['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4', '#46f0f0', 
          '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', 
          '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080', '#ffffff', '#000000']


# to enable plotting in colab
def enable_plotly_in_cell():
    import IPython
    from plotly.offline import init_notebook_mode
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
    init_notebook_mode(connected=False)
    
get_ipython().events.register('pre_run_cell', enable_plotly_in_cell)

# load helper code
from helper_code.olympics import *

### 2. Import data and create a dataframe
Olympics dataset is available on [Kaggle](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results/data#athlete_events.csv) which is an online community of data scientists and machine learners and a well-known competition platform for predictive modeling and analytics.

For this hackathon, the dataset is stored in cloud storage. Let us import it in this notebook. Executing cells below will also create a dataframe and make you aware of some interesting facts about the dataset.

In [None]:
# reading from cloud object storage
olympics_url = "https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/olympics.csv"

In [None]:
# reading the input file and creating dataframe
olympics = pd.read_csv(olympics_url) 

In [None]:
# how many rows and colums does the dataframe have?
olympics.shape

Did you notice the number of rows in this dataset? 
- Is it possible to go through all the rows manually?
- Would you consider it as a **big dataset**?

In [None]:
# what are the column names?
olympics.columns

Now you know which columns are there in the dataset, but what do those columns refer to? Here is the description for some of the columns from Kaggle:

- **ID** - Unique number for each athlete  
- **Name** - Athlete's name  
- **Sex** - M or F  
- **Age** - Integer  
- **Height** - In centimeters  
- **Weight** - In kilograms  
- **Team** - Team name  
- **NOC** - National Olympic Committee 3-letter code  
- **Games** - Year and season  
- **Year** - Integer  
- **Season** - Summer or Winter  
- **City** - Host city  
- **Sport** - Sport  
- **Event** - Event  
- **Medal** - Gold, Silver, Bronze, or NA  
- **region** - Country 

In [None]:
# display first 5 rows to explore what the data looks like
olympics.head()

Now everything is set up for crunching the Olympics dataset. Your group can go through the rest of the notebook and work on challenges.

**While working on the challenges, feel free to add new code/markdown cells as needed.**

## Part A: Number of participants by year

Let us determine how many athletes participated in each of the olympics held so far.

In [None]:
# group by year and calculate number of rows for every group
athletes_by_year = olympics.groupby(["Year"]).size()

# create additional column "count" to store the number of athlets
athletes_by_year = athletes_by_year.reset_index(name='count')

# print year and number of athletes for first 5 olympics
athletes_by_year.head()

In [None]:
# what is the maximum number of participants
athletes_by_year.max()

In [None]:
# creating a line graph
athletes_by_year.set_index("Year").iplot(xTitle="Year",yTitle="Number of participants")

### Challenges:

- Which Olympics had minimum number of participants and when was it held? (use `min()` function)
- Create a bar chart or an area plot by changing `iplot()` to `iplot(kind="bar")` or `.iplot(kind="area",fill=True)`. Which plot helps you better understand the data?

You might have observed unusual behaviour in the plot after 1992. Let us not worry about that for the moment as you will find it out while solving the next set of challenges.


## Part B: Number of participants by year and season

Let us find out how many athletes participated in summer/winter olympic games in a given year.

In [None]:
# call pre-defined function "get_counts_by_group()" 
athletes_by_season = get_counts_by_group(olympics, "Season")

athletes_by_season.head()

In [None]:
# create a stacked bar chart
athletes_by_season.iplot(kind="bar", barmode="stack",xTitle="Year",yTitle="Number of participants")

This chart has many good pieces of information. Let us see how many of them did you observe.

### Challenges:

- When was the first winter Olympics held?
- Did you see any change in the hosting pattern of winter Olympics after 1992?
- The bar chart is missing some columns around 1920 and 1940. Why? (Hint: [Olympic Games](https://en.wikipedia.org/wiki/Olympic_Games))

Let's find the year with the most participants in summer Olympic games. We will use `sort_values()` function.

In [None]:
# sort_values() function - sorts by a column or set of columns
athletes_by_season.sort_values("Summer", ascending = False).head(10)

### Challenges:

- Find number of participants by year and sex (using **Sex** column) and create a stacked bar chart.
- Which year had the most female participants?

## Part C: Number of medals by country and season

Let us count the number of medals won by countries in a given season.

In [None]:
# we will keep only the rows for athletes who got medals
medals = olympics.dropna(subset=["Medal"])

# lets select only Winter season
medals_winter = medals[medals["Season"]=="Winter"]

# use predefined function to group by year and country and then calculate number of rows
medals_by_region = get_counts_by_group(medals_winter, "region")

# display top 5 rows
medals_by_region.head()

In [None]:
# display data only for some countries. There are too many of them, it will get too messy if we plot all
countries_list = ["Canada","Russia","USA","Norway","Japan","China"]

# get the subset containing data for the countries in the above list
medals_subset = medals_by_region[countries_list]

# create an area chart
medals_subset.iplot(kind="area",fill=True,xTitle="Year",yTitle="Number of medals")

### Challenges:

- List the names of countries who won medal(s) in summer Olympics. (Hint: print the column names of `medals_by_region` dataframe)
- In which season Canada is more successful in winning medals? Summer or winter?
- In which year Canada won the most medals in the winter Olympics? 
- In which year Canada won the most medals in the summer Olympics?

## Part D: Sport-wise medal table for a country in a selected year

Let us find out how many gold/silver/bronze medals were won in different sports by a country (for e.g. Canada) in a specific Olympics (for e.g. 1984). Here, you have to specify the country and year of Olympics for which you want to get the medal table.

**Note:** If you enter the country or year that doesn't exist in the dataset then the code will give an error. Execute the cell again to start over.

In [None]:
# read user input for country name
print("Enter country: ")
country = input()

# read user input for year
print("Enter year: ")
year = input()

# subset by specific year and country
medals_by_country = medals[(medals["Year"]==float(year)) 
                            &(medals["region"]==country)]

# count number of rows
medals_by_kind = medals_by_country.groupby(["Medal"]).size()

# create additional column "count" to store the number of athletes
medals_by_kind = medals_by_kind.reset_index(name='count')

# show the dataframe
medals_by_kind

In [None]:
# create a Pie chart
medals_by_kind.iplot(kind="pie", labels="Medal",values="count")

In [None]:
# use pre-defined function to get medal counts by sport
medal_by_sport = get_counts_by_medal(medals_by_country)

medal_by_sport

In [None]:
# create a bar chart
medal_by_sport.iplot(kind = "bar", barmode = "stack",xTitle="Sport",yTitle="Count")

### Challenges:

- Create the sport-wise medal table for Russia in 1980 Olympics.
  - In which sports Russia won only gold, silver and bronze medals?
  - In which sports Russia won the highest number of medals?
  - List top 5 sports in which Russia won highest number of *gold* medals. (Hint: Sort `medal_by_sport` dataframe using `sort_values()` function) 

## Part E: Number of participants with/without medal

Let us try to gauge the success ratio of athletes in winning the medal at Olympics. The code cell below plots the stacked bar chart presenting number of Canadian athletes with and without medals in various sports in 1984 winter Olympics.

In [None]:
# use pre-defined function to get participants counts by sport
summary = get_participation_counts(olympics ,year=1984, season="Winter", country="Canada")

# create a stacked bar chart
summary.iplot(kind= "bar", barmode="stack",xTitle="Year",yTitle="Number of participants")

### Challenges:

Plot the similar bar chart for Canadian athletes at 2014 winter Olympics.
   - List the sports in which Canada won medals in 2014 but not in 1984 winter Olympics.
   - In which sport, all the Canadian atheletes won medals in 2014 winter Olympics?

Feel free to play with different countries and years to identify the sports in which athletes were more successful.

## Summary

This workbook analyzes the **Olympics** dataset from Kaggle with the help of python code blocks. Number of participants and medals won by them are analyzed for various countries and Olympic games. Also, sport-wise medal table is prepared and visualized using interactive plots while addressing numerous associated challenges. 

By taking part in this hackathon and completing these challenges, you learnt how to analyze big dataset which is impractical to do manually, create visualizations and most importantly, developed [*computational thinking*](https://en.wikipedia.org/wiki/Computational_thinking) abilities which can be used to solve various problems.

![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)