# Live Code Lab - Data Science Introduction

## Register for Data Science Leaderboards
Before beginning the Live Code Lab Quiz, please sign up for the Data Science Leaderboard.

Once you create a username and password and then sign up via the jupyter notebook, your username will appear in the leaderboard below:

- http://ds-leaderboards.com

Feel free to work in groups of 1, 2, or 3. The team with the highest rank at the end of the lab will receive an OZK Labs sponsored prize!

In [None]:
# Imports
import pandas as pd
import numpy as np

# API Used to interact with leaderboard app
import api.Leaderboard as leaderboard

### Please Enter Your Desired Username and Password Below:

In [None]:
username = 'dannydk6'
password = 'super'

leaderboard.signup(username=username,password=password)

## Part 1
The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on [All Time Olympic Games Medals](https://en.wikipedia.org/wiki/All-time_Olympic_Games_medal_table), and does some basic data cleaning. 

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.

In [None]:
import pandas as pd

df = pd.read_csv('csv/olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')
df.sample(7)

### Question 0 (Example)

What is the name of first country in first row of df?

Points: 5

*This function should return a String.*

In [None]:
# You should write your whole answer within the function provided. The autograder will call
# these functions and compare the return value against the correct solution value
def answer_zero():
    # This function returns the row for Afghanistan, which is a Series object. The assignment
    # question description will tell you the general format the autograder is expecting
    return df.iloc[0].name

# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, feel free to ask me or any of the TAs!
answer_zero()

In [None]:
# Submit question 0 to Leaderboards
leaderboard.submit_question(qnumber=0,
                            response=answer_zero(),
                            username=username,
                            password=password)

### Question 1
Which country has won the most gold medals in summer games?

Points: 10

*This function should return a single string value.*

In [None]:
df[df.Gold == df.Gold.max()].iloc[0].name

In [None]:
def answer_one():
    return "YOUR ANSWER HERE"

In [None]:
# Submit question 1 to Leaderboards
leaderboard.submit_question(qnumber=1,
                            response=answer_one(),
                            username=username,
                            password=password)

### Question 2
Which country had the biggest difference between their summer and winter gold medal counts?

Points: 10

*This function should return a single string value.*

In [None]:
def answer_two():
    return "YOUR ANSWER HERE"

In [None]:
answer_two()

In [None]:
# Submit question 2 to Leaderboards
leaderboard.submit_question(qnumber=2,
                            response=answer_two(),
                            username=username,
                            password=password)

### Question 3
Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count? 

$$\frac{Summer~Gold - Winter~Gold}{Total~Gold}$$

Only include countries that have won at least 1 gold in both summer and winter.

Points: 10

*This function should return a single string value.*

In [None]:
def answer_three():
    return "YOUR ANSWER HERE"

In [None]:
# Submit question 3 to Leaderboards
leaderboard.submit_question(qnumber=3,
                            response=answer_three(),
                            username=username,
                            password=password)

### Question 4
Write a function that creates a Series called "Points" which is a weighted value where each gold medal (`Gold.2`) counts for 3 points, silver medals (`Silver.2`) for 2 points, and bronze medals (`Bronze.2`) for 1 point. The function should return only the column (a Series object) which you created, with the country names as indices.

Points: 15

*This function should return a Series named `Points` of length 146*

In [None]:
def answer_four():
    return 'YOUR ANSWER HERE'

In [None]:
# Submit question 4 to Leaderboards
leaderboard.submit_four(answer_four(),
                        username=username,
                        password=password)

## Part 2
For the next set of questions, we will be using census data from the [United States Census Bureau](http://www.census.gov). Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. [See this document](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2015/co-est2015-alldata.pdf) for a description of the variable names.

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

### Question 5
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)

Points: 10

*This function should return a single string value.*

In [None]:
census_df = pd.read_csv('csv/census.csv')
census_df.head()

In [None]:
def answer_five():
    return "YOUR ANSWER HERE"

In [None]:
answer_five()

In [None]:
# Submit question 5 to Leaderboards
leaderboard.submit_question(qnumber=5,
                            response=answer_five(),
                            username=username,
                            password=password)

### Question 6
**Only looking at the three most populous counties for each state**, what are the three most populous states (in order of highest population to lowest population)? Use `CENSUS2010POP`.

Points: 20

*This function should return a list of string values.*

In [None]:
# Three most populous counties for each state

In [None]:
def answer_six():
    return 'YOUR ANSWER HERE'

In [None]:
# Submit question 6 to Leaderboards
leaderboard.submit_six(answer_six(),
                        username=username,
                        password=password)

## Part 3
For the next set of questions, we will be using soccer data from a Kaggle competition.

The soccer dataset (soccer-data.csv) should be loaded as games_df. Answer questions using this as appropriate.

In [None]:
soccer_df = pd.read_csv('csv/soccer-data.csv')
soccer_df.head()

### Question 7
How many unique cities have games been played in?

Points: 10

*This function should return an integer.*

In [None]:
def answer_seven():
    return "YOUR ANSWER HERE"

In [None]:
# Submit question 7 to Leaderboards
leaderboard.submit_question(qnumber=6,
                            response=answer_seven(),
                            username=username,
                            password=password)

### Question 8
How many games were played in Glasgow, Scotland?

Points: 10

*This function should return an integer.*

In [None]:
def answer_eight():
    return "YOUR ANSWER HERE"

In [None]:
# Submit question 8 to Leaderboards
leaderboard.submit_question(qnumber=8,
                            response=answer_seven(),
                            username=username,
                            password=password)