![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

In [None]:
#from IPython.display import HTML, display
#display(HTML("<table><tr><td><img src='data/map.png' width='550'></td><td><img src='data/globe.jpeg' width='420'></td></tr></table>"))

### Prep work

Run the next cell to load libaries and pre-defined functions:

In [None]:
import pandas as pd
import IPython
from plotly.offline import init_notebook_mode

#to enable plotting in colab
def enable_plotly_in_cell():
    display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
  '''))
    init_notebook_mode(connected=False)

get_ipython().events.register('pre_run_cell', enable_plotly_in_cell)

# Group goal

 
Go through the  analysis below, work on challenges.


**Extra challenge**:

Is there anything else interesting you can find and visualize for this data? 

### Getting data

This dataset was created by [Bootstrap](https://www.bootstrapworld.org/index.shtml) company and can be downloaded from  [here](https://docs.google.com/spreadsheets/d/19VoYxPw0tmuSViN1qFIkyUoepjNSRsuQCe0TZZDmrZs/edit#gid=213565368).


Data was aggreagted from the following souces :
 - The World Factbook:
  - [GDP (PPP)](https://www.cia.gov/library/publications/the-world-factbook/rankorder/2001rank.html)
  - [Life expectancy at birth](https://www.cia.gov/library/publications/the-world-factbook/fields/355rank.html)
  - [Population](https://www.cia.gov/library/publications/the-world-factbook/fields/335rank.html)

- Wikipedia:
 - [Universal Health Care](https://en.wikipedia.org/wiki/List_of_countries_with_universal_health_care)
 
    
Some countries/territories/regions were omitted from the dataset due to incomplete data.
   

In [None]:
#reading from cloud object storage
target_url="https://swift-yeg.cloud.cybera.ca:8080/v1/AUTH_d22d1e3f28be45209ba8f660295c84cf/hackaton/countries2.csv"

In [None]:
#reading the input file and creating dataframe
countries = pd.read_csv(target_url) 

In [None]:
#how many rows and colums does the dataframe have?
countries.shape

In [None]:
#what are the column names?
countries.columns

Columns description:  
   
**gdp(\$US)**  - the sum value of all goods and services produced in the country valued at prices prevailing in the United States.

**life-expectancy (yrs)** -  the average number of years to be lived by a group of people born in the same year, if mortality at each age remains constant in the future. Life expectancy at birth is also a measure of overall quality of life in a country and summarizes the mortality at all ages.

**population** - population of the country.

**has-univ-healthcare** - Universal health coverage is a broad concept that has been implemented in several ways. The common denominator for all such programs is some form of government action aimed at extending access to health care as widely as possible and setting minimum standards.

**code** - Country code

In [None]:
#display first 5 rows to explore how the data looks like
countries.head()

In [None]:
#let's create another column - GDP per person
countries['gdp ($US) person'] = countries['gdp ($US)']/countries["population"]

### Exploring data by country

We can  plot all countries that we have using `px.choropleth()`  function.  
Lets try coloring countries differently depending on the specific column:

In [None]:
import plotly.express as px

In [None]:
fig = px.choropleth(countries, locations="code",
                    color="life-expectancy (yrs)", #coloring by life-expectancy
                    hover_name="country") #country name will appear when you hover your mouse over it
fig.show()

Look at the map  - interestingly  - Japan has the highest life expectancy!  
Lets find out what is the exact number:

In [None]:
countries[countries["country"]=="Japan"]

### Challenge
 - Using the cells above as an example - create new cells and draw a map colored by `population` and `gdp ($US)`
 
 - Print on the screen the exact number for China population.
  
 - If you look at both maps you created - do they look similar? Why do you think it happens?

In [None]:
#plot by new created column
fig = px.choropleth(countries, locations="code",
                    color="gdp ($US) person",
                    hover_name="country")
fig.show()

For the next part of  the notebook we need additional libaries loaded.

In [1]:
#library should be installed already
#!pip install cufflinks ipywidgets

In [None]:
import cufflinks as cf
cf.go_offline()

Lets find out the top 20 countries having the highest "gdp per person" value.

In [None]:
#select only two columns - "gdp ($US) person" and "country"
gdp_person = countries[["gdp ($US) person","country"]]

#order by "gdp ($US) person", having highest numbers on top and get top 20
gdp_person = gdp_person.sort_values("gdp ($US) person", ascending = False).head(20)

gdp_person

In [None]:
#plotting top 20 countries, setting index to country - so the bars are marked with country names
gdp_person.set_index("country").iplot(kind = "bar",  yTitle='GDP (USD) Per Person', xTitle="Country")

Looks like some of the countries in the top 20 are quite small - like Luxembourg or Brunei.   
Let's find out what is the population for these countries.

In [None]:
# creating new column  - population in thousands 
countries["population_t"] = countries["population"]/1000

In [None]:
#this time we select 3 columns - "gdp ($US) person", "population_t" and "country"
gdp_person_pop = countries[["gdp ($US) person","population_t" ,"country"]]

#sorting again  by "gdp ($US) person"
gdp_person_pop = gdp_person_pop.sort_values("gdp ($US) person", ascending = False).head(20)


gdp_person_pop.set_index("country").iplot(kind = "bar",yTitle="Population in thousands and GDP",xTitle="Country")

We can see that the majority of countires in top 20 have smaller population, but United States populalion significantly larger than other countries, so there likely no connection between GDP per person and population number

### Challenge
 - Using the cells above as an example create new cells and find out the top 20 countries with least life expectancy.
 - Do these countries have Universal Health Care?

### Exploring data by continent

#### Number of countries per continent

In [None]:
#unique continents
continents = countries["continent"].unique()

#how many of them?
print(len(continents)," continents")

continents

In [None]:
#group by continent anc calclulate how many rows/countries
counts_by_continent = countries.groupby("continent").size()

#Create additional column - count
counts_by_continent = counts_by_continent.reset_index(name="count")

counts_by_continent

In [None]:
#using kind pie to create a pie chart
counts_by_continent.iplot(kind="pie",labels = "continent",values = "count")

Looks like Asia, Africa nad Europe have almost equal number of countries.

#### Population by continent

Calculate wich continent has the largest population:

In [None]:
#group by continent anc calclulate sum for every column
sum_by_continent = countries.groupby("continent").sum()

#convert index(row names) into additional column
sum_by_continent = sum_by_continent.reset_index()

sum_by_continent

In [None]:
# we select only one column - population and create a pie chart
sum_by_continent.iplot(kind="pie", values="population",labels="continent")

### Challenge
  - Try changing the plot in the cell above to bar graph, what kind of plot gives better understanding for this data? 
  - Using **sum_by_continent** dataframe  - create new cell(s) and plot the `has-univ-healthcare` column  to visualize which continent has more countries with Unviversal Healthcare available.
  - Using the cell above as an example - calculate the the averages for every column
      - Use the `mean()` function
  - Plot average life expectancy per continent

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)