![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)
 
<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fhackathon&branch=master&subPath=HackathonNotebooks/Gapminder/gapminder.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"></a>

# Data Science Hackathon - Gapminder

As our world becomes increasingly data-driven, the ability to analyze, visualize, and draw insights from large and complex datasets is becoming an essential skill. Data science can provide you with the tools to make better decisions and solve complex problems.

Click on the cell below, then click the Run button above to import and display a dataframe about *real* global data provided by [Gapminder](https://www.gapminder.org/).

In [None]:
import pandas as pd
import plotly.express as px

gapminder_data = px.data.gapminder() 
gapminder_data

To create visualizations, we will use [Plotly Express](https://plotly.com/python/plotly-express). Click on the cell below then click Run to display a `histogram` graph.

In [None]:
import plotly.express as px
fig = px.histogram(gapminder_data, x='continent', y='pop', title='Total Continental Population (1952-2007)')
fig

## Beginner Challenges

Each of these challenges is worth 2 points, and uses the `gapminder_data` dataframe.

1. Make a *histogram* graph with `country` on the x-axis and `year` on the y-axis.

2. Make a *histogram* graph using the following code.
```
x='country', y='pop', color='year'
```

3. Add a title to the previous graph called `"Total Country Populations (1952-2007)"`

4. Recreate the histogram, adding the following parameter to calculate average instead of totals.
```
, histfunc='avg'
```

4. Filter the dataframe to show only **Americas** using the following code: 
`new_data = gapminder_data.query('continent == "Americas"')`

5. Re-name your dataframe `new_data` to an *appropriate* name based on the filtered data. 

6. Reset the index of your newly named dataframe using the syntax: `.reset_index(drop=True)` at the end of your dataframe.

7. Display your new dataframe using a *scatter* plot comparing `pop` and `lifeExp`. The scatter plot syntax is `px.scatter`.

8. Add `,size=gdpPercap` as a parameter to your plot.

9. Pick out a country that has *less than* 150 million `pop`, less than 10000 `gdpPercap`, and greater than 30 `lifeExp`. Write down this country in the **markdown** cell below.

10. Comment on any *trends* you see in the figure above. You can comment about any of variables used in the plot, such as `pop`, `gdpPercap`, `lifeExp`, or `continent`.

11. Sort the `gapminder_data` dataframe by `gdpPercap` using the following code.
```
gapminder_data.sort_values('gdpPercap')
```

12.  Sort the `gapminder` dataframe in descending order by adding the following.
```
, ascending=False
```

Name the new data-frame `sorted_gapminder`.

13. Use `.head(5)` and `.tail(5)` to see the top 5 and bottom 5 `gdpPercap` countries in your `sorted_gapminder` dataframe.

14. Make a *histogram* of the sorted dataframe. 

15.  Create a 3-dimensional plot using `px.scatter_3d`, where `x=pop`, `y=lifeExp`, and `z=gdpPercap` using `gapminder_data`.

16. Add the following parameters: 

```
, color='country'
```

and 

```
, symbol='continent'
```

17.  Display just one column of the dataframe using the following.
```
gapminder_data['year']
```

18. Display the maximum year in the column using `.max()`

19. Find the average ([mean](https://simple.wikipedia.org/wiki/Mean)) of a column with the following.
```
gapminder_data['lifeExp'].mean()
```

20. *Query* the `gapminder_data` dataframe by the max year found earlier and name the new dataframe `queried`. Then, find the middle value ([median](https://simple.wikipedia.org/wiki/Median)) of the `lifeExp` column. 

21.  Compare the *mean* values of the *min* (or lowest) year and the *max* (or highest) year. Which is higher, and provide a potential reason why?

22. Use your `queried` dataframe to create a scatter matrix. 
```
px.scatter_matrix(queried, dimensions=['gdpPercap', 'lifeExp', 'pop'], color='continent')
```

23.  Create a correlation matrix using the same dimensions as the scatter matrix using `.corr()` and name it as `correlation`. The correlation matrix should only use the columns:
```
gapminder_data[['year', 'lifeExp', 'pop']]
```

24. Display the correlation matrix using `correlation`.

```
px.imshow(correlation)
```

25. Find the highest and lowest correlation value in the correlation matrix. **Note**: Correlations such as `pop` and `pop` do not count as they will always have an 100% or 1.0 correlation value onto themselves. 

26.  Make a *pie chart* illustrating the percentage of the world's population by *continent*.
```
px.pie(gapminder_data, values='pop', names='continent')
```

27.  Make another *pie chart*, but, filter the data by the year 2007. Add an appropriate title to the pie chart.

28. Which continent had the biggest **total** percentage difference? 

29. Query the data by the year 2007 again. Create a *scatter_geo* map using your filtered dataframe:
```
geo_map = px.scatter_geo(your_dataframe, locations='iso_alpha', size='pop')
```

30. Make your `geo_map` more realistic by adding the following properties:

- showland=True, landcolor='Green'
- showocean=True, oceancolor='LightBlue'
- showrivers=True, rivercolor='Blue'
- showlakes=True, lakecolor='Turquoise'

This is done using:

```
geo_map.update_geos()
```

## Intermediate Challenges

Each of these challenges is worth 5 points, 