# Analyzing CIA Factbook Data Using SQL

In this project, I'll work with data from the [CIA World Factbook](https://www.cia.gov/library/publications/the-world-factbook/), a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:

- `population` - The population as of `2015`.
- `population_growth` - The annual population growth rate, as a percentage.
- `area` - The total land and water area.

You can [download the SQLite database](https://dsserver-prod-resources-1.s3.amazonaws.com/257/factbook.db) `factbook.db`. In this project, I'll use SQL to explore and analyze data from this database.

## Introduction

First, connect the Jupyter Notebook to the database file.

In [0]:
# Load SQL
%%capture
%load_ext sql

In [13]:
# Load the database
%sql sqlite:///factbook.db

'Connected: @factbook.db'

Query the database to get informations about the table.

In [14]:
%%sql
SELECT * FROM sqlite_master WHERE type='table';

 * sqlite:///factbook.db
Done.


type,name,tbl_name,rootpage,sql
table,sqlite_sequence,sqlite_sequence,3,"CREATE TABLE sqlite_sequence(name,seq)"
table,facts,facts,47,"CREATE TABLE ""facts"" (""id"" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, ""code"" varchar(255) NOT NULL, ""name"" varchar(255) NOT NULL, ""area"" integer, ""area_land"" integer, ""area_water"" integer, ""population"" integer, ""population_growth"" float, ""birth_rate"" float, ""death_rate"" float, ""migration_rate"" float)"


## Overview of the Data

Let's see the first few rows of our `facts` table.

In [15]:
%%sql
SELECT * FROM facts LIMIT 5;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46


Here are the descriptions for some of the columns:

- `name` - The name of the country.
- `area` - The total land and sea area of the country.
- `population` - The country's population.
- `population_growth` - The country's population growth as a percentage.
- `birth_rate` - The country's birth rate, or the number of births a year per 1,000 people.
- `death_rate` - The country's death rate, or the number of death a year per 1,000 people.
- `area` - The country's total area (both land and water).
- `area_land` - The country's land area in square kilometers.
- `area_water` - The country's waterarea in square kilometers.

# Summary Statistics

Let's start by calculating some summary statistics and look for any outlier countries.

In [19]:
%%sql
SELECT
    MIN(population) min_pop,
    MAX(population) max_pop, 
    MIN(population_growth) min_pop_grwth,
    MAX(population_growth) max_pop_grwth 
FROM facts;

 * sqlite:///factbook.db
Done.


min_pop,max_pop,min_pop_grwth,max_pop_grwth
0,7256490011,0.0,4.02


A few things stick out from the summary statistics in the last screen:

- there's a country with a population of `0`
- there's a country with a population of `7256490011` (or more than 7.2 billion people)

Let's use subqueries to zoom in on just these countries.

In [21]:
%%sql
SELECT name country_name, population FROM facts
WHERE (
    population = (SELECT MIN(population) FROM facts)
    ) OR (
    population = (SELECT MAX(population) FROM facts)
    );

 * sqlite:///factbook.db
Done.


country_name,population
Antarctica,0
World,7256490011


It seems like the table contains a row for the whole world, which explains the population of over 7.2 billion. It also seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook [page for Antarctica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html).

## Exploring Average Population and Area

Let's continue by calculating some averages.

In [23]:
%%sql
SELECT 
    ROUND(AVG(population)) averge_pop, 
    ROUND(AVG(area)) average_area 
FROM facts;

 * sqlite:///factbook.db
Done.


averge_pop,average_area
62094928.0,555094.0


# Finding Densely Populated Countries

I'll identify countries that have:

- Above average values for `population`.
- Below average values for `area`.

In [31]:
%%sql
SELECT name country_name, population / 1000000 population_millions, area FROM facts
WHERE (
    population > (SELECT AVG(population) FROM facts)
) AND(
    area < (SELECT AVG(area) FROM facts)
);

 * sqlite:///factbook.db
Done.


country_name,population_millions,area
Bangladesh,168,148460
Germany,80,357022
Japan,126,377915
Philippines,100,300000
Thailand,67,513120
United Kingdom,64,243610
Vietnam,94,331210
