# Analyzing CIA Factbook Data Using SQL

In this project, we'll explore data from the CIA World Factbook that contains demographic information for all the countries on Earth. 

## Setup

You can use the following code if you need to install `ipython-sql`:

    !conda install -yc conda-forge ipython-sql
    
First, we'll connect our Jupyter Notebook to our database file.

In [1]:
%%capture
%load_ext sql
%sql sqlite:///factbook.db

Next, let's take a first look at the data to see what it looks like.

In [2]:
%%sql
SELECT *
  FROM facts
 LIMIT 5;

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at
1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51,2015-11-01 13:19:49.461734,2015-11-01 13:19:49.461734
2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3,2015-11-01 13:19:54.431082,2015-11-01 13:19:54.431082
3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,2015-11-01 13:19:59.961286,2015-11-01 13:19:59.961286
4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0,2015-11-01 13:20:03.659945,2015-11-01 13:20:03.659945
5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46,2015-11-01 13:20:08.625072,2015-11-01 13:20:08.625072


Here are descriptions for some of the columns:

* `name` - Country's name.
* `area` - Country's total area (land and water).
* `area_land` - Country's land area in km^2.
* `area_water` - Country's water area in km^2.
* `population` - Country's population.
* `population_growth` - Country's population growth as percentage.
* `birth_rate` - Country's number of births per year per 1,000 people.
* `death_rate` - Country's number of deaths per year per 1,000 people.


## Summary Statistics Pt.1

Next, we'll calculate some summary statistics and look for any outliers.

In [3]:
%%sql
SELECT MIN(population) AS min_pop,
       MAX(population) AS max_pop,
       MIN(population_growth) AS min_pop_growth,
       MAX(population_growth) AS max_pop_growth
  FROM facts;

 * sqlite:///factbook.db
Done.


min_pop,max_pop,min_pop_growth,max_pop_growth
0,7256490011,0.0,4.02


It looks like there's a country with a population of 0 and a country with a population of 7,256,490,011.

## Exploring Outliers

We'll need to take a closer look at these countries to understand the data.

In [4]:
%%sql
SELECT *
  FROM facts
 WHERE population == (SELECT min(population)
                        FROM facts
                     );

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at
250,ay,Antarctica,,280000,,0,,,,,2015-11-01 13:38:44.885746,2015-11-01 13:38:44.885746


In [5]:
%%sql
SELECT *
  FROM facts
 WHERE population == (SELECT max(population)
                        FROM facts
                     );

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at
261,xx,World,,,,7256490011,1.08,18.6,7.8,,2015-11-01 13:39:09.910721,2015-11-01 13:39:09.910721


When we look closer at the outliers in the database we can see that the population of more than 7 billion refers to the entire world, and the population of 0 refers to Antarctica. When we look at the [CIA Factbook page for Antarctica](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html) we see that this number is accurate.


Now that we have this info, we should recalculate the summary statistics but exclude the row for the entire world.


## Summary Statistics Pt.2

In [6]:
%%sql
SELECT MIN(population) AS min_pop,
       MAX(population) AS max_pop,
       MIN(population_growth) AS min_pop_growth,
       MAX(population_growth) AS max_pop_growth
  FROM facts
 WHERE name != 'World';

 * sqlite:///factbook.db
Done.


min_pop,max_pop,min_pop_growth,max_pop_growth
0,1367485388,0.0,4.02


We can see that we still have the `min_pop` of 0 for Antarctica, and now a more believable 1.3 billion `max_pop`.

## Exploring Average Population and Area

Next we'll want to look at population density and compare the average values for each country's population and land area. Let's start by looking at the average population and area.

In [7]:
%%sql
SELECT AVG(population) AS avg_population, AVG(area_land) AS avg_land_area
  FROM facts
 WHERE name != 'World';

 * sqlite:///factbook.db
Done.


avg_population,avg_land_area
32242666.56846473,522702.57723577233


## Finding Densely Populated Countries

Next we'll look at countries with above average values for population and below average values for land area.

In [8]:
%%sql
SELECT *
  FROM facts
 WHERE population > (SELECT AVG(population)
                      FROM facts
                    )
  AND area_land < (SELECT AVG(area_land)
                     FROM facts
);

 * sqlite:///factbook.db
Done.


id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at
14,bg,Bangladesh,148460,130170,18290,168957745,1.6,21.14,5.61,0.46,2015-11-01 13:20:52.753843,2015-11-01 13:20:52.753843
65,gm,Germany,357022,348672,8350,80854408,0.17,8.47,11.42,1.24,2015-11-01 13:25:21.942190,2015-11-01 13:25:21.942190
85,ja,Japan,377915,364485,13430,126919659,0.16,7.93,9.51,0.0,2015-11-01 13:27:08.040081,2015-11-01 13:27:08.040081
138,rp,Philippines,300000,298170,1830,100998376,1.61,24.27,6.11,2.09,2015-11-01 13:31:23.643550,2015-11-01 13:31:23.643550
173,th,Thailand,513120,510890,2230,67976405,0.34,11.19,7.8,0.0,2015-11-01 13:34:11.057976,2015-11-01 13:34:11.057976
185,uk,United Kingdom,243610,241930,1680,64088222,0.54,12.17,9.35,2.54,2015-11-01 13:35:09.362933,2015-11-01 13:35:09.362933
192,vm,Vietnam,331210,310070,21140,94348835,0.97,15.96,5.93,0.3,2015-11-01 13:35:42.896553,2015-11-01 13:35:42.896553


Let's calculating the actual population density this time and look at the top 20 most densely populated countries by land area.

In [9]:
%%sql
SELECT name, CAST(population AS float) / CAST(area_land AS float) AS density
  FROM facts
 ORDER BY density DESC
 LIMIT 20;

 * sqlite:///factbook.db
Done.


name,density
Macau,21168.964285714286
Monaco,15267.5
Singapore,8259.784570596797
Hong Kong,6655.27120223672
Gaza Strip,5191.819444444444
Gibraltar,4876.333333333333
Bahrain,1771.8592105263158
Maldives,1319.6409395973155
Malta,1310.01582278481
Bermuda,1299.925925925926


## Conclusion & Next Steps

In this project we did some initial exploration of the CIA Factbook database and looked at some summary statistics for countries around the world.

Now that we have access to this database and some understanding of the data, here are some next steps to continue our exploration:

* What country has the most people? What country has the highest growth rate?
* Which countries have the highest ratios of water to land? Which countries have more water than land?
* Which countries will add the most people to their population next year?
* Which countries have a higher death rate than birth rate?
* What countries have the highest population/area ratio and how does it compare to list we found?

The idea for this project comes from the [DATAQUEST](https://app.dataquest.io/) **SQL Fundamentals** course.