# Introduction 

This project aims to analyze CIA Factbook Data, a compendium of statistics about all of the countries on Earth.

In [None]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('../input/factbook.db')

q1 = "SELECT * FROM sqlite_master WHERE type='table';" 
pd.read_sql_query(q1, conn)

In [None]:
q2 = "SELECT * FROM facts LIMIT 5;"
pd.read_sql_query(q2, conn)

# Basic Statistics

In [None]:
q3 = '''
SELECT MIN(Population) min_population, MAX(Population) max_population,
MIN(population_growth) min_pop_growth, MAX(population_growth) max_pop_growth
FROM facts
'''
pd.read_sql_query(q3, conn)

# Outliers

Seems like there is a country with a population of 0 people and another with more than 7.2 billion people.

Let's find out which ones those are.

In [None]:
q4 = '''
SELECT name, population FROM facts
WHERE population = 0 OR population > 7000000000
'''
pd.read_sql_query(q4, conn)

According to the CIA Factbook page for Antarctica, there are no indigenous inhabitants even though there are both permanent and summer-only staffed research stations, which explains the population of 0. Also, the table contains a row for the entire world, whose population is over 7.2 billion people.

# Analysis

Let's start the analysis by studying the birth/death rate, along with the population and population growth rate.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

q5 = '''
SELECT population, population_growth, birth_rate, death_rate FROM facts
WHERE population < (SELECT MAX(Population) FROM facts) AND population > 0;
'''
df1 = pd.read_sql_query(q5, conn)

fig = plt.figure(figsize=(10,8));
ax1 = fig.add_subplot(1,1,1);
df1.hist(ax=ax1);
ax1.grid(False)

The birth rate appears to be up to 25 per 1000 people in the majority of the countries in the world, whereas the death rate doesn't exceed 15 per 1000 people. 

Looking at the peaks, we can also see that mostly 10-15 per 1000 people are born and around 7 per 1000 people die, which generally speaking is a positive survival ratio. 

Something interesting is that almost any country in the world, except for a few of them, has a population of up to around 100 million people, with up to 4% growth rate.

Let's find out the top 10 countries that exceed this number.

In [None]:
q6 = '''
SELECT name heavily_populated, population FROM facts
WHERE population > 100000000 AND name != (SELECT name FROM facts WHERE name IN ('European Union'))
ORDER BY population DESC
LIMIT 11;
'''

heavily_populated = pd.read_sql_query(q6, conn)
heavily_populated = heavily_populated[heavily_populated['heavily_populated']!='World']

ax = heavily_populated.plot(x='heavily_populated', y='population', kind='bar', rot=40, legend=False)
ax.set_title('Heavily Populated Countries')
ax.set_xlabel("Country")
ax.set_ylabel("Population")
ax.grid(False);

Clearly lead by China and India, with over 1.2 billion inhabitants, the most populated countries in the world feature USA, Indonesia or Brazil.

Next, let's take a look at the population density per square meter and plot an histogram to visualize it.

In [None]:
q7 = '''
SELECT name, CAST(population as float)/CAST(area_land as float) population_density FROM facts
WHERE population < (SELECT MAX(Population) FROM facts) AND population > 0;
'''

population_density = pd.read_sql_query(q7, conn)
fig = plt.figure(figsize=(10,8));
ax3 = fig.add_subplot(1,1,1);
population_density.hist(ax=ax3);
ax3.grid(False)

Most countries have a population density of up to about 2500 people per square meter, except for a little few that have from 5000 to 20000 people/square meter.

Let's find out which countries have the highest population density, and whether or not they are the same as the most populated.

In [None]:
population_density.sort_values('population_density', ascending=False, inplace=True)
print(population_density.head(20))

We can see how almost any of the top 10 most populated countries are inside the most densely populated ones, which in many cases happen to be small countries like Monaco, Malta or Gibraltar.

However, it would be insteresting to see which country, among the most heavily populated ones, has the highest population density. Let's find that out.

In [None]:
populated_countries = heavily_populated['heavily_populated'].tolist()
populated_countries_density = population_density[population_density['name'].isin(populated_countries)]
populated_countries_density = populated_countries_density.sort_values('population_density', ascending=False)

q8 = "SELECT * FROM facts;"
countries = pd.read_sql_query(q8, conn)
large_countries = countries[countries['name'].isin(populated_countries)].sort_values('area_land', ascending=False)

ax1 = populated_countries_density.plot(x='name', y='population_density', kind='bar', rot=40, legend=False)
ax1.set_title('Population Density of\n Highly Populated Countries')
ax1.set_xlabel("Country")
ax1.set_ylabel("Population Density/Square Meter")
ax1.grid(False);

ax2 = large_countries.plot(x='name', y='area_land', kind='bar', rot=40, legend=False)
ax2.set_title('Area of Highly Populated Countries')
ax2.set_xlabel("Country")
ax2.set_ylabel("Area (Square Meters)")
ax2.grid(False);

Even though we saw how China and India were the two most populated countries in the world, only India keeps its place as the second most densely populated one, since China is much bigger by area. 

While Russia has a smaller population, it is the largest country in the world, which explains why inside this top 10 it is the least densely populated one of all. On the other hand, Bangladesh, which initially showed much less population than India or China but has a much smaller area, which makes it the leader in the denity population graph.



Next, let's explore which countries have the highest ratios of water to land.

In [None]:
q9 = '''
SELECT name, CAST(area_water as float)/CAST(area as float) water_ratio FROM facts
WHERE population < (SELECT MAX(Population) FROM facts) AND population > 0;
'''

water_ratio = pd.read_sql_query(q9, conn).sort_values('water_ratio', ascending=False).head(20)
water_ratio

As we could expect, countries fully or partially surrounded by water lead this ranking. However, the difference between the country with the largest water area in the world, Virgin Islands, and the rest is pretty big, with an 80% over a 35% percent in the second country, Puerto Rico. 

In fact, if we take a look at which countries have more water than land, we come across a surprising result:

In [None]:
q11 = '''
SELECT name water_countries FROM facts
WHERE area_water > area_land AND (population < (SELECT MAX(Population) FROM facts) AND population > 0);
'''

water_countries = pd.read_sql_query(q11, conn).head(20)
water_countries

Based on this data set, the only country in the world that has more water than land is Virgin Islands.