## 1: Introduction

In this mission, we'll be calculating summary statistics with SQL. We've often needed to count the number of records that matched a particular SQL query. So far, we've been able to do this by:

- Performing a SQL query with Python
- Retrieving the results and storing them as a list
- Finding the length of the list

While this approach works, it requires quite a bit of code, and it's also fairly slow. As we progress through this mission, we'll learn how to count with SQL only.

We'll be working with factbook.db, a SQLite database containing information about every country in the world. We'll use a table in the file called facts. Each row in facts represents a single country and contains several columns, including:

- name - The name of the country
- area - The total land and sea area of the country
- population - The country's population
- birth_rate - The country's birth rate
- created_at - The date the record was created
- updated_at - The date the record was updated
- Here are the first few rows of facts:

| id | code | name        | area    | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | created_at                 | updated_at                 |
|----|------|-------------|---------|-----------|------------|------------|-------------------|------------|------------|----------------|----------------------------|----------------------------|
| 1  | af   | Afghanistan | 652230  | 652230    | 0          | 32564342   | 2.32              | 38.57      | 13.89      | 1.51           | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |
| 2  | al   | Albania     | 28748   | 27398     | 1350       | 3029278    | 0.3               | 12.92      | 6.58       | 3.3            | 2015-11-01 13:19:54.431082 | 2015-11-01 13:19:54.431082 |
| 3  | ag   | Algeria     | 2381741 | 2381741   | 0          | 39542166   | 1.84              | 23.67      | 4.31       | 0.92           | 2015-11-01 13:19:59.961286 | 2015-11-01 13:19:59.961286 |

#### Instructions: 

- Import sqlite3.
- Initialize a connection to factbook.db using the connect() method, and store it in the variable conn.
- Use conn, the execute() method, and the fetchall() method to fetch all of the records in the facts table. Assign the result to the facts variable.
- Print out the facts variable.
- Count the number of items in facts, and assign the result to facts_count.

In [3]:
import sqlite3

conn = sqlite3.connect("data/factbook.db")
facts = conn.execute("select * from facts").fetchall()
facts_count = len(facts)
print(facts_count)

261


## 2: Counting The Number Of Rows In SQL
Counting the number of records in a table is a common operation, and it feels like it should be more efficient than the code we just wrote on the last screen. Thankfully, SQL has a COUNT aggregation function that allows us to count the number of records in a table. We call it an aggregation function because it works across many rows to calculate an aggregate value. Here's an example:


``SELECT COUNT(*) FROM facts;`` 

The query above will count the number of rows in the facts table of factbook.db. If we want to count the number of non-null values in a single column instead, we can use the following syntax:

``SELECT COUNT(area_water) 
FROM facts;``

Note that this query will only count the total number of non-null values in the area_water column. That means it can return a different total than COUNT(*).

Each of the queries above will return a list with a single tuple when we execute it in Python. The result will look like this:

``[(243,)]``

To get the integer count from the result, we'll need to extract the first element in the first tuple of the results.

This style saves typing, and it's also much faster for larger data sets. That's because we can do the counting inside the database, rather than having to pull all of the data into the Python environment first. In general, doing operations within a SQL database engine will be faster than doing the equivalent operations after pulling the data into a programming environment. This is because SQL database engines are optimized specifically for querying.

#### Instructions: 

- Use the COUNT aggregation function to count the number of non-null values in the birth_rate column of the facts table.
- Extract the integer value from the result, and assign it to birth_rate_count.
- Print out birth_rate_count.

In [4]:
birth_rate_count = conn.execute("select count(birth_rate) from facts").fetchall()[0][0]
print(birth_rate_count)

228


## 3: Finding A Column's Minimum And Maximum Values In SQL
SQL has other aggregation functions, in addition to COUNT. MIN and MAX, for example, find the minimum and maximum values in a column. While we can use the COUNT function with any column, MIN and MAX only work with numeric columns. Here's an example of how we can use these functions:


``SELECT MAX(birth_rate)
FROM facts;``

Just like the COUNT function, MIN and MAX will return a list with a single tuple. In this case, the result is:

``[(45.45,)]``

45.45 is the highest value in the birth_rate column of the facts table.

#### Instructions:
- Use the MIN function to find the minimum value in the population_growth column.
    - Extract the numeric result and assign it to min_population_growth.
    - Print min_population_growth.
- Use the MAX function to find the maximum value in the death_rate column.
    - Extract the numeric result and assign it to max_death_rate.
    - Print max_death_rate.

In [10]:
conn = sqlite3.connect("data/factbook.db")

min_population_growth = conn.execute("select min(population_growth) from facts").fetchall()
print(min_population_growth)

max_death_rate = conn.execute("select max(death_rate) from facts").fetchall()
print(max_death_rate)

[(0.0,)]
[(14.89,)]


## 4: Calculating Sums And Averages In SQL
The final two aggregation functions we'll look at are SUM and AVG. SUM finds the total of all of the values in a numeric column:


``SELECT SUM(birth_rate) 
FROM facts;``

This function also returns a list with a single tuple. Our query will return this list:

``[(4406.909999999998,)]``

AVG finds the mean of all of the non-null values in a column:


``SELECT AVG(birth_rate) 
FROM facts;``


The result of this query is:

``[(19.32855263157894,)]``

#### Instructions:
- Use the SUM function to find the sum of the area_land column.
    - Extract the numeric result and assign it to total_land_area.
    - Print total_land_area.
- Use the AVG function to find the mean of the area_water column.
    - Extract the numeric result and assign it to avg_water_area.
    - Print avg_water_area.

In [8]:
conn = sqlite3.connect("data/factbook.db")

total_land_area = conn.execute("select sum(area_land) from facts").fetchall()
print(total_land_area)

avg_water_area = conn.execute("select avg(area_water) from facts").fetchall()
print(avg_water_area)

[(128584834,)]
[(19067.59259259259,)]


## 5: Combining Multiple Aggregation Functions

If we wanted to use the SUM, AVG, and MAX functions on a column, writing three different queries would be inefficient. Recall that we can query multiple columns by separating their names with commas, like this:

``SELECT birth_rate, death_rate, population_growth 
FROM facts;``

We can apply the sample principle to combine multiple aggregation functions into a single query:

``SELECT COUNT(*), SUM(death_rate), AVG(population_growth) 
FROM facts;``

Because we've specified three aggregation functions in the query, it will return a list containing a tuple with three elements:

``[(261, 1783.2500000000002, 1.2009745762711865)]``

The order of the results corresponds to the order of the aggregation functions in the query. In our example, the first element in the tuple is the count of all the rows, the second is the sum of the death_rate column, and the third is the mean of the population_growth column.

#### Instructions:
- Write a single query that calculates the following statistics about the facts table, in order:
    - The mean of the population column
    - The sum of the population column
    - The maximum value in the birth_rate column
- Assign the result of the query to facts_stats
- Print facts_stats

In [11]:
conn = sqlite3.connect("data/factbook.db")

query = "select avg(population), sum(population), max(birth_rate) from facts"
facts_stats = conn.execute(query).fetchall()
print(facts_stats)

[(62094928.32231405, 15026972654, 45.45)]


## 6: Aggregating Values For A Subset Of The Data

As you may recall from an earlier mission, we can use the WHERE statement to limit our query to certain rows in a SQL table:

``SELECT population 
FROM facts 
WHERE birth_rate > 10;``

The query above will select any values in the population column where the birth_rate is higher than 10. We can also use WHERE statements with aggregation functions to calculate statistics for a subset of rows:

``SELECT COUNT(*) 
FROM facts 
WHERE population > 5000000;``

The query above will count the number of rows where population is greater than 5000000.

#### Instructions: 
- Calculate the mean population_growth for countries with a population greater than 10000000.
    - Extract the numeric result and assign it to population_growth.
    - Print population_growth.

In [13]:
conn = sqlite3.connect("data/factbook.db")

query = "select avg(population_growth) from facts where population > 10000000"
population_growth = conn.execute(query).fetchall()[0][0]
print(population_growth)

1.4572222222222226


## 7: Selecting Unique Rows

There are times when we only want to select the unique values in a column or database, rather than each individual row. One example would be if our facts table had duplicate entries for each country, like this:

| id | code | name        | area   | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | created_at                 | updated_at                 |
|----|------|-------------|--------|-----------|------------|------------|-------------------|------------|------------|----------------|----------------------------|----------------------------|
| 1  | af   | Afghanistan | 652230 | 652230    | 0          | 32564342   | 2.32              | 38.57      | 13.89      | 1.51           | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |
| 2  | af   | Afghanistan | 652230 | 652230    | 0          | 32564342   | 2.32              | 38.57      | 13.89      | 1.51           | 2015-11-01 13:19:49.461734 | 2015-11-01 13:19:49.461734 |

To get a list of all of the countries in the world, we'll need to remove these duplicate rows so that there aren't duplicate entries. We can do this with the DISTINCT statement:


``SELECT DISTINCT name 
FROM facts;``

This query will return all of the unique values in the name column of facts. It won't return any duplicate values.

We can also use the DISTINCT statement with multiple columns to return unique pairings of those columns:


``SELECT DISTINCT name, population 
FROM facts;``

The query above will select the unique combinations of values in the population and name columns from facts.

#### Instructions:

- Select all of the distinct values in the birth_rate column of the facts table, and assign the result to unique_birth_rates.
- Print unique_birth_rates.

In [15]:
conn = sqlite3.connect("data/factbook.db")

query = "select distinct birth_rate from facts"
unique_birth_rates = conn.execute(query).fetchall()
unique_birth_rates = [i[0] for i in unique_birth_rates] # list comprehension
print(unique_birth_rates)

[38.57, 12.92, 23.67, 8.13, 38.78, 15.85, 16.64, 13.61, 12.15, 9.41, 15.5, 13.66, 21.14, 11.87, 10.7, 11.41, 24.68, 36.02, 17.78, 22.76, 8.87, 20.96, 14.46, 17.32, 8.92, 42.03, 18.39, 42.01, 23.83, 36.17, 10.28, 20.33, 35.08, 36.6, 13.83, 12.49, 16.47, 27.84, 34.88, 35.85, 15.91, 28.67, 9.45, 9.9, 9.63, 10.27, 23.65, 15.41, 18.73, 18.51, 22.9, 16.46, 33.31, 30.0, 10.51, 37.27, 19.43, 10.72, 12.38, 34.49, 30.86, 12.74, 8.47, 31.09, 8.66, 16.03, 24.89, 35.74, 33.38, 15.59, 22.31, 23.14, 9.16, 13.91, 19.55, 16.72, 17.99, 31.45, 14.84, 18.48, 8.74, 18.16, 7.93, 25.37, 19.15, 26.4, 21.46, 14.52, 8.19, None, 19.91, 22.98, 24.25, 10.0, 14.59, 25.47, 34.41, 18.03, 10.45, 10.1, 11.37, 11.55, 32.61, 41.56, 19.71, 15.75, 44.99, 10.18, 25.6, 31.34, 13.29, 18.78, 20.54, 12.0, 6.65, 20.25, 10.42, 18.2, 38.58, 19.8, 24.95, 20.64, 10.83, 13.33, 45.45, 37.64, 12.14, 24.44, 22.58, 11.05, 18.32, 24.38, 16.37, 18.28, 24.27, 9.74, 9.27, 9.84, 9.14, 11.6, 33.75, 13.5, 13.7, 13.57, 20.87, 8.63, 34.23, 34.52,

## 8: Aggregating Unique Values

If we wanted to count the number of unique items in the population column, we could use the COUNT aggregation function along with the DISTINCT statement. Here's how it would work:


``SELECT COUNT(DISTINCT population) 
FROM facts;``

The query above will count all of the distinct values in the population column. We can also use other aggregation functions along with the DISTINCT statement:


``SELECT AVG(DISTINCT birth_rate) 
FROM facts;``

This query will find the mean of all of the distinct values in the birth_rate column.

#### Instructions:
- Find the average of all of the distinct values in the birth_rate column where population is greater than 20000000.
    - Extract the numeric result and assign it to average_birth_rate.
    - Print average_birth_rate.
- Find the sum of all of the distinct values in the population column where area_land is greater than 1000000.
    - Extract the numeric result and assign it to sum_population.
    - Print sum_population.

In [16]:
conn = sqlite3.connect("data/factbook.db")

query = "select avg(distinct birth_rate) from facts where population > 20000000"
average_birth_rate = conn.execute(query).fetchall()[0][0]
print(average_birth_rate)

query = "select sum(distinct population) from facts where area_land > 1000000"
sum_population = conn.execute(query).fetchall()[0][0]
print(sum_population)

20.43473684210527
4233873015


## 9: Performing Arithmetic In SQL

Sometimes we'll want to perform some arithmetic on the columns in a SQL table. We might want to make the counts in the population column easier to understand by expressing them in terms of millions, for example. Instead of a number like 9766442, we'd want to display 9.766442. We could do this in Python, but it would be cumbersome to pull all of the data into the Python environment and then manipulate it. Fortunately, we can perform the math inside the SQL database engine instead. Here's an example:


``SELECT population / 1000000 
FROM facts;``

The query above will divide each value in the population column by 1000000, and return the result. Because the population column contains integers and we're dividing by an integer, the results will be integers as well. If we want to retain precision, we can specify a float instead:


``SELECT population / 1000000.0 
FROM facts;``

The query above will return a series of floats, instead of rounding the values to integers. Here are the rules for what an arithmetic operation will return:

- Two floats - Returns a float (ex. SELECT birth_rate / 1000000.0 FROM facts;)
- A float and an integer - Returns a float (ex. SELECT population / 1000000.0 FROM facts;)
- Two integers - Returns an integer (ex. SELECT population / 1000000 FROM facts;)

#### Instructions:
- Use arithmetic operators in a SQL query to express population_growth in terms of millions. Divide by a float so the query also returns a float.
    - Assign the result of the query to population_growth_millions.
    - Print population_growth_millions.

In [18]:
conn = sqlite3.connect("data/factbook.db")

query = "select population_growth / 1000000 from facts"
population_growth_millions = conn.execute(query).fetchall()

population_growth_millions = [i[0] for i in population_growth_millions]
print(population_growth_millions)

[2.32e-06, 3e-07, 1.8400000000000002e-06, 1.2e-07, 2.7799999999999996e-06, 1.24e-06, 9.300000000000001e-07, 1.5e-07, 1.0700000000000001e-06, 5.5e-07, 9.6e-07, 8.5e-07, 2.4100000000000002e-06, 1.6000000000000001e-06, 3.1e-07, 2.0000000000000002e-07, 7.6e-07, 1.87e-06, 2.7799999999999996e-06, 1.1100000000000002e-06, 1.56e-06, 1.3e-07, 1.21e-06, 7.7e-07, 1.6200000000000002e-06, 5.8e-07, 3.03e-06, 1.01e-06, 3.28e-06, 1.5800000000000001e-06, 2.5899999999999998e-06, 7.5e-07, 1.3600000000000001e-06, 2.13e-06, 1.89e-06, 8.2e-07, 4.5000000000000003e-07, 1.04e-06, 1.77e-06, 2.4500000000000003e-06, 2e-06, 1.22e-06, 1.91e-06, 1.3e-07, 1.5e-07, 1.4299999999999999e-06, 1.6e-07, 2.2e-07, 2.2e-06, 2.1e-07, 1.23e-06, 1.35e-06, 1.79e-06, 2.5e-07, 2.5099999999999997e-06, 2.25e-06, 5.5e-07, 2.8900000000000003e-06, 6.7e-07, 4.0000000000000003e-07, 4.3e-07, 1.9299999999999997e-06, 2.16e-06, 8e-08, 1.7000000000000001e-07, 2.1800000000000003e-06, 1e-08, 4.8e-07, 1.8200000000000002e-06, 2.63e-06, 1.91e-06, 2e-

## 10: Performing Arithmetic Between Columns

A few screens ago, we learned how to apply aggregation functions to columns in the SELECT statement, like so:


``SELECT AVG(birth_rate), SUM(population)
FROM facts;``

The aggregation functions modified the columns' values before SQLite returned them. SQL lets us perform many different kinds of manipulations on the columns we select. To calculate the ratio between births and deaths for each country, for example, we could divide the birth_rate column by the death_rate column. Here's how we would accomplish this:


``SELECT birth_rate / death_rate 
FROM facts;``

The query above will divide each value in the birth_rate column by the corresponding value in the death_rate column.

We can also perform more complex queries, such as finding the ratio of birth_rate plus migration_rate to death_rate. The results will help us discover whether the population is increasing or decreasing:


``SELECT (birth_rate + migration_rate) / death_rate 
FROM facts;``

The query will add together the birth_rate and migration_rate columns, then divide by the death_rate column. Arithmetic in SQL respects the order of operations and parentheses, so the addition step happens before the division step.

#### Instructions:

- Use a SQL query to compute the population of each country a year from now.
     - Multiply the population and population_growth columns, then add the population column to the result.
- Assign the result of the query to next_year_population.
- Print next_year_population.

In [23]:
conn = sqlite3.connect("data/factbook.db")

query = "SELECT (1 + (population_growth/100)) * population FROM facts;"
next_year_population = conn.execute(query).fetchall()
next_year_population = [i[0] for i in next_year_population]
print(next_year_population)

[33319834.734400004, 3038365.834, 40269741.8544, 85682.69600000001, 20170937.8134, 93582.2064, 43835802.5398, 3060966.5730000003, 22994449.849799998, 8713210.525, 9874675.488, 327356.0745, 1379066.3733, 171661068.92000002, 291504.87240000005, 9608868.378, 11410035.1948, 353864.8003, 10739119.3866, 750154.3009, 10969375.759200001, 3872082.1715, 2209129.8999, 205832612.55240002, 436606.2652, 7228576.979400001, 19505316.0858, 56889040.0806, 11094622.6528, 15956954.344800001, 24354063.7462, 35363084.77, 553418.5048, 5506378.780700001, 11851290.518399999, 17651827.732, 1373639072.2459998, 47222789.9712, 794794.1867000001, 81319826.832, 4850198.94, 4872876.5568, 23740242.2682, 4470648.2972, 11047980.149500001, 1206202.5171, 10661873.747200001, 5593782.3066, 846547.128, 73761.5747, 10607644.6988, 16082619.346, 90071320.3884, 6156703.375, 759335.6492999999, 6674562.0024999995, 1272379.81, 102340381.16909999, 915481.9062999999, 5498829.688, 66839947.193799995, 1738248.9848000002, 2010211.514400

## 11: Next Steps
In this mission, we explored how to calculate summary statistics in SQL. It's often advantageous to do these computations in the SQL database instead of a Python environment because it's faster to code and execute. In the next mission, we'll cover how to calculate more advanced statistics in SQL with the GROUP BY statement.