<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Initial data analysis with numeric functions
© ExploreAI Academy

In this notebook, we demonstrate how to extract basic metadata about a dataset using numeric functions.


> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

By the end of this train, you should:
- Understand how to perform an initial data analysis using SQL numeric functions.

## Connecting to our MySQL database

Using our `Access_to_Basic_Services` table created in MySQL Workbench, we want to answer some questions on the range of our dataset. We can apply the same queries in MySQL Workbench and in this notebook if we connect to our MySQL server. Since we have a MySQL database, we can connect to it using mysql and pymysql.

In [None]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [None]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with our connection password and `db_name` with our database name. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:password@localhost:3306/united_nations

To make a query, we add the `%%sql` command to the start of a cell, create one open line, and then the query like below, and run the cell.

In [None]:
%%sql

SELECT 
    *
FROM
    Access_to_Basic_Services
LIMIT 5;

## Exercise
We want to determine the following:
1. What is the total number of entries in the dataset?
2. What are the earliest and latest years for which we have data?
3. How many countries are included in this dataset?
4. What is the average percentage of people who have access to managed drinking water services across all years and all countries included in our dataset?

### 1. What is the total number of entries in the dataset?

Count the number of entries in the dataset using the COUNT function. Return the result with an alias.

In [None]:
# Add your code here

### 2. What are the earliest and latest years for which we have data?

Determine the earliest and latest years by calculating the minimum and maximum of the `Time_period` column using the MIN and MAX functions respectively. Use aliases to name the results.

In [None]:
# Add your code here

### 3. How many countries are included in this dataset?

Count the number of countries in the `Country_name` column. Note, if we only use the COUNT function without an additional keyword, SQL will return the total number of entries in the column, including duplicates. Use the DISTINCT keyword to only return unique country names. Return the result with an alias.

In [None]:
# Add your code here

### 4. What is the average percentage of people who have access to managed drinking water services across all years and all countries included in our dataset?

Use the AVG function to calculate the average of the `Pct_managed_drinking_water_services` column. Use an alias.

In [None]:
# Add your code here

## Solutions

### 1. What is the total number of entries in the dataset?

In [None]:
%%sql

SELECT
    COUNT(*) AS Number_of_observations
FROM united_nations.Access_to_Basic_Services;

Here, we could’ve also used any of the columns to get the count of entries.

### 2. What are the earliest and latest years for which we have data?

In [None]:
%%sql

SELECT
    MIN(Time_period) AS Min_time_period,
    MAX(Time_period) AS Max_time_period
FROM united_nations.Access_to_Basic_Services;

### 3. How many countries are included in this dataset?

In [None]:
%%sql

SELECT
    COUNT(DISTINCT Country_name) AS Number_of_countries
FROM united_nations.Access_to_Basic_Services;

### 4. What is the average percentage of people who have access to managed drinking water services across all years and all countries included in our dataset?

In [None]:
%%sql

SELECT
    AVG(Pct_managed_drinking_water_services) AS AVG_managed_drinking_water_services
FROM united_nations.Access_to_Basic_Services;

### Summary

We can also combine all of our queries into a single query to have a single return that includes all of the values.

In [None]:
%%sql

SELECT
    COUNT(*) AS Number_of_observations,
    MIN(Time_period) AS Min_time_period,
    MAX(Time_period) AS Max_time_period,
    COUNT(DISTINCT Country_name) AS Number_of_countries,
    AVG(Pct_managed_drinking_water_services) AS AVG_managed_drinking_water_services
FROM united_nations.Access_to_Basic_Services;

Note that our results table is neatly labelled with the appropriate column names since we used aliases, i.e. the AS function. This function works especially well when we are expecting a result with multiple columns.

#  
<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>