<div align="right" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/Logo blue_dark.png"  style="width:25px" align="right";/>
</div>

# Using SQL string functions to clean data
© ExploreAI Academy

In this notebook, we will use SQL string functions to clean our data by identifying and removing unwanted characters.



> ⚠️ This notebook will not run on Google Colab because it cannot connect to a local database. Please make sure that this notebook is running on the same local machine as your MySQL Workbench installation and MySQL `united_nations` database.

## Learning objectives

In this train, we will learn how to:
- Identify and remove unwanted spaces from string values.
- Extract portions of a string based on specified start and end positions.


## Overview

Let's explore how string functions can be used to clean up data in our table `Access_to_Basic_Services`.
The country name column contains a number of entries with unwanted information inside parentheses. We need to extract the country name without the additional details.


## Connecting to the MySQL database

We'll start by connecting to the `united_nations` database. To connect to the MySQL server, run the cells below.


In [None]:
# Load and activate the SQL extension to allow us to execute SQL in a Jupyter notebook. 
# If you get an error here, make sure that mysql and pymysql are installed correctly. 

%load_ext sql

In [None]:
# Establish a connection to the local database using the '%sql' magic command.
# Replace 'password' with your connection password. 
# If you get an error here, please make sure the database name or password is correct.

%sql mysql+pymysql://root:password@localhost:3306/united_nations

## Exercise


Let's start by selecting all unique country names from the table `Access_to_Basic_Services`. We will then use the `WHERE` clause to filter country names that have information in parentheses.

In [None]:
%%sql

SELECT 
	Distinct Country_name 
FROM 
	united_nations.Access_to_Basic_Services  
WHERE 
	Country_name LIKE '%(%)%';

### 1. Extract country names without the information inside the parenthesis

Adding to the previous query, extract the country names on the left of the opening bracket, using the position of the opening bracket as the length of the substring to be extracted. Store the results in a column called `New_country_name`. 

Then, get the length of the records on the `New_country_name` column to help identify any extra characters. Store this as `New_country_name_length`.

In [None]:
%%sql
# Add your code here

### 2. Identify any extra characters

Refine the solution above to remove any extra characters on the `New_country_name` column. Use the `New_country_name_length` column above to help identify the number of extra characters then update it on the query below to confirm that the extra spaces have been removed on the `New_country_name` column.  

In [None]:
%%sql
# Add your code here

## Solutions

### 1. Extract country names without the information inside the parenthesis

Our approach includes utilising the `POSITION SQL` function to locate the position of the opening bracket in all entries of the `Country_name` column. We then use the `POSITION` function within a `LEFT` function to retrieve the characters preceding the opening bracket and store the results in a column called `New_country_name`. With this, we can extract the country names on the left of the opening bracket using the `LEFT` function with the `POSITION` function as the length of the substring.

To get the length of the newly formed country name, so that we can look for any discrepancies, we will nest the query used to create the `New_country_name` column in a `LENGTH` function and then save this as `New_country_name_length`. 


In [None]:
%%sql

SELECT
    Distinct Country_name,
	LEFT(Country_name, POSITION('(' in Country_name)) as New_country_name,
	LENGTH(LEFT(Country_name, POSITION('(' in Country_name))) as New_country_name_length
FROM 
	Access_to_Basic_Services 
WHERE 
	Country_name LIKE '%(%)%';

### 2. Identify any extra characters

The `New_country_name` column now displays the country name without the unwanted information inside the parentheses. However, we notice that there's an extra character, the opening bracket. This is because our substring length is inclusive of the opening bracket. We will therefore refine our solution by subtracting one position from the opening bracket's position to ensure that the extra character is removed.

We notice another issue – extra spaces. For example, the country name "Iran" should have a length of four characters, but it shows six. Once the opening bracket is removed, there will still be an extra character caused by whitespace. To address this problem, we will use the `RTRIM` function on the extracted country name to remove any trailing spaces.


In [None]:
%%sql

SELECT Distinct
 	Country_name, 
	RTRIM(LEFT(Country_name, POSITION('(' in Country_name)-1)) as New_country_name, 
	LENGTH(RTRIM(LEFT(Country_name, POSITION('(' in Country_name)-1))) as New_country_name_length 
FROM 
	Access_to_Basic_Services WHERE Country_name like '%(%)%';

## Summary

The `New_country_name_length` column now accurately represents the length of the extracted country name without any additional spaces. By using nested functions and refining our approach step by step, we've successfully cleaned our data.

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>