# **Maji Ndogo: From analysis to action**
# **Beginning Our Data-Driven Journey in Maji Ndogo**

### Overview:
Welcome to my personal data project focused on addressing a water crisis in Maji Ndogo. This project involves analyzing a database of 60,000 records collected by a dedicated team to extract insights and solutions for improving water quality and addressing pollution issues in the area.

### Main Goal:
The main goal of this project is to leverage data analysis techniques to understand different water sources, assess water quality, identify pollution issues, and ultimately contribute to solving the water crisis in Maji Ndogo.

## Main Points:

### 1. Get to Know Our Data:
- **Exploring the Foundational Tables and Their Structure:**
  - Understanding the structure of the database tables.
  - Identifying relationships between tables using "_id" columns.
  - Accessing the data dictionary for detailed documentation.

### 2. Dive into Sources:
- **Understanding Different Sources with SELECT:**
  - Identifying the table containing information on water sources.
  - Writing SQL queries to retrieve unique types of water sources.
  - Exploring the types of water sources like tap in home, tap in home broken, well, shared tap, and river.

### 3. Unpack the Visits:
- **Discovering the Visit Patterns:**
  - Identifying the table logging visits to water sources.
  - Writing SQL queries to retrieve records based on visit patterns, like time spent in queues.
  - Analyzing multiple visits to shared taps for queue time variations.

### 4. Water Source Quality:
- **Understanding Water Quality:**
  - Locating the table containing quality scores assigned by field surveyors.
  - Analyzing quality scores ranging from 1 to 10 for different water sources.
  - Investigating records with high quality scores for home taps and revisited sources.

### 5. Pollution Issues:
- **Correcting Pollution Data with LIKE and String Operations:**
  - Identifying the table recording contamination/pollution data for well sources.
  - Examining the pollutants, contamination levels, and classifications (Clean, Contaminated: Biological, Contaminated: Chemical).
  - Linking pollution data to specific sources in Maji Ndogo for further analysis and action.


This project aims to not only explore the database but also develop meaningful insights through data analysis, contributing towards sustainable solutions for the water crisis in Maji Ndogo. Let's dive into this project together and make a positive impact!

## 1. **Get to Know Our Data**


1. **Identify Tables:**
   - Use the `SHOW TABLES` query to list all tables in the database.

2. **Table Purpose:**
   - Understand the role of each table in storing specific types of data.

3. **Retrieve First Records:**
   - Write a `SELECT` statement to retrieve the first five records from each table.
   - Examine the columns and their data types in each table to understand the information they contain.

4. **Explore Data Content:**
   - Review the retrieved records to grasp the type of data present in each table.
   - Note down the information stored in each table to build a comprehensive understanding of the database content.

By following these steps, we will gain a solid foundation in understanding the data structure and content within the database, setting the stage for further exploration and analysis to extract meaningful insights for addressing the water crisis in Maji Ndogo.


We invaluable an asset - a database of 60,000 records, meticulously collected by a devoted team of engineers, field workers, scientists, and analysts.

The next crucial phase of our mission begins. We need to make sense of this immense data trove and extract meaningful insights.

Let's load this database and thoroughly acquaint ourselves with it.

In [1]:
%load_ext sql
# Connect to MySQL database
%sql mysql+pymysql://root:123456@127.0.0.1:3306/md_water_services

Let's start by retrieving the first few records from each table. How many tables are there in our database? What are the names of these tables?

In [2]:
#Retrieve a list of all tables in the database
%sql SHOW TABLES;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
8 rows affected.


Tables_in_md_water_services
data_dictionary
employee
global_water_access
location
visits
water_quality
water_source
well_pollution


It looks like someone took the time to name all of these tables pretty well because we can kind of figure out what each table is about
without really having to think too hard. water_source probably logs information about each source like where it is, what type of source
it is and so on.

So let's have a look at one of these tables, Let's use location so we can use that killer query

In [3]:
%%sql
# Retrieve a sample of records from the location table
SELECT 
	*
FROM 
	location
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


location_id,address,province_name,town_name,location_type
AkHa00000,2 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00001,10 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00002,9 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00003,139 Addis Ababa Road,Akatsi,Harare,Urban
AkHa00004,17 Addis Ababa Road,Akatsi,Harare,Urban


So we can see that this table has information on a specific location, with an address, the province and town the location is in, and if it's
in a city (Urban) or not. We can't really see what location this is but we can see some sort of identifying number of that location.

Ok, so let's look at the visits table.

In [4]:
%%sql
# Retrieve a sample of records from the visits table
SELECT
	*
FROM
	visits
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
0,SoIl32582,SoIl32582224,2021-01-01 09:10:00,1,15,12
1,KiRu28935,KiRu28935224,2021-01-01 09:17:00,1,0,46
2,HaRu19752,HaRu19752224,2021-01-01 09:36:00,1,62,40
3,AkLu01628,AkLu01628224,2021-01-01 09:53:00,1,0,1
4,AkRu03357,AkRu03357224,2021-01-01 10:11:00,1,28,14


So this is a list of location_id, source_id, record_id, and a date and time, so it makes sense that someone (assigned_employee_id) visited some location (location_id) at some time (time_of_record ) and found a 'source' there (source_id). Often the
"_id" columns are related to another table. In this case, the source_id in the visits table refers to source_id in the water_source
table.

Ok, so let's look at the water_source table to see what a 'source' is. Normally "_id" columns are related to another table.

In [5]:
%%sql
# Retrieve a sample of records from the water_source table
SELECT
	*
FROM
	water_source
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


source_id,type_of_water_source,number_of_people_served
AkHa00000224,tap_in_home,956
AkHa00001224,tap_in_home_broken,930
AkHa00002224,tap_in_home_broken,486
AkHa00003224,well,364
AkHa00004224,tap_in_home_broken,942


Nice! Ok, we're getting somewhere now... Water sources are where people get their water from! Ok, this database is actually complex,
so maybe a good idea for us is to look at the rest of the tables quickly.

In [6]:
%%sql
# Retrieve a sample of records from the employee table
SELECT 
	*
FROM 
	employee
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


assigned_employee_id,employee_name,phone_number,email,address,province_name,town_name,position
0,Amara Jengo,99637993287,,36 Pwani Mchangani Road,Sokoto,Ilanga,Field Surveyor
1,Bello Azibo,99643864786,,129 Ziwa La Kioo Road,Kilimani,Rural,Field Surveyor
2,Bakari Iniko,99222599041,,18 Mlima Tazama Avenue,Hawassa,Rural,Field Surveyor
3,Malachi Mavuso,99945849900,,100 Mogadishu Road,Akatsi,Lusaka,Field Surveyor
4,Cheche Buhle,99381679640,,1 Savanna Street,Akatsi,Rural,Field Surveyor


In [7]:
%%sql
# Retrieve a sample of records from the water_quality table
SELECT 
	*
FROM 
	water_quality
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


record_id,subjective_quality_score,visit_count
0,0,1
1,1,1
2,5,1
3,10,1
4,4,1


In [8]:
%%sql
# Retrieve a sample of records from the well_pollution table
SELECT
	*
FROM 
	well_pollution
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


source_id,date,description,pollutant_ppm,biological,results
KiRu28935224,2021-01-04 09:17:00,Bacteria: Giardia Lamblia,0.0,495.898,Contaminated: Biological
AkLu01628224,2021-01-04 09:53:00,Bacteria: E. coli,0.0,6.09608,Contaminated: Biological
HaZa21742224,2021-01-04 10:37:00,"Inorganic contaminants: Zinc, Zinc, Lead, Cadmium",2.715,0.0,Contaminated: Chemical
HaRu19725224,2021-01-04 11:04:00,Clean,0.0288593,9.56996e-05,Clean
SoRu35703224,2021-01-04 11:29:00,Bacteria: E. coli,0.0,22.5009,Contaminated: Biological


In [9]:
%%sql
# Retrieve a sample of records from the global_water_access table
SELECT 
	*
FROM 
	global_water_access
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


name,region,year,pop_n,pop_u,wat_bas_n,wat_lim_n,wat_unimp_n,wat_sur_n,wat_bas_r,wat_lim_r,wat_unimp_r,wat_sur_r,wat_bas_u,wat_lim_u,wat_unimp_u,wat_sur_u
Afghanistan,South Asia,2015,34413.6,24.803,61.3398,3.5112,22.1688,12.9802,52.9885,3.86114,26.5533,16.5971,86.6589,2.45027,8.87604,2.01475
Afghanistan,South Asia,2020,38928.3,26.026,75.0914,1.44754,14.5603,8.90078,66.3279,1.95682,19.6829,12.0323,100.0,0.0,0.0,0.0
Albania,Europe & Central Asia,2015,2890.52,57.434,93.3943,3.62638,2.97929,0.0,90.6273,5.26317,4.10955,0.0,95.4451,2.41331,2.14162,0.0
Albania,Europe & Central Asia,2020,2877.8,62.112,95.068,1.88466,3.04731,0.0,94.0914,2.30526,3.60338,0.0,95.6638,1.62809,2.7081,0.0
Algeria,Middle East & North Africa,2015,39728.0,70.848,93.4096,5.15778,1.27546,0.157193,88.3527,8.68575,2.58043,0.381108,95.4903,3.70612,0.73851,0.0650579


A data dictionary has been embedded into the database. If we query the data_dictionary table, an explanation of each column is
given there.

In [10]:
%%sql
# Retrieve the data dictionary
SELECT
	*
FROM
	data_dictionary;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
49 rows affected.


table_name,column_name,description,datatype,related_to
employee,assigned_employee_id,Unique ID assigned to each employee,INT,visits
employee,employee_name,Name of the employee,VARCHAR(255),
employee,phone_number,Contact number of the employee,VARCHAR(15),
employee,email,Email address of the employee,VARCHAR(255),
employee,address,Residential address of the employee,VARCHAR(255),
employee,town_name,Name of the town where the employee resides,VARCHAR(255),
employee,province_name,Name of the province where the employee resides,VARCHAR(255),
employee,position,Position or job title of the employee,VARCHAR(255),
visits,record_id,Unique ID assigned to each visit,int,"water_quality, water_source"
visits,location_id,ID of the location visited,varchar(255),location


## 2. **Dive into Sources: Understanding Different Sources with SELECT**

1. **Identify Water Sources Table:**
   - Locate the table dedicated to storing information about different water sources

2. **Retrieve Unique Water Source Types:**
   - Write a `SELECT` query to extract all distinct types of water sources available in the `water_source` table.
   - Use the `DISTINCT` keyword to ensure only unique water source types are returned in the query results.

3. **Explore Water Source Details:**
   - Examine the columns in the `water_source` table to understand the attributes associated with each water source.
   - Utilize descriptive column names to identify key information such as location, source type, and status.

4. **Analyze Water Source Distribution:**
   - Calculate the frequency of each water source type by running aggregate functions like `COUNT` in combination with `GROUP BY`.
   - Gain insights into the distribution of different water sources to prioritize analysis and interventions based on prevalence.

By delving into the `water_source` table using SQL queries, we will uncover valuable insights into the types, distribution, and characteristics of water sources in Maji Ndogo. This exploration is crucial for understanding the landscape of available water sources and informing targeted strategies to address the water crisis effectively.

Now that we're familiar with the structure of the tables, let's dive deeper. We need to understand the types of water sources we're
dealing with. 

Let's write a SQL query to find all the unique types of water sources.

In [11]:
%%sql
# Retrieve all unique types of water sources
SELECT DISTINCT 
	type_of_water_source
FROM
	water_source;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


type_of_water_source
tap_in_home
tap_in_home_broken
well
shared_tap
river


Let me quickly bring you up to speed on these water source types

Characteristics and Implications of Each Water Source Type:

   1. **River:**
      - **Characteristics:** Open water source where people collect drinking water.
      - **Implications:** High risk of contamination with biological and other pollutants, making it the least desirable water source due to health risks.

   2. **Well:**
      - **Characteristics:** Draws water from underground sources, commonly shared by communities.
      - **Implications:** Less likely to be contaminated compared to a river; however, many wells may be unclean due to aging infrastructure and past corruption.

   3. **Shared Tap:**
      - **Characteristics:** Tap located in a public area shared by communities.
      - **Implications:** Serves multiple individuals, providing access to clean water for a significant number of people.

   4. **Tap in Home:**
      - **Characteristics:** Taps installed inside homes, serving about 6 people on average.
      - **Implications:** Provides convenient access to clean water within households, contributing to improved hygiene and quality of life.

   5. **Broken Tap in Home:**
      - **Characteristics:** Taps installed in homes but not functional due to infrastructure issues.
      - **Implications:** Represents a potential source of clean water that is currently unusable, highlighting the need for maintenance and repair to ensure access to safe drinking water.

Understanding the characteristics and implications of each water source type is essential for assessing water quality, identifying potential contamination risks, and implementing targeted interventions to improve access to clean and safe drinking water in Maji Ndogo.

**An important note on the home taps:** About 6-10 million people have running water installed in their homes in Maji Ndogo, including
broken taps. If we were to document this, we would have a row of data for each home, so that one record is one tap. That means our
database would contain about 1 million rows of data, which may slow our systems down. For now, the surveyors combined the data of
many households together into a single record.

For example, the first record, AkHa00000224 is for a tap_in_home that serves 956 people. What this means is that the records of about
160 homes nearby were combined into one record, with an average of 6 people living in each house 160 x 6 ≈ 956. So 1 tap_in_home
or tap_in_home_broken record actually refers to multiple households, with the sum of the people living in these homes equal to number_of_people_served.

## 3. **Unpack the Visits: Discovering the Visit Patterns**

1. **Identify Visits Table:**
   - Locate the `visits` table within the database, which likely contains records of visits to different water sources.

2. **Retrieve Visit Patterns:**
   - Write a `SELECT` query to extract data from the `visits` table, focusing on visit timestamps, source locations, and visit durations.
   - Utilize filtering conditions to identify patterns such as peak visit times, popular source locations, and average visit durations.

3. **Analyze Visit Frequency:**
   - Calculate the frequency of visits to each water source by using aggregate functions like `COUNT` in combination with `GROUP BY`.
   - Identify sources that have been visited more frequently than others to understand usage patterns and prioritize analysis.

4. **Explore Visit Distribution:**
   - Visualize the distribution of visits geographically by mapping source locations based on visit frequencies.
   - Identify clusters of high-traffic locations and analyze visit patterns to uncover insights into water source accessibility and usage.

By delving into the `visits` table and unraveling the visit patterns, we will gain valuable insights into how different water sources are utilized, the frequency of visits, and potential trends that can inform decision-making processes to address the water crisis effectively in Maji Ndogo.

We have a table in our database that logs the visits made to different water sources.

Let's write an SQL query that retrieves all records from this table where the time_in_queue is more than some crazy time, say 500 min. How
would it feel to queue 8 hours for water?


In [12]:
%%sql
# Retrieve all records from the visits table where the time spent in queue is more than 500 minutes
SELECT
	*
FROM 
	visits
WHERE
	time_in_queue > 500;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
105 rows affected.


record_id,location_id,source_id,time_of_record,visit_count,time_in_queue,assigned_employee_id
899,SoRu35083,SoRu35083224,2021-01-16 10:14:00,6,515,28
2304,SoKo33124,SoKo33124224,2021-02-06 07:53:00,5,512,16
2315,KiRu26095,KiRu26095224,2021-02-06 14:32:00,3,529,8
3206,SoRu38776,SoRu38776224,2021-02-20 15:03:00,5,509,46
3701,HaRu19601,HaRu19601224,2021-02-27 12:53:00,3,504,0
4154,SoRu38869,SoRu38869224,2021-03-06 10:44:00,2,533,24
5483,AmRu14089,AmRu14089224,2021-03-27 18:15:00,4,509,12
9177,SoRu37635,SoRu37635224,2021-05-22 18:48:00,2,515,1
9648,SoRu36096,SoRu36096224,2021-05-29 11:24:00,2,533,3
11631,AkKi00881,AkKi00881224,2021-06-26 06:15:00,6,502,32


How is this possible? Can you imagine queueing 8 hours for water?

I am wondering what type of water sources take this long to queue for. We will have to find that information in another table that lists
the types of water sources. If I remember correctly, the table has type_of_water_source, and a source_id column. So let's write
down a couple of these source_id values from our results, and search for them in the other table.

In [13]:
%%sql
# Retrieve records for specific source_ids
SELECT
	*
FROM
	water_source
WHERE
	source_id IN('AkKi00881224','SoRu37635224','SoRu36096224','AkRu05234224','HaZa21742224');


 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


source_id,type_of_water_source,number_of_people_served
AkKi00881224,shared_tap,3398
AkRu05234224,tap_in_home_broken,496
HaZa21742224,well,308
SoRu36096224,shared_tap,3786
SoRu37635224,shared_tap,3920


If we check them we will see which sources have people queueing. The field surveyors also
let us know that they measured sources that had queues a few times to see if the queue time changed.

## 4. **Water Source Quality: Understanding Water Quality**

1. **Locate Water Quality Table:**
   - Identify the table storing information related to water quality.

2. **Retrieve Water Quality Scores:**
   - Write a `SELECT` query to extract quality scores assigned to different water sources from the `water_quality` table.
   - Review the range of quality scores (typically from 1 to 10) and their distribution across various water sources.

3. **Analyze Quality Trends:**
   - Calculate average quality scores for different types of water sources using aggregate functions like `AVG`.
   - Identify sources with consistently high or low quality scores to understand variations in water quality.

4. **Investigate Quality Factors:**
   - Explore additional attributes in the `water_quality` table that may influence water quality, such as source type, location, and historical data.
   - Consider factors like source maintenance, contamination levels, and treatment processes affecting water quality assessments.

By delving into the `water_quality` table and examining the quality scores assigned to different water sources, we can gain valuable insights into the overall water quality landscape in Maji Ndogo. This analysis will help in identifying areas of concern, monitoring quality trends, and guiding interventions to improve water quality for the community.

The quality of our water sources is the whole point of this survey. We have a table that contains a quality score for each visit made
about a water source that was assigned by a Field surveyor. They assigned a score to each source from 1, being terrible, to 10 for a
good, clean water source in a home. Shared taps are not rated as high, and the score also depends on how long the queue times are.



Let's check if this is true. The surveyors only made multiple visits to shared taps and did not revisit other types of water sources. So
there should be no records of second visits to locations where there are good water sources, like taps in homes.



So Let's write a query to find records where the subject_quality_score is 10 -- only looking for home taps -- and where the source
was visited a second time.

In [14]:
%%sql
# Find records where the subject_quality_score is 10 for home taps and the source was visited a second time
SELECT
	count(*)
FROM
	water_quality
WHERE
	subjective_quality_score = 10
AND 
	visit_count = 2;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
1 rows affected.


count(*)
218


I get 218 rows of data. But this should not be happening! I think some of our employees may have made mistakes. To be honest, I'll
be surprised if there are no errors in our data at this scale!
We have to recheck some of these sources. We can appoint an Auditor to check some of the data independently, and make sure we have the right information!


## 5. **Pollution Issues: Correcting Pollution Data with LIKE and String Operations**

1. **Identify Pollution Data Table:**
   - Locate the table containing contamination/pollution data for well sources.

2. **Examine First Few Rows:**
   - Write a query to retrieve and display the first few rows of data from the `well_pollution` table.
   - Review the columns such as `source_id`, `date`, `description`, `pollutant_ppm`, `biological`, and `results` to understand the data structure.

3. **Understand Data Integrity:**
   - Analyze the descriptions in the `well_pollution` table, which are notes taken by scientists as text.
   - Check for any inconsistencies or errors in the data that may impact the accuracy of pollution assessments.

4. **Identify Clean but Contaminated Records:**
   - Write a query that checks if the `results` column is marked as "Clean" but the `biological` column has a value greater than 0.01.
   - Identify records where the cleanliness status contradicts the actual biological contamination levels for further investigation.

By addressing inconsistencies in pollution data through SQL queries and string operations, we can ensure the accuracy and reliability of the pollution assessments for well sources in Maji Ndogo. This process is crucial for maintaining data integrity and making informed decisions to safeguard public health and water quality.

It looks like our scientists diligently recorded the water quality of all the wells. Some are contaminated with biological contaminants,
while others are polluted with an excess of heavy metals and other pollutants. Based on the results, each well was classified as Clean,
Contaminated: Biological or Contaminated: Chemical. It is important to know this because wells that are polluted with bio- or
other contaminants are not safe to drink. It looks like they recorded the source_id of each test, so we can link it to a source, at some
place in Maji Ndogo.

In the well pollution table, the descriptions are notes taken by our scientists as text, so it will be challenging to process it. The
biological column is in units of CFU/mL, so it measures how much contamination is in the water. 0 is clean, and anything more than
0.01 is contaminated.
Let's check the integrity of the data. The worst case is if we have contamination, but we think we don't. People can get sick, so we
need to make sure there are no errors here.


So, this is a query that checks if the results is Clean but the biological column is > 0.01.

In [15]:
%%sql
# Check for records where the results are 'Clean' but the biological column is greater than 0.01
SELECT
	*
FROM
	well_pollution
WHERE
	results = 'Clean'
AND
	biological > 0.01;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
64 rows affected.


source_id,date,description,pollutant_ppm,biological,results
AkRu08936224,2021-01-08 09:22:00,Bacteria: E. coli,0.0406458,35.0068,Clean
AkRu06489224,2021-01-10 09:44:00,Clean Bacteria: Giardia Lamblia,0.0897904,38.467,Clean
SoRu38011224,2021-01-14 15:35:00,Bacteria: E. coli,0.0425095,19.2897,Clean
AkKi00955224,2021-01-22 12:47:00,Bacteria: E. coli,0.0812092,40.2273,Clean
KiHa22929224,2021-02-06 13:54:00,Bacteria: E. coli,0.0722537,18.4482,Clean
KiRu25473224,2021-02-07 15:51:00,Clean Bacteria: Giardia Lamblia,0.0630094,24.4536,Clean
HaRu17401224,2021-03-01 13:44:00,Clean Bacteria: Giardia Lamblia,0.0649209,25.8129,Clean
AkRu07137224,2021-03-04 13:41:00,Clean Bacteria: Giardia Lamblia,0.0656843,18.2978,Clean
KiRu27205224,2021-03-13 14:17:00,Clean Bacteria: Giardia Lamblia,0.0418018,49.4281,Clean
AkLu02307224,2021-03-13 15:41:00,Bacteria: E. coli,0.0709682,35.203,Clean


If we compare the results of this query to the entire table it seems like we have some inconsistencies in how the well statuses are
recorded. Specifically, it seems that some data input personnel might have mistaken the description field for determining the cleanliness of the water.

It seems like, in some cases, if the description field begins with the word “Clean”, the results have been classified as “Clean” in the results column, even though the biological column is > 0.01. Let’s dive deeper into the cause of the issue with the biological contamination data.

Descriptions should only have the word “Clean” if there is no biological contamination (and no chemical pollutants). Some data personnel must have copied the data from the scientist's notes into our database incorrectly. We need to find and
remove the “Clean” part from all the descriptions that do have a biological contamination so this mistake is not made again.

The second issue has arisen from this error, but it is much more problematic. Some of the field surveyors have marked wells as Clean in
the results column because the description had the word “Clean” in it, even though they have a biological contamination. So we need
to find all the results that have a value greater than 0.01 in the biological column and have been set to Clean in the results column.

let's look at the descriptions. We need to identify the records that mistakenly have the word Clean in the description. However, it
is important to remember that not all of our field surveyors used the description to set the results – some checked the actual data.

In [16]:
%%sql
# Find descriptions that mistakenly include 'Clean' before the actual contamination information
SELECT
	count(*)
FROM
	well_pollution
WHERE
	description LIKE "Clean_%";

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
1 rows affected.


count(*)
38


Now we need to fix these descriptions so that we don’t encounter this issue again in the future.

Looking at the results we can see two different descriptions that we need to fix:
1. All records that mistakenly have Clean Bacteria: E. coli should updated to Bacteria: E. coli
2. All records that mistakenly have Clean Bacteria: Giardia Lamblia should updated to Bacteria: Giardia Lamblia

The second issue we need to fix is in our results column. We need to update the results column from Clean to Contaminated: Biological where the biological column has a value greater than 0.01.

Ok, so here is how I did it:

−− Case 1a: Update descriptions that mistakenly mention
`Clean Bacteria: E. coli` to `Bacteria: E. coli`

−− Case 1b: Update the descriptions that mistakenly mention
`Clean Bacteria: Giardia Lamblia` to `Bacteria: Giardia Lamblia

−− Case 2: Update the `result` to `Contaminated: Biological` where
`biological` is greater than 0.01 plus current results is `Clean`

Now, when we change any data on the database, we need to be SURE there are no errors, as this could fill the database with incorrect
values. A safer way to do the UPDATE is by testing the changes on a copy of the table first.

In [17]:
%%sql
/* Fix descriptions that mistakenly include 'Clean' before the actual contamination information
Case 1a: Fix descriptions mentioning 'Clean Bacteria: E. coli' 
Case 1b: Fix descriptions mentioning 'Clean Bacteria: Giardia Lamblia'
Case 2: Update the 'results' column to 'Contaminated: Biological' where necessary */
SET SQL_SAFE_UPDATES = 0;
CREATE TABLE well_pollution_copy AS SELECT * FROM well_pollution;
UPDATE
	well_pollution_copy
SET
	description = 'Bacteria: E. coli'
WHERE
	description = 'Clean Bacteria: E. coli';

UPDATE
	well_pollution_copy
SET
	description = 'Bacteria: Giardia Lamblia'
WHERE
	description = 'Clean Bacteria: Giardia Lamblia';

UPDATE
	well_pollution_copy
SET
	results = 'Contaminated: Biological'
WHERE
	biological > 0.01
AND 
	results = 'Clean';


 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
0 rows affected.
17383 rows affected.
26 rows affected.
12 rows affected.
64 rows affected.


[]

In [18]:
%%sql
# Check if the updates are correct
SELECT
	*
FROM
well_pollution_copy
WHERE
	description LIKE "Clean_%"
OR (results = "Clean" AND biological > 0.01);

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
0 rows affected.


source_id,date,description,pollutant_ppm,biological,results


In [19]:
%%sql
#update the actual well_pollution table
SET SQL_SAFE_UPDATES = 0;
UPDATE
	well_pollution
SET
	description = 'Bacteria: E. coli'
WHERE
	description = 'Clean Bacteria: E. coli';

UPDATE
	well_pollution
SET
	description = 'Bacteria: Giardia Lamblia'
WHERE
	description = 'Clean Bacteria: Giardia Lamblia';

UPDATE
	well_pollution
SET
	results = 'Contaminated: Biological'
WHERE
	biological > 0.01
AND 
	results = 'Clean';

#drop well_pollution_copy table
DROP TABLE well_pollution_copy;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
0 rows affected.
26 rows affected.
12 rows affected.
64 rows affected.
0 rows affected.


[]