In [1]:
!pip install ipython-sql pymysql



In [2]:
%load_ext sql
%config SqlMagic.style = '_DEPRECATED_DEFAULT'
%config SqlMagic.autopandas = True

In [3]:
%sql mysql+pymysql://root:12345678@localhost:3306/md_water_services

### Maji Ndogo: From analysis to action
# Charting the course for Maji Ndogo's water future
We aim to convert our data into actionable knowledge. Understanding the situation is one thing, but it's the translation of
that understanding into informed decisions that will truly make a difference. We will shape our raw data into meaningful views, providing essential information to decision-makers. This will enable us to discern the materials we need, plan our budgets, and identify the areas requiring immediate attention. We're not just analysing data. We'll be creating job lists for our engineers. Their expertise will be invaluable in tackling the challenges we face, but they can only do their job effectively when they have clear, data-driven directions.

### Joining Pieces together
1. Are there any specific provinces, or towns where some sources are more abundant?
2. We identified that tap_in_home_broken taps are easy wins. Are there any towns where this is a particular problem?

To answer question 1, we will need province_name and town_name from the location table. We also need to know type_of_water_source and
number_of_people_served from the water_source table.  

The problem is that the location table uses location_id while water_source only has source_id. So we won't be able to join these tables di-
rectly. But the visits table maps location_id and source_id. So if we use visits as the table we query from, we can join location where the location_id matches, and water_source where the source_id matches.

# Joining location to visits.

In [7]:
%%sql
SELECT
province_name,
town_name,
visit_count,
l.location_id
FROM location AS l
JOIN visits AS v
ON l.location_id=v.location_id
LIMIT 5;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


Unnamed: 0,province_name,town_name,visit_count,location_id
0,Akatsi,Harare,1,AkHa00000
1,Akatsi,Harare,1,AkHa00001
2,Akatsi,Harare,1,AkHa00002
3,Akatsi,Harare,1,AkHa00003
4,Akatsi,Harare,1,AkHa00004


### Joining the water_source table on the key shared between water_source and visits.

In [10]:
%%sql
SELECT
province_name,
town_name,
visit_count,
l.location_id,
type_of_water_source, number_of_people_served
FROM location AS l
JOIN visits AS v
ON l.location_id=v.location_id
JOIN water_source AS ws
ON v.source_id= ws.source_id
LIMIT 5;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


Unnamed: 0,province_name,town_name,visit_count,location_id,type_of_water_source,number_of_people_served
0,Akatsi,Harare,1,AkHa00000,tap_in_home,956
1,Akatsi,Harare,1,AkHa00001,tap_in_home_broken,930
2,Akatsi,Harare,1,AkHa00002,tap_in_home_broken,486
3,Akatsi,Harare,1,AkHa00003,well,364
4,Akatsi,Harare,1,AkHa00004,tap_in_home_broken,942


Note that there are rows where visit_count > 1. These were the sites our surveyors collected additional information for, but they happened at the
same source/location. For example:

In [14]:
%%sql
SELECT
province_name,
town_name,
visit_count,
l.location_id,
type_of_water_source, number_of_people_served
FROM location AS l
JOIN visits AS v
ON l.location_id=v.location_id
JOIN water_source AS ws
ON v.source_id= ws.source_id
WHERE v.location_id = 'AkHa00103'
LIMIT 5;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


Unnamed: 0,province_name,town_name,visit_count,location_id,type_of_water_source,number_of_people_served
0,Akatsi,Harare,1,AkHa00103,shared_tap,3340
1,Akatsi,Harare,2,AkHa00103,shared_tap,3340
2,Akatsi,Harare,3,AkHa00103,shared_tap,3340
3,Akatsi,Harare,4,AkHa00103,shared_tap,3340
4,Akatsi,Harare,5,AkHa00103,shared_tap,3340


There, you can see that for one location, there are multiple AkHa00103 records for the same location. If we aggregate, we will include
these rows, so our results will be incorrect. To fix this, we can just select rows where visits.visit_count = 1.

### Adding the location_type column from location and time_in_queue from visits to our results set.

In [15]:
%%sql
SELECT
province_name,
town_name,
visit_count,
l.location_id,
type_of_water_source,
location_type,
number_of_people_served,
time_in_queue
FROM location AS l
JOIN visits AS v
ON l.location_id=v.location_id
JOIN water_source AS ws
ON v.source_id= ws.source_id
where v.visit_count = 1
LIMIT 5;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


Unnamed: 0,province_name,town_name,visit_count,location_id,type_of_water_source,location_type,number_of_people_served,time_in_queue
0,Sokoto,Ilanga,1,SoIl32582,river,Urban,402,15
1,Kilimani,Rural,1,KiRu28935,well,Rural,252,0
2,Hawassa,Rural,1,HaRu19752,shared_tap,Rural,542,62
3,Akatsi,Lusaka,1,AkLu01628,well,Urban,210,0
4,Akatsi,Rural,1,AkRu03357,shared_tap,Rural,2598,28


We need to grab the results from the well_pollution table. The well_pollution table contains only data for well. If we just use JOIN, we will do an inner join, so that only records
that are in well_pollution AND visits will be joined. We have to use a LEFT JOIN to join the results from the well_pollution table for well
sources

In [18]:
%%sql
SELECT
province_name,
town_name,
visit_count,
l.location_id,
type_of_water_source,
location_type,
number_of_people_served,
time_in_queue
FROM location AS l
JOIN visits AS v
ON l.location_id=v.location_id
LEFT JOIN well_pollution AS wp
ON wp.source_id = v.source_id
JOIN water_source AS ws
ON v.source_id= ws.source_id
where v.visit_count = 1
LIMIT 5;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
5 rows affected.


Unnamed: 0,province_name,town_name,visit_count,location_id,type_of_water_source,location_type,number_of_people_served,time_in_queue
0,Sokoto,Ilanga,1,SoIl32582,river,Urban,402,15
1,Kilimani,Rural,1,KiRu28935,well,Rural,252,0
2,Hawassa,Rural,1,HaRu19752,shared_tap,Rural,542,62
3,Akatsi,Lusaka,1,AkLu01628,well,Urban,210,0
4,Akatsi,Rural,1,AkRu03357,shared_tap,Rural,2598,28


The above table contains the data we need for our analysis. Now we want to analyse the data in the results set. We can either create a CTE, and then
query it, or in create view

In [19]:
%%sql
CREATE VIEW combined_analysis_table AS(
SELECT
province_name,
town_name,
visit_count,
l.location_id,
type_of_water_source,
location_type,
number_of_people_served,
time_in_queue
FROM location AS l
JOIN visits AS v
ON l.location_id=v.location_id
LEFT JOIN well_pollution AS wp
ON wp.source_id = v.source_id
JOIN water_source AS ws
ON v.source_id= ws.source_id
where v.visit_count = 1
LIMIT 5);


 * mysql+pymysql://root:***@localhost:3306/md_water_services
0 rows affected.


## The Last Analysis
We want to break down our data into provinces or towns and source types. If we understand where
the problems are, and what we need to improve at those locations, we can make an informed decision on where to send our repair teams.

In [24]:
%%sql
WITH province_totals AS (-- This CTE calculates the population of each province
SELECT
province_name,
SUM(ct.number_of_people_served) AS total_ppl_serv
FROM combined_analysis_table
GROUP BY province_name
)
SELECT
ct.province_name,
-- These case statements create columns for each type of source.
-- The results are aggregated, and percentages are calculated
ROUND((SUM(CASE WHEN type_of_water_source = 'river'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS river,
ROUND((SUM(CASE WHEN type_of_water_source = 'shared_tap'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS shared_tap,
ROUND((SUM(CASE WHEN type_of_water_source = 'tap_in_home'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS tap_in_home,
ROUND((SUM(CASE WHEN type_of_water_source = 'tap_in_home_broken'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS tap_in_home_broken,
ROUND((SUM(CASE WHEN type_of_water_source = 'well'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS well
FROM
combined_analysis_table ct
JOIN
province_totals pt ON ct.province_name = pt.province_name
GROUP BY
ct.province_name
ORDER BY
ct.province_name;


 * mysql+pymysql://root:***@localhost:3306/md_water_services
(pymysql.err.OperationalError) (1054, "Unknown column 'ct.number_of_people_served' in 'field list'")
[SQL: WITH province_totals AS (-- This CTE calculates the population of each province
SELECT
province_name,
SUM(ct.number_of_people_served) AS total_ppl_serv
FROM combined_analysis_table
GROUP BY province_name
)
SELECT
ct.province_name,
-- These case statements create columns for each type of source.
-- The results are aggregated, and percentages are calculated
ROUND((SUM(CASE WHEN type_of_water_source = 'river'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS river,
ROUND((SUM(CASE WHEN type_of_water_source = 'shared_tap'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS shared_tap,
ROUND((SUM(CASE WHEN type_of_water_source = 'tap_in_home'
THEN people_served ELSE 0 END) * 100.0 / pt.total_ppl_serv), 0) AS tap_in_home,
ROUND((SUM(CASE WHEN type_of_water_source = 'tap_in_home_broken'
THEN peop

In [None]:
       province_totals is a CTE that calculates the sum of all the people surveyed grouped by province. If you replace the query above with this one:
SELECT
*
FROM
province_totals;