# **Maji Ndogo: From analysis to action**
# **Clustering data to unveil Maji Ndogo's water crisis**

### **Overview:**
This personal project focuses on a comprehensive data exploration journey aimed at addressing water resource management challenges in a community. By analyzing and interpreting data related to water sources, infrastructure, and employee performance, the project aims to derive insights to guide decision-making for sustainable solutions.

### **Main Goal:**
The main goal of this project is to utilize data analysis techniques to understand and improve water resource management in the target community.

## Main Points:

### 1. Cleaning our data:
- **Updating employee data:**
  - Rectifying data inconsistencies and inaccuracies in employee records.
  - Generating a column of names, email addresses, and phone numbers for top employees.

### 2. Honoring the workers:
- **Finding our best:**
  - Recognizing and acknowledging top-performing field workers.
  - Identifying key contributors in data collection efforts.

### 3. Analyzing locations:
- **Understanding where the water sources are:**
  - Focusing on province_name, town_name, and location_type to pinpoint water source locations.
  - Creating queries to count the number of records per town for spatial insights.

### 4. Diving into the sources:
- **Seeing the scope of the problem:**
  - Exploring the water_source table to understand different water source types and user demographics.
  - Analyzing the number of people served by each source type to gauge impact and usage patterns.

### 5. Start of a solution:
- **Thinking about how we can repair:**
  - Prioritizing infrastructure repairs based on the total number of people served by each source type.
  - Using data-driven insights to guide decision-making on infrastructure improvements.

### 6. Analyzing queues:
- **Uncovering when citizens collect water:**
  - Examining queue times and patterns to understand water collection behaviors.
  - Identifying peak usage times and potential bottlenecks in water access.

### 7. Reporting insights:
- **Assembling our insights into a story:**
  - Summarizing key findings and insights from data analysis.
  - Crafting a narrative to communicate findings effectively to stakeholders and decision-makers.

Through this project, the aim is to leverage data analysis to drive informed decisions and actions towards enhancing water resource management practices in the community.

## 1. **Cleaning our data: Updating employee data**

1. **Identifying and rectifying data inconsistencies and inaccuracies in employee records:**
   - Ensure the correctness of employee information, such as names, email addresses, and phone numbers.
   - Use TRIM() function to remove any unwanted spaces in the data.

2. **Updating employee data:**
   - Generate a column of names, email addresses, and phone numbers for top employees.
   - Use the employee table to count how many employees live in each town, identifying the number of employees in smaller communities in rural areas.

By following these steps, we can clean and update employee data, ensuring accurate and consistent information for further analysis.

In [1]:
%load_ext sql
# Connect to MySQL database
%sql mysql+pymysql://root:123456@127.0.0.1:3306/md_water_services

The employee table has info on all of our workers, but note that the email addresses have not been added. We will have to send
them reports and figures, so let's update it. Luckily the emails for our department are easy: first_name.last_name@ndogowater.gov.

We can determine the email address for each employee by:
- selecting the employee_name column
- replacing the space with a full stop
- make it lowercase
- and stitch it all together

We have to update the database again with these email addresses, so before we do, let's use a SELECT query to get the format right, then use
UPDATE and SET to make the changes.

In [2]:
%%sql
# Check if email query is working
SELECT
    CONCAT(LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov') AS new_email
FROM
    employee;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
56 rows affected.


new_email
amara.jengo@ndogowater.gov
bello.azibo@ndogowater.gov
bakari.iniko@ndogowater.gov
malachi.mavuso@ndogowater.gov
cheche.buhle@ndogowater.gov
zuriel.matembo@ndogowater.gov
deka.osumare@ndogowater.gov
lalitha.kaburi@ndogowater.gov
enitan.zuri@ndogowater.gov
farai.nia@ndogowater.gov


In [3]:
%%sql
# Update emails column in employee table
SET SQL_SAFE_UPDATES = 0;
UPDATE employee
SET email = CONCAT(LOWER(REPLACE(employee_name, ' ', '.')), '@ndogowater.gov');

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
0 rows affected.
56 rows affected.


[]

In [4]:
%%sql
# Check if emails have been updated correctly in employees table
SELECT *
FROM employee
LIMIT 5;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


assigned_employee_id,employee_name,phone_number,email,address,province_name,town_name,position
0,Amara Jengo,99637993287,amara.jengo@ndogowater.gov,36 Pwani Mchangani Road,Sokoto,Ilanga,Field Surveyor
1,Bello Azibo,99643864786,bello.azibo@ndogowater.gov,129 Ziwa La Kioo Road,Kilimani,Rural,Field Surveyor
2,Bakari Iniko,99222599041,bakari.iniko@ndogowater.gov,18 Mlima Tazama Avenue,Hawassa,Rural,Field Surveyor
3,Malachi Mavuso,99945849900,malachi.mavuso@ndogowater.gov,100 Mogadishu Road,Akatsi,Lusaka,Field Surveyor
4,Cheche Buhle,99381679640,cheche.buhle@ndogowater.gov,1 Savanna Street,Akatsi,Rural,Field Surveyor


I picked up another bit we have to clean up. Often when databases are created and updated, or information is collected from different sources,
errors creep in. For example, if we look at the phone numbers in the phone_number column, the values are stored as strings.



The phone numbers should be 12 characters long, consisting of the plus sign, area code (99), and the phone number digits. However, when we use
the LENGTH(column) function, it returns 13 characters, indicating there's an extra character.

In [5]:
%%sql
# Check if phone numbers are more than 12 characters
SELECT
    LENGTH(phone_number)
FROM
    employee;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
56 rows affected.


LENGTH(phone_number)
13
13
13
13
13
13
13
13
13
13


That's because there is a space at the end of the number! If we try to send an automated SMS to that number it will fail. This happens so often
that they create a function, especially for trimming off the space, called TRIM(column).
It removes any leading or trailing spaces from a string.

In [6]:
%%sql
# Function to trim extra space from the phone numbers
SELECT
    LENGTH(trim(phone_number))
FROM
    employee;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
56 rows affected.


LENGTH(trim(phone_number))
12
12
12
12
12
12
12
12
12
12


In [7]:
%%sql
# Update employee table with new trimmed phone numbers
UPDATE employee
SET phone_number = trim(phone_number);

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
56 rows affected.


[]

In [8]:
%%sql
# Check updated employees table for trimmed extra space from the phone numbers
SELECT
    LENGTH(trim(phone_number))
FROM
    employee;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
56 rows affected.


LENGTH(trim(phone_number))
12
12
12
12
12
12
12
12
12
12


## 2. **Honoring the workers: Finding our best**

1. **Identifying top-performing field workers:**
   - Count the number of records each employee collected to determine their performance.
   - Use the database to find the employee_ids of the top 3 field surveyors with the most location visits.

2. **Retrieving employee information:**
   - Use the employee_ids to get the names, email addresses, and phone numbers of the top 3 field surveyors.
   - Prepare a table with the top 3 field surveyors' information for acknowledgment and appreciation.

3. **President request:**
   - Send an email or message congratulating the top 3 field surveyors based on their performance.
   - Recognize and honor the workers who have made significant contributions to the data collection process.

4. **Top performers' contribution:**
   - Highlight the importance of top performers' efforts in the data collection process.
   - Acknowledge their dedication and hard work in gathering valuable data for analysis.

By following these steps, we can honor the workers and find the best-performing field surveyors, ensuring they are recognized for their contributions to the data exploration journey.

Before we dive into the analysis, let's get warmed up a bit!
Let's have a look at where our employees live.


In [9]:
%%sql
# Count how many employees live in each town
SELECT
    town_name,
    count(employee_name)
FROM
    employee
GROUP BY
    town_name;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
9 rows affected.


town_name,count(employee_name)
Ilanga,3
Rural,29
Lusaka,4
Zanzibar,4
Dahabu,6
Kintampo,1
Harare,5
Yaounde,1
Serowe,3


Note how many of our workers are living in smaller communities in the rural parts of Maji Ndogo.

The president of Maji Ndogo congratulated the team for completing the survey, but we would not have this data were it not for our field workers. So let's gather some data on their performance in this process, so we can thank those who really put all their effort in.

The president of Maji Ndogo has asked we send out an email or message congratulating the top 3 field surveyors. So let's use the database to get the employee_ids and use those to get the names, email and phone numbers of the three field surveyors with the most location visits.

Let's first look at the number of records each employee collected. So let's find the correct table, figure out what function to use and how to group, order and limit the results to only see the top 3 employee_ids with the highest number of locations visited.

In [10]:
%%sql
# Finding the employeeID of top three field surveyors based on total visit count
SELECT 
    assigned_employee_id,
    sum(visit_count) as number_of_visits
FROM 
    visits
GROUP BY 
    assigned_employee_id
ORDER BY
    sum(visit_count) DESC
LIMIT 3;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
3 rows affected.


assigned_employee_id,number_of_visits
1,8944
30,8800
34,8411


Let's create a query that looks up the employee's info.

In [11]:
%%sql
# Finding the name, email, and phone numbers of the top three field surveyors
SELECT
    employee_name,
    email,
    phone_number
FROM
    employee
WHERE
    assigned_employee_id IN ('1', '30', '34');

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
3 rows affected.


employee_name,email,phone_number
Bello Azibo,bello.azibo@ndogowater.gov,99643864786
Pili Zola,pili.zola@ndogowater.gov,99822478933
Rudo Imani,rudo.imani@ndogowater.gov,99046972648


I'll send that off to The president of Maji Ndogo. But this survey is not primarily about our employees, so let's get working on the main task! We'll start looking at
some of the tables in the dataset at a larger scale, identify some trends, summarise important data, and draw insights.

## 3. **Analyzing locations: Understanding where the water sources are**

1. **Overview of the dataset:**
   - The dataset contains water source records for each province and town in the country.
   - The dataset is reliable and provides a comprehensive view of the water crisis across the country.

2. **Location distribution:**
   - 60% of all water sources in the dataset are in rural communities.
   - This insight highlights the need to consider rural areas when making decisions.

3. **Location analysis:**
   - The location table provides insights into the distribution of water sources across different provinces and towns.
   - The data shows that every province and town has many documented sources, ensuring a comprehensive view of the water crisis.


By analyzing the locations, we can gain valuable insights into the distribution of water sources across the country, helping inform decision-making for water resource management.

Looking at the location table, let’s focus on the province_name, town_name and location_type to understand where the water sources are in
Maji Ndogo.

Let's create a query that counts the number of records per town.

In [12]:
%%sql
# Number of records per town
SELECT 
    count(location_id) as number_of_records,
    town_name
FROM
    location
GROUP BY
    town_name;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
25 rows affected.


number_of_records,town_name
1650,Harare
780,Kintampo
1070,Lusaka
23740,Rural
400,Abidjan
1090,Amina
930,Asmara
400,Bello
930,Dahabu
520,Pwani


Now let's count the records per province.

In [13]:
%%sql
# Number of records per province
SELECT 
    count(location_id) as number_of_records,
    province_name
FROM
    location
GROUP BY
    province_name;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


number_of_records,province_name
8940,Akatsi
6950,Amanzi
6030,Hawassa
9510,Kilimani
8220,Sokoto


From this table, it's pretty clear that most of the water sources in the survey are situated in small rural communities, scattered across Maji Ndogo.
If we count the records for each province, most of them have a similar number of sources, so every province is well-represented in the survey.



Let's find a way to do the following:
1. Create a result set showing:

    • province_name

    • town_name

    • An aggregated count of records for each town (consider naming this records_per_town).

    • Ensure our data is grouped by both province_name and town_name.

2. Order our results primarily by province_name. Within each province, further sort the towns by their record counts in descending order.

In [14]:
%%sql
# Creating a table showing the number of records per province per town in descending order
SELECT
    province_name,
    town_name,
    count(location_id) as number_of_records
FROM
    location
GROUP BY
    province_name,
    town_name
ORDER BY
    province_name,
    count(location_id) DESC;


 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
31 rows affected.


province_name,town_name,number_of_records
Akatsi,Rural,6290
Akatsi,Lusaka,1070
Akatsi,Harare,800
Akatsi,Kintampo,780
Amanzi,Rural,3100
Amanzi,Asmara,930
Amanzi,Dahabu,930
Amanzi,Amina,670
Amanzi,Pwani,520
Amanzi,Abidjan,400


These results show us that our field surveyors did an excellent job of documenting the status of our country's water crisis. Every province and town
has many documented sources.
This makes me confident that the data we have is reliable enough to base our decisions on. This is an insight we can use to communicate data
integrity, so let's make a note of that.

Finally, let's look at the number of records for each location type

In [15]:
%%sql
# Number of records for each location type
SELECT
    count(location_id) as num_sources,
    location_type
FROM
    location
GROUP BY
    location_type;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
2 rows affected.


num_sources,location_type
15910,Urban
23740,Rural


We can see that there are more rural sources than urban, but it's really hard to understand those numbers. Percentages are more relatable.
If we use SQL as a very overpowered calculator:

In [16]:
%%sql
# Percentage of rural communities
SELECT 23740 / (15910 + 23740) * 100 as pct_rural;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
1 rows affected.


pct_rural
59.8739


We can see that 60% of all water sources in the data set are in rural communities.

So again, what are some of the insights we gained from the location table?
1. Our entire country was properly canvassed, and our dataset represents the situation on the ground.
2. 60% of our water sources are in rural communities across Maji Ndogo. We need to keep this in mind when we make decisions.

## 4. **Diving into the sources: Seeing the scope of the problem**

1. **Clustering data:**
   - The goal is to unveil the water crisis in Maji Ndogo by analyzing larger patterns and trends in the data.
   - By clustering data, we can unearth broader narratives and hidden correlations concealed within the dataset.

2. **Understanding the structure of data:**
   - Data is not just numbers or dates, but stories waiting to be deciphered.
   - The unique structure of data brims with valuable insights, which can be unlocked by processing and determining categories.

3. **Calculating the average number of people per source type:**
   - Calculate the average number of people served by a single instance of each water source type to understand the typical capacity or load on a single water source.
   - This can help decide which sources should be repaired or upgraded based on the average impact of each upgrade.

4. **Calculating the total number of people served by each type of water source:**
   - Calculate the total number of people served by each type of water source to make it easier to interpret.
   - Order the results so the most people served by a source are at the top.

5. **Analyzing the distribution of people served by each type of water source:**
   - Calculate the percentage of citizens served by each type of water source.
   - This can help understand the impact of each type of water source on the community and prioritize repairs or upgrades accordingly.

By diving into the sources and seeing the scope of the problem, we can gain a comprehensive understanding of the water crisis in Maji Ndogo, enabling informed decision-making for repairs and upgrades.

Ok, water_source is a big table, with lots of stories to tell, so strap in!

The way I look at this table; we have access to different water source types and the number of people using each source.
These are the questions that I am curious about.
1. How many people did we survey in total?
2. How many wells, taps and rivers are there?
3. How many people share particular types of water sources on average?
4. How many people are getting water from each type of source?


So firstly we need to know how many people did we survery in total AKA the total number of people served

In [17]:
%%sql
# Number of people served
SELECT
    sum(number_of_people_served) as total_people_served 
FROM
    water_source;


 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
1 rows affected.


total_people_served
27628140


For the second question, we want to count how many of each of the different water source types there are.

In [18]:
%%sql
# Count how many of each of the different water source types there are
SELECT
    type_of_water_source,
    count(source_id)
FROM
    water_source
GROUP BY
    type_of_water_source
ORDER BY
    count(source_id) DESC;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


type_of_water_source,count(source_id)
well,17383
tap_in_home,7265
tap_in_home_broken,5856
shared_tap,5767
river,3379


Which of those sources stands out? It is pretty clear that although there was a drought, water is still abundant in Maji Ndogo. This isn't just an informative result, we will need these numbers to understand how much all of these repairs will cost. If we know how many taps we need to install, and we know how much it will cost to install them, we can calculate how much it will cost to solve the water crisis.

Ok next up, question 3: What is the average number of people that are served by each water source?

In [19]:
%%sql
# Average number of people served by each water source
SELECT
    type_of_water_source,
    round(avg(number_of_people_served), 0) as avg_people_served
FROM
    water_source
GROUP BY
    type_of_water_source
ORDER BY
    avg(number_of_people_served) DESC;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


type_of_water_source,avg_people_served
shared_tap,2071
river,699
tap_in_home_broken,649
tap_in_home,644
well,279


These results are telling us that 644 people share a tap_in_home on average. Does that make sense?

No it doesn’t, right?
The surveyors combined the data of many households together and added this as a single tap record, but each household actually has its own tap. In addition to this, there is an average of 6 people living in a home. So 6 people actually share 1 tap (not 644).

It is always important to think about data. We tend to just analyse, and calculate at the start of our careers, but the value we bring as data
practitioners is in understanding the meaning of results or numbers, and interpreting their meaning.
Imagine we were presenting this to the President and all of the Ministers, and one of them asks us: "Why does it say that 644 share a home tap?"
and we had no answer.

This means that 1 tap_in_home actually represents 644 ÷ 6 = ± 100 taps.

Calculating the average number of people served by a single instance of each water source type helps us understand the typical capacity or load
on a single water source. This can help us decide which sources should be repaired or upgraded, based on the average impact of each upgrade.
For example, wells don't seem to be a problem, as fewer people are sharing them.

On the other hand, 2000 share a single public tap on average! We saw some of the queue times last time, and now we can see why. So looking at
these results, we probably should focus on improving shared taps first.

Now let’s calculate the total number of people served by each type of water source in total, to make it easier to interpret, let's order them so the most people served by a source is at the top.

In [20]:
%%sql
# Total number of people served by each type of water source
SELECT
    type_of_water_source,
    sum(number_of_people_served) as sum_people_served
FROM
    water_source
GROUP BY
    type_of_water_source
ORDER BY
    sum(number_of_people_served) DESC;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


type_of_water_source,sum_people_served
shared_tap,11945272
well,4841724
tap_in_home,4678880
tap_in_home_broken,3799720
river,2362544


It's a little hard to comprehend these numbers, but we can see that one of these is dominating. To make it a bit simpler to interpret, let's use
percentages. First, we need the total number of citizens then use the result of that and divide each of the SUM(number_of_people_served) by
that number, times 100, to get percentages.

In [21]:
%%sql
# Total number of people served by each type of water source in percentages
SELECT
    type_of_water_source,
    round(sum(number_of_people_served) / 27628140 * 100, 0)  as pct_people_served
FROM
    water_source
GROUP BY
    type_of_water_source
ORDER BY
    round(sum(number_of_people_served) / 27628140 * 100, 0) DESC;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
5 rows affected.


type_of_water_source,pct_people_served
shared_tap,43
well,18
tap_in_home,17
tap_in_home_broken,14
river,9


43% of our people are using shared taps in their communities, and on average, we saw earlier, that 2000 people share one shared_tap.

By adding tap_in_home and tap_in_home_broken together, we see that 31% of people have water infrastructure installed in their homes, but 45%
(14/31) of these taps are not working! This isn't the tap itself that is broken, but rather the infrastructure like treatment plants, reservoirs, pipes, and
pumps that serve these homes that are broken.
18% of people are using wells. But only 4916 out of 17383 are clean = 28% (from part one).

## 5. **Start of a solution: Thinking about how we can repair**

1. **Understanding the problem:**
   - Analyze the data to identify the most critical issues in the water supply system.

2. **Focusing on improvements:**
   - Identify the areas where improvements can have the most significant impact.

3. **Calculating the total number of people served by each type of water source:**
   - Calculate the total number of people served by each type of water source to make it easier to interpret.
   - Order the results so the most people served by a source are at the top.

4. **Ranking water sources based on the number of people served:**
   - Use a window function to rank each type of water source based on the total number of people served.
   - This will help prioritize repairs and improvements based on the number of people affected.

5. **Analyzing the most used water sources:**
   - Identify the most used water sources, such as shared taps, and prioritize fixing them first.
   - This approach ensures that the greatest number of people benefit from the repairs and improvements made.

By following these steps, we can develop a data-driven strategy for repairing and improving the water supply system, ensuring that the most critical issues are addressed first and the greatest number of people benefit from the improvements.

At some point, we will have to fix or improve all of the infrastructure, so we should start thinking about how we can make a data-driven decision
how to do it. I think a simple approach is to fix the things that affect most people first. So let's write a query that ranks each type of source based on how many people in total use it. RANK() should tell us we are going to need a window function to do this, so let's think through the problem.

We will need the following columns:
- Type of sources -- Easy
- Total people served grouped by the types -- We did that earlier, so that's easy too.
- A rank based on the total people served, grouped by the types -- A little harder.

But think about this: If someone has a tap in their home, they already have the best source available. Since we can’t do anything more to improve
this, we should remove tap_in_home from the ranking before we continue.

So let's use a window function on the total people served column, converting it into a rank.

In [22]:
%%sql
# Window function on the total people served column, converting it into a rank
SELECT
    type_of_water_source,
    sum(number_of_people_served) as sum_people_served,
    rank() over(order by sum(number_of_people_served) DESC) as rank_by_population
FROM
    water_source
WHERE
    type_of_water_source != 'tap_in_home'
GROUP BY
    type_of_water_source;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
4 rows affected.


type_of_water_source,sum_people_served,rank_by_population
shared_tap,11945272,1
well,4841724,2
tap_in_home_broken,3799720,3
river,2362544,4


Ok, so we should fix shared taps first, then wells, and so on. But the next question is, which shared taps or wells should be fixed first? We can use
the same logic; the most used sources should really be fixed first.

So let's create a query to do this, and keep these requirements in mind:
1. The sources within each type should be assigned a rank.
2. Limit the results to only improvable sources.
3. Think about how to partition, filter and order the results set.
4. Order the results to see the top of the list.


In [23]:
%%sql
# Rank by source id
SELECT
    source_id,
    type_of_water_source,
    sum(number_of_people_served) as sum_people_served,
    dense_rank() over(order by sum(number_of_people_served) DESC) as priority_rank
FROM
    water_source
WHERE
    type_of_water_source != 'tap_in_home'
GROUP BY
    source_id
LIMIT 10;


 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
10 rows affected.


source_id,type_of_water_source,sum_people_served,priority_rank
HaRu19509224,shared_tap,3998,1
AkRu05603224,shared_tap,3998,1
AkRu04862224,shared_tap,3996,2
KiHa22867224,shared_tap,3996,2
AmAs10911224,shared_tap,3996,2
HaRu19839224,shared_tap,3994,3
KiZu31330224,shared_tap,3994,3
KiRu28630224,shared_tap,3992,4
KiZu31415224,shared_tap,3992,4
SoRu38511224,shared_tap,3990,5


Imagine yourself in an engineer's boots, and try to interpret the priority list. Thinking about the user
of a table helps us to design the table better.
In that line of thought, would it make sense to give them a list of source_ids? How would they know where to go? 
(we'll explore this problem in part 3 of the project)

## 6. **Analyzing queues: Uncovering when citizens collect water**

1. **Calculating the average queue time:**
   - Analyze the data to determine the average time citizens spend waiting in line to collect water.
   - This information can help identify bottlenecks and areas where improvements can be made.

2. **Identifying peak hours for water collection:**
   - Analyze the data to determine when the highest demand for water occurs.
   - This information can help in planning for infrastructure upgrades or repairs during off-peak hours, reducing the impact on citizens.

3. **Analyzing queue patterns:**
   - Identify patterns in queue times and water collection demand.
   - This information can help in understanding the underlying factors contributing to the water crisis and inform decision-making for sustainable solutions.

4. **Comparing queue times across different water sources:**
   - Analyze the data to compare queue times for different types of water sources.
   - This information can help prioritize repairs and improvements for water sources with the longest queue times, ensuring that the greatest number of citizens benefit from the improvements.

By analyzing queues and uncovering when citizens collect water, we can gain valuable insights into the water crisis, enabling informed decision-making for repairs, upgrades, and sustainable solutions.

A recap from last time:
The visits table documented all of the visits our field surveyors made to each location. For most sources, one visit was enough, but if there were
queues, they visited the location a couple of times to get a good idea of the time it took for people to queue for water. So we have the time that
they collected the data, how many times the site was visited, and how long people had to queue for water.

So, let's look at the information we have available, and think of what we could learn from it. Remember we can use some DateTime functions here to get
some deeper insight into the water queueing situation in Maji Ndogo, like which day of the week it was, and what time.


Ok, these are some of the things I think are worth looking at:
1. How long did the survey take?
2. What is the average total queue time for water?
3. What is the average queue time on different days?
4. How can we communicate this information efficiently?

Let's look at visits, especially the time_of_record column. It is an SQL DateTime datatype, so we can use all of the DateTime functions to aggregate
data for each day and even per hour.


**Question 1:**

To calculate how long the survey took, we need to get the first and last dates (which functions can find the largest/smallest value), and subtract
them. Remember with DateTime data, we can't just subtract the values. We have to use a function to get the difference in days.

In [24]:
%%sql
# Finding survey duration in days
SELECT
    DATEDIFF(max(time_of_record), min(time_of_record)) AS survey_duration_in_days
FROM
    visits;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
1 rows affected.


survey_duration_in_days
924


I get 924 days which is about 2 and a half years!

Just imagine all the visits, meeting all those people on the ground for two years! It is sometimes easy to see data as meaningless numbers and text,
but remember that each person in that queue that day could have been someone who walked 10 kilometres, queued for 4-5 hours and then walked
all the way back home! Often these are children who need to do this, so they have less time to attend school. 

**Question 2:**

Let's see how long people have to queue on average in Maji Ndogo. Keep in mind that many sources like taps_in_home have no queues. These
are just recorded as 0 in the time_in_queue column, so when we calculate averages, we need to exclude those rows. Let's try using NULLIF() do to
this.

In [25]:
%%sql
# Average time in queue excluding tap in homes
SELECT
    avg((select nullif(time_in_queue,0))) as avg_time_in_queue
FROM
    visits;


 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
1 rows affected.


avg_time_in_queue
123.2574


So on average, people take two hours to fetch water if they don't have a tap in their homes.

That may sound reasonable, but some days might have more people who need water, and only have time to go and collect some on certain days.

**Question 3:**

So let's look at the queue times aggregated across the different days of the week.

DAY() gives us the day of the month. It we want to aggregate data for each day of the week, we need to use another DateTime function,
DAYNAME(column). As the name suggests, it returns the day from a timestamp as a string. Using that on the time_of_record column will result
in a column with day names, Monday, Tuesday, etc., from the timestamp.

To do this, we need to calculate the average queue time, grouped by day of the week. 

In [26]:
%%sql
# Average queue time by day of week
SELECT
    dayname(time_of_record) as day_of_week,
    round(avg((select nullif(time_in_queue,0))), 0) as avg_time_in_queue
FROM
    visits
GROUP BY
    dayname(time_of_record)
ORDER BY
    round(avg((select nullif(time_in_queue,0))), 0) DESC;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
7 rows affected.


day_of_week,avg_time_in_queue
Saturday,246
Monday,137
Friday,120
Tuesday,108
Thursday,105
Wednesday,97
Sunday,82


Wow, ok Saturdays have much longer queue times compared to the other days!

**Question 4:**

We can also look at what time during the day people collect water. Let's try to order the results in a meaningful way.


In [27]:
%%sql
# Average queue time by hour of day
SELECT
    TIME_FORMAT(TIME(time_of_record), '%H:00') AS hour_of_day,
    round(avg((select nullif(time_in_queue,0))), 0) as avg_time_in_queue
FROM
    visits
GROUP BY
    TIME_FORMAT(TIME(time_of_record), '%H:00')
ORDER BY
    round(avg((select nullif(time_in_queue,0))), 0) DESC;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
14 rows affected.


hour_of_day,avg_time_in_queue
19:00,168
07:00,149
08:00,149
17:00,149
06:00,149
18:00,147
09:00,118
13:00,115
10:00,114
14:00,114


Let's create a Pivot table for queue times per hour and per day to better understand the situation.

In [28]:
%%sql
# Pivot table for queue times per hour and per day
SELECT
    TIME_FORMAT(TIME(time_of_record), '%H:00') AS hour_of_day,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Sunday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Sunday,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Monday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Monday,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Tuesday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Tuesday,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Wednesday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Wednesday,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Thursday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Thursday,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Friday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Friday,
    ROUND(AVG(
        CASE
            WHEN DAYNAME(time_of_record) = 'Saturday' THEN time_in_queue
            ELSE NULL
        END
    ), 0) AS Saturday
FROM
    visits
WHERE
    time_in_queue != 0 -- this excludes other sources with 0 queue times
GROUP BY
    hour_of_day
ORDER BY
    hour_of_day;

 * mysql+pymysql://root:***@127.0.0.1:3306/md_water_services
14 rows affected.


hour_of_day,Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday
06:00,79,190,134,112,134,153,247
07:00,82,186,128,111,139,156,247
08:00,86,183,130,119,129,153,247
09:00,84,127,105,94,99,107,252
10:00,83,119,99,89,95,112,259
11:00,78,115,102,86,99,104,236
12:00,78,115,97,88,96,109,239
13:00,81,122,97,98,101,115,242
14:00,83,127,104,92,96,110,244
15:00,83,126,104,88,92,110,248


Now we can compare the queue times for each day, hour by hour!

We can spot these patterns:
1. Queues are very long on a Monday morning and Monday evening as people rush to get water.
2. Wednesday has the lowest queue times, but long queues on Wednesday evening.
3. People have to queue pretty much twice as long on Saturdays compared to the weekdays. It looks like people spend their Saturdays queueing
for water, perhaps for the week's supply?
4. The shortest queues are on Sundays, and this is a cultural thing. The people of Maji Ndogo prioritise family and religion, so Sundays are spent
with family and friends.

## 7. **Reporting insights: Assembling our insights into a story**

1. **Share the insights we gathered**


2. **Develop an initial plan**


3. **Recommend practical solutions**

By reporting insights and assembling them into a story, we can effectively communicate the findings from the data analysis, inform decision-making, and build support for sustainable solutions to the water crisis in Maji Ndogo.

This survey aimed to identify the water sources people use and determine both the total and average number of users for each source.
Additionally, it examined the duration citizens typically spend in queues to access water.
So let's create a short summary report we can send off to The president of Maji Ndogo:

**Insights**
1. Most water sources are rural.
2. 43% of our people are using shared taps. 2000 people often share one tap.
3. 31% of our population has water infrastructure in their homes, but within that group, 45% face non-functional systems due to issues with pipes,
pumps, and reservoirs.
4. 18% of our people are using wells of which, but within that, only 28% are clean..
5. Our citizens often face long wait times for water, averaging more than 120 minutes.
6. In terms of queues:
    - Queues are very long on Saturdays.
    - Queues are longer in the mornings and evenings.
    - Wednesdays and Sundays have the shortest queues.

**Start of our plan**

We have started thinking about a plan:
1. We want to focus our efforts on improving the water sources that affect the most people.
    - Most people will benefit if we improve the shared taps first.
    - Wells are a good source of water, but many are contaminated. Fixing this will benefit a lot of people.
    - Fixing existing infrastructure will help many people. If they have running water again, they won't have to queue, thereby shorting queue times for
    others. So we can solve two problems at once.
    - Installing taps in homes will stretch our resources too thin, so for now, if the queue times are low, we won't improve that source.
2. Most water sources are in rural areas. We need to ensure our teams know this as this means they will have to make these repairs/upgrades in
rural areas where road conditions, supplies, and labour are harder challenges to overcome.


**Practical solutions**
1. If communities are using rivers, we can dispatch trucks to those regions to provide water temporarily in the short term, while we send out
crews to drill for wells, providing a more permanent solution.
2. If communities are using wells, we can install filters to purify the water. For wells with biological contamination, we can install UV filters that
kill microorganisms, and for *polluted wells*, we can install reverse osmosis filters. In the long term, we need to figure out why these sources
are polluted.
3. For shared taps, in the short term, we can send additional water tankers to the busiest taps, on the busiest days. We can use the queue time
pivot table we made to send tankers at the busiest times. Meanwhile, we can start the work on installing extra taps where they are needed.
According to UN standards, the maximum acceptable wait time for water is 30 minutes. With this in mind, our aim is to install taps to get
queue times below 30 min.
4. Shared taps with short queue times (< 30 min) represent a logistical challenge to further reduce waiting times. The most effective solution,
installing taps in homes, is resource-intensive and better suited as a long-term goal.
5. Addressing broken infrastructure offers a significant impact even with just a single intervention. It is expensive to fix, but so many people
can benefit from repairing one facility. For example, fixing a reservoir or pipe that multiple taps are connected to. We will have to find the
commonly affected areas though to see where the problem actually is.