<a href="https://colab.research.google.com/github/davidnene/Eldo_Hub_DS_Challenge/blob/main/EldoHub_DS_Challenge%5BDavidNene%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data Science/ Data Analytics challenge**

Consider a dataset providing information on the functionality of infrastructure resources,
for each water point it includes the name of the village it is in and its functional state.
Implement a data processing module in python which takes a dataset URL as input and
returns:

● The number of water points that are functional,

● The number of water points per community,

● The rank for each community by the percentage of broken water points.

data source: 
https://raw.githubusercontent.com/onaio/ona-tech/master/data/water_points.json


In [7]:
#install BeautifulSoup library
#pip install bs4

#define an automation function
def auto_analyzer(url):
  #import prerequisites
  import requests
  from bs4 import BeautifulSoup
  import pandas as pd
  
  #store the required url in a variable
  URL = "https://raw.githubusercontent.com/onaio/ona-tech/master/data/water_points.json"
  r=requests.get(URL)
  
  #parse url
  soup= BeautifulSoup(r.content,'lxml')
  raw=soup.findAll('p')[0].text
  
  #extract and read the json file
  page_d=pd.read_json(raw.split("window.pageData=")[0],orient='records') 
  page = page_d[['communities_villages','water_functioning']].copy()
  
  #start analysis
  #functional water points
  number_functional=page['water_functioning'].value_counts()

  #water points per community
  number_water_points=pd.DataFrame(page['water_functioning'].groupby(page['communities_villages']).count())
  no_of_waterpoints_per_community = number_water_points.sort_values(by='water_functioning', ascending=False)

  #The rank for each community by the percentage of broken water points.
  not_functioning = page[page['water_functioning']=='no']
  not_functioning['comm_rank_in_%'] = not_functioning['communities_villages'].rank(pct=True)
  community_rank = not_functioning[['communities_villages','comm_rank_in_%']].copy()
  
  #excecute
  print('Functional water points: Yes')
  print(number_functional)
  print(' ')

  print('Water points per community')
  print(no_of_waterpoints_per_community)
  print(' ')

  print('The rank for each community by the percentage of broken water points')
  print(community_rank)

In [8]:
#test and evaluate the function
auto_analyzer("https://raw.githubusercontent.com/onaio/ona-tech/master/data/water_points.json")

Functional water points: Yes
yes      623
no        87
na_dn      2
Name: water_functioning, dtype: int64
 
Water points per community
                      water_functioning
communities_villages                   
Kpatarigu                            51
Jagsa                                38
Nayoku                               35
Guuta                                32
Nabulugu                             31
...                                 ...
Suik                                  1
Jiniensa                              1
Gumaryili                             1
Kalaasa                               1
Garigu                                1

[65 rows x 1 columns]
 
The rank for each community by the percentage of broken water points
    communities_villages  comm_rank_in_%
8              Selinvoya        0.735632
15              Nabulugu        0.591954
31              Nabulugu        0.591954
34              Nabulugu        0.591954
37              Nabulugu        0.591954
..   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
