### Capstone Project (Date Engineering and Intelligence Unit)

* Objective: To test the level of expertise across

* Data acquisition, Data analysis, Data Visualization and Data interpretation (insight generation)

* Data Category: Health, Energy, Religion, Agriculture, Hospitality, Finance

* Capstone Rule:

* All datasets must be from Nigerian geography.
* Try to make your deliverable as native to the Nigerian ecosystem as possible.
* Be original in your deliverable.

* Task One: Select one or more data categories, search the internet, and acquire enough geospatial and statistical dataset

* Deliverables:

* Acquired dataset and their referenced source

* Task Two: Clean and analyze the acquired dataset, generation at least 10 analytical outputs you believe are useful

* (hint: try to focus on what can help create solution, impact or business intelligence).

* Deliverable:

* Cleaned dataset

* Analytical report and method used

* Task Three: Create a visualization based on the analysis done, visuals should be in both statistical charts and maps. (hint: 
your output here can be either static or interactive or both)

* Deliverables:

* Statistical Chart(s)
* Embellished map(s)

* Task Four: Interpret maps and charts in a summarized format (no more than 500 words) providing a relatable context on how the visualization relates to reality in terms of solution, impact, or profitability

* Deliverables:

* A document having all visualization with their appending interpretation


#### Acquired Data: Health Facilities in Nigeria
#### Data Source: [https://data.humdata.org/dataset/nigeria-health-facilities] 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt

In [2]:
df = pd.read_csv('nigeriahealthfacilities.csv')
df.shape

(46146, 15)

In [3]:
df.head()

Unnamed: 0,id,name,global_id,alternate_name,functional_status,type,ward_code,category,timestamp,accessibility,lga_name,lga_code,state_code,state_name,FID
0,1,G R A Nursing Home,af719462-abfd-4f47-9dc3-0987164e75ac,Nursing Home,Unknown,Primary,12413,Primary Health Center,2020-07-04T13:49:18Z,,Maiduguri,124,BR,Borno,sv_health_facilities.fid--3185df38_17a314b4ee6...
1,2,Gishili Health Center,a29b0328-d844-4358-b0ab-2e120b8fb30f,Nursing Home,Functional,Primary,12413,Primary Health Center,2020-07-04T13:49:18Z,Unknown,Maiduguri,124,BR,Borno,sv_health_facilities.fid--3185df38_17a314b4ee6...
2,3,Lehobi Primary Health Care,b685b769-5c83-4f83-a182-00e7e1b777d8,,Partially Functional,Primary,10207,Primary Health Center,2020-07-04T13:49:18Z,,Askira Uba,102,BR,Borno,sv_health_facilities.fid--3185df38_17a314b4ee6...
3,4,Dugja Idp Camp,78e64f7a-cbb8-4357-9e64-a7e502534527,Mandara Girau Dispensary,Not Functional,Primary,10503,Primary Health Center,2020-07-04T13:49:18Z,Unknown,Biu,105,BR,Borno,sv_health_facilities.fid--3185df38_17a314b4ee6...
4,5,Kopa Maikudiri Dispensary,409c97ce-7490-4dc2-a8f8-2b8d53ad2b12,,Partially Functional,Primary,10209,Dispensary,2020-07-04T13:49:18Z,Unknown,Askira Uba,102,BR,Borno,sv_health_facilities.fid--3185df38_17a314b4ee6...


In [5]:
# let's check unique values
columns=df.columns
for col in columns:
    print(col,':\n',df[col].unique())
    print(df[col].value_counts())
    print('\n',20*'**','\n')

id :
 [    1     2     3 ... 46606 46607 46608]
1        1
31039    1
31041    1
31042    1
31043    1
        ..
15546    1
15547    1
15548    1
15550    1
46608    1
Name: id, Length: 46146, dtype: int64

 **************************************** 

name :
 ['G R A Nursing Home' 'Gishili Health Center' 'Lehobi Primary Health Care'
 ... 'Ardo Kola Primary Health Care Center'
 'Ke Comprehensive Health Center' 'Bonny Primary Health Center']
Police Clinic                       17
Alheri Clinic                       11
Sabon Gari Primary Health Center    11
Model Health Center                 10
Sauki Clinic                         9
                                    ..
Elele Army Barracks                  1
Rumuokwuta Primary Health Center     1
Okira Primary Health Center          1
Obarany Health Post                  1
Bonny Primary Health Center          1
Name: name, Length: 43292, dtype: int64

 **************************************** 

global_id :
 ['af719462-abfd-4f47-9dc3-098

Name: state_name, dtype: int64

 **************************************** 

FID :
 ['sv_health_facilities.fid--3185df38_17a314b4ee6_9a6'
 'sv_health_facilities.fid--3185df38_17a314b4ee6_9a7'
 'sv_health_facilities.fid--3185df38_17a314b4ee6_9a8' ...
 'sv_health_facilities.fid--3185df38_17a326f927a_-421a'
 'sv_health_facilities.fid--3185df38_17a326f927a_-4219'
 'sv_health_facilities.fid--3185df38_17a326f927a_-4218']
sv_health_facilities.fid--3185df38_17a314b4ee6_9a6      1
sv_health_facilities.fid--3185df38_17a326f927a_-7e34    1
sv_health_facilities.fid--3185df38_17a326f927a_-7e32    1
sv_health_facilities.fid--3185df38_17a326f927a_-7e31    1
sv_health_facilities.fid--3185df38_17a326f927a_-7e30    1
                                                       ..
sv_health_facilities.fid--3185df38_17a314b4ee6_45bd     1
sv_health_facilities.fid--3185df38_17a314b4ee6_45be     1
sv_health_facilities.fid--3185df38_17a314b4ee6_45bf     1
sv_health_facilities.fid--3185df38_17a314b4ee6_45c0     1
sv

In [4]:
df.columns

Index(['id', 'name', 'global_id', 'alternate_name', 'functional_status',
       'type', 'ward_code', 'category', 'timestamp', 'accessibility',
       'lga_name', 'lga_code', 'state_code', 'state_name', 'FID'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46146 entries, 0 to 46145
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 46146 non-null  int64 
 1   name               46146 non-null  object
 2   global_id          46146 non-null  object
 3   alternate_name     4026 non-null   object
 4   functional_status  46146 non-null  object
 5   type               46146 non-null  object
 6   ward_code          46146 non-null  object
 7   category           46112 non-null  object
 8   timestamp          46146 non-null  object
 9   accessibility      134 non-null    object
 10  lga_name           46146 non-null  object
 11  lga_code           46146 non-null  int64 
 12  state_code         45020 non-null  object
 13  state_name         46146 non-null  object
 14  FID                46146 non-null  object
dtypes: int64(2), object(13)
memory usage: 5.3+ MB


In [16]:
#There is no duplicate in our dataset
duplicates = df.duplicated()
duplicates.sum()

0

In [17]:
df.isna().sum()

id                       0
name                     0
global_id                0
alternate_name       42120
functional_status        0
type                     0
ward_code                0
category                34
timestamp                0
accessibility        46012
lga_name                 0
lga_code                 0
state_code            1126
state_name               0
FID                      0
dtype: int64

In [None]:
df['VendorID'].unique()
df['VendorID'].value_counts()
sns.countplot(x="VendorID", data=df)
df['VendorID'].value_counts().plot(kind='pie',autopct='%1.1f%%')