# Final Report Proposal
## Group Members
- Benjamin D. Fedoruk -- 100779076
- Amit Sarvate -- 100794129
- Ahad Abdul -- 100xxxxxx
- Lexa Torrance -- 100754032

## Dataset Selected
- The dataset which was selected was compiled as a STATA .do file as part of a report by Deaton, Brady and Lipka, Bethany in 2021 titled “The Provision of Drinking Water in First Nations Communities and Ontario Municipalities: Insight into the Emergency of Water Arrangements”. 
- The original scholarly article and the data (along with sources for the data collected) can be accessed at doi:10.5683/SP2/TTHJVN.
- The team wanted to tackle the issue of clean drinking water, especially after the horrific scenes seen on the news regarding Iqaluit’s water contamination crisis in recent weeks. 
- The specific dataset selected was gathered from a variety of sources. The dataset documents water supply data from First Nations communities and Northern Ontarian communities in 2009 and 2010, and provides several interesting research directions.
- These data were supplemented using municipal data sources by the original authors, to ensure that a sufficient amount of data was present. It should be noted that during the years of 2009-2010, the issue of water shortage was of lower priority at the provincial and federal levels. 
- Further data labels were connected to each community using FedNor, the Canadian Census, and Natural Resources Canada. These data were, again, compiled by the original authors of the research. These data included communal characteristic data, describing various labels including elevation, First Nation status, Northern Status (according to FedNor), and water sharing agreements set-up. 

## Data Attributes
**NB.** Please note that all major descriptors used herein are quoted directly from the original source (Deaton and Lipka, 2021). The team thinks it is best to use information directly from the original source material, to ensure that the data is not misinterpreted. However, the team has summarized each descriptor, notated by one further indentation.
- *Dist* -- This variable captures the distance from each community (census subdivision) in the data set to its closest neighbouring census subdivision with water infrastructure. These distances are measured as the distance from community boundary to neighbour’s centroid (in kilometers).
    - Number of kilometers from one community to the next community with water infrastructure. 
- *Elev* -- This variable captures the elevation of each community (census subdivision) in our data set relative to mean sea level. It is measured at the community centroid in 10s of meters. 
    - Number of tens of meters above sea level for the community. 
- *FN* -- This variable identifies the communities (census subdivisions) in our data set that are First Nations communities. 
    - Boolean of whether or not community is designated First Nations. 
- *Inc* -- This variable captures the 2005 regional (census division) median income for each community (census subdivision) in our data set, as reported by the 2006 Canadian census. This regional median income is measured in \[1000s of dollars\]. 
    - Number of thousands of dollars representing average income for the community.
- *lnDist* -- This variable captures the natural log of Inc, defined above. 
    - Just the natural logarithm of the Dist variable. 
- *lnPD* -- This variable captures the natural log of PD, defined below. 
    - This is the natural logarithm of the PD variable, which will be described later.
- *North* -- This variable identifies communities (census subdivisions) in our data set that are located in Northern Ontario census divisions as defined by FedNor (the government of Canada’s economic development organization for Northern Ontario). 
    - Boolean of whether or not community is in the north as defined by FedNor. 
- *PD* -- The variable captures the population density of each community (census subdivision) in our data set, in 100s of persons per square kilometer, as reported in the 20descriptors06 Canadian census. 
    - Number of hundreds of persons per square kilometer in a community. 
- *WSA* -- This variable identifies communities (census subdivisions) in our data set that receive at least some portion of their water supply through some form of water sharing agreement (WSA). 
    - Boolean of whether or not community has a set-up water sharing agreement with another community. 
    
## Proposal
The team would like to do research into the clean drinking water for Northern Ontario’s communities. This research interest was sparked from issues surrounding the Iqaluit water crisis in October of 2021. This dataset is relatively recent (gathered in mid-2021), and so it provides a modern take on the water crisis. Water is critical to human civilization, and so doing research into water contamination directly benefits an at-risk community of Canadians. 

## Other Datasets
These datasets are more focused on a global perspective, and on the health effects of water contamination, however it was slightly out of the scope. We wanted to focus on a Canadian perspective, which these datasets did not satisfy. 
- https://ourworldindata.org/water-access
- https://www.sdg6data.org/


## Research Questions
- To what extent does nationality affect the water infrastructure of a community?
- What is an optimal strategy to ensure all Ontarians have access to clean drinking water?
- Are water sharing agreements (WSAs) an effective way to combat water contamination and scarcity?
- What is median income correlated to the water infrastructure in the community? (And by extension, should federal and provincial governments provide more economic resources to underprivileged communities?)
- Does communal isolation have an impact on clean drinking water availability? 

## Methodology
- We expect to gather the data using Pandas, and use various statistical methods from statsmodels and using base Python to gather statistical insights. 
- We will use matplotlib and seaborn to display the data, plotting various forms of visualizations including histograms, box and whisker plot, scatter plot (with some abstract regression), and kernel density estimates (KDEs). 
- We will gather trends and regression quantifiers using scikit-learn. 
- We may try to use NLTK and various natural language processing methodology, which may potentially need to be scraped from the internet.  

## Potentially Foreseen Libraries
- numpy
- scikit-learn
- matplotlib
- seaborn
- nltk
- statsmodels
- pandas
- scrapy

## Future Potential
- The conclusions generated herein can help to ensure that the Iqaluit situation does not occur in Ontarian communities where water quality is at risk.
- The end goal of this research is to produce a list of policy proposals and calls to actions for various levels of government, which will be data-driven. 
- It must be ensured that at-risk communities are not overly damaged compared to the typical Ontarian community, or Canadian community at large.
- Lives are being saved, which decreases hospital occupancy and governmental medical costs. Thus, the costs for the government are not projected to be drastically different when a focus is placed on accessible and clean drinking water. 


In [25]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [9]:
df = pd.read_table('cleanwater.tab')
df

Unnamed: 0,WSA,FN,North,PD,Dist,Elev,Inc,lnPD,lnDist,lnInc
0,0.0,0,0.0,0.213,4.661,6.0,23.940,-1.546463,1.539230,3.175551
1,0.0,1,0.0,,3.667,5.0,23.940,,1.299374,3.175551
2,1.0,0,0.0,0.280,3.022,9.0,23.940,-1.272966,1.105919,3.175551
3,0.0,0,0.0,7.471,1.090,6.0,23.940,2.011029,0.086178,3.175551
4,0.0,0,0.0,0.203,7.561,9.0,23.940,-1.594549,2.023003,3.175551
...,...,...,...,...,...,...,...,...,...,...
427,0.0,1,1.0,0.065,37.750,20.0,23.667,-2.733368,3.630985,3.164082
428,0.0,1,1.0,0.122,57.555,27.0,23.667,-2.103734,4.052741,3.164082
429,0.0,1,1.0,0.536,38.728,26.0,23.667,-0.623621,3.656563,3.164082
430,0.0,1,1.0,0.004,27.182,21.0,23.667,-5.521461,3.302555,3.164082


In [24]:
print(f'There are {df.shape[0]} records/rows.\n\n')

print(f'Here is a print of the first few records:\n{df.head()}\n\n')

print(f'The population density ranges on [',end='')
print(df['PD'].min(), end='')
print(', ', end='')
print(df['PD'].max(), end=']\n\n\n')

print(f'The distance ranges on [',end='')
print(df['Dist'].min(), end='')
print(', ', end='')
print(df['Dist'].max(), end=']\n\n\n')

print(f'The elevation ranges on [',end='')
print(df['Elev'].min(), end='')
print(', ', end='')
print(df['Elev'].max(), end=']\n\n\n')

print(f'The income ranges on [',end='')
print(df['Inc'].min(), end='')
print(', ', end='')
print(df['Inc'].max(), end=']\n')

print(f'There are {df.shape[1]} columns. They are: {list(df.columns)}')

There are 432 records/rows.


Here is a print of the first few records:
   WSA  FN  North     PD   Dist  Elev    Inc      lnPD    lnDist     lnInc
0  0.0   0    0.0  0.213  4.661   6.0  23.94 -1.546463  1.539230  3.175551
1  0.0   1    0.0    NaN  3.667   5.0  23.94       NaN  1.299374  3.175551
2  1.0   0    0.0  0.280  3.022   9.0  23.94 -1.272966  1.105919  3.175551
3  0.0   0    0.0  7.471  1.090   6.0  23.94  2.011029  0.086178  3.175551
4  0.0   0    0.0  0.203  7.561   9.0  23.94 -1.594549  2.023003  3.175551


The population density ranges on [0.003, 39.724]


The distance ranges on [0.132, 178.416]


The elevation ranges on [1.0, 49.0]


The income ranges on [19.894, 35.433]
There are 10 columns. They are: ['WSA', 'FN', 'North', 'PD', 'Dist', 'Elev', 'Inc', 'lnPD', 'lnDist', 'lnInc']


In [27]:
sns.boxplot(x=df, orient='h')

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().