# Next Steps (as of October 20, 2018)
So we got the file put into a potentially readable format for a bipartite network in networkx, but we still have a lot of work to do!

**Audrey**
* Figure out what format this file needs to be in in order to load into networkx as a bipartite
* Create a bipartite network!

**Cassie**
* Figure out and get the thresholds for pollutants in California (esp. if they're different than what we have)
* Put those thresholds into a format that can be easily used by the code we've written

# Water Quality Network
For our network project, Cassie and I are thinking of creating a bipartite network where the nodes are pollutants and the links are water facilities whose measurements of those pollutants are above the threshold. By doing this linking, we hope to see what the biggest pollutants are, how pollutants might connect to one another, and if there are any pollutants that we should be worried about.

## Data Sources
We got our data from USGS (U.S. Geological Survey). I'll need to get the exact link later.

## Looking at the data
Since I'm not too familiar with the data, let's load it in and take a look. There's some preliminary things that I do know, which will be come prevelant when you look at the code. For instance, you can break up the code into chunks by splitting on instances of "#\n". The # comes from the file header and most "sections" within the header are separated by # followed by a new line. The last item of the split is the actual data.

Let's start by loading in everything!

## MUST PIP INSTALL
* numpy
* pandas
* matplotlib
* xlrd

In [1]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
def load_data(file_name):
    data = []
    with open(file_name, 'r') as file:
        data = file.read().split("#\n")

    if not data:
        print("File {} was unable to be read.".format(file_name))
    return data

In [3]:
file_name = "LA_Water_Quality_Data.txt"
data = load_data(file_name)
print("Number of sections:", len(data))
print("2nd section:", data[2], sep = "\n")
print("Number of characters in actual data:", len(data[15]))

Number of sections: 16
2nd section:
# U.S. Geological Survey
# 
# This file contains selected water-quality data for stations in the National Water Information 
# System water-quality database.  Explanation of codes found in this file are followed by
# the retrieved data.

Number of characters in actual data: 9736982


Great! We have 15 chunks of text when doing that split (which feels a bit better than doing it line-by-line). Here's a breakdown of what's inside:
    #  0:                               #  8: coll_ent_cd  
    #  1: File created...               #  9: medium_cd  
    #  2: U.S. Geological Survey        # 10: tu_id  
    #  3: The data you have...          # 11: body_part_id  
    #  4: To view additional...         # 12: remark_cd  
    #  5: Param_id      - parameter     # 13: Data for the following sites...  
    #  6: sample_start_time_datum_cd    # 14: WARNING: some preadsheet...  
    #  7: tm_datum_rlbty_cd             # 15: Data!  

I've already glanced at the file in Excel and figured out how to parse all of the parameters and their descriptions, which will probably be useful later on. 

In [4]:
# inputs the part of the header that contains the parameter label followed by its meaning
# outputs a dictionary where the key is the label (lower case) and the value is the description
def get_parameter_def(param_header):
    params_dict = {}
    params_type = {}
    params = param_header.split("\n")
    params_pattern = re.compile("# +(\w+) +- +(.+)")

    for param in params:
        a = params_pattern.search(param)

        if a: 
            #print(first_t, a.group(1), a.group(2))
            params_dict[a.group(1).lower()] = a.group(2)
            #print(a.group(1) + ":", a.group(2).split(',')[-1])

    return params_dict

In [5]:
params_dict = get_parameter_def(data[5])
print("Total number of parameters measured:", len(params_dict))

Total number of parameters measured: 1046


In [341]:
needed_params = ["site_no", "sample_dt", "sample_tm"]
data_params = [param for param in params_dict if param[0] == "p"]
#params_dict

In [342]:
data_to_use = pd.read_csv(pd.compat.StringIO(data[15]), sep='\t', low_memory=False, header=0, skiprows=[1])
# I find this a bit concerning...: https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options
# I'd want to specify type, but it seems like there's a lot of strings within supposedly numerical columns...
print(data_to_use.shape)
display(data_to_use.head(2))

(7763, 1046)


Unnamed: 0,agency_cd,site_no,sample_dt,sample_tm,sample_end_dt,sample_end_tm,sample_start_time_datum_cd,tm_datum_rlbty_cd,coll_ent_cd,medium_cd,...,p99856,p99871,p99931,p99947,p99958,p99959,p99963,p99972,p99994,p99995
0,USGS,332031118504001,2000-10-24,14:30,,,PDT,T,USGS-WRD,WG,...,,0.0,,,,,,,,
1,USGS,333420118060501,2000-11-09,09:30,,,PST,T,USGS-WRD,WG,...,,0.0,,,,,,,,


In [344]:
data_to_use_numbers = data_to_use[data_params]
data_to_use_numbers.head()

Unnamed: 0,p00003,p00004,p00005,p00008,p00009,p00010,p00011,p00020,p00021,p00025,...,p99856,p99871,p99931,p99947,p99958,p99959,p99963,p99972,p99994,p99995
0,,,,,,,,,,,...,,0.0,,,,,,,,
1,,,,,,,,,,,...,,0.0,,,,,,,,
2,,,,,,18.5,,,,,...,,0.0,,,,,,973.0,90.7,99.8
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [350]:
data_to_use_numbers = data_to_use_numbers.replace("[<> A-Za-z]", "", regex=True)
data_to_use_numbers = data_to_use_numbers.replace(" ", "", regex=True)
data_to_use_numbers = data_to_use_numbers.replace(r'^$', np.nan, regex=True)

In [352]:
data_to_use_numbers = data_to_use_numbers.astype(np.float64)

In [360]:
#total_data = pd.concat([data_to_use[needed_params], data_to_use_numbers], axis=1)

In [361]:
#total_data.head()

Unnamed: 0,site_no,sample_dt,sample_tm,p00003,p00004,p00005,p00008,p00009,p00010,p00011,...,p99856,p99871,p99931,p99947,p99958,p99959,p99963,p99972,p99994,p99995
0,332031118504001,2000-10-24,14:30,,,,,,,,...,,0.0,,,,,,,,
1,333420118060501,2000-11-09,09:30,,,,,,,,...,,0.0,,,,,,,,
2,333420118060501,2006-08-30,08:20,,,,,,18.5,,...,,0.0,,,,,,973.0,90.7,99.8
3,333420118060501,2006-08-30,08:30,,,,,,,,...,,,,,,,,,,
4,333420118060501,2006-08-30,08:40,,,,,,,,...,,,,,,,,,,


## Splitting params into filtered and unfiltered
* unfiltered = grab the ground water as is - you have extra sediment somehow
* filtered = you filter out the ground water 
* do analysis with filtered and unfiltered, but DON'T MIX
* higher pollution for unfiltered properties

#### Note
There's some descrepencies in these parameters. We'll deal with this later.

In [330]:
filtered_params_1 = {}
unfiltered_params_1 = {}
for param in params_dict.items():
    components = param[1].split(", ")
    component = components[0].lower()
    if "filtered" in components:
        if component not in filtered_params_1:
            filtered_params_1[component] = []
        filtered_params_1[component].append(param[0])
    elif "unfiltered" in components:
        if component not in unfiltered_params_1:
            unfiltered_params_1[component] = []
        unfiltered_params_1[component].append(param[0])
        
filtered_params_2 = {}
unfiltered_params_2 = {}
for param in params_dict.items():
    components = param[1].split(", ")
    component = components[0].lower()
    if "unfiltered" not in components: #"filtered" in components:
        if component not in filtered_params_2:
            filtered_params_2[component] = []
        filtered_params_2[component].append(param[0])
    elif "filtered" not in components: #"unfiltered" in components:
        if component not in unfiltered_params_2:
            unfiltered_params_2[component] = []
        unfiltered_params_2[component].append(param[0])

In [338]:
print("For first method:")
print(len(set(filtered_params_1).intersection(set(unfiltered_params_1))))
print(len(filtered_params_1))
print(len(unfiltered_params_1))

print("For second method:")
print(len(set(filtered_params_2).intersection(set(unfiltered_params_2))))
print(len(filtered_params_2))
print(len(unfiltered_params_2))

print(len(set(filtered_params_2).difference(filtered_params_1)))
print(len(set(unfiltered_params_2).difference(unfiltered_params_1)))
print(518-197)

For first method:
57
197
360
For second method:
127
518
360
321
0
321


In [10]:
# This is writing the params into a file
#unfiltered_file = open("Unfiltered_params.txt", "w")
#filtered_file = open("Filtered_params.txt", "w")
#
#for param in unfiltered_params.items():
#    formatSTR = param[0] + "\t" + "\t".join(param[1]) + "\r\n"
#    unfiltered_file.write(formatSTR.lower())
#unfiltered_file.close()
#for param in filtered_params.items():
#    formatSTR = param[0] + "\t" + "\t".join(param[1]) + "\r\n"
#    filtered_file.write(formatSTR.lower())
#filtered_file.close()

## Next Step: Getting the pollutants 

In [203]:
pollutant_file = pd.ExcelFile("Thresholds_hh_USGScompatible.xlsx")
convert_to_str = {name:str for name in pd.read_excel("Thresholds_hh_USGScompatible.xlsx").columns.values.tolist()}
pollutant_info = pollutant_file.parse(converters=convert_to_str)

In [204]:
pollutant_info = pollutant_info[pollutant_info["Pollutant (P = priority pollutant)"].notna()]
pollutant_info["Pollutant (P = priority pollutant)"] = pollutant_info["Pollutant (P = priority pollutant)"].str.strip()
pollutant_info["Pollutant (P = priority pollutant)"] = pollutant_info["Pollutant (P = priority pollutant)"].str.lower()
#pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].astype(str)
pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].str.replace("<", "")
pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].str.strip()
has_ranges = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].str.contains("-")
ranges = pollutant_info[has_ranges]
ranges
pollutant_info = pollutant_info[~has_ranges]
pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pd.to_numeric(pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"])
pollutant_info.head()

Unnamed: 0,Pollutant (P = priority pollutant),CAS Number,Human Health for the consumption of Water + Organism (µg/L),Human Health for the consumption of Organism Only (µg/L),Publication Year,Notes
0,acenaphthene,83329,70.0,90.0,2015,The criterion for organoleptic (taste and odor...
1,acrylonitrile,107131,0.061,7.0,2015,This criterion is based on carcinogenicity of ...
2,aldrin,309002,7.7e-07,7.7e-07,2015,This criterion is based on carcinogenicity of ...
3,alpha-hch,319846,0.00036,0.00039,2015,This criterion is based on carcinogenicity of ...
4,alpha-endosulfan,959988,20.0,30.0,2015,


In [397]:
common_pollutants_p = []
reverse_params_dict = {}
pollutant_info["Pollutant (P = priority pollutant)"]=pollutant_info["Pollutant (P = priority pollutant)"].str.lower()
a = pollutant_info["Pollutant (P = priority pollutant)"]
a = a[a.notna()]
for pollutant in a:
    pollutant = pollutant
    if pollutant in filtered_params_2:
        for column in filtered_params_2[pollutant]:
            common_pollutants_p.append(column)
            reverse_params_dict[column] = pollutant
#reverse_params_dict
#common_pollutants_p

In [394]:
total_data = pd.concat([data_to_use[needed_params], data_to_use_numbers[common_pollutants_p]], axis=1)
#data_to_use_subset.head()

#total_data.notna().sum() # <- this is useful when looking at how many values each column has

## TEST - IT WORKS!!!! 

In [379]:
column = "p01010"
threshold = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"][pollutant_info["Pollutant (P = priority pollutant)"]==reverse_params_dict[column]]
actual_threshold = threshold.values[0]
print("Column:", reverse_params_dict[column], "\tThreshold:", actual_threshold)

Column: beryllium 	Threshold: 4.0


In [389]:
passed_threshold = total_data[column][total_data[column].notna()] > actual_threshold
has_passed_site = total_data["site_no"][total_data[column].notna()][passed_threshold]
has_passed_nums = total_data[column][total_data[column].notna()][passed_threshold]
#len(passed_threshold)
#len(total_data)

335943118042215	beryllium	5.0
335943118042224	beryllium	10.0
335943118042226	beryllium	10.0
335943118042228	beryllium	10.0
335943118042243	beryllium	5.0
344415118130501	beryllium	10.0
10278300	beryllium	130.0
10278300	beryllium	100.0
10278300	beryllium	50.0
10278300	beryllium	100.0
10278300	beryllium	50.0
10278300	beryllium	100.0
10278300	beryllium	50.0
10278300	beryllium	70.0


## Let's do this

In [406]:
to_write_to_file = ""
formatSTR = "{}\t{}\t{}\n"
pollutant_col = "Pollutant (P = priority pollutant)"
threshold_col = "Human Health for the consumption of\xa0Water + Organism (µg/L)"

In [408]:
for column in common_pollutants_p:
    threshold = pollutant_info[threshold_col][pollutant_info[pollutant_col]==reverse_params_dict[column]].values[0]
    passed_threshold = total_data[column][total_data[column].notna()] > threshold
    has_passed_site = total_data["site_no"][total_data[column].notna()][passed_threshold]
    has_passed_nums = total_data[column][total_data[column].notna()][passed_threshold]
    
    for i in range(len(has_passed_site)):
        to_add = formatSTR.format(has_passed_site.iloc[i], reverse_params_dict[column], has_passed_nums.iloc[i])
        to_write_to_file += to_add

In [411]:
file = open("potentially_a_bipartite.tsv", "w")
file.write(to_write_to_file)
file.close()

In [412]:
len(common_pollutants_p)

56

In [413]:
pollutant_info.shape

(87, 6)

### Tangent: Looking at the amount of data per parameter 

In [15]:
# Let's see how much data we have for each parameter...
params_counts_dict = {}
for param in a:
    count = a.shape[0] - sum(a[param].isna())
    if count > 0:
        params_counts_dict[params_dict[param]] = count
    
    #print(params_dict[param] + ":", count)
params_counts = list(params_counts_dict.values())

NameError: name 'a' is not defined

In [None]:
plt.hist(list(params_counts_dict.values()))

In [None]:
counts_of_counts = {}
for count in params_counts_dict.values():
    if count not in counts_of_counts:
        counts_of_counts[count] = 0
    counts_of_counts[count] += 1

param_counts = list(params_counts_dict.values())
count_counts = list(counts_of_counts.values())
bin_edges = np.logspace(np.log10(min(param_counts)), 
                        np.log10(max(param_counts)),
                        num = 10)
density, _ = np.histogram(param_counts, bins=bin_edges, density=True)

In [None]:
log_be = np.log10(bin_edges)
x = 10**((log_be[1:] + log_be[:-1])/2)

plt.loglog(x, density, marker='o', linestyle='none')

In [None]:
log_be