# Next Steps (as of November 12, 2018)
We have our network! But now we need to calculate the values.

**Audrey**
* [ ] Make sure all of the values we have are measured by micrograms/L!!!  
* [ ] Do conversions for all values that aren't micrograms/L
* [ ] Revamp code to only include the things we used  
* [ ] Calculate values like the average degree, etc.  

# Water Quality Network
For our network project, Cassie and I are thinking of creating a bipartite network where the nodes are pollutants and the links are water facilities whose measurements of those pollutants are above the threshold. By doing this linking, we hope to see what the biggest pollutants are, how pollutants might connect to one another, and if there are any pollutants that we should be worried about.

## Data Sources
We got our data from USGS (U.S. Geological Survey). I'll need to get the exact link later.

### MUST PIP INSTALL
* numpy
* pandas
* matplotlib
* xlrd

In [1]:
import re
import numpy as np
import pandas as pd
import networkx as nx
from io import StringIO
import matplotlib.pyplot as plt
from networkx.algorithms import bipartite
%matplotlib inline

## Looking at the Quality data
Since I'm not too familiar with the data, let's load it in and take a look. There's some preliminary things that I do know, which will be come prevelant when you look at the code. For instance, you can break up the code into chunks by splitting on instances of "#\n". The # comes from the file header and most "sections" within the header are separated by # followed by a new line. The last item of the split is the actual data.

Let's start by loading in everything!

In [2]:
def load_quality_data(file_name):
    quality_data = []
    with open(file_name, 'r') as file:
        quality_data = file.read().split("#\n")

    if not quality_data:
        print("File {} was unable to be read.".format(file_name))
    return quality_data

In [3]:
file_name = "LA_Water_Quality_Data.txt"
quality_data = load_quality_data(file_name)
print("Number of sections:", len(quality_data))
print("2nd section:", quality_data[2], sep = "\n")
print("Number of characters in actual data:", len(quality_data[15]))

Number of sections: 16
2nd section:
# U.S. Geological Survey
# 
# This file contains selected water-quality data for stations in the National Water Information 
# System water-quality database.  Explanation of codes found in this file are followed by
# the retrieved data.

Number of characters in actual data: 9736982


Great! We have 15 chunks of text when doing that split (which feels a bit better than doing it line-by-line). Here's a breakdown of what's inside:
    #  0:                               #  8: coll_ent_cd  
    #  1: File created...               #  9: medium_cd  
    #  2: U.S. Geological Survey        # 10: tu_id  
    #  3: The data you have...          # 11: body_part_id  
    #  4: To view additional...         # 12: remark_cd  
    #  5: Param_id      - parameter     # 13: Data for the following sites...  
    #  6: sample_start_time_datum_cd    # 14: WARNING: some spreadsheet...  
    #  7: tm_datum_rlbty_cd             # 15: Data!  

I've already glanced at the file in Excel and figured out how to parse all of the parameters and their descriptions, which will probably be useful later on. 

### Parameters
There are about a thousand parameters, most of them are in the format of a p + 5-digit number (i.e. `p62168`). Not very descriptive, but the 6th entry of our data header has the actual descriptions for each parameter. These descriptions have a lot of information, such as the pollutant name, filtered vs unfiltered, and units of measurement.

We want to later extract the pollutant and if the sample was filtered or unfiltered. For now, let's just make a dictionary to get the description for the parameters.

`get_params_def`: function
* Input is `param_header`, which is a giant string with `"\n"` characters separating each line
* Each "line" has the `p#####` parameter label and the description, 

In [4]:
# inputs the part of the header that contains the parameter label followed by its meaning
# outputs a dictionary where the key is the label (lower case) and the value is the description
def get_params_def(param_header):
    params_def_dict = {} # Key is p#####, value is description, which contains the pollutant
    params = param_header.split("\n")
    params_pattern = re.compile("# +(\w+) +- +(.+)")

    for param in params:
        a = params_pattern.search(param)

        if a: 
            altered_description = re.sub("  +", ", ", 
                                         a.group(2).replace("filtered (", "filtered, ("))
            params_def_dict[a.group(1).lower()] = altered_description
            #print(a.group(1) + ":", a.group(2).split(',')[-1])

    return params_def_dict

In [5]:
params_def_dict = get_params_def(quality_data[5])
print("Total number of parameters measured:", len(params_def_dict))
ex_param_dict = list(params_def_dict.keys())[len(params_def_dict.keys())//2]
print("Example:\n\tkey   = {}\n\tvalue = {}".format(ex_param_dict, params_def_dict[ex_param_dict]))

Total number of parameters measured: 1046
Example:
	key   = p62168
	value = Fipronil sulfone, water, filtered, recoverable, micrograms per liter


We have the list of all the parameters, but we're really only interested in a few of the metadata ones and we'll need to go through the measurement ones to determine which ones we want.

Since we know the metadata parameters we want, we can put them into `needed_params`:
* the site number, `site_no`
* the date the measurement was taken `sample_dt`
* the time the measurement was taken `sample_tm`

Since we want to go through specifically the measurements (p + 5-digit params), we'll create a list of all those named `data_params` that we'll use later to subset the data.

In [6]:
# Get only the parameters we're interested in, which are the site info, date/times of sampling,
#  and all of the measurements (the p + 5-digit params)
needed_params = ["site_no", "sample_dt", "sample_tm"]
data_params = [param for param in params_def_dict if param[0] == "p"]

### Quality Data 
Now that we have the parameters, we can grab the actual measurements from `quality_data`. The data is a tab and newline-separated chunk of text where the tabs separate parameter measurements and the newlines separate measurements of a specific time. The data also has an extra row underneath the header that doesn't seem of any use to us, so we can disregard it.

In [7]:
test = quality_data[15]

In [8]:
subbed_test = re.sub("\t[AERMUV]", 
              "\t", test.replace("USGS", 
                                 "~~~~").replace(" ", 
                                                 "").replace("<", 
                                                             "").replace(">",
                                                                         "")).replace("~~~~", 
                                                                                      "USGS")

In [9]:
types_per_param = {param:"float" if param[0]=="p" else "str" for param in params_def_dict}

In [10]:
data_to_use = pd.read_csv(StringIO(subbed_test), sep='\t', 
                          dtype=types_per_param, header=0, skiprows=[1])
# I find this a bit concerning...: https://stackoverflow.com/questions/24251219/pandas-read-csv-low-memory-and-dtype-options
# I'd want to specify type, but it seems like there's a lot of strings within supposedly numerical columns...
print(data_to_use.shape)
display(data_to_use.head(2))

(7763, 1046)


Unnamed: 0,agency_cd,site_no,sample_dt,sample_tm,sample_end_dt,sample_end_tm,sample_start_time_datum_cd,tm_datum_rlbty_cd,coll_ent_cd,medium_cd,...,p99856,p99871,p99931,p99947,p99958,p99959,p99963,p99972,p99994,p99995
0,USGS,332031118504001,2000-10-24,14:30,,,PDT,T,USGS-WRD,WG,...,,0.0,,,,,,,,
1,USGS,333420118060501,2000-11-09,09:30,,,PST,T,USGS-WRD,WG,...,,0.0,,,,,,,,


Now that we've loaded in all the data successfully, let's get the subset of the data that only contains measurements. Calling it `data_to_use_numbers`.

In [11]:
data_to_use_numbers = data_to_use[data_params]
data_to_use_numbers.head()

Unnamed: 0,p00003,p00004,p00005,p00008,p00009,p00010,p00011,p00020,p00021,p00025,...,p99856,p99871,p99931,p99947,p99958,p99959,p99963,p99972,p99994,p99995
0,,,,,,,,,,,...,,0.0,,,,,,,,
1,,,,,,,,,,,...,,0.0,,,,,,,,
2,,,,,,18.5,,,,,...,,0.0,,,,,,973.0,90.7,99.8
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [12]:
#string = data_to_use_numbers.to_string()

In [13]:
#string.replace("NaN", "~~~").replace("\t", "===============")

In [14]:
data_to_use.columns.values.tolist()

['agency_cd',
 'site_no',
 'sample_dt',
 'sample_tm',
 'sample_end_dt',
 'sample_end_tm',
 'sample_start_time_datum_cd',
 'tm_datum_rlbty_cd',
 'coll_ent_cd',
 'medium_cd',
 'tu_id',
 'body_part_id',
 'p00003',
 'p00004',
 'p00005',
 'p00008',
 'p00009',
 'p00010',
 'p00011',
 'p00020',
 'p00021',
 'p00025',
 'p00028',
 'p00029',
 'p00032',
 'p00058',
 'p00059',
 'p00060',
 'p00061',
 'p00065',
 'p00070',
 'p00075',
 'p00076',
 'p00080',
 'p00085',
 'p00090',
 'p00095',
 'p00098',
 'p00191',
 'p00300',
 'p00301',
 'p00310',
 'p00335',
 'p00340',
 'p00400',
 'p00403',
 'p00405',
 'p00410',
 'p00419',
 'p00440',
 'p00445',
 'p00447',
 'p00450',
 'p00452',
 'p00453',
 'p00510',
 'p00515',
 'p00520',
 'p00530',
 'p00535',
 'p00540',
 'p00550',
 'p00556',
 'p00572',
 'p00573',
 'p00600',
 'p00602',
 'p00605',
 'p00607',
 'p00608',
 'p00610',
 'p00613',
 'p00615',
 'p00618',
 'p00620',
 'p00623',
 'p00624',
 'p00625',
 'p00630',
 'p00631',
 'p00650',
 'p00660',
 'p00665',
 'p00666',
 'p00671

In [15]:
#data_to_use_numbers = data_to_use_numbers.replace("[<> A-Za-z]", "", regex=True)
#data_to_use_numbers = data_to_use_numbers.replace(" ", "", regex=True)
#data_to_use_numbers = data_to_use_numbers.replace(r'^$', np.nan, regex=True)

In [16]:
#data_to_use_numbers = data_to_use_numbers.astype(np.float64)

In [17]:
all(data_to_use_numbers.dtypes == "float64")

True

In [18]:
#total_data = pd.concat([data_to_use[needed_params], data_to_use_numbers], axis=1)

In [19]:
#total_data.head()

## Splitting params into filtered and unfiltered
* unfiltered = grab the ground water as is - you have extra sediment somehow
* filtered = you filter out the ground water 
* do analysis with filtered and unfiltered, but DON'T MIX
* higher pollution for unfiltered properties

#### Note
There's some descrepencies in these parameters. We'll deal with this later.

In [20]:
def filter_params_by(params_def_dict, by="filtered", opposite=False):
    filtered_params = {}
    for param_def in params_def_dict.items():
        component_def = param_def[1].split(", ")
        component = component_def[0].lower()
        
        if (opposite and by not in component_def) or (not opposite and by in component_def):
            if component not in filtered_params:
                filtered_params[component] = []
            filtered_params[component].append(param_def[0])
            
    return filtered_params

In [21]:
#said_filtered_pollutants = filter_params_by(params_def_dict, by="filtered", opposite=False)
all_filtered_params = filter_params_by(params_def_dict, by="unfiltered", opposite=True)

In [22]:
# This is writing the params into a file
#unfiltered_file = open("Unfiltered_params.txt", "w")
#filtered_file = open("Filtered_params.txt", "w")
#
#for param in unfiltered_params.items():
#    formatSTR = param[0] + "\t" + "\t".join(param[1]) + "\r\n"
#    unfiltered_file.write(formatSTR.lower())
#unfiltered_file.close()
#for param in filtered_params.items():
#    formatSTR = param[0] + "\t" + "\t".join(param[1]) + "\r\n"
#    filtered_file.write(formatSTR.lower())
#filtered_file.close()

## Next Step: Getting the pollutants 

In [23]:
# pollutant_file = pd.ExcelFile("Thresholds_hh_USGScompatible.xlsx")
# convert_to_str = {name:str for name in pd.read_excel("Thresholds_hh_USGScompatible.xlsx").columns.values.tolist()}
# pollutant_info = pollutant_file.parse(converters=convert_to_str)

In [24]:
# pollutant_info = pollutant_info[pollutant_info["Pollutant (P = priority pollutant)"].notna()]
# pollutant_info["Pollutant (P = priority pollutant)"] = pollutant_info["Pollutant (P = priority pollutant)"].str.strip()
# pollutant_info["Pollutant (P = priority pollutant)"] = pollutant_info["Pollutant (P = priority pollutant)"].str.lower()
# #pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].astype(str)
# pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].str.replace("<", "")
# pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].str.strip()
# has_ranges = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"].str.contains("-")
# ranges = pollutant_info[has_ranges]
# pollutant_info = pollutant_info[~has_ranges]
# pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"] = pd.to_numeric(pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"])
# pollutant_info.head()

In [25]:
# # Where we got the columns with the same pollutant names
# common_pollutants_p = []
# reverse_params_def_dict = {}
# pollutant_info["Pollutant (P = priority pollutant)"]=pollutant_info["Pollutant (P = priority pollutant)"].str.lower()
# a = pollutant_info["Pollutant (P = priority pollutant)"]
# a = a[a.notna()]
# for pollutant in a:
#     pollutant = pollutant
#     if pollutant in all_filtered_params:
#         for column in all_filtered_params[pollutant]:
#             common_pollutants_p.append(column)
#             reverse_params_def_dict[column] = pollutant
# #reverse_params_def_dict
# #common_pollutants_p

In [26]:
# reverse_params_def_dict

In [27]:
# total_data = pd.concat([data_to_use[needed_params], data_to_use_numbers[common_pollutants_p]], axis=1)
# #data_to_use_subset.head()

# #total_data.notna().sum() # <- this is useful when looking at how many values each column has

In [28]:
# total_data.shape

In [29]:
# common_pollutants_p

### Try the CA Pollutants instead 

In [30]:
ca_pollutant_file = pd.ExcelFile("CA_thresholds_compliation_filtered.xlsx")
ca_pollutant_info = ca_pollutant_file.parse()

In [31]:
ca_pollutant_info.head()

Unnamed: 0,Name1,Organic_Inorganic,CA_Prim_MCL,CA_Prim_MCL_2,CA_Prim_MCL_unit,CA_Prim_MCL_date,CA_Sec_MCL,CA_Sec_MCL_2,CA_Sec_MCL_unit,CA_Sec_MCL_date,...,CA_BayEst_Health_unit,CA_BayEst_Health_date,NAWQC_Health_WF,NAWQC_Health_WF_2,NAWQC_Health_WF_unit,NAWQC_Health_WF_date,NAWQC_Health_F,NAWQC_Health_F_2,NAWQC_Health_F_unit,NAWQC_Health_F_date
0,Acenaphthene,Organic,,,,NaT,,,,NaT,...,,2000-05-18,70.0,,,2015-06-29,90.0,,,2015-06-29
1,Acenaphthylene,Organic,,,,NaT,,,,NaT,...,,NaT,,,,NaT,,,,NaT
2,Acetochlor,Organic,,,,NaT,,,,NaT,...,,NaT,,,,NaT,,,,NaT
3,Acetone,Organic,,,,NaT,,,,NaT,...,,NaT,,,,NaT,,,,NaT
4,Acetonitrile,Organic,,,,NaT,,,,NaT,...,,NaT,,,,NaT,,,,NaT


In [76]:
ca_pollutant_info.iloc[:, :2] = ca_pollutant_info.iloc[:, :2].replace(to_replace=np.nan, value="")
#names = set(ca_pollutant_info["Name1"].str.lower().tolist()) # gives us all the names we're working with
ca_pollutant_info["Name1"] = ca_pollutant_info["Name1"].str.lower()
names = ca_pollutant_info["Name1"]


In [64]:
all_params = set(list(all_filtered_params.keys()))

In [47]:
len(names.intersection(all_params))

143

In [65]:
temp_intersecting_pollutants = names.intersection(all_params)

In [52]:
temp_not_intersecting_pollutants = names.symmetric_difference(all_params)
temp_file = open("different_params.tsv", 'w')
for param in temp_not_intersecting_pollutants:
    if param in names:
        temp_file.write(param + "\tnames\n")
    else:
        temp_file.write(param + "\tparam\n")
temp_file.close()

In [48]:
len(names.symmetric_difference(all_params))

494

In [38]:
print("Length of names", len(names))
print("Length of params", len(set(list(all_filtered_params.keys()))))

Length of names 263
Length of params 517


In [40]:
temp_file = open("filtered_params.tsv", 'w')
for param in all_filtered_params.keys():
    temp_file.write(param + "\n")
temp_file.close()

In [93]:
count = 0
all_params = list(all_filtered_params.keys())# + list(unfiltered_params_2.keys())
rows_containing_pollutant = [] # will be used to grab the rows from ca_pollutant_info
names_of_pollutant = []        # 
common_pollutants_p = []
reverse_params_def_dict = {}
temp_supposedly_intersecting = set()
for column in all_params:
    if np.any(names.str.match(column)):
        temp_supposedly_intersecting.add(column)
        match_col = np.where(names.str.match(column))[0]
        print(column, match_col, type(match_col))
        rows_containing_pollutant += match_col.tolist()
        for i in range(len(match_col)):
            names_of_pollutant.append(column)
        for p_column in all_filtered_params[column]:
            reverse_params_def_dict[p_column] = column
            common_pollutants_p.append(p_column)
        count += 1
total_data = pd.concat([data_to_use[needed_params], data_to_use_numbers[common_pollutants_p]], axis=1)
#reverse_params_def_dict = {column:pollutant for column,pollutant in zip(common_pollutants_p,names_of_pollutant)}

nitrite [187] <class 'numpy.ndarray'>
nitrate [186] <class 'numpy.ndarray'>
phosphorus [208 209] <class 'numpy.ndarray'>
cyanide [91] <class 'numpy.ndarray'>
sodium [229] <class 'numpy.ndarray'>
chloride [66] <class 'numpy.ndarray'>
sulfate [232] <class 'numpy.ndarray'>
fluoride [148] <class 'numpy.ndarray'>
arsenic [19] <class 'numpy.ndarray'>
barium [22] <class 'numpy.ndarray'>
beryllium [33] <class 'numpy.ndarray'>
boron [40] <class 'numpy.ndarray'>
cadmium [58] <class 'numpy.ndarray'>
chromium [81 82] <class 'numpy.ndarray'>
cobalt [84] <class 'numpy.ndarray'>
copper [85] <class 'numpy.ndarray'>
iron [163] <class 'numpy.ndarray'>
lead [166] <class 'numpy.ndarray'>
manganese [169] <class 'numpy.ndarray'>
thallium [239] <class 'numpy.ndarray'>
molybdenum [182] <class 'numpy.ndarray'>
nickel [185] <class 'numpy.ndarray'>
silver [227] <class 'numpy.ndarray'>
strontium [230] <class 'numpy.ndarray'>
vanadium [260] <class 'numpy.ndarray'>
antimony [18] <class 'numpy.ndarray'>
aluminum [14

In [86]:
# temp_count = 0
# temp_names = set(ca_pollutant_info["Name1"].tolist())
# temp_all_params = set(list(all_filtered_params.keys()))
# temp_rows_containing_pollutant = [] # will be used to grab the rows from ca_pollutant_info
# temp_names_of_pollutant = []        # 
# temp_common_pollutants_p = []
# temp_reverse_params_def_dict = {}
# for column in temp_names.intersection(temp_all_params):
#     match_col = [i for i,name in enumerate(temp_names) if name == column]
#     print(column, match_col, type(match_col))
#     temp_rows_containing_pollutant += match_col
#     for i in range(len(match_col)):
#         temp_names_of_pollutant.append(column)
#     for p_column in all_filtered_params[column]:
#         temp_reverse_params_def_dict[p_column] = column
#         temp_common_pollutants_p.append(p_column)
#     temp_count += 1

trifluralin [8] <class 'list'>
2-methylnaphthalene [6] <class 'list'>
dinoseb [10] <class 'list'>
ethane [12] <class 'list'>
copper [14] <class 'list'>
chlordane (technical) [16] <class 'list'>
napropamide [17] <class 'list'>
mercury [18] <class 'list'>
alkalinity [19] <class 'list'>
fluoranthene [21] <class 'list'>
bisphenol a [22] <class 'list'>
fenamiphos [25] <class 'list'>
fluoride [26] <class 'list'>
cypermethrin [27] <class 'list'>
p,p'-ddt [33] <class 'list'>
chlorpyrifos [34] <class 'list'>
phenol [36] <class 'list'>
benfluralin [40] <class 'list'>
iodide [41] <class 'list'>
aldrin [42] <class 'list'>
chromium(vi) [43] <class 'list'>
aluminum [46] <class 'list'>
isopropylbenzene [49] <class 'list'>
cobalt [50] <class 'list'>
beryllium [52] <class 'list'>
acifluorfen [53] <class 'list'>
norflurazon [54] <class 'list'>
bromoxynil [56] <class 'list'>
warfarin [64] <class 'list'>
atrazine [57] <class 'list'>
methomyl [60] <class 'list'>
phosphorus [61] <class 'list'>
propargite [6

In [78]:
np.where(names.str.match("aldrin"))[0]

array([12])

In [79]:
temp_names = names.tolist()

In [89]:
temp_count

143

In [58]:
print(len(common_pollutants_p), len(reverse_params_def_dict))

188 188


In [59]:
len(common_pollutants_p)

188

In [60]:
#print(sorted(rows_containing_pollutant))
print(count)

142


In [90]:
temp_names.intersection(temp_all_params).difference(temp_supposedly_intersecting)

{'benzo[a]pyrene',
 'bis(2-ethylhexyl) phthalate',
 'chlordane (technical)',
 'chromium(vi)'}

In [92]:
np.where(names.str.match('benzo[a]pyrene'))

(array([], dtype=int64),)

In [72]:
temp_intersecting_pollutants.difference(set(names_of_pollutant))

set()

In [61]:
len(names_of_pollutant)

150

In [94]:
chosen_ca_pollutant_info = ca_pollutant_info.iloc[rows_containing_pollutant, :]
chosen_ca_pollutant_info.shape

(150, 90)

In [95]:
data_cols = [column for column in chosen_ca_pollutant_info.columns.values.tolist()[7:] if "unit" not in column and "date" not in column and "Ag_Goals" not in column]#and "_2" not in column]
chosen_ca_pollutant_data = chosen_ca_pollutant_info[data_cols]
chosen_ca_pollutant_data.head()

Unnamed: 0,CA_Sec_MCL_2,USEPA_Prim_MCL,USEPA_Prim_MCL_2,USEPA_Sec_MCL,USEPA_Sec_MCL_2,USEPA_MCL_Goal,USEPA_MCL_Goal_2,CA_PHG,CA_PHG_2,CA_Action_Level,...,CA_Inland_Health_DW,CA_Inland_Health_DW_2,CA_Inland_Health_Other,CA_Inland_Health_Other_2,CA_BayEst_Health,CA_BayEst_Health_2,NAWQC_Health_WF,NAWQC_Health_WF_2,NAWQC_Health_F,NAWQC_Health_F_2
187,,1000.0,,,,1000.0,,1000.0,1000.0,,...,,,,,,,,,,
186,,10000.0,,,,10000.0,,10000.0,10000.0,,...,,,,,,,10000.0,,,
208,,,,,,,,,,,...,,,,,,,,,,
209,,,,,,,,,,,...,,,,,,,,,,
91,,200.0,,,,200.0,,150.0,,,...,700.0,,220000.0,,220000.0,,4.0,,400.0,


In [96]:
# For checking the min values
#min_counts = []
#row_num = []
#for row in chosen_ca_pollutant_data.iterrows():
#    min_count = 999999999999
#    nonna_count = 0
#    for item in row[1]:
#        #if type(item) != float:
#        #    item = np.nan
#        if not np.isnan(item):
#            nonna_count += 1
#            if item < min_count:
#                min_count = item
#    
#    if min_count != 999999999999:
#        min_counts.append(min_count)
#        row_num.append(row[0])
#print(min_counts)
#print(row_num)

In [97]:
x = chosen_ca_pollutant_data.iloc[19,:]
x[(x.notna()) &  (x>0)]

USEPA_Prim_MCL           15.00
CA_PHG                    0.20
CalEPA_Cancer_Potency     4.10
Prop65_Cancer             7.50
Prop65_Repro              0.25
Name: 166, dtype: float64

In [98]:
def find_threshold_minimum(x):
    x_data = x[(x.notna()) &  (x > 0)]
    if len(x_data) > 0:
        return min(x_data)
    else:
        return 999999999
    
min_thresholds = chosen_ca_pollutant_data.apply(find_threshold_minimum, axis=1)
min_thresholds_cols = chosen_ca_pollutant_data.idxmin(axis=1)
#min_thresholds_cols = min_thresholds_cols[min_thresholds_cols.notna()]
#min_thresholds = chosen_ca_pollutant_data[min_thresholds_cols]

In [99]:
for row in chosen_ca_pollutant_data.iterrows():
    x = row[1]
    x_data = x[(x.notna()) &  (x > 0)]
    if len(x_data) > 0:
        if max(x_data)/min(x_data) > 1000:
            print(chosen_ca_pollutant_info["Name1"][row[0]])
            print(min(x_data), max(x_data), )
            print("\n")

cyanide
4.0 220000.0


arsenic
0.0014 10.0


beryllium
1.0 30000.0


cadmium
0.0023 5.0


chromium(vi)
0.02 21.0


antimony
1.0 4300.0


alpha-hch
0.0039 500.0


p,p'-dde
0.00059 1.0


aldrin
8e-05 0.3


lindane
0.019 500.0


p,p'-ddd
0.00083 1.0


p,p'-ddt
0.00059 3.5


dieldrin
0.00014 0.5


toxaphene
0.00073 8.75


heptachlor
0.00021 10.0


151    heptachlor epoxide
151    heptachlor epoxide
Name: Name1, dtype: object
0.0001 10.0


151    heptachlor epoxide
151    heptachlor epoxide
Name: Name1, dtype: object
0.0001 10.0


pcbs
0.00017 50.0


2,4-dinitrophenol
10.0 14000.0


2,4-dinitrotoluene
0.05 1000.0


alachlor
0.4 700.0


acifluorfen
1.0 2000.0


phenol
2000.0 4600000.0




In [100]:
names_of_pollutant

['nitrite',
 'nitrate',
 'phosphorus',
 'phosphorus',
 'cyanide',
 'sodium',
 'chloride',
 'sulfate',
 'fluoride',
 'arsenic',
 'barium',
 'beryllium',
 'boron',
 'cadmium',
 'chromium',
 'chromium',
 'cobalt',
 'copper',
 'iron',
 'lead',
 'manganese',
 'thallium',
 'molybdenum',
 'nickel',
 'silver',
 'strontium',
 'vanadium',
 'antimony',
 'aluminum',
 'selenium',
 'propachlor',
 'hexazinone',
 'butylate',
 'bromacil',
 'terbacil',
 'diphenamid',
 'simazine',
 'prometryn',
 'prometon',
 'cyanazine',
 'fonofos',
 'radium-226',
 'alkalinity',
 'alpha-hch',
 'alpha-endosulfan',
 "p,p'-dde",
 'dicamba',
 'dicrotophos',
 'linuron',
 'bentazon',
 'dichlorvos',
 'fluometuron',
 'oxamyl',
 'chlorpyrifos',
 'aldrin',
 'lindane',
 "p,p'-ddd",
 "p,p'-ddt",
 'dieldrin',
 'endrin',
 'toxaphene',
 'heptachlor',
 'heptachlor',
 'metolachlor',
 'heptachlor epoxide',
 'pcbs',
 'malathion',
 'parathion',
 'diazinon',
 'atrazine',
 '2,4-d',
 '2,4-d',
 '2,4-d',
 '2,4-d',
 '2,4,5-t',
 'alachlor',
 'prop

In [101]:
len(min_thresholds)

150

In [102]:
len(min_thresholds_cols)

150

In [103]:
len(rows_containing_pollutant)

150

In [104]:
ca_pollutant_thresholds = pd.DataFrame({"Pollutant":names_of_pollutant, "Min Value":min_thresholds, "Threshold Adhered To":min_thresholds_cols})
ca_pollutant_thresholds = ca_pollutant_thresholds[ca_pollutant_thresholds["Threshold Adhered To"].notna()]
ca_pollutant_thresholds.head()

Unnamed: 0,Pollutant,Min Value,Threshold Adhered To
187,nitrite,700.0,USEPA_IRIS_RfD
186,nitrate,10000.0,USEPA_Prim_MCL
209,phosphorus,0.1,USEPA_HA_NonCancer
91,cyanide,4.0,NAWQC_Health_WF
229,sodium,20000.0,USEPA_HA_NonCancer


In [106]:
common_pollutants_p = []
reverse_params_def_dict = {}
for pollutant in ca_pollutant_thresholds["Pollutant"]:
    for column in all_filtered_params[pollutant]:
        reverse_params_def_dict[column] = pollutant
        common_pollutants_p.append(column)

In [107]:
ca_pollutant_thresholds.shape

(140, 3)

In [108]:
common_pollutants_p

['p00613',
 'p71856',
 'p00618',
 'p71851',
 'p00666',
 'p00721',
 'p00930',
 'p00940',
 'p00945',
 'p00950',
 'p91002',
 'p01000',
 'p01001',
 'p01003',
 'p01005',
 'p01006',
 'p01010',
 'p01020',
 'p01025',
 'p01026',
 'p01028',
 'p01030',
 'p01031',
 'p01030',
 'p01031',
 'p01040',
 'p01041',
 'p01043',
 'p01044',
 'p01046',
 'p01170',
 'p01049',
 'p01050',
 'p01052',
 'p01054',
 'p01056',
 'p01057',
 'p01060',
 'p01065',
 'p01066',
 'p01068',
 'p01075',
 'p01076',
 'p01080',
 'p01085',
 'p01095',
 'p01106',
 'p01145',
 'p01146',
 'p01148',
 'p04024',
 'p04025',
 'p04028',
 'p04029',
 'p63189',
 'p04032',
 'p82665',
 'p04033',
 'p04035',
 'p04036',
 'p04037',
 'p63226',
 'p04041',
 'p04095',
 'p09503',
 'p09511',
 'p34253',
 'p34362',
 'p39389',
 'p34653',
 'p39368',
 'p38442',
 'p38454',
 'p38478',
 'p82666',
 'p38711',
 'p38775',
 'p38811',
 'p38866',
 'p38933',
 'p63195',
 'p39333',
 'p39341',
 'p39343',
 'p39363',
 'p39373',
 'p39381',
 'p39383',
 'p39393',
 'p39403',
 'p39413',

In [109]:
# Parallel test for new pollutant data - WORKS
column = "p01010"
threshold = ca_pollutant_thresholds["Min Value"][ca_pollutant_thresholds["Pollutant"]==reverse_params_def_dict[column]]
actual_threshold = threshold.values[0]

In [110]:
# passed_threshold = gives gauges that exceeded threshold
passed_threshold = total_data[column][total_data[column].notna()] > actual_threshold
has_passed_site = total_data["site_no"][total_data[column].notna()][passed_threshold]
has_passed_nums = total_data[column][total_data[column].notna()][passed_threshold]

In [111]:
total_data.columns.values.tolist()

['site_no',
 'sample_dt',
 'sample_tm',
 'p00613',
 'p71856',
 'p00618',
 'p71851',
 'p00666',
 'p00721',
 'p00930',
 'p00940',
 'p00945',
 'p00950',
 'p91002',
 'p01000',
 'p01001',
 'p01003',
 'p01005',
 'p01006',
 'p01010',
 'p01020',
 'p01025',
 'p01026',
 'p01028',
 'p01030',
 'p01031',
 'p01035',
 'p01036',
 'p01040',
 'p01041',
 'p01043',
 'p01044',
 'p01046',
 'p01170',
 'p01049',
 'p01050',
 'p01052',
 'p01054',
 'p01056',
 'p01057',
 'p01060',
 'p01065',
 'p01066',
 'p01068',
 'p01075',
 'p01076',
 'p01080',
 'p01085',
 'p01095',
 'p01106',
 'p01145',
 'p01146',
 'p01148',
 'p04024',
 'p04025',
 'p04028',
 'p04029',
 'p63189',
 'p04032',
 'p82665',
 'p04033',
 'p04035',
 'p04036',
 'p04037',
 'p63226',
 'p04041',
 'p04095',
 'p09503',
 'p09511',
 'p29801',
 'p29802',
 'p39036',
 'p39086',
 'p34253',
 'p34362',
 'p39389',
 'p34653',
 'p39368',
 'p38442',
 'p38454',
 'p38478',
 'p82666',
 'p38711',
 'p38775',
 'p38811',
 'p38866',
 'p38933',
 'p63195',
 'p39333',
 'p39341',
 'p

In [112]:
to_write_to_file = ""
header = "Site\tPollutant\tDate\t% Error\tValue\tThreshold\tThreshold Adhered To\n"
formatSTR = "{}\t{}\t{}\t{}\t{}\t{}\t{}\n"
pollutant_col = "Pollutant"
threshold_col = "Min Value"
which_threshold_col = "Threshold Adhered To"
date_col = "sample_dt"

In [113]:
reverse_params_def_dict

{'p00613': 'nitrite',
 'p71856': 'nitrite',
 'p00618': 'nitrate',
 'p71851': 'nitrate',
 'p00666': 'phosphorus',
 'p00721': 'cyanide',
 'p00930': 'sodium',
 'p00940': 'chloride',
 'p00945': 'sulfate',
 'p00950': 'fluoride',
 'p91002': 'fluoride',
 'p01000': 'arsenic',
 'p01001': 'arsenic',
 'p01003': 'arsenic',
 'p01005': 'barium',
 'p01006': 'barium',
 'p01010': 'beryllium',
 'p01020': 'boron',
 'p01025': 'cadmium',
 'p01026': 'cadmium',
 'p01028': 'cadmium',
 'p01030': 'chromium',
 'p01031': 'chromium',
 'p01040': 'copper',
 'p01041': 'copper',
 'p01043': 'copper',
 'p01044': 'iron',
 'p01046': 'iron',
 'p01170': 'iron',
 'p01049': 'lead',
 'p01050': 'lead',
 'p01052': 'lead',
 'p01054': 'manganese',
 'p01056': 'manganese',
 'p01057': 'thallium',
 'p01060': 'molybdenum',
 'p01065': 'nickel',
 'p01066': 'nickel',
 'p01068': 'nickel',
 'p01075': 'silver',
 'p01076': 'silver',
 'p01080': 'strontium',
 'p01085': 'vanadium',
 'p01095': 'antimony',
 'p01106': 'aluminum',
 'p01145': 'seleni

In [114]:
# Going through all the p##### column names
#  use the pollutant associated to that column to determine where there are rows for that
#  pollutant
# The 
for column in common_pollutants_p:
    columns_with_pollutant_info = ca_pollutant_thresholds[pollutant_col]==reverse_params_def_dict[column]
    threshold = ca_pollutant_thresholds[threshold_col][columns_with_pollutant_info].values[0]
    passed_threshold = total_data[column][total_data[column].notna()] > threshold
    has_passed_site = total_data["site_no"][total_data[column].notna()][passed_threshold]
    has_passed_nums = total_data[column][total_data[column].notna()][passed_threshold]
    has_passed_pcts = (has_passed_nums - threshold)/threshold
    has_passed_date = total_data[date_col][total_data[column].notna()][passed_threshold]
    which_threshold = ca_pollutant_thresholds[which_threshold_col][ca_pollutant_thresholds[pollutant_col]==reverse_params_def_dict[column]].values[0]
    for i in range(len(has_passed_site)):
        to_add = formatSTR.format(has_passed_site.iloc[i], reverse_params_def_dict[column], 
                                  has_passed_date.iloc[i], has_passed_pcts.iloc[i], 
                                  has_passed_nums.iloc[i], threshold, which_threshold)
        to_write_to_file += to_add

In [116]:
file = open("potentially_a_bipartite_US.tsv", "w")
file.write("#" + header + to_write_to_file)
file.close()

In [None]:
def network_by_year()

## PUT IT INTO NETWORKX 

In [None]:
#reload the file
graph_info = pd.read_csv("potentially_a_bipartite_US.tsv", sep="\t")
display(graph_info.head())

#df.groupby(['Col1', 'Col2']).size()
graph_site_pollutant = graph_info.groupby(["Site", "Pollutant"]).size().reset_index()[["Site", "Pollutant"]]
display(graph_site_pollutant.head())

In [None]:
for decade in range(1970, 2030, 10):
    graph_info["Date"][0][:4]

In [None]:
edges = [tuple(row[1]) for row in graph_site_pollutant.iterrows()]

In [None]:
pollutant_graph = nx.Graph()

In [None]:
#pollutant_graph.add_nodes_from(data_to_use['site_no'], bipartite=0)
#pollutant_graph.add_nodes_from(filtered_params_2.keys(), bipartite=1)
#pollutant_graph.add_nodes_from(graph_site_pollutant["Site"], bipartite=0)
#pollutant_graph.add_nodes_from(graph_site_pollutant["Pollutant"], bipartite=1)
pollutant_graph.add_edges_from(edges)

In [None]:
sites, pollutants = bipartite.sets(pollutant_graph)

In [None]:
max([len(i) for i in nx.connected_components(pollutant_graph)])

In [None]:
pos = nx.spring_layout(pollutant_graph)

In [None]:
plt.figure(figsize=(12,8))

communities = [sites, pollutants]
colors = ["salmon", "lightblue"]
nx.draw_networkx_edges(pollutant_graph, pos=pos, width = 1, edge_color="darkgray")
for community, color in zip(communities, colors):
    nx.draw_networkx_nodes(pollutant_graph, pos=pos, 
                           nodelist=community,
                           node_color=color,
                           with_labels=False,
                           node_size=40)

    #nx.draw_networkx_labels(pollutant_graph,pos=pos)
_=plt.axis("off")

In [None]:
site_graph = bipartite.projected_graph(pollutant_graph, sites)
[len(i) for i in nx.connected_components(site_graph)]

In [None]:
pos = nx.spring_layout(site_graph, k=0.0001)

In [None]:
plt.figure(figsize=(12,8))
nx.draw_networkx_edges(site_graph, pos=pos, width = 1, edge_color="darkgray")
nx.draw_networkx_nodes(site_graph, pos=pos, with_labels=False)
_=plt.axis("off")

In [None]:
pollute_graph = bipartite.projected_graph(pollutant_graph, pollutants)
[len(i) for i in nx.connected_components(pollute_graph)]

In [None]:
nx.write_edgelist(site_graph, "site_graph_edges.tsv", delimiter="\t", data = False)
nx.write_edgelist(pollute_graph, "pollute_graph_edges.tsv", delimiter="\t", data = False)

## TEST - IT WORKS!!!! 

column = "p01010"
threshold = pollutant_info["Human Health for the consumption of\xa0Water + Organism (µg/L)"][pollutant_info["Pollutant (P = priority pollutant)"]==reverse_params_def_dict[column]]
actual_threshold = threshold.values[0]
print("Column:", reverse_params_def_dict[column], "\tThreshold:", actual_threshold)

passed_threshold = total_data[column][total_data[column].notna()] > actual_threshold
has_passed_site = total_data["site_no"][total_data[column].notna()][passed_threshold]
has_passed_nums = total_data[column][total_data[column].notna()][passed_threshold]
#len(passed_threshold)
#len(total_data)

## Let's do this

to_write_to_file = ""
formatSTR = "{}\t{}\t{}\n"
pollutant_col = "Pollutant (P = priority pollutant)"
threshold_col = "Human Health for the consumption of\xa0Water + Organism (µg/L)"

for column in common_pollutants_p:
    threshold = pollutant_info[threshold_col][pollutant_info[pollutant_col]==reverse_params_def_dict[column]].values[0]
    passed_threshold = total_data[column][total_data[column].notna()] > threshold
    has_passed_site = total_data["site_no"][total_data[column].notna()][passed_threshold]
    has_passed_nums = total_data[column][total_data[column].notna()][passed_threshold]
    
    for i in range(len(has_passed_site)):
        to_add = formatSTR.format(has_passed_site.iloc[i], reverse_params_def_dict[column], has_passed_nums.iloc[i])
        to_write_to_file += to_add

file = open("potentially_a_bipartite.tsv", "w")
file.write(to_write_to_file)
file.close()

len(common_pollutants_p)

pollutant_info.shape

### Tangent: Looking at the amount of data per parameter 

In [None]:
# Let's see how much data we have for each parameter...
params_counts_dict = {}
for param in a:
    count = a.shape[0] - sum(a[param].isna())
    if count > 0:
        params_counts_dict[params_def_dict[param]] = count
    
    #print(params_def_dict[param] + ":", count)
params_counts = list(params_counts_dict.values())

In [None]:
plt.hist(list(params_counts_dict.values()))

In [None]:
counts_of_counts = {}
for count in params_counts_dict.values():
    if count not in counts_of_counts:
        counts_of_counts[count] = 0
    counts_of_counts[count] += 1

param_counts = list(params_counts_dict.values())
count_counts = list(counts_of_counts.values())
bin_edges = np.logspace(np.log10(min(param_counts)), 
                        np.log10(max(param_counts)),
                        num = 10)
density, _ = np.histogram(param_counts, bins=bin_edges, density=True)

In [None]:
log_be = np.log10(bin_edges)
x = 10**((log_be[1:] + log_be[:-1])/2)

plt.loglog(x, density, marker='o', linestyle='none')

In [None]:
log_be