## Using ChatGPT for NLP labeling of the water-year summaries dataset
#### NOTE THAT CODE DOES NOT RUN AND IS FOR RESEARCH PURPOSES.
- Code may be easily adapted for usage in other NLP, though note that a new openAI API key is needed. They are free with your account and is preloaded with 18.00$ of requests.
### Introduction to OpenAI: 
ChatGPT is a recently developed powerful Natural Language Processing machine learning model. Its primary use has been for user interaction and personal use. It can be adapted to solve many complex problems and is trained on a large variety of data. We aimed to use this technology to process the location, gauge, and remark information about a particular stream gauge site collected from the USGS. Our goal is to filter sites that do not fit our criteria for valid stream gauge sites. Criteria can be easily adapted in this approach, however for this project we want sites that convey the natural data without man-made changes. This includes but is not limited to, dams, concrete stream beds, irrigation diversions, and other forms of impactful man-made structures. Note that this code was not used in the final testing of the model, but found some use in regular classification.
### Pros:
- Very interpretable and easy to understand.
- Setting up is not too difficult.
- Changing criteria is trivial.
- Very powerful NLP model.
- Does not need to be trained and tuning is minimal.
### Cons:
- Not consistent in producing correctly formatted output.
- Language used in remarks and descriptions is diffrent from state to state, making labeling difficult.
- Openai charges requests made to the API and thus is not completely free. Note that there is a free trial that includes more than enough for most single projects.

### Assosiated Files:
- modelTesting.txt: Some call and response from given prompts, real data, and fabricated data. See file for details on testing.
- openai_test_data: File of Washington stream gauges, contains location, remarks, and gauge information.
### Conclusion:
Though it was most definitely an interesting idea, prompt engineering is much too variable for consistent use in this project. With the usage of different kinds of language and the ease in which GPT-3 gets confused with complicated prompts, it does not make sense to continue to attempt to use this model. Note that OpenAI has different kinds of models, some of which can be trained and even more finely tuned. 

In [None]:
!pip install openai
# OpenAI are the creators of chatGPT, and other similar natural language processing AI.

Collecting openai
  Downloading openai-0.26.3.tar.gz (55 kB)
     -------------------------------------- 55.5/55.5 kB 481.2 kB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting aiohttp
  Downloading aiohttp-3.8.3-cp38-cp38-win_amd64.whl (324 kB)
     -------------------------------------- 324.3/324.3 kB 2.9 MB/s eta 0:00:00
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.8.2-cp38-cp38-win_amd64.whl (56 kB)
     ---------------------------------------- 56.9/56.9 kB 2.9 MB/s eta 0:00:00
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp38-cp38-win_amd64.whl (28 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Collecting async-timeout<5.0,

In [None]:
import openai

# This is for your personal openAI API key. They are free up to 18 dollars of requests, see openai pricing for more details.
openai.api_key = ""
# Function for processing remarks. Takes string input and checks it against gage criteria. 
def getResponse(remarks,prompt):
    response = openai.Completion.create(engine = "text-davinci-002", prompt =prompt, temperature = 0.25, n = 1, best_of = 1,  frequency_penalty=0, stop = None, max_tokens = 1024,  presence_penalty=0)
    return response["choices"][0]["text"]


#GAGE REMARK TESTING:
#remarks = "REMARKS - Two small dams may cause slight regulation at times. Some small diversions for domestic use upstream from station. Echo Lake conduit (station 11434500) diverts from Echo Lake (station 10336608), to South Fork American River Basin. 10/01/2013-09/30/2014: Records good except for estimated daily discharges, which are poor. 10/01/2014-09/30/2015: Records good except for estimated daily discharges, which are poor. 10/01/2015-09/30/2016: Records good except for estimated daily discharges, which are poor. 10/01/2016-09/30/2017: Records fair except for estimated daily discharges, which are poor. 10/01/2017-09/30/2018: Records fair except for estimated discharges, which are poor. 10/01/2018-09/30/2019: Records fair except for estimated discharges, which are poor. 10/01/2019-09/30/2020: Records fair except for estimated discharges, which are poor. 10/01/2020-09/30/2021: Records poor."
#remarks = "REMARKS - No regulation or diversion upstream from station. See schematic diagram of San Gabriel River and Los Angeles River Basins available from the California Water Science Center."
#remarks = "REMARKS - Flow regulated since July 1931 by Big Tujunga Flood-Control Reservoir, capacity, 5,690 acre-ft, and since September 1940 by Hansen Flood-Control Reservoir, capacity, 25,450 acre-ft. Several small diversions for domestic use and irrigation. Since about 1948, Los Angeles County Department of Public Works has diverted water 0.3 mi upstream from gage to spreading grounds. See schematic diagram of San Gabriel River and Los Angeles River Basins available from the California Water Science Center."
#remarks = "REMARKS - Flow regulated by South Lake (station 10270700). Green Creek Conduit (station 10270680) diverts water into basin at South Lake. Water is used for power development downstream. See schematic diagram of Bishop Creek Basin available from the California Water Science Center."
#remarks = "REMARKS - Sample and flow data collected for the San Joaquin River Restoration Project. Instantaneous discharges are from USGS flow measurements made concurrently with samples. No data collected for the 2014, 2015 and 2016 Water Years due to flow restrictions."

#FABRICATED GAGE REMARKS:
#remarks = "REMARKS - This streambed contains no diversions."
#remarks = "REMARKS - There are no upstream dams or resiviors on this river"
#remarks = "REMARKS - There is an upstream resiviour however it does not affect anything and can be treated as non-existant"

#TESTING WITH GREATER INFROMATION
#remarks = "REMARKS - Flow regulated since July 1931 by Big Tujunga Flood-Control Reservoir, capacity, 5,690 acre-ft, and since September 1940 by Hansen Flood-Control Reservoir, capacity, 25,450 acre-ft. Several small diversions for domestic use and irrigation. Since about 1948, Los Angeles County Department of Public Works has diverted water 0.3 mi upstream from gage to spreading grounds. See schematic diagram of San Gabriel River and Los Angeles River Basins available from the California Water Science Center."
#gages = "GAGE - Water-stage recorder and concrete-lined flood-control channel. Datum of gage is 945.75 ft above NAVD of 1988. See WSP 1735 for history of changes prior to Oct. 1, 1953."
#location = "LOCATION - Referenced to North American Datum of 1927, Los Angeles County, CA, Hydrologic Unit 18070105, in Mission San Fernando Grant, in city of Los Angeles, on left bank of concrete outlet channel, 0.1 mi upstream from Glen Oaks Boulevard, 0.5 mi downstream from Hansen Dam, and 3 mi southeast of San Fernando."

In [None]:
import pandas as pd
cleanCSV = pd.read_csv("clean_csv_ water-year summary.csv")

In [None]:
# Labeling using GPT and cleaned water year summary data.
# Note that this code is no longer functional, but can be used for refrence.
import time
cleanCSV["ContainsBadInfo"] = None
for i, row in cleanCSV.iterrows():
    location = row["Location"]
    gage = row["Gage"]
    remarks = row["Remarks"]
    if str(location) == "nan":
        location = ""
    if str(gage) == "nan":
        gage = ""
    if str(remarks) == "nan":
        remarks = ""
    text = location + " " + gage + " " + remarks
    if(text.strip() == ""):
        row["ContainsBadInfo"] = "yes"
        continue
    row["ContainsBadInfo"] = getResponse(text)
    print("Request success!")
    time.sleep(10)

Lat 32°48'44", long 114°30'51" referenced to North American Datum of 1927, in SE 1/4 NE 1/4 sec.35, T.15 S., R.24 E., Imperial County, CA, Hydrologic Unit 15030107, San Bernardino meridian, on right bank 1.4 mi downstream from Laguna Dam, 2.8 mi northeast of Bard, CA, and 10 mi northeast of Yuma, AZ. Water-stage recorder. Datum of gage is 123.05 ft above NAVD of 1988.  Record is rated fair.   Natural flow of the Colorado River at this point is affected by transmountain diversions, storage reservoirs, power developments, ground-water withdrawals, diversions for irrigation, municipal, and industrial uses, and return flows from irrigated areas.  Flow past station consists mainly of water released through Imperial Dam, sludge from the desilting basins at Imperial Dam, seepage through Imperial Dam, and seepage from the All-American Canal and the Gila Gravity Main Canal.
Request success!
Lat 32°43'54", long 114°37'55" referenced to North American Datum of 1927, in SW 1/4 SW 1/4 sec.26, T.16 

ValueError: Length of values (276) does not match length of index (158)