## Cleaning Coordinate Data

This notebook will take in the name of a csv file, extract the Site Names, Latitude, and Longitude. It will then clean the data...

# IMPORTANT: Before converting to csv, replace all commas with a slash

#### Cleaning Site Names:
* Replaces any commas with a '/' so as to not add an extra element to our csv file

#### Cleaning Coordinates
* Converts any coordinates in degree format to decimal format
* Removes any special characterse
* In the event there is more than one coordinate in a latitude/longitude cell, it will randomly select one as the official coordinate per Dr. Castro's instructions

After all the data has been cleaned, it will output a new csv file formatted for use in Google Maps


### Imports

Pandas allows us to read our csv files

Re allows us to make regex functions. In this notebook, we use the regex functions to remove all special characters (apart from '.' and '-') from our coordinates

In [1]:
import pandas as pd
import re

### Extracting from csv file

This function will take in the name of a file in csv format, and put the data in the "Site" "Latitude" and "Longitude" columns into lists...

#### Cleaning the lists
* We will need to delete every row that is missing coordinate data. This can either be from the cells being empty, or holding fake coordinates as placeholders (not in the constraints of coordinate range)
    * Latitudes only range from: -90 to 90
    * Longitudes only range from: -180 to 180

## Defining Helper Functions

In [25]:
#See if the coordinates fall within the acceptable range
def is_legit_latitude(latitude):
    if -90 <= latitude <= 90:
        return true
    return false

def is_legit_longitude(longitude):
    if -180 <= longitude <= 180:
        return true
    return false


#Merging columns in the event there is more than one latitude of longitude column
def merge_two_columns(site_column, column1, column2):
    print("Merge\n")
    merged_list = []
    position_counter = 0
    
    for site in site_column:
        if column1[position_counter]:
            merged_list.append(column1[position_counter])
        elif column2[position_counter]:
            merged_list.append(column2[position_counter])
        else:
            merged_list.append(" ")
        position_counter += 1
    
    return merged_list


#Increases the length of a coordinate list in the event the list of names is longer
def extend_columns(original_column, desired_length):
    original_column.extend([None] * desired_length)
    extended_column = original_column[:desired_length]
    return extended_column


def preamble_to_merging(site_column, site_length, column1, column2):
    print("Preamble\n")
    #creates lists for the two different columns
    column1_list = df[column1].tolist()
    column2_list = df[column2].tolist()
                            
    #determine the length of each list... 
    column1_length = len(column1_list)
    column2_length = len(column2_list)
    
        
    print("Column1: ", column1_length)
    print(column1_list)
    print("\nColumn2: ", column2_length)
    print(column2_list)
    print("\n")
                            
    #If one/both the lists are shorter than the list of site names,
    #we will need to extend them to comply with formatting
    if column1_length < site_length:
        print("Preamble: First if\n")
        column1_list = extend_columns(column1_list, site_length)
    if column2_length < site_length:
        print("Preamble: Second if\n")
        column2_list = extend_columns(column2_list, site_length)
    
    #with the two lists the same size, we will merge them together
    return merge_two_columns(site_column, column1_list, column2_list)

## Defining Main Function

In [26]:
#for optional inputs, set equal to none
def extract_from_csv(site_column, latitude_column1, longitude_column1, latitude_column2 = None, longitude_column2 = None):
    #This acts as a basis for how long our coordinate lists should be
    site_column_length = len(site_column)
    
    
    #If there is a second column for latitude data, we will need to combine them before turning it into our final list
    if latitude_column2:
        print("Extract: Entering first if statement\n")
        latitude_list = preamble_to_merging(site_column, site_column_length, latitude_column1, latitude_column2)
    #If there is only one column of latitude data, we will immediately create a list
    else:
        latitude_list = df[latitude_column1].tolist()
    
    #If there is a second column for latitude data, we will need to combine them before turning it into our final list
    if longitude_column2:
        print("Extract: Entering second if statement\n")
        longitude_list = preamble_to_merging(site_column, site_column_length, longitude_column1, longitude_column2)
    #If there is only one column of longitude data, we will immediately create a list
    else:
        longitude_list = df[longitude_column2].tolist()
    
    
    position_of_valid_coordinates = []
    position_counter = 0
    
    
    for site in site_list:
        if latitude_list[position_counter] and longitude_list[position_counter]:
            position_of_valid_coordinates.append(position_counter)
        position_counter += 1
        

In [27]:
df = pd.read_csv('OldMediterranean.csv')
site_list = df['Site'].tolist()

latitude_list = extract_from_csv('site_list','Latitude1', 'Longitude1', 'Latitude2', 'Longitude2')

Extract: Entering first if statement

Preamble

Column1:  234
[' 36° 7\'51.15"N', ' 36°12\'24.72"N', ' 45°28\'21.26"N', '37°33\'33.7"N\xa0', '37°33\'34.3"N\xa0', '42°22\'05.9"N ', '36.976556', ' 39°55\'19.18"N', '43°10\'03.2"N ', '43°17\'59.9"N', '43°17\'59.0"N ', '43°17\'59.6"N ', '37°02\'28.3"N ', '37°02\'33.1"N ', ' 42°31\'32.66"N', ' 31°20\'0.16"N', '37°02\'25.3"N ', '32°33\'21.1"N ', ' 31°19\'45.70"N', '39°49\'25.4"N\xa0', ' 34°47\'18.00"N', '35°20\'49.9"N ', ' 41°29\'3.07"N', '37°52\'16.0"N ', '37°52\'16.0"N ', '43°01\'31.8"N ', ' 43° 0\'37.83"N', ' 43°43\'27.17"N', ' 43°25\'30.61"N', ' 43° 9\'53.55"N', ' 43°11\'45.62"N', ' 43°11\'44.52"N', ' 43°15\'49.17"N', ' 43° 0\'24.17"N', ' 45° 7\'35.14"N', ' 43° 8\'47.77"N', ' 39°10\'30.76"N', ' 39°18\'55.81"N', ' 41°51\'48.04"N', ' 43° 1\'54.44"N', '43°32\'32.5"N ', ' 43° 4\'4.34"N', '43°01\'04.5"N ', '43°16\'18.5"N ', '35°31\'54.9"N ', '43°02\'59.8"N ', '43°23\'56.4"N ', ' 43°25\'24.66"N', '43°02\'29.1"N ', '43°24\'34.2"N ', ' 43°40\'25.

IndexError: list index out of range

This takes our rows for latitude and longitude and turns them into a list of strings. Now we can start accessing the characters

### Code Segment Two

Now, we need to look at the list of strings and remove the special characters... I think in the event there is more than one potential coordinate listed, we can't assume the characters separating the two will be the same for every instance so it will be best to remove them and then check if there are any numbers separated by spaces at the end

OKAY, there are some instances where the coordinates are NOT in decimal format so we've got to figure out how to work with that -_-

#### Changing to Decimal Format

Alright, we're going to need to be able to isolate the coordinate in the incorrect data format, hold it's position in the spreadsheet, alter it, then place it back in position

In [None]:
latToConvert = []
longToConvert = []

latCounter = []
longCounter = []

counter = 0
for latitude in latList:
    if "N" in latitude or "S" in latitude:
        latToConvert.append(latitude)
        latCounter.append(counter)
    counter += 1
    
counter = 0
for longitude in longList:
    if "E" in longitude or "W" in longitude:
        longToConvert.append(longitude)
        longCounter.append(counter)
    counter += 1

In [None]:
print(latToConvert)
print(latCounter)

print(longToConvert)
print(longCounter)

Alright! Now we've got our list of coordinates in the wrong data format... now all we need to do is convert

In this experiment list, some of the characters separating the degrees/minutes/seconds are incorrect.. so I think it'll be best to replace special characters with spaces, change all mutlispaces to one space, and do the calculations based on that :)

In [None]:
convertedLat = []
convertedLong = []

def converter(coordinate):
    temp = []
    data = re.sub(r"[^0-9.-]+", ' ', coordinate)
    for c in data.split():
        try:
            temp.append(float(c))
        except ValueError:
            pass
    input = temp[0] + (temp[1]/60) + (temp[2]/3600)
    return input

for latitude in latToConvert:
    if "N" in latitude:
        input = converter(coordinate = latitude)
        convertedLat.append(input)
    if "S" in latitude:
        input = -1 * converter(coordinate = latitude)
        convertedLat.append(input)
print(convertedLat)

for longitude in longToConvert:
    if "E" in longitude:
        input = converter(coordinate = longitude)
        convertedLong.append(input)
    if "W" in longitude:
        input = -1 * converter(coordinate = longitude)
        convertedLong.append(input)
print(convertedLong)

Now that we habve the converted coordinate points it is time to use them to replace the main list...

In [None]:
counter = 0
for position in latCounter:
    latList[position] = str(convertedLat[counter])
    counter += 1

counter = 0
for position in longCounter:
    longList[position] = str(convertedLong[counter])
    counter += 1


#### Deleting extra characters

In [None]:
cleanLatList = []
cleanLongList = []
cleanSiteList = []

for latitude in latList:
    cleanLatList.append(re.sub(r"[^0-9.-]+", ' ', latitude))
    
for longitude in longList:
    cleanLongList.append(re.sub(r"[^0-9.-]+", ' ', longitude))

for site in siteList:
    cleanSiteList.append(site.replace(",", " /"))

The print statements below verify our converted coordinates are in the master list

In [None]:
print(cleanLongList[176])
print(cleanLongList[379])

print(cleanLatList[176])
print(cleanLatList[379])

print(cleanSiteList[176])
print(cleanSiteList[379])
print(cleanSiteList[334])

### Okey-dokey, I was hoping we could just look at the data point with 

In [None]:
for point in cleanLongList:
    if " " in point.strip():
        print(point)

### Code Segment Three

Alright, here we are going to make a brand new csv file with all our data points ready to be uploaded into a map :)

In [None]:
import csv

counter = 0
with open('EarlyModernPostClean.csv', 'w', newline='') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
    filewriter.writerow(['Site', 'Latitude', 'Longitude'])
    for site in cleanSiteList:
        filewriter.writerow([cleanSiteList[counter], cleanLatList[counter], cleanLongList[counter]])
        counter += 1
        

In [None]:
df2 = pd.read_csv('EarlyModernPostClean.csv')

df2.head()