## Cleaning Coordinate Data

This notebook will take in the name of a csv file, extract the Site Names, Latitude, and Longitude. It will then clean the data...

# IMPORTANT: Before converting to csv, replace all commas with a slash

#### Cleaning Site Names:
* Replaces any commas with a '/' so as to not add an extra element to our csv file

#### Cleaning Coordinates
* Converts any coordinates in degree format to decimal format
* Removes any special characterse
* In the event there is more than one coordinate in a latitude/longitude cell, it will randomly select one as the official coordinate per Dr. Castro's instructions

After all the data has been cleaned, it will output a new csv file formatted for use in Google Maps


### Imports

Pandas allows us to read our csv files

Re allows us to make regex functions. In this notebook, we use the regex functions to remove all special characters (apart from '.' and '-') from our coordinates

In [1]:
import pandas as pd
import re

### Extracting from csv file

This function will take in the name of a file in csv format, and put the data in the "Site" "Latitude" and "Longitude" columns into lists...

#### Cleaning the lists
* We will need to delete every row that is missing coordinate data. This can either be from the cells being empty, or holding fake coordinates as placeholders (not in the constraints of coordinate range)
    * Latitudes only range from: -90 to 90
    * Longitudes only range from: -180 to 180

## Defining Helper Functions

In [2]:
#Merging columns in the event there is more than one latitude of longitude column
def merge_two_columns(site_column, column1, column2):
    merged_list = []
    position_counter = 0
    
    for site in site_column:
        if (str(column1[position_counter]) != 'nan'):
            merged_list.append(column1[position_counter])
        elif (str(column2[position_counter]) != 'nan'):
            merged_list.append(column2[position_counter])
        else:
            merged_list.append('nan')
        position_counter += 1
    
    return merged_list


#Increases the length of a coordinate list in the event the list of names is longer
def extend_columns(original_column, desired_length):
    original_column.extend(['nan'] * desired_length)
    extended_column = original_column[:desired_length]
    return extended_column


#Makes sure all the lists are the correct length, etc, before merging together
def preamble_to_merging(site_column, site_length, column1, column2):
    #creates lists for the two different columns
    column1_list = df[column1].tolist()
    column2_list = df[column2].tolist()
    
    #determine the length of each list... 
    column1_length = len(column1_list)
    column2_length = len(column2_list)
              
    #If one/both the lists are shorter than the list of site names,
    #we will need to extend them to comply with formatting
    if column1_length < site_length:
        column1_list = extend_columns(column1_list, site_length)
    if column2_length < site_length:
        column2_list = extend_columns(column2_list, site_length)
    
    #with the two lists the same size, we will merge them together
    return merge_two_columns(site_column, column1_list, column2_list)


#for optional inputs, set equal to none
def extract_from_csv(site_column, coordinate_column1, coordinate_column2 = None):
    #This acts as a basis for how long our coordinate lists should be
    site_column_length = len(site_column)
    
    #If there is a second column for coordinate data, we will need to combine them before turning it into our final list
    if coordinate_column2:
        coordinate_list = preamble_to_merging(site_column, site_column_length, coordinate_column1, coordinate_column2)
    #If there is only one column of latitude data, we will immediately create a list
    else:
        coordinate_list = df[coordinate_column1].tolist()
    
    return coordinate_list

## Defining Main Function

In [3]:
df = pd.read_csv('OldMediterranean.csv')
site_list = df['Site'].tolist()

latitude_list = extract_from_csv(site_list,'Latitude1','Latitude2')
longitude_list = extract_from_csv(site_list,'Longitude1','Longitude2')

## Now we see if the coordinates exist

In [4]:
#checks if one or more of the coordinates is nan
def is_nan(latitude, longitude):
    if (latitude == 'nan' or longitude == 'nan'):
        return True
    return False


In [6]:
#probably excessive, but we will be bouncing between lists
temp_site_list = []
temp_latitude_list = []
temp_longitude_list = []

counter = 0

#If nan is false, we add the site name and coordinates to a new list
for site in site_list:
    if (is_nan(latitude_list[counter], longitude_list[counter]) == False):
        temp_site_list.append(site)
        temp_latitude_list.append(latitude_list[counter])
        temp_longitude_list.append(longitude_list[counter])
    counter += 1

#give our old lists the correct data!
site_list = temp_site_list
latitude_list = temp_latitude_list
longitude_list = temp_longitude_list

This takes our rows for latitude and longitude and turns them into a list of strings. Now we can start accessing the characters

### Code Segment Two

Now, we need to look at the list of strings and remove the special characters... I think in the event there is more than one potential coordinate listed, we can't assume the characters separating the two will be the same for every instance so it will be best to remove them and then check if there are any numbers separated by spaces at the end

OKAY, there are some instances where the coordinates are NOT in decimal format so we've got to figure out how to work with that -_-

#### Changing to Decimal Format

Alright, we're going to need to be able to isolate the coordinate in the incorrect data format, hold it's position in the spreadsheet, alter it, then place it back in position

In [7]:
#Creates list of coordinates that need to be converted to decimal format
#saves the location of the coordinate in the original list
def decimal_coordinate_needs_to_be_converted(coordinate_list, key_value1, key_value2):
    #holds the coordinates we need to convert from degree to decimal
    coordinate_to_convert = []
    #holds the position of what needs to be replaced
    position_to_convert = []
    
    #the position being updated
    counter = 0
    
    #Puts the coordinates and their locations into a list
    for coordinate in coordinate_list:
        if key_value1 in coordinate or key_value2 in coordinate:
            coordinate_to_convert.append(coordinate)
            position_to_convert.append(counter)
        counter += 1
        
    return coordinate_to_convert, position_to_convert

#Converts from degree format to decimal
def converter(coordinate):
    #Holds the three different parts of the degree coordinate
    temp = []
    #gets rid of extra characters
    data = re.sub(r"[^0-9.-]+", ' ', coordinate)
    
    #Converts the parts of the coordinate to numeric value
    for c in data.split():
        try:
            temp.append(float(c))
        except ValueError:
            pass
    
    #performs conversion calculations
    input = temp[0] + (temp[1]/60) + (temp[2]/3600)
    return input

#Mulitplies by -1 if coordinate is West or South
def finalize_converted_coordinates(coordinate_list, key_value1, key_value2):
    converted_list = []
    #looks at each coordinate and sees if it should be multiplied by -1
    for coordinate in coordinate_list:
        print(coordinate)
        if key_value1 in coordinate:
            converted_value = converter(coordinate)
            converted_list.append(converted_value)
        if key_value2 in coordinate:
            converted_value = -1 * converter(coordinate)
            converted_list.append(converted_value)
        
    return converted_list

In [8]:
lat_to_convert, lat_counter =  decimal_coordinate_needs_to_be_converted(latitude_list, "N", "S")
long_to_convert, long_counter =  decimal_coordinate_needs_to_be_converted(longitude_list, "E", "W")

converted_lat = finalize_converted_coordinates(lat_to_convert, "N", "S")
converted_long = finalize_converted_coordinates(long_to_convert, "E", "W")

 36° 7'51.15"N
 36°12'24.72"N
 45°28'21.26"N
37°33'33.7"N 
37°33'34.3"N 
42°22'05.9"N 
 39°55'19.18"N
43°10'03.2"N 
43°17'59.9"N
43°17'59.0"N 
43°17'59.6"N 
37°02'28.3"N 
37°02'33.1"N 
 42°31'32.66"N
 31°20'0.16"N
37°02'25.3"N 
32°33'21.1"N 
 31°19'45.70"N
39°49'25.4"N 
 34°47'18.00"N
35°20'49.9"N 
 41°29'3.07"N
37°52'16.0"N 
37°52'16.0"N 
43°01'31.8"N 
 43° 0'37.83"N
 43°43'27.17"N
 43°25'30.61"N
 43° 9'53.55"N
 43°11'45.62"N
 43°11'44.52"N
 43°15'49.17"N
 43° 0'24.17"N
 45° 7'35.14"N
 43° 8'47.77"N
 39°10'30.76"N
 39°18'55.81"N
 41°51'48.04"N
 43° 1'54.44"N
43°32'32.5"N 
 43° 4'4.34"N
43°01'04.5"N 
43°16'18.5"N 
35°31'54.9"N 
43°02'59.8"N 
43°23'56.4"N 
 43°25'24.66"N
43°02'29.1"N 
43°24'34.2"N 
 43°40'25.89"N
 43°40'31.88"N
 43°10'58.19"N
 43°26'40.06"N
 42°20'53.85"N
 41°51'34.20"N
 45°45'55.28"N
 42°30'54.52"N
 43°31'16.93"N
43°24'49.9"N 
 41°21'60.00"N
 43°40'32.63"N
 43°17'0.69"N
 42°21'41.12"N
 41°56'37.56"N
 43° 1'17.17"N
 44°42'46.87"N
 41°19'25.64"N
 43°43'27.17"N
 43° 1'16.

IndexError: list index out of range

In [9]:
print(converted_lat)
print(converted_long)

NameError: name 'converted_lat' is not defined

Alright! Now we've got our list of coordinates in the wrong data format... now all we need to do is convert

In this experiment list, some of the characters separating the degrees/minutes/seconds are incorrect.. so I think it'll be best to replace special characters with spaces, change all mutlispaces to one space, and do the calculations based on that :)

In [10]:

converted_lat = []
converted_long = []



Now that we habve the converted coordinate points it is time to use them to replace the main list...

In [11]:
counter = 0
for position in latCounter:
    latList[position] = str(convertedLat[counter])
    counter += 1

counter = 0
for position in longCounter:
    longList[position] = str(convertedLong[counter])
    counter += 1


NameError: name 'latCounter' is not defined

#### Deleting extra characters

In [12]:
cleanLatList = []
cleanLongList = []
cleanSiteList = []

for latitude in latList:
    cleanLatList.append(re.sub(r"[^0-9.-]+", ' ', latitude))
    
for longitude in longList:
    cleanLongList.append(re.sub(r"[^0-9.-]+", ' ', longitude))

for site in siteList:
    cleanSiteList.append(site.replace(",", " /"))

NameError: name 'latList' is not defined

The print statements below verify our converted coordinates are in the master list

In [13]:
print(cleanLongList[176])
print(cleanLongList[379])

print(cleanLatList[176])
print(cleanLatList[379])

print(cleanSiteList[176])
print(cleanSiteList[379])
print(cleanSiteList[334])

IndexError: list index out of range

### Okey-dokey, I was hoping we could just look at the data point with 

In [14]:
for point in cleanLongList:
    if " " in point.strip():
        print(point)

### Code Segment Three

Alright, here we are going to make a brand new csv file with all our data points ready to be uploaded into a map :)

In [15]:
import csv

counter = 0
with open('EarlyModernPostClean.csv', 'w', newline='') as csvfile:
    filewriter = csv.writer(csvfile, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
    filewriter.writerow(['Site', 'Latitude', 'Longitude'])
    for site in cleanSiteList:
        filewriter.writerow([cleanSiteList[counter], cleanLatList[counter], cleanLongList[counter]])
        counter += 1
        

In [16]:
df2 = pd.read_csv('EarlyModernPostClean.csv')

df2.head()

Unnamed: 0,Site,Latitude,Longitude
