# Validate and Clean Location-Based Data

This notebook illustrates the use of ITALLIC software to validate, clean and impute missing values in location-based data.   

## Load modules

In [1]:
# Import the module
from IaaGeoDataCleaning.CleaningUtils.validate_data import *
from IaaGeoDataCleaning.CleaningUtils.impute_annotation import *
from IaaGeoDataCleaning.MapTools.iaa_explore import *

## Load data (file or DataFrame)

The sample data consists of location-based plant breeding data with location(location ID, country name, location name, region, latitude, longitude and elevation) and climate related information (maximum temperature, minium temprerature, precipitation and environment type. This sample dataset has been sub-sampled from a dataset originally obtained from CIMMYT. For the purposes of illustrating functionality, the sub-sampled dataset was altered to introduce common errors we observed in location-based plant breeding data. 

In [2]:
# Load data from CSV file
df=read_file('resources/PlantBreedingData.csv')
# Display the first 3 rows. 
df.head(3)

Unnamed: 0,LocationID,Country,Location,Region,Latitude,Longitude,ElevM,MAX_TEMP,MIN_TEMP,PRE,MegaEnviron
0,1,ANGOLA,MAZOZO,Eastern and Southern Africa,-9.1,13.7,41,30.0,23.16,467,Dry Lowland
1,2,ANGOLA,CABINDA,Eastern and Southern Africa,-5.6,12.2,22,31.120001,23.42,577,Dry Lowland
2,3,ANGOLA,ST. VINCENT,Eastern and Southern Africa,-5.57,12.2,57,31.18,23.5,578,Dry Lowland


## Validate data
To validate location data, four columns corresponding to latitude, longitude, location name and country name (lat_col, lng_col, loc_col and ctry_col) are required. See the sample data above. 

In [3]:
vData = ValidateData(df,lat_col='Latitude',lng_col='Longitude',loc_col='Location',ctry_col='Country')
vData.validate()

## Visualize data
The MapTool class is used to visualize location data. It uses a shapefile to validate entries or mark them as potential errors. The Python package folium is used to visualize these points. By default, a shapefile with level 3 adminstrative boundaries (ADM3) is used but users have the option of using their own shapefile with higher resolution adminstrative boundaries. The documentation provides detailed information on how to use this tool and other advanced features. 

In [4]:
# Initialize MapTool 
mTool = MapTool()

###  Visualize all data  

In [5]:
# Create a basemap of the world. The function create_map takes as input two parameters. The 
# focal point of the map defined by a latitude and longitude e.g., (1,23) and zoom level e.g., 4.
# See documentation for more details. 
wm = mTool.create_map(center=(1,23),zoom=4)
all_points = vData.plot_all_data(clr='purple', as_cluster=False)
mTool.add_to_map(wm, all_points)

### Visualize validate (correct) data
As the visualization above illustrates, there are errors in the data. Some of the data points are plotted in 
the ocean an indication the latitude and longitude values are incorrect. None of the plant breeding stations in the 
dataset are in the ocean. Below we visualize data where latitude and longitude match the country

In [6]:
# Create a new map object
wm = mTool.create_map(center=(1,23),zoom=4)
# Visualize validated data
correct_points = vData.plot_matched_latlng_data(clr='black', as_cluster=False)
mTool.add_to_map(wm, correct_points)

### Visualize data points with potential errors
Add data points where latitude and longitude values do not match the country. 

In [7]:
point_errors = vData.plot_mismatched_latlng(clr='red', as_cluster=False)
mTool.add_to_map(wm, point_errors)

### Save map
Interactive map can be saved to be viewed outside the notebook. 

In [8]:
mTool.save("FiguresTables/Figure2.html")

# Correct flipped location data
Now that we have visually inspected our data and know there are errors on latitude and longitude values, we can try to correct the errors. This step fixes entries where lat/lng values have been flipped or the sign value has been altered. 

To fix entries with lat/lng values flipped or signs altered , the function **fix_flipped_latlng** generates all 8 possible lat/lng combinations:  (lat, lng), (lat, -lng), (-lat, lng), (-lat, -lng), (lng, lat), (lng, -lat), (-lng, lat), (-lng, -lat). It then checks to see if one of these combinations results in a location that matches the country name. Lat/Lng pairs that match the conutry name are identified as a possible correct values. 

In [9]:
# Fix lat/lng values that are flipped
vData.fix_flipped_latlng()
# Check to see if there are any mismatched rows
vData.get_mismatched_latlng_df().head(2)

Unnamed: 0,Coordinate_Error,Country,ElevM,Flipped_Lat,Flipped_Lng,Flipped_Type,ISO2,Latitude,Location,LocationID,Longitude,MAX_TEMP,MIN_TEMP,Matched_Country_ISO2,MegaEnviron,PRE,Region,geometry
3,Mismatched country,MALI,0,0.0,0.0,Original,ML,0.0,RUE MOHAMED,101,0.0,0.0,0.0,KI,Wet Lowland,0,Western Africa,POINT (0.00000 0.00000)
5,Mismatched country,UGANDA,1026,-10.66,35.58,Original,UG,-10.66,"LIKONDE,SONGEA",187,35.58,27.039999,18.18,KI,Wet Upper Mid-altitude,1080,Eastern and Southern Africa,POINT (35.58000 -10.66000)


# GeoCode mismatched values
Geocode entries that could not be fixed by trying different combinations of latitude and longitude pairs. 

In [10]:
vData.geocode_mismatched_latlng()

# Combine datasets
Combine....

In [11]:
x1 = vData.get_matched_latlng_df()
x2 = vData.get_flipped_latlng_df()
x3 = vData.get_geocoded_latlng_df()
x4 = vData.get_mismatched_latlng_df()
print("df=",df.shape[0])
print("x1=",x1.shape[0])
print("x2=",x2.shape[0])
print("x3=",x3.shape[0])
print("x4=",x4.shape[0])

df= 229
x1= 220
x2= 4
x3= 5
x4= 0


In [12]:
vData.combine_validated_data()
combined = vData.get_combined_data_df()
print("combined=",combined.shape[0])
combined.head(5)

combined= 229


Unnamed: 0,LocationID,Country,Location,Region,Latitude,Longitude,ElevM,MAX_TEMP,MIN_TEMP,PRE,MegaEnviron,Validated_Lat,Validated_Lng,Geocode_Type
0,2,ANGOLA,CABINDA,Eastern and Southern Africa,-5.6,12.2,22,31.120001,23.42,577,Dry Lowland,-5.6,12.2,original
1,7,ANGOLA,CHIANGA,Eastern and Southern Africa,-12.73,15.83,1693,24.92,14.12,1049,Wet Upper Mid-altitude,-12.73,15.83,original
2,6,ANGOLA,HUMPATA,Eastern and Southern Africa,-15.03,13.43,1890,25.620001,13.76,619,Wet Upper Mid-altitude,-15.03,13.43,original
3,4,ANGOLA,KILOMBA,Eastern and Southern Africa,-8.9,14.7,514,28.66,19.799999,819,Wet Lower Mid-altitude,-8.9,14.7,original
4,8,ANGOLA,MALANGE,,-9.533,16.333,1149,27.879999,16.120001,720,,-9.533,16.333,original


# Impute missing values

### Visualize regions

In [13]:
# Initialize MapTool with shapefile
mTool = MapTool()

# Create a basemap of the world
wm = mTool.create_map(center=(1,23),zoom=4)
points = mTool.plot_data_generic(data=combined, plot_col='Region', lat_col='Validated_Lat', lng_col='Validated_Lng', 
                                 as_cluster=False) # return a list of markers
mTool.add_to_map(wm, points )

### Save map
Interactive map can be saved to be viewed outside the notebook. 

In [14]:
mTool.save("FiguresTables/Figure3.html")

### Impute missing region values
Points in red are missing region information. Impute missing values

### Visually check results

In [15]:
xcolumns=['Validated_Lat','Validated_Lng']
y_column ='Region'
aTool = AnnotateTool(combined,xcolumns,y_column)
aTool.predict_Y(n_neighbors=3)

In [16]:
updated_regions = aTool.get_updated_df()
updated_regions.to_csv('FiguresTables/Table1.csv') 
mTool = MapTool()
wm = mTool.create_map(center=(1,23),zoom=4)
points = mTool.plot_data_generic(data=updated_regions, plot_col='updated_Region', lat_col='Validated_Lat', lng_col='Validated_Lng', 
                                 as_cluster=False, plot_col_type='Region_type') # return a list of markers
mTool.add_to_map(wm, points )

### Save map
Interactive map can be saved to be viewed outside the notebook. 

In [17]:
mTool.save("FiguresTables/Figure4.html")

### Visualize MegaEnviron

In [18]:
# Initialize MapTool with shapefile
mTool = MapTool()

# Create a basemap of the world
wm = mTool.create_map(center=(1,23),zoom=4)
points = mTool.plot_data_generic(data=updated_regions, plot_col='MegaEnviron', lat_col='Validated_Lat', lng_col='Validated_Lng', 
                                 as_cluster=False) # return a list of markers
mTool.add_to_map(wm, points )

### Impute missing MegaEnviron values
Points in red are missing MegaEnviron information. Impute missing values

In [19]:
xcolumns_2=['Validated_Lat','Validated_Lng']
y_column_2 ='MegaEnviron'
aTool2 = AnnotateTool(updated_regions,xcolumns_2,y_column_2)
aTool2.predict_Y(n_neighbors=3)

### Visually check results

In [20]:
updated_MegaEnviron = aTool2.get_updated_df()
mTool = MapTool()
wm = mTool.create_map(center=(1,23),zoom=4)
points = mTool.plot_data_generic(data=updated_MegaEnviron, plot_col='updated_MegaEnviron', lat_col='Validated_Lat', lng_col='Validated_Lng', 
                                 as_cluster=False, plot_col_type='MegaEnviron_type') # return a list of markers
mTool.add_to_map(wm, points )

In [21]:
updated_MegaEnviron.to_csv("FiguresTables/ValidatedPlantBreedingData.csv")
geocoded_loc = updated_MegaEnviron[updated_MegaEnviron['Geocode_Type'] == 'Geocoded'] 
geocoded_loc.head(2)

Unnamed: 0,LocationID,Country,Location,Region,Latitude,Longitude,ElevM,MAX_TEMP,MIN_TEMP,PRE,MegaEnviron,Validated_Lat,Validated_Lng,Geocode_Type,updated_Region,Region_type,updated_MegaEnviron,MegaEnviron_type
224,101,MALI,RUE MOHAMED,Western Africa,0.0,0.0,0,0.0,0.0,0,Wet Lowland,12.646425,-7.997405,Geocoded,Western Africa,original,Wet Lowland,original
225,187,UGANDA,"LIKONDE,SONGEA",Eastern and Southern Africa,-10.66,35.58,1026,27.039999,18.18,1080,Wet Upper Mid-altitude,1.645562,31.220322,Geocoded,Eastern and Southern Africa,original,Wet Upper Mid-altitude,original


In [22]:
flipped_loc = updated_MegaEnviron[updated_MegaEnviron['Geocode_Type'] == 'Flipped'] 
# Initialize MapTool with shapefile
mTool = MapTool()

# Create a basemap of the world
wm = mTool.create_map(center=(1,23),zoom=4)
points = mTool.plot_pair_with_line(data=flipped_loc[0:2], loc_col='Location', ctry_col='Country', 
                                 lat1='Latitude', lng1='Longitude',lat2='Validated_Lat', lng2='Validated_Lng',
                                 clr1="red", clr2="green",clrLine='orange') # return a list of markers and polylines
mTool.add_to_map(wm, points )