#### Cleaning Steps - World Temperature Data

We need to join I94 Records with World Temperature Data, but City and State identify most of our Port of Entries. Furthermore, the US has duplicated cities, making it dangerous to assume a city is in a specific State. Fortunately, the table contains Latitude and Longitude, and unfortunately, these latitudes and longitudes do not match the location. If we select a sample of arbitrary Latitudes and Longitudes and plot them on Google Maps, we will find that none of them match the city in question but have an offset of 20-50km diameter. 

In [1]:
from datetime import datetime, timedelta
from numpy import nan
import numbers
from pyspark.sql import SparkSession, SQLContext, GroupedData
from pyspark.sql.functions import *
from pyspark.sql.types import DoubleType
import pandas as pd
from os import getcwd

from immigration_lib.aws_tools import create_spark_session

pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_rows', 500)

aws_config = f"{getcwd()}/aws_config.cfg"

spark = create_spark_session(aws_config, {}, "Cleaning World Temperature Data")

  - Latitude and Longitude seem to be in degree decimal plus compass. Example of evidence: http://berkeleyearth.lbl.gov/locations/32.95N-117.77W
  - Same latitude and longitude but Compass for S and W denotes a factor of -1

In [2]:
def parse_dms(dms):
    dm, compass = dms[:-1], dms[-1:]
    dm_number = float(dm)
    if compass == 'W' or compass == 'S':
        dm_number *= -1
    return dm_number

temperature_data = './temperature-data/GlobalLandTemperaturesByCity.csv'
parse_dms_udf = udf(parse_dms, DoubleType())
temperature_data_df = spark.read\
                            .option("header",True) \
                            .option("dateFormat", "yyyy-MM-dd") \
                            .option("inferSchema", "true") \
                            .option("nullValue", "null") \
                            .csv(temperature_data) \
                            .withColumn("Latitude", parse_dms_udf(col("Latitude"))) \
                            .withColumn("Longitude", parse_dms_udf(col("Longitude"))) \
                            .withColumnRenamed("dt","Date")

temperature_data_df.printSchema()

us_cities_temperatures = temperature_data_df.filter(col("Country").contains("United States"))
us_cities_temperatures.count()
# For reference in Pandas
# temperature_data_df = pd.read_csv(temperature_data, parse_dates=['dt'])
# temperature_data_df.rename(columns = {'dt':'Date'}, inplace = True)
# display(temperature_data_df.head())
# display(temperature_data_df[temperature_data_df.Country.str.contains("United States")].head())

root
 |-- Date: timestamp (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: double (nullable = true)
 |-- Longitude: double (nullable = true)



687289

In [3]:
country_list = temperature_data_df.select('Country').distinct().toPandas()['Country'].values.tolist()
pr_in = 'Puerto Rico' in country_list
us_in = 'United States' in country_list
gu_in = 'Guam' in country_list
vi_in = 'Virgin Islands' in country_list
print(f'Has Puerto Rico? {pr_in}')
print(f'Has United States?  {us_in}')
print(f'Has Guam?  {gu_in}')
print(f'Has Virgin Islands?  {vi_in}')

Has Puerto Rico? True
Has United States?  True
Has Guam?  False
Has Virgin Islands?  False


In [4]:
us_cities_temperatures.filter(col("City") == "Washington").select("Latitude", "Longitude").show(5)

+--------+---------+
|Latitude|Longitude|
+--------+---------+
|   39.38|   -76.99|
|   39.38|   -76.99|
|   39.38|   -76.99|
|   39.38|   -76.99|
|   39.38|   -76.99|
+--------+---------+
only showing top 5 rows



In [5]:
more_data = temperature_data_df \
    .filter((col("Country").contains("United States")) | col("Country").contains("Puerto Rico"))

more_data.count()

694120

In [6]:
print(f'There are {more_data.select("Latitude", "Longitude", "City", "Country").distinct().count()} combinations of City, Latitude, and Longitude')

us_cities_temperatures_df = more_data.select("Latitude", "Longitude", "City", "Country").distinct()
us_cities_temperatures_df.show()

There are 260 combinations of City, Latitude, and Longitude
+--------+---------+-----------+-------------+
|Latitude|Longitude|       City|      Country|
+--------+---------+-----------+-------------+
|   45.81|   -93.46| Saint Paul|United States|
|   42.59|   -78.55|    Buffalo|United States|
|   36.17|   -75.58|    Hampton|United States|
|   45.81|   -93.46|Minneapolis|United States|
|   32.95|  -117.77|  Fullerton|United States|
|   34.56|   -118.7|Simi Valley|United States|
|   34.56|   -91.46|Little Rock|United States|
|   39.38|   -95.72|     Topeka|United States|
|   32.95|   -90.96|    Jackson|United States|
|   34.56|   -118.7|Los Angeles|United States|
|   32.95|  -117.77|     Corona|United States|
|   34.56|  -116.76|    Fontana|United States|
|   36.17|  -115.36|  Henderson|United States|
|   39.38|   -85.32| Cincinnati|United States|
|   36.17|  -115.36|   Paradise|United States|
|   37.78|  -122.03|   Berkeley|United States|
|   39.38|  -104.05|Westminster|United States|


In [7]:
us_cities = us_cities_temperatures_df.toPandas()
latlongs_df = us_cities[['Latitude', 'Longitude', 'City', 'Country']]
latlongs = latlongs_df.values.tolist()

display(latlongs_df[latlongs_df['Country'] == 'United States'].head())
display(latlongs_df[latlongs_df['Country'] == 'Puerto Rico'].head())

Unnamed: 0,Latitude,Longitude,City,Country
0,45.81,-93.46,Saint Paul,United States
1,42.59,-78.55,Buffalo,United States
2,36.17,-75.58,Hampton,United States
3,45.81,-93.46,Minneapolis,United States
4,32.95,-117.77,Fullerton,United States


Unnamed: 0,Latitude,Longitude,City,Country
55,18.48,-65.92,Ponce,Puerto Rico
138,18.48,-65.92,San Juan,Puerto Rico
210,18.48,-65.92,Carolina,Puerto Rico


We need to join I94 Records with World Temperature Data, but City and State identify most of our Port of Entries. Furthermore, the US has duplicated cities, making it dangerous to assume a city is in a specific State. Fortunately, the table contains Latitude and Longitude, and unfortunately, these latitudes and longitudes do not match the location (and rounding is not the problem). 


 If we observe on http://berkeleyearth.lbl.gov/locations/29.74N-97.85W, we will see the link to Google Maps, and even though it is deprecated, we can find how it was using the Latitude and Longitude. Here is an example: 

> https://mapsengine.google.com/11291863457841367551-04024907758807700184-4/mapview/?lat=29.7428&lng=-97.8462&z=8

Both Latitude and Longitude were used directly as part of Google Maps.

If we select a sample of arbitrary Latitudes and Longitudes and plot them on Google Maps, we will find that none of them match the city in question but have an offset of 20-50km diameter.

To fix the Latitudes and Longitudes, we pass a `components` object to Google's Geocoding API, and we use the City name and the Country we are filtering. There are 257 Cities we want to find, and we save the results locally not to accrue too many costs of calling the API.

In [8]:
import requests
import json
import os

API_KEY = os.environ['GOOGLE_GEOCODING_API_KEY']

if not os.path.isdir('./google-maps'):
    os.mkdir('./google-maps')

def process_google_results(google_payload):
    return [result["formatted_address"] for result in google_payload["results"]]
    
def process_lat_long(latlong):
    latitude = latlong['Latitude']
    longitude = latlong['Longitude']
    city = latlong['City']
    local_tmp_file = f"./google-maps/{city}.json"
    addresses = None
    if not os.path.exists(local_tmp_file):
        url = f'https://maps.googleapis.com/maps/api/geocode/json?components=locality:{city}|country:US&latlng={latitude},{longitude}&key={API_KEY}'
        response = requests.get(url)
        google_api_json = response.json()
        with open(local_tmp_file, 'w') as outfile:
            json.dump(google_api_json, outfile)
        return process_google_results(google_api_json)
    else:
        with open(local_tmp_file) as json_file:
            city_data = json.load(json_file)
            return process_google_results(city_data)

us_latlongs_df = latlongs_df[latlongs_df['Country'] == 'United States'].copy()
pr_latlongs_df = latlongs_df[latlongs_df['Country'] == 'Puerto Rico'].copy()
        
if "IsValidCoordinate" in us_latlongs_df.columns:
    us_latlongs_df = us_latlongs_df.drop('IsValidCoordinate', axis=1)

us_latlongs_df['GeocodingApiResults'] = us_latlongs_df.apply(process_lat_long, axis=1)

us_latlongs_df.head()

Unnamed: 0,Latitude,Longitude,City,Country,GeocodingApiResults
0,45.81,-93.46,Saint Paul,United States,"[St Paul, MN, USA]"
1,42.59,-78.55,Buffalo,United States,"[Buffalo, NY, USA]"
2,36.17,-75.58,Hampton,United States,"[Hampton, VA, USA]"
3,45.81,-93.46,Minneapolis,United States,"[Minneapolis, MN, USA]"
4,32.95,-117.77,Fullerton,United States,"[Fullerton, CA, USA]"


While we now have results for each of the 257 Entries, it does not mean it is a match. To match each entry, we will use multiple strategies:
1. Extract the city part of the Address returned by Google
2. Compare the city name directly to the API Result.
3. Using `libpostal`, expand any St. to Saint, and compare it with the city.
4. If that didn't match, use a fuzzy comparison, if it is a 90% match, it is a valid result.
5. All other possibilities are a non-valid result.

In [9]:
from postal.expand import expand_address
from fuzzywuzzy import fuzz

def validate_coordinate(row):
    city = row['City'].lower()
    geocoding_results = [result.split(',') for result in row['GeocodingApiResults']]
    if len(geocoding_results) == 0:
        if city == 'nuevo laredo':
            return (True, ['Laredo', 'TX', 'USA'])
        if city == 'corona':
            return (True, ['Corona', 'CA', 'USA'])
        return (False, [])
    for result in geocoding_results:
        lresult = [the_str.lower() for the_str in result]
        if len(lresult)  == 0:
            return (False, result)
        if lresult[0] == city:
            return (True, result)
        possibles = expand_address(lresult[0])
        if any([city in possible for possible in possibles]):
            return (True, result)
        if (fuzz.token_set_ratio(city, lresult[0])) > 90:
            return (True, result)
    return (False, [])

us_latlongs_df['IsValidCoordinate'] = us_latlongs_df.apply(validate_coordinate, axis=1)

valid_cities = us_latlongs_df[us_latlongs_df['IsValidCoordinate'].map(lambda x: x[0])].shape[0]
invalid_cities = us_latlongs_df[us_latlongs_df['IsValidCoordinate'].map(lambda x: not x[0])]

print(f"""Out of {us_latlongs_df.shape[0]} cities with latitude and longitudes,
only {valid_cities - 2} cities were found in the Google Api Results using
the same method as Berkeley Earth (e.g., http://berkeleyearth.lbl.gov/locations/29.74N-97.85W).
""")

print(f"{invalid_cities.shape[0]} Invalid States")

Out of 257 cities with latitude and longitudes,
only 255 cities were found in the Google Api Results using
the same method as Berkeley Earth (e.g., http://berkeleyearth.lbl.gov/locations/29.74N-97.85W).

0 Invalid States


Corona is in California. Nuevo Laredo is in Mexico, but according to this [Berkeley Website Page](http://berkeleyearth.lbl.gov/locations/28.13N-99.09W) it was confused with Laredo, TX, so we mark it as such. For these two cases, Google returned no results, and since we are already limiting the US, we add a single clause for each.

Washington is another particular case since it is not in any State. For differentiation, we will be using WA as the State name.

In [10]:
import re

def fill_state_from_valid_coordinate(row):
    is_valid, address_chunks = row['IsValidCoordinate']
    if not is_valid:
        return None
    pattern = '[A-Z].*[A-Z]'
    result = re.findall(pattern, address_chunks[-2]) 
    if len(result) == 0:
        if address_chunks[-2] == "Washington":
            return "DC"
        return None
    return result[0]
    

us_latlongs_df['State'] = us_latlongs_df.apply(fill_state_from_valid_coordinate, axis=1)
pr_latlongs_df['State'] = pr_latlongs_df['Latitude'].apply(lambda r: 'PR')
dataframes = [us_latlongs_df.drop(["GeocodingApiResults", "IsValidCoordinate"], axis=1), pr_latlongs_df ]
temperatures_final_df = pd.concat(dataframes)

temperatures_final_df.head()

display(temperatures_final_df[temperatures_final_df['Country'] == 'United States'].iloc[:2])
display(temperatures_final_df[temperatures_final_df['Country'] == 'Puerto Rico'].iloc[:2])

Unnamed: 0,Latitude,Longitude,City,Country,State
0,45.81,-93.46,Saint Paul,United States,MN
1,42.59,-78.55,Buffalo,United States,NY


Unnamed: 0,Latitude,Longitude,City,Country,State
55,18.48,-65.92,Ponce,Puerto Rico,PR
138,18.48,-65.92,San Juan,Puerto Rico,PR


The final step is to merge our Google Data into the Original table to create our ultimate representation for the Staging Table. We drop Latitude and Longitude since it is easier to plot by City and State.

In [11]:
from pyspark.sql.types import StringType
from pyspark.sql import Row

cities_with_states = spark.createDataFrame(temperatures_final_df)
cities_with_states.createOrReplaceTempView('cities_with_states')
us_cities_temperatures.createOrReplaceTempView('us_cities_temperatures')

us_cities_final = spark.sql("""
SELECT 
    us_cities_temperatures.Date,
    us_cities_temperatures.AverageTemperature, 
    us_cities_temperatures.AverageTemperatureUncertainty,
    us_cities_temperatures.City, 
    cities_with_states.State,
    us_cities_temperatures.Country
FROM cities_with_states, us_cities_temperatures
WHERE (cities_with_states.Latitude = us_cities_temperatures.Latitude)
    AND (cities_with_states.Longitude = us_cities_temperatures.Longitude)
    AND (cities_with_states.City = us_cities_temperatures.City)
""")

us_cities_final.printSchema()

root
 |-- Date: timestamp (nullable = true)
 |-- AverageTemperature: double (nullable = true)
 |-- AverageTemperatureUncertainty: double (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Country: string (nullable = true)



In [12]:
display(us_cities_final.select("*").limit(5).toPandas())


us_cities_final.filter(isnull(col("State"))).toPandas()

Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,State,Country
0,1820-01-01 00:07:00,3.293,3.278,Plano,TX,United States
1,1820-02-01 00:07:00,8.423,2.879,Plano,TX,United States
2,1820-03-01 00:07:00,12.046,2.347,Plano,TX,United States
3,1820-04-01 00:07:00,18.946,2.092,Plano,TX,United States
4,1820-05-01 00:07:00,22.195,1.832,Plano,TX,United States


Unnamed: 0,Date,AverageTemperature,AverageTemperatureUncertainty,City,State,Country


#### Conclusions for Cleaning World  Temperature Data

1. Always validate Coordinates with some service (in this case Google Maps).
2. Make sure you can join the Dataset with other Datasets somehow. In this case, we needed the State, and we obtained it.


In [13]:
us_cities_final.filter(col("State") == "DC").show(5)

+-------------------+------------------+-----------------------------+----------+-----+-------------+
|               Date|AverageTemperature|AverageTemperatureUncertainty|      City|State|      Country|
+-------------------+------------------+-----------------------------+----------+-----+-------------+
|1743-11-01 00:00:00|             5.339|                        1.828|Washington|   DC|United States|
|1743-12-01 00:00:00|              null|                         null|Washington|   DC|United States|
|1744-01-01 00:00:00|              null|                         null|Washington|   DC|United States|
|1744-02-01 00:00:00|              null|                         null|Washington|   DC|United States|
|1744-03-01 00:00:00|              null|                         null|Washington|   DC|United States|
+-------------------+------------------+-----------------------------+----------+-----+-------------+
only showing top 5 rows



In [14]:
output_data = "s3a://claudiordgz-udacity-dend"
us_cities_final.write.parquet(f'{output_data}/capstone/staging_world_temperature_data', mode='append')