# Data Methodology

First, initialize python libraries and load the csv file into memory.

In [3]:
import pandas as pd 
import re
import os
from pathlib import Path
from geopy.geocoders import OpenMapQuest
from geopy.extra.rate_limiter import RateLimiter

In [4]:
data = pd.read_csv("Swastika Data.csv")

Check the dataframe's columns and decide which to remove.

In [5]:
print(data.columns)

Index(['website', 'date of discovery or report', 'date', 'city', 'STATE',
       'source', 'reported phenomenon other than swastika',
       'reported phenomenon other than swastika 2',
       'reported phenomenon other than swastika 3', 'accompanying text',
       'accompanying visual signs', 'Nazi Reference', 'media', 'place',
       'category of place', 'structure', 'url_to_jpg', 'target', 'target.1',
       'culprit', ' ', 'notes', 'Actor 1', 'Move 1', 'Actor 1\nMove 1',
       'Actor 2', 'Move 2', 'Actor 2\nMove 2', 'Actor 3', 'Move 3',
       'Actor 3\nMove 3', 'Actor 4', 'Move 4', 'Actor 4\nMove 4',
       'Number of Moves', 'Classification of Incident',
       'reported phenomenon broad classification', 'combined moves',
       'combined actors', 'combined targets', 'Police Involvement',
       'Police Move', 'Religious Leader Involvement',
       'Letters/Statements Issued', 'Clean Up/Cover Up',
       'Clean Up/Cover Up Actor', 'suspension/denial of access to space',
       '

For my data visualization, I only need the following categories {website, date of discovery or report, city, STATE, category of place, Nazi Reference, target}. I remove the rest from the datamframe to make exploring the relevant components easier. I'll also remove any empty cells. 

In [6]:
data = data[['website','date of discovery or report', 'city', 'STATE','category of place', 'Nazi Reference', 'target']]
data.dropna(subset=['city'], inplace=True) #This removes any rows that lack a city.
data.tail()

Unnamed: 0,website,date of discovery or report,city,STATE,category of place,Nazi Reference,target
1332,https://richmond.com/news/local/crime-and-cour...,1/12/2021,Ashland,VA,local business,Yes,
1333,https://denver.cbslocal.com/2021/01/13/littlet...,1/12/2021,Littleton,CO,local business,Yes,Asian American Community
1334,https://www.king5.com/article/news/local/black...,1/12/2021,Shoreline,WA,local business,Yes,Black American Community
1335,https://www.sanjoseinside.com/news/police-inve...,1/16/2021,Morgan Hill,CA,religious institution,Yes,Jewish Community
1336,https://www.metroweekly.com/2021/01/gay-d-c-re...,1/18/2021,Washington,DC,private property,Yes,LGBTQ


I need to convert the text in these columns into relevant datatypes, like date and latitude/longitude, so that they can be graphed and mapped. First, I start with date.

In [7]:
data = data.rename(columns={'date of discovery or report':'date'}) #Change the column name to something easier to reference.
data.date = data['date'].apply(lambda x: pd.to_datetime(x)) # Convert string into date type.

Now its time to add latitude and longitude. I do this through a processes called "geocoding" which relies on services provided by others. I want to submit the city and state name as a string to a 3rd party service, so that it can be convertered into latitude and longitude. I am using the OpenMapQuest API, which allows for up to 15,000 requests for free. I have removed my API key for security purposes. You can set up your own here: https://developer.mapquest.com/user/login/sign-up

In [8]:
data['location'] = data['city'] + ", " + data['STATE'] # First, we combine city and state into a single string. 

In [9]:
%%time
apikey = "put-your-key-here"
geolocator = OpenMapQuest(api_key=apikey)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=15,error_wait_seconds=30) #This ensures I am not spamming their service.

data['place'] = data['location'].apply(lambda x: geolocator.geocode(x))


KeyboardInterrupt



I created a new column called "place" that stores structured geographic information about each of the cities included in the dataset. Most importantly, it contains lat and long info. Here is an example of what it's output looks like:

Location(Broken Arrow, Tulsa County, Oklahoma, United States of America, (36.0525993, -95.7908195, 0.0))

For convenience, I am going to save the latitude and longitude as separate columns. 

In [152]:
data['place'].apply(lambda x: print(x) if x is None)

SyntaxError: invalid syntax (4191168999.py, line 1)

In [144]:
def NullLocation(location): 
    if location is not None:
        print(location)
        return location.latitude, location.longitude
    else:
        return 0,0
    
data['latitude','longitude'] = data["place"].apply(lambda x: NullLocation(x))


Broken Arrow, Tulsa County, Oklahoma, United States of America
Jacksonville, Duval County, Florida, United States of America
Spring Valley, Clark County, Nevada, NV 89113, United States of America
New Brunswick, Middlesex County, New Jersey, United States of America
Rocky Point, Suffolk County, New York, 11778, United States of America
Jackson, Ocean County, New Jersey, United States of America
Newtonville, Middlesex County, Massachusetts, 02460, United States of America
Hollywood, Assembly Way, Metropolis at Metrotown, Metrotown, Burnaby, Metro Vancouver, British Columbia, V5H 4M1, Canada
Bellingham, Whatcom County, Washington, United States of America
Seattle, King County, Washington, United States of America
Philadelphia, Philadelphia County, Pennsylvania, United States of America
New Brunswick, Middlesex County, New Jersey, United States of America
Madison, Dane County, Wisconsin, United States of America
San Luis Obispo, San Luis Obispo County, California, United States of America

In [115]:
data.to_csv("datawithlatlong.csv") # Save to file

# The dataset is now ready to be visualized in javascript! 