# "Coding Question: Write a function for calculating the location entropy of a give dataset"

Based on my understanding of location entropy, the original formula calculates entropy for each location based on the number of users visiting each location and how many times each user visits said location. However, the question presented here asks for the entropy of the entire dataset, rather than for each location. Hence, I will be making the following assumptions for the sake of the equation:

- Each visit to a particular location is effectivetly a unique visitor

- The probability of a given location is yielded by the number of times it is visited over the total number of visits in the given dataset

Given these conditions, the formula is relatively straightforwad in implementation - the function I wrote accepts a list of how many times each location is visited as input, as it only returns a single value which is the entropy of the entire dataset

In [2]:
import numpy as np
import pandas as pd

In [3]:
# this function accepts a list of how many times each location is visited
def entropy(locations):
    total = sum(locations)
    return -sum(location/total * np.log2(location/total) for location in locations)

This dataset is the West Nile Virus mosquito dataset available on Kaggle at https://www.kaggle.com/c/predict-west-nile-virus/data

In [4]:
df = pd.read_csv("train.csv")

In [5]:
df.shape

(10506, 12)

In [6]:
df.head()

Unnamed: 0,Date,Address,Species,Block,Street,Trap,AddressNumberAndStreet,Latitude,Longitude,AddressAccuracy,NumMosquitos,WnvPresent
0,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX PIPIENS/RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
1,2007-05-29,"4100 North Oak Park Avenue, Chicago, IL 60634,...",CULEX RESTUANS,41,N OAK PARK AVE,T002,"4100 N OAK PARK AVE, Chicago, IL",41.95469,-87.800991,9,1,0
2,2007-05-29,"6200 North Mandell Avenue, Chicago, IL 60646, USA",CULEX RESTUANS,62,N MANDELL AVE,T007,"6200 N MANDELL AVE, Chicago, IL",41.994991,-87.769279,9,1,0
3,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX PIPIENS/RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,1,0
4,2007-05-29,"7900 West Foster Avenue, Chicago, IL 60656, USA",CULEX RESTUANS,79,W FOSTER AVE,T015,"7900 W FOSTER AVE, Chicago, IL",41.974089,-87.824812,8,4,0


In [7]:
df.columns

Index(['Date', 'Address', 'Species', 'Block', 'Street', 'Trap',
       'AddressNumberAndStreet', 'Latitude', 'Longitude', 'AddressAccuracy',
       'NumMosquitos', 'WnvPresent'],
      dtype='object')

In [8]:
# dropping the irrelevant data
df.drop(['Date', 'Species', 'Address', 'Trap', 'AddressNumberAndStreet', 'AddressAccuracy', 'NumMosquitos', 'WnvPresent'], axis=1, inplace=True)

In [9]:
df.head()

Unnamed: 0,Block,Street,Latitude,Longitude
0,41,N OAK PARK AVE,41.95469,-87.800991
1,41,N OAK PARK AVE,41.95469,-87.800991
2,62,N MANDELL AVE,41.994991,-87.769279
3,79,W FOSTER AVE,41.974089,-87.824812
4,79,W FOSTER AVE,41.974089,-87.824812


In [10]:
# creating a combined value for the coordinates
df['Combined'] = [str(i[0]) + ' ' + str(i[1]) for i in df[['Latitude', 'Longitude']].values]

In [11]:
df.head()

Unnamed: 0,Block,Street,Latitude,Longitude,Combined
0,41,N OAK PARK AVE,41.95469,-87.800991,41.95469 -87.800991
1,41,N OAK PARK AVE,41.95469,-87.800991,41.95469 -87.800991
2,62,N MANDELL AVE,41.994991,-87.769279,41.994991 -87.769279
3,79,W FOSTER AVE,41.974089,-87.824812,41.974089 -87.824812
4,79,W FOSTER AVE,41.974089,-87.824812,41.974089 -87.824812


In [12]:
# each time a location appears is counted as a unique visit
locations = df.Combined.value_counts().values

In [13]:
print("The entropy of this dataset is", entropy(locations))

The entropy of this dataset is 6.523038135932779
