We want to see:


* at least two graphs containing exploration of the dataset
* a statement of your question (or questions!) and how you arrived there
* the explanation of at least two new columns you created and how you did it
* the comparison between two classification approaches, including a qualitative discussion of  simplicity, time to run the model, and accuracy, precision, and/or recall
* the comparison between two regression approaches, including a qualitative discussion of * simplicity, time to run the model, and accuracy, precision, and/or recall
* an overall conclusion, with a preliminary answer to your initial question(s), next steps, and what other data you would like to have in order to better answer your question(s)

In [45]:
import pandas as pd
import numpy as np
import sys
from matplotlib import pyplot as plt
df = pd.read_csv("profiles.csv")
cols = df.shape[1]
rows = df.shape[0]
pd.set_option('display.max_columns', cols*2)

In [4]:
df.tail(1)

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status
59945,39,average,,socially,,graduated from masters program,"is it odd that having a little ""enemy"" status ...",i work with elderly people (psychotherapy and ...,i'm a great bullshitter. i don't know what it ...,"either that i am funny/sarcastic, or that i am...","i just read the help by kathryn stockett, sooo...",1. family &amp; friends &amp; other humans - i...,"sex, myself, other people, how amazing everyth...","out at happy hour with my friends, running int...",i wish i could cry like holly hunter in broadc...,"if you have a back-bone, an opinion, a sense o...",white,68.0,-1,medicine / health,2012-06-29-00-42,"san francisco, california",,gay,likes dogs and likes cats,catholicism and laughing about it,m,gemini and it&rsquo;s fun to think about,sometimes,english,single


In [9]:
import json
with open ("us_states.json", 'r') as f:
    us_state_abbrev = json.load(f)

<h3>Adding new columns - city, state, in the US, in CA, longtitude, and latitude</h3>

I wanted to add geolocation data for geomapping, but along the way found out that almost all of the data was in California.  Because of this, I just got geolocation data for California cities.  

It wasn't possible using the api I found to get geolocation data for all cities, but I was able to for almost all of them.

In [46]:
df['city'] = df.location.map(lambda l : l.split(", ")[0])
df['state'] = df.location.map(lambda l : l.split(", ")[1])
df['stateAbbr'] = df.state.map(lambda s : us_state_abbrev[s] if s in us_state_abbrev else s)
df['inUS'] = df.state.map(lambda s : s in us_state_abbrev)
df['inCA'] = df.state.map(lambda s : s == 'california')

inUS = df.inUS.value_counts()
inCA = df.inCA.value_counts()
print("{}% in US".format(inUS[1]/rows*100))
print("{}% in CA".format(inCA[1]/rows*100))

99.97831381576752% in US
99.84819671037268% in CA


Because 99.978% of the data are in the US and 99.848% of the data are in California, all non-California observations are going to be ignored for any geo-location summary information.

In [57]:
import requests
def getLatLong(location):
    print(".", end="")
    response = requests.get("http://open.mapquestapi.com/geocoding/v1/address?key=pxVoRtOLk6bTqJAGmcCdLY9ZrLGA707h&location={}".format(location))  
    try:
        response = response.json()
        state = location.split(", ")[1]
        latLong = None
        for data in response['results'][0]['locations']:
            if state == data['adminArea3']:
                latLong = data['latLng']
        if not latLong: # could not find the actual state's data
            raise
        lat = latLong['lat']
        long = latLong['lng']
        return lat, long
    except:
        print(" {} missing ".format(location), end="")
        return None, None

First, we need to find unique locations and get the latitude and longtitude for them.  I am using an api key from mapquest and their geocoding api to do this.

In [68]:
locations = df.location.value_counts().index
location_counts = list(df.location.value_counts())
location_dict = {}

for location, count in zip(locations,location_counts):
    city = location.split(", ")[0]
    state = location.split(", ")[1]
    if not state == 'california':
        continue
    lat, long = getLatLong("{}, CA, US".format(city))
    if not lat:
        continue
    location_dict[city] = {"count": count, "lat": lat, "long": long}

.......................... martinez, CA, US missing ..... albany, CA, US missing .............. green brae, CA, US missing ..... belvedere tiburon, CA, US missing .... crockett, CA, US missing . el granada, CA, US missing ...... piedmont, CA, US missing .. westlake, CA, US missing .......... san diego, CA, US missing . santa cruz, CA, US missing ..... bayshore, CA, US missing .... kensington, CA, US missing ....... irvine, CA, US missing ..... union city, CA, US missing ...... modesto, CA, US missing ........... long beach, CA, US missing . pasadena, CA, US missing . canyon, CA, US missing .. arcadia, CA, US missing ........... chico, CA, US missing ....... ashland, CA, US missing .

In [71]:
df['lat'] = df.city.map(lambda c : location_dict[c]['lat'] if c in location_dict else None)
df['long'] = df.city.map(lambda c : location_dict[c]['long'] if c in location_dict else None)

Because this takes a while to fetch this data, I am saving the dataframe and geolocation data with counts in the next frame and reloading it in the following frame.  I can start from there in the future.

In [77]:
df.to_csv("profilesWithGeoData.csv")
import json
with open('location_dict.json', 'w') as f:
    json.dump(location_dict, f)

<h3>Start from here in the future (no need to redo getting geolocation data)</h3>

In [3]:
import pandas as pd
import numpy as np
import sys, json
from matplotlib import pyplot as plt

with open('location_dict.json', 'r') as f:
    location_dict = json.load(f)

df = pd.read_csv("profilesWithGeoData.csv")
cols = df.shape[1]
rows = df.shape[0]
pd.set_option('display.max_columns', cols*2)
df.head(1)

Unnamed: 0.1,Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,essay4,essay5,essay6,essay7,essay8,essay9,ethnicity,height,income,job,last_online,location,offspring,orientation,pets,religion,sex,sign,smokes,speaks,status,city,state,stateAbbr,inUS,inCA,lat,long
0,0,22,a little extra,strictly anything,socially,never,working on college/university,about me:<br />\r\n<br />\r\ni would love to t...,currently working as an international agent fo...,making people laugh.<br />\r\nranting about a ...,"the way i look. i am a six foot half asian, ha...","books:<br />\r\nabsurdistan, the republic, of ...",food.<br />\r\nwater.<br />\r\ncell phone.<br ...,duality and humorous things,trying to find someone to hang out with. i am ...,i am new to california and looking for someone...,you want to be swept off your feet!<br />\r\ny...,"asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california","doesn&rsquo;t have kids, but might want them",straight,likes dogs and likes cats,agnosticism and very serious about it,m,gemini,sometimes,english,single,south san francisco,california,CA,True,True,37.654949,-122.408125


In [13]:
!{sys.executable} -m pip install -q folium
import folium
import math


MAX_COUNT = math.log(list(df.location.value_counts())[0])
MIN_LISTINGS = 10
color_scaler = (max_count)/255
def percToRed(perc):
    return '#%02x%02x%02x' % (255, 255-int(perc*255), 255-int(perc*255))

# map centered on US
dating_map = folium.Map(location=[36.7783, -119.4179], zoom_start=6)

for location, data in location_dict.items():
    count = data['count']
    if count < MIN_LISTINGS:
        continue # ignore locations with less than MIN_LISTINGS listings
    percent = math.log(count)/MAX_COUNT
    color = percToRed(percent)
    folium.CircleMarker(
        [data['lat'], data['long']],
        radius=5,
        popup=folium.Popup("{} ({} listings)".format(location, count), parse_html=True),
        fill=True,
        color=color,
        fill_color=color,
        fill_opacity=0.6
        ).add_to(dating_map)

<h3>First 'graph' - listing density by geolocation</h3>

The map below shows the density of listings by geolocation.  The log of the number of listings per location was taken, as otherwise everything but San Francisco would be white. 

(the scale for the map markers is white to red, white being low density of listings and red being high)

In [14]:
dating_map