# Final Project: San Francisco Bike Theft Predictions

# Part 1: Proposal
*Frame the problem, criteria, and data source(s)*

## Problem Statement

1. Hypothesis/assumptions
2. Goals and Success Metrics
3. Risk/Limitations
4. Data Source

**Stolen and not stolen bike data from the [Bikeindex API3](https://bikeindex.org/documentation/api_v3).**
- Data collected real time. 
- API scraped as of November 5, 2018.

**San Francisco City data from the [datasf.org](https://data.sfgov.org/City-Management-and-Ethics/San-Francisco-City-Survey-Data-1996-2017/huch-6k5m).**
- Data collected every two years. 
- Documentation recommends using data from 2015 forward.
- Most complete zipcode-related data is from 2017 survey.

For this project, I downloaded all the bike data in BikeIndex and merged San Francisco city data via zipcodes. The following lines of code document my method of cleaning the bike data and combine it with the San Francisco survey data in a meaningful way. Because the survey data for San Francisco was written as a set of classifiers, I used dummy variables to quantify this information into probabilities.

Bike retrieval code can be found at my [github here](https://github.com/chanwinyee/ds_foundations/blob/master/final_project/bike_index_data_retrieval_DONE.ipynb).


In [1]:
# Data cleaning, exploration, and analysis tools
import pandas as pd
import seaborn as sns
import numpy as np
from ast import literal_eval
import re as re


### Clean the BikeIndex data

In [2]:
# Import the csv of stolen and not stolen bike data and store in a pandas DataFrame
bike_data = pd.read_csv('bike_index_api_stolenessall.csv')
bike_df = pd.DataFrame(data=bike_data)
bike_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 188574 entries, 0 to 188573
Data columns (total 14 columns):
Unnamed: 0           188574 non-null int64
date_stolen          62097 non-null float64
frame_colors         188574 non-null object
frame_model          170000 non-null object
id                   188574 non-null int64
is_stock_img         188574 non-null bool
large_img            78875 non-null object
manufacturer_name    188567 non-null object
serial               188211 non-null object
stolen               188574 non-null bool
stolen_location      61129 non-null object
thumb                78875 non-null object
title                188573 non-null object
year                 133836 non-null float64
dtypes: bool(2), float64(2), int64(2), object(8)
memory usage: 17.6+ MB


This bike data has all datapoints ever collected in the BikeIndex API. What I really want are datapoints for bikes that are in San Francisco.

**Assumption 1:** I am going to narrow this data set to bikes that have some representation of California as evident by a "stolen_location" value that contains "CA", "California", or "San Francisco" in it. I am excluding all data points that do not have this indicator.

In [3]:
bike_df_clean = bike_df.copy()
bike_df_clean = bike_df_clean.dropna(subset=['stolen_location'])
bike_df_clean = bike_df_clean[bike_df_clean['stolen_location'].apply(str).str.contains('CA|California|San Francisco')]

In [4]:

def split_zipcode(x):
    array=re.findall('\d{5}',str(x))
    if len(array)==0:
        return None
    else:
        return array[0]
            
bike_df_clean['stolen_zipcode'] = bike_df_clean['stolen_location'].apply(split_zipcode)

In [5]:
# Confirm that the zipcode dad has been successfully parsed out into a new column
bike_df_clean.head()

Unnamed: 0.1,Unnamed: 0,date_stolen,frame_colors,frame_model,id,is_stock_img,large_img,manufacturer_name,serial,stolen,stolen_location,thumb,title,year,stolen_zipcode
33,33,1541264000.0,['Blue'],Cross-Check,462239,False,https://files.bikeindex.org/uploads/Pu/140320/...,Surly,YS-PC20270,True,"San Francisco,CA,94105",https://files.bikeindex.org/uploads/Pu/140320/...,2014 Surly Cross-Check,2014.0,94105
39,39,1541236000.0,"['Red', 'Silver, gray or bare metal']",OCR 3,461486,False,https://files.bikeindex.org/uploads/Pu/140180/...,Giant,absent,True,"San Francisco,CA,94114",https://files.bikeindex.org/uploads/Pu/140180/...,2007 Giant OCR 3,2007.0,94114
40,40,1541236000.0,"['Silver, gray or bare metal']",Thin 7,461723,False,https://files.bikeindex.org/uploads/Pu/140253/...,Sondors,MT17004959,True,"Berkeley,CA,94704",https://files.bikeindex.org/uploads/Pu/140253/...,Sondors Thin 7,,94704
41,41,1541221000.0,['Black'],N/a,461764,False,https://files.bikeindex.org/uploads/Pu/140264/...,Not visible on bike,absent,True,"San Francisco,CA,94110",https://files.bikeindex.org/uploads/Pu/140264/...,Not visible on bike N/a,,94110
42,42,1541259000.0,['White'],Lightweight 6061 Aluminum Frame,460962,False,,SXL,absent,True,"Los Angeles,CA,90007",,2018 SXL Lightweight 6061 Aluminum Frame,2018.0,90007


In [6]:
# The size of the California dataset is much smaller than the original dataset
bike_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13313 entries, 33 to 99992
Data columns (total 15 columns):
Unnamed: 0           13313 non-null int64
date_stolen          13313 non-null float64
frame_colors         13313 non-null object
frame_model          12308 non-null object
id                   13313 non-null int64
is_stock_img         13313 non-null bool
large_img            8185 non-null object
manufacturer_name    13313 non-null object
serial               13282 non-null object
stolen               13313 non-null bool
stolen_location      13313 non-null object
thumb                8185 non-null object
title                13313 non-null object
year                 11010 non-null float64
stolen_zipcode       12881 non-null object
dtypes: bool(2), float64(2), int64(2), object(9)
memory usage: 1.4+ MB


In [7]:
# Convert the date_stolen into something readable and extract the year
bike_df_clean['date_stolen'] = pd.to_datetime(bike_df_clean['date_stolen'],unit='s')
bike_df_clean['year_stolen'] = bike_df_clean['date_stolen'].dt.year

In [8]:
# Confirm that the year has been extracted correctly.
bike_df_clean

Unnamed: 0.1,Unnamed: 0,date_stolen,frame_colors,frame_model,id,is_stock_img,large_img,manufacturer_name,serial,stolen,stolen_location,thumb,title,year,stolen_zipcode,year_stolen
33,33,2018-11-03 17:00:00,['Blue'],Cross-Check,462239,False,https://files.bikeindex.org/uploads/Pu/140320/...,Surly,YS-PC20270,True,"San Francisco,CA,94105",https://files.bikeindex.org/uploads/Pu/140320/...,2014 Surly Cross-Check,2014.0,94105,2018
39,39,2018-11-03 09:00:00,"['Red', 'Silver, gray or bare metal']",OCR 3,461486,False,https://files.bikeindex.org/uploads/Pu/140180/...,Giant,absent,True,"San Francisco,CA,94114",https://files.bikeindex.org/uploads/Pu/140180/...,2007 Giant OCR 3,2007.0,94114,2018
40,40,2018-11-03 09:00:00,"['Silver, gray or bare metal']",Thin 7,461723,False,https://files.bikeindex.org/uploads/Pu/140253/...,Sondors,MT17004959,True,"Berkeley,CA,94704",https://files.bikeindex.org/uploads/Pu/140253/...,Sondors Thin 7,,94704,2018
41,41,2018-11-03 05:00:00,['Black'],N/a,461764,False,https://files.bikeindex.org/uploads/Pu/140264/...,Not visible on bike,absent,True,"San Francisco,CA,94110",https://files.bikeindex.org/uploads/Pu/140264/...,Not visible on bike N/a,,94110,2018
42,42,2018-11-03 15:24:15,['White'],Lightweight 6061 Aluminum Frame,460962,False,,SXL,absent,True,"Los Angeles,CA,90007",,2018 SXL Lightweight 6061 Aluminum Frame,2018.0,90007,2018
47,47,2018-11-02 23:00:00,"['Silver, gray or bare metal']",Mountain,462623,False,,Genesis,no number,True,"Chico,CA,95973",,Genesis Mountain,,95973,2018
48,48,2018-11-02 21:00:44,"['Silver, gray or bare metal']",Cadent 1,460772,False,https://files.bikeindex.org/uploads/Pu/140095/...,Raleigh,u149k14722,True,"San Francisco,CA,94103",https://files.bikeindex.org/uploads/Pu/140095/...,2015 Raleigh Cadent 1,2015.0,94103,2018
49,49,2018-11-02 21:00:00,"['Silver, gray or bare metal']",Bike DB APEX,460776,False,https://files.bikeindex.org/uploads/Pu/140096/...,Diamondback,DAA16F000473,True,"San Diego,CA,92109",https://files.bikeindex.org/uploads/Pu/140096/...,2016 Diamondback Bike DB APEX,2016.0,92109,2018
56,56,2018-11-02 18:56:45,"['Blue', 'Blue']","19"" frame size kent bayside.",460699,False,,Kent,GS72696,True,"Santa Ana,CA,92705",,"Kent 19"" frame size kent bayside.",,92705,2018
69,69,2018-11-02 04:00:00,['Black'],Volare,69412,False,https://files.bikeindex.org/uploads/Pu/46648/l...,Schwinn,SNMNG 14C37721,True,"San Francisco,CA,94118",https://files.bikeindex.org/uploads/Pu/46648/s...,2014 Schwinn Volare,2014.0,94118,2018


In [9]:
# Check to see if the data is clean for year, as in, there are no repeat years or strange ways to notate the year
year_stolen = []
year_stolen = bike_df_clean['year_stolen'].unique()
year_stolen.sort()
year_stolen

array([1990, 1998, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008,
       2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018])

**Assumption 2:** Knowing that my San Francisco City Survey Data recommends only using the data from 2015 forward and the most complete data is from the survey conducted in 2017, I am going to base my model on bike data from 2015 forward and assume that the state of San Francisco between 2015 and present day is not much different than what was captured in 2017.

Looking at the completeness of the BikeIndex dataset, it appears the data is richer in the years after 2014.

In [10]:
bike_df_clean.groupby('year_stolen').count()

Unnamed: 0_level_0,Unnamed: 0,date_stolen,frame_colors,frame_model,id,is_stock_img,large_img,manufacturer_name,serial,stolen,stolen_location,thumb,title,year,stolen_zipcode
year_stolen,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1990,1,1,1,1,1,1,0,1,1,1,1,0,1,1,0
1998,6,6,6,6,6,6,0,6,6,6,6,0,6,6,6
2000,20,20,20,17,20,20,3,20,20,20,20,3,20,18,20
2001,2,2,2,1,2,2,1,2,2,2,2,1,2,1,2
2002,6,6,6,6,6,6,1,6,6,6,6,1,6,5,6
2003,2,2,2,2,2,2,0,2,2,2,2,0,2,2,2
2004,10,10,10,10,10,10,0,10,10,10,10,0,10,8,10
2005,108,108,108,103,108,108,15,108,108,108,108,15,108,96,108
2006,149,149,149,145,149,149,14,149,149,149,149,14,149,136,149
2007,199,199,199,189,199,199,64,199,199,199,199,64,199,166,199


In [11]:
# Select data from year_stolen 2015 to 2018
bike_df_final = bike_df_clean[bike_df_clean['year_stolen'].isin(['2018','2017','2016','2015'])]
bike_df_final['year_stolen'].unique()

array([2018, 2017, 2016, 2015])

In [12]:
bike_df_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7302 entries, 33 to 36403
Data columns (total 16 columns):
Unnamed: 0           7302 non-null int64
date_stolen          7302 non-null datetime64[ns]
frame_colors         7302 non-null object
frame_model          6644 non-null object
id                   7302 non-null int64
is_stock_img         7302 non-null bool
large_img            4993 non-null object
manufacturer_name    7302 non-null object
serial               7271 non-null object
stolen               7302 non-null bool
stolen_location      7302 non-null object
thumb                4993 non-null object
title                7302 non-null object
year                 5972 non-null float64
stolen_zipcode       6970 non-null object
year_stolen          7302 non-null int64
dtypes: bool(2), datetime64[ns](1), float64(1), int64(3), object(9)
memory usage: 870.0+ KB


In [13]:
# Store final data into csv
# bike_df_final.to_csv(path_or_buf='/Users/lizchan/ds_foundations/final_project/bike_data_clean.csv')

### Cleaning the San Francisco Survey Data

In [14]:
# Import the survey data
survey_data = pd.read_csv('San_Francisco_City_Survey_Data_1996-2017.csv')
survey_df = pd.DataFrame(data=survey_data)
total_columns = survey_df.columns
print(len(total_columns))
print(survey_df.columns.nunique())

92
92


Seeing there are 92 columns in this survey dataset, I read through the Survey data dictionary to understand what was available to me and selected columns that could be relevant to zipcodes where bikes are stolen. 

**Assumption 3:** I selected columns that are organized, according to the data dictionary, into "survey" and "demographics". I made the assumption that demographics of a zipcode are variables that can predict whether or not a bike will get stolen.

In [15]:
# Data Dictionary for survey data
column_names = {
    'id':'Unique id',
    'year':'Survey year',
    'mode':'survey mode',
    'language':'survey language',
    'dlivedsf':'Length of SF residence 1996-2009 (Groupings change in 2011)', #Made Contiguous 
    'primlang_1':'primary language 1',
    'primlang_2':'primary language',
    'primlang_3':'primary language',
    'primlang_4':'primary_language',
    'dage':'Respondents age group (Age groups change in 2011, 2017)', #Made Contiguous 
    'dethnic':'Respondents ethnicity',
    'mixed_1':'mixed race or ethnics',
    'mixed_2':'mixed race or ethnics',
    'mixed_3':'mixed race or ethnics',
    'mixed_4':'mixed race or ethnics',
    'deduc':'Respondents highest education completed',
    'dincome':'Household income year prior to survey', #Made Contiguous 
    'dhouse':'Number of people in household', #Made Contiguous 
    'ownrenhm':'Own or rent home',
    'gender':'Respondents sex',
    'dsexornt':'Respondents sexual orientation',
    'zipcode':'zipcode',
    'district':'Supervisorial District',
    'movesf':'Likelihood of moving away from SF in the next 3 years',
    'disablephys':'physically disabled',
    'disablement':'mentally disabled'    
}

After looking at the demographic data, I noticed that there are fields that may not relate to bike-stolenness.

**Assumption 4:** Mentally disabled or physically disabled persons are less likely to ride bikes. I removed this column for consideration in my model.

**Assumption 5:** Survey year, mode, survey language, years lived in San Francisco, age, ethnicity, education, income, number of people in household, gender, and likelihood of moving away from SF are variables to consider in likelihood to have a bike stolen.

**Assumption 6:** When making values contiguous, I assigned the upper bound of the following columns to be:

- dlivedsf: 30+ changed to 40; years lived in SF
- dage: 65+ changed to 65; age
- dincome: 200001+ changed to 300000; income
- dhouse: 6 or more changed to 6; number of people in household

In [16]:
# Collect an array of column titles to keep

survey_info = ['id','year','mode','language'] # Columns classified as survey-related
demographics = ['dlivedsf','dage','dethnic','deduc','dincome','dhouse','gender','zipcode','movesf'] # Columns classified as demographics-related
active_columns = survey_info + demographics

# Collect an array of column titles to discard
discard_columns = []

for t in total_columns:
    if t not in active_columns:
        discard_columns.append(t)
        
# Create clean DataFrame with relevant survey columns( active_columns)

survey_df_clean = survey_df.copy()
survey_df_clean = survey_df_clean.drop(columns=discard_columns)
survey_df_clean = survey_df_clean[survey_df_clean['year'].isin(['2018','2017','2016','2015'])]


In [17]:
survey_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4345 entries, 16699 to 37971
Data columns (total 13 columns):
id          4345 non-null int64
year        4345 non-null int64
mode        4345 non-null float64
language    4345 non-null float64
dhouse      4345 non-null float64
dlivedsf    4345 non-null float64
movesf      4345 non-null float64
dincome     4345 non-null float64
dage        4345 non-null float64
gender      4345 non-null float64
dethnic     4345 non-null float64
deduc       4345 non-null float64
zipcode     2166 non-null float64
dtypes: float64(11), int64(2)
memory usage: 475.2 KB


Because I am planning to merge the survey data with my bike data using zipcodes, any data that does not have a zipcode to link is unusable. I am discarding all survey data that has a missing zipcode.

In [18]:
# Drop null zipcodes
survey_df_clean_null = survey_df_clean[survey_df_clean.isnull().any(axis=1)]
survey_df_clean_value = survey_df_clean.copy()
survey_df_clean_value = survey_df_clean_value.dropna()

In [19]:
print('Years of survey with null zipcodes: ',survey_df_clean_null['year'].unique())
print('Years of survey with zipcodes: ',survey_df_clean_value['year'].unique())

Years of survey with null zipcodes:  [2015]
Years of survey with zipcodes:  [2017]


The survey data with zipcodes on which I can merge to my BikeIndex dataset is from 2017. (See **Assumption 2**)

In [20]:
# Check data size. Appears deceptively complete. 
survey_df_clean_value.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2166 entries, 16699 to 37971
Data columns (total 13 columns):
id          2166 non-null int64
year        2166 non-null int64
mode        2166 non-null float64
language    2166 non-null float64
dhouse      2166 non-null float64
dlivedsf    2166 non-null float64
movesf      2166 non-null float64
dincome     2166 non-null float64
dage        2166 non-null float64
gender      2166 non-null float64
dethnic     2166 non-null float64
deduc       2166 non-null float64
zipcode     2166 non-null float64
dtypes: float64(11), int64(2)
memory usage: 236.9 KB


In [21]:
# Manually type in the survey and demographics key into dictionaries
# Note any "not available" datapoints as NaN
# Convert columns that can be contiguous

mode_dict={
    1:'phone',
    2:'mail',
    3:'web/phone',
    4:'web/mail'
}

language_dict={
    1:'English',
    2:'Spanish',
    3:'Chinese',
    4:'Tagalog'
}

#Made Contiguous
dlivedsf_dict={
    1:2,
    2:5,
    3:10,
    4:20,
    5:30,
    6:40, #30+; I gave this value an extra subjective weight
    7:None
}

# Made contiguous
dage_dict={
    1:24,
    2:34,
    3:44,
    4:54,
    5:59,
    6:64,
    7:65, #65+
    8:None
}

dethnic_dict={
    1:'Black/African American',
    2:'Asian or Pacific Islander',
    3:'Latino/Hispanic',
    4:'Native American/Indian',
    5:'White/Caucasian',
    6:'Other',
    7:'Mixed Ethnicity',
    8:'Dont know',
    9:None,
    10:'Pacific Islander',
    11:'Arab / Middle Eastern /North African ( 2015 Only); Arab,Middle Eastern, South Asian (2017)',
    12:'Mixed Unspecified',
    13:'Caribbean (2017)'
}

deduc_dict={
    1:'Less than high school',
    2:'High school',
    3:'Less than 4 years of college',
    4:'4 or more years of college/Post Graduate',
    5:None,
}

# Made contiguous
dincome_dict={
    1:10000,
    2:25000,
    3:35000,
    4:50000,
    5:100000,
    6:200000,
    7:300000, #30000 +
    8:None
}

# Made contiguous
dhouse_dict={
    1:1,
    2:2,
    3:3,
    4:4,
    5:5,
    6:6, #6 ore more
    7:None,
}

gender_dict={
    1:'Female',
    2:'Male',
    3:'Other',
    4:None,
}

movesf_dict={
    1:'Very likely',
    2:'Somewhat likely',
    3:'Not too likely',
    4:'Not at all likely',
    5:None
}

In [22]:
survey_df_clean_value.head()

Unnamed: 0,id,year,mode,language,dhouse,dlivedsf,movesf,dincome,dage,gender,dethnic,deduc,zipcode
16699,201711681,2017,1.0,3.0,1.0,3.0,2.0,2.0,2.0,1.0,2.0,2.0,94114.0
18495,201711805,2017,1.0,1.0,2.0,6.0,4.0,7.0,7.0,2.0,5.0,4.0,94124.0
18885,201711881,2017,1.0,1.0,2.0,6.0,4.0,3.0,4.0,2.0,1.0,3.0,94115.0
20949,201711908,2017,1.0,1.0,2.0,4.0,2.0,6.0,3.0,1.0,4.0,4.0,94110.0
29172,201710361,2017,1.0,1.0,2.0,3.0,4.0,8.0,4.0,1.0,9.0,4.0,94132.0


In [23]:
# Apply dictionary to dataset
survey_df_clean_value_a= survey_df_clean_value.copy()
survey_df_clean_value_a['mode'] = survey_df_clean_value_a['mode'].map(mode_dict)
survey_df_clean_value_a['language'] = survey_df_clean_value_a['language'].map(language_dict)
survey_df_clean_value_a['dhouse'] = survey_df_clean_value_a['dhouse'].map(dhouse_dict)
survey_df_clean_value_a['dlivedsf'] = survey_df_clean_value_a['dlivedsf'].map(dlivedsf_dict)
survey_df_clean_value_a['movesf'] = survey_df_clean_value_a['movesf'].map(movesf_dict)
survey_df_clean_value_a['dincome'] = survey_df_clean_value_a['dincome'].map(dincome_dict)
survey_df_clean_value_a['dage'] = survey_df_clean_value_a['dage'].map(dage_dict)
survey_df_clean_value_a['gender'] = survey_df_clean_value_a['gender'].map(gender_dict)
survey_df_clean_value_a['dethnic'] = survey_df_clean_value_a['dethnic'].map(dethnic_dict)
survey_df_clean_value_a['deduc'] = survey_df_clean_value_a['deduc'].map(deduc_dict)

survey_df_clean_value_a.head()

Unnamed: 0,id,year,mode,language,dhouse,dlivedsf,movesf,dincome,dage,gender,dethnic,deduc,zipcode
16699,201711681,2017,phone,Chinese,1.0,10.0,Somewhat likely,25000.0,34.0,Female,Asian or Pacific Islander,High school,94114.0
18495,201711805,2017,phone,English,2.0,40.0,Not at all likely,300000.0,65.0,Male,White/Caucasian,4 or more years of college/Post Graduate,94124.0
18885,201711881,2017,phone,English,2.0,40.0,Not at all likely,35000.0,54.0,Male,Black/African American,Less than 4 years of college,94115.0
20949,201711908,2017,phone,English,2.0,20.0,Somewhat likely,200000.0,44.0,Female,Native American/Indian,4 or more years of college/Post Graduate,94110.0
29172,201710361,2017,phone,English,2.0,10.0,Not at all likely,,54.0,Female,,4 or more years of college/Post Graduate,94132.0


In [24]:
# Check to see that NaN values are accurately recorded
survey_df_clean_value_a.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2166 entries, 16699 to 37971
Data columns (total 13 columns):
id          2166 non-null int64
year        2166 non-null int64
mode        2166 non-null object
language    2166 non-null object
dhouse      2145 non-null float64
dlivedsf    2159 non-null float64
movesf      2137 non-null object
dincome     1781 non-null float64
dage        2113 non-null float64
gender      2133 non-null object
dethnic     2066 non-null object
deduc       2124 non-null object
zipcode     2166 non-null float64
dtypes: float64(5), int64(2), object(6)
memory usage: 236.9+ KB


In [25]:
# Use DummyVariables to quantify qualitative data into probabilities

survey_df_clean_value_a_dummies = pd.get_dummies(survey_df_clean_value_a)
survey_df_final = survey_df_clean_value_a_dummies.copy()
survey_df_final = survey_df_final.groupby('zipcode').mean()

survey_df_final.head()

Unnamed: 0_level_0,id,year,dhouse,dlivedsf,dincome,dage,mode_phone,mode_web/phone,language_Chinese,language_English,...,dethnic_Latino/Hispanic,dethnic_Mixed Ethnicity,dethnic_Native American/Indian,dethnic_Other,dethnic_Pacific Islander,dethnic_White/Caucasian,deduc_4 or more years of college/Post Graduate,deduc_High school,deduc_Less than 4 years of college,deduc_Less than high school
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
94102.0,201711200.0,2017.0,2.246914,22.95122,77323.943662,48.691358,1.0,0.0,0.073171,0.902439,...,0.0,0.02439,0.378049,0.0,0.097561,0.109756,0.414634,0.243902,0.231707,0.085366
94103.0,201711000.0,2017.0,2.294872,21.076923,111805.555556,48.423077,0.987179,0.012821,0.012821,0.974359,...,0.025641,0.0,0.435897,0.0,0.102564,0.205128,0.538462,0.179487,0.25641,0.025641
94104.0,201711200.0,2017.0,2.545455,10.727273,113500.0,40.25,1.0,0.0,0.083333,0.75,...,0.083333,0.0,0.333333,0.0,0.0,0.25,0.583333,0.333333,0.0,0.083333
94105.0,201711000.0,2017.0,1.944444,14.555556,189666.666667,45.833333,0.944444,0.055556,0.0,1.0,...,0.055556,0.0,0.5,0.0,0.055556,0.0,0.944444,0.055556,0.0,0.0
94107.0,201711100.0,2017.0,2.392157,26.45098,156914.893617,50.62,0.960784,0.039216,0.0,1.0,...,0.019608,0.019608,0.588235,0.0,0.078431,0.058824,0.647059,0.117647,0.196078,0.039216


In [26]:
# Store as a CSV
# survey_df_final.to_csv(path_or_buf='/Users/lizchan/ds_foundations/final_project/survey_clean_2017.csv')

### Combine the BikeIndex data with San Francisco City Survey Data

In [27]:
# Import CSVs

bike_df = pd.read_csv('bike_data_clean.csv')
bike_df = pd.DataFrame(data=bike_df)

survey_df = pd.read_csv('survey_clean_2017.csv')
survey_df = pd.DataFrame(data=survey_df)

In [28]:
bike_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,date_stolen,frame_colors,frame_model,id,is_stock_img,large_img,manufacturer_name,serial,stolen,stolen_location,thumb,title,year,stolen_zipcode,year_stolen
0,33,33,2018-11-03 17:00:00,['Blue'],Cross-Check,462239,False,https://files.bikeindex.org/uploads/Pu/140320/...,Surly,YS-PC20270,True,"San Francisco,CA,94105",https://files.bikeindex.org/uploads/Pu/140320/...,2014 Surly Cross-Check,2014.0,94105.0,2018
1,39,39,2018-11-03 09:00:00,"['Red', 'Silver, gray or bare metal']",OCR 3,461486,False,https://files.bikeindex.org/uploads/Pu/140180/...,Giant,absent,True,"San Francisco,CA,94114",https://files.bikeindex.org/uploads/Pu/140180/...,2007 Giant OCR 3,2007.0,94114.0,2018
2,40,40,2018-11-03 09:00:00,"['Silver, gray or bare metal']",Thin 7,461723,False,https://files.bikeindex.org/uploads/Pu/140253/...,Sondors,MT17004959,True,"Berkeley,CA,94704",https://files.bikeindex.org/uploads/Pu/140253/...,Sondors Thin 7,,94704.0,2018
3,41,41,2018-11-03 05:00:00,['Black'],N/a,461764,False,https://files.bikeindex.org/uploads/Pu/140264/...,Not visible on bike,absent,True,"San Francisco,CA,94110",https://files.bikeindex.org/uploads/Pu/140264/...,Not visible on bike N/a,,94110.0,2018
4,42,42,2018-11-03 15:24:15,['White'],Lightweight 6061 Aluminum Frame,460962,False,,SXL,absent,True,"Los Angeles,CA,90007",,2018 SXL Lightweight 6061 Aluminum Frame,2018.0,90007.0,2018


In [29]:
survey_df.head()

Unnamed: 0,zipcode,id,year,dhouse,dlivedsf,dincome,dage,mode_phone,mode_web/phone,language_Chinese,...,dethnic_Latino/Hispanic,dethnic_Mixed Ethnicity,dethnic_Native American/Indian,dethnic_Other,dethnic_Pacific Islander,dethnic_White/Caucasian,deduc_4 or more years of college/Post Graduate,deduc_High school,deduc_Less than 4 years of college,deduc_Less than high school
0,94102.0,201711200.0,2017.0,2.246914,22.95122,77323.943662,48.691358,1.0,0.0,0.073171,...,0.0,0.02439,0.378049,0.0,0.097561,0.109756,0.414634,0.243902,0.231707,0.085366
1,94103.0,201711000.0,2017.0,2.294872,21.076923,111805.555556,48.423077,0.987179,0.012821,0.012821,...,0.025641,0.0,0.435897,0.0,0.102564,0.205128,0.538462,0.179487,0.25641,0.025641
2,94104.0,201711200.0,2017.0,2.545455,10.727273,113500.0,40.25,1.0,0.0,0.083333,...,0.083333,0.0,0.333333,0.0,0.0,0.25,0.583333,0.333333,0.0,0.083333
3,94105.0,201711000.0,2017.0,1.944444,14.555556,189666.666667,45.833333,0.944444,0.055556,0.0,...,0.055556,0.0,0.5,0.0,0.055556,0.0,0.944444,0.055556,0.0,0.0
4,94107.0,201711100.0,2017.0,2.392157,26.45098,156914.893617,50.62,0.960784,0.039216,0.0,...,0.019608,0.019608,0.588235,0.0,0.078431,0.058824,0.647059,0.117647,0.196078,0.039216


In [30]:
# Merge datasets on stolen_zipcode == zipcode
bikedata_df = pd.merge(bike_df, survey_df, right_on = 'zipcode', left_on = 'stolen_zipcode')
bikedata_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2687 entries, 0 to 2686
Data columns (total 50 columns):
Unnamed: 0                                        2687 non-null int64
Unnamed: 0.1                                      2687 non-null int64
date_stolen                                       2687 non-null object
frame_colors                                      2687 non-null object
frame_model                                       2479 non-null object
id_x                                              2687 non-null int64
is_stock_img                                      2687 non-null bool
large_img                                         2065 non-null object
manufacturer_name                                 2687 non-null object
serial                                            2678 non-null object
stolen                                            2687 non-null bool
stolen_location                                   2687 non-null object
thumb                                           

# Part 2: Brief
*Perform EDA on the dataset*

## Exploratory Data Summary
1. Create an exploratory data analysis notebook.
2. Perform statistical analysis, along with any visualizations.
3. Determine how to handle sampling or missing values.
4. Clearly identify shortcomings, assumptions, and next steps.


In [31]:
# Data cleaning, exploration, and analysis tools
import pandas as pd
import seaborn as sns
import numpy as np
from ast import literal_eval
import re as re


# Part 3: Technical Notebook

*A detailed Jupyter Notebook with a summary of your analysis, approach, and evaluation metrics.*

Note: Here are some things to consider in your notebook: sample size, correlations, feature importance, unexplained variance or outliers, variable selection, train/test comparison, and any relationships between your target and independent variables.

In [32]:
# Import the Machine Learning Libraries

# Data cleaning for machine learning models
from sklearn.model_selection import train_test_split #split data into testing and training data
from sklearn.feature_selection import SelectKBest # identify best X that may predict Y
from sklearn.feature_selection import mutual_info_regression #needed for SelectKBest
from sklearn.preprocessing import StandardScaler #handle outliers after selecting K best guess variables that predict Y

# Machine Learning model
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
#reg = LinearRegression(fit_intercept=True)
#fit_intercept = True; hyper parameter for linear regression, add one-extra term - a start value (a starting weight); rarely False



# Error Measures
from sklearn.dummy import DummyRegressor
# Use DummyRegressor to compare your linear regression to the dumbest possible

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error, mean_absolute_error