# Final Capstone Project - Week 5 - The Battle of the Neighborhoods - Applied Data Science Capstone - IBM/Coursera

> ## Targeted Marketing Campaign Prediction & User Behavior Analysis using Classification & Regression Algorithms

> ## Brandy Guillory

> ## December 26, 2019

## Table of contents
1. [Introduction – Business Understanding](#0)<br>
2. [Data Science Methodology](#datasciencemethodology)
3. [Results](#results)
4. [Discussion](#discussion)
5. [Conclusion](#conclusion)

## Introduction – Business Understanding<a id="0"></a>

### 1.1 Background

There has been an evolution of marketing. Since the 1900s starting on both radio and television, the marketing focus was on "selling" starting. The Golden age of Advertising introduced such ads as "Uncle Sam Wants You for the Army" and "Eat Your Wheaties". Marketing became more personalized with a focus on brand awareness and problem solving. Then there was the digital ad revolution  that began with online advertising in the 1990s and mobile ads in 2000. There is a plethora of data collected daily about users and the ability to harness this data to produce more targeted and personalized ad campaigns to create better customer experience and revenue generation.

### 1.2 Problem

An employee at a fictitious big data marketing company, Insights LLC has been tasked with helping its customer determine four ideal marketing campaigns based on user behavior in specific geographic regions and appropriate customer segmentation based on this data.

### 1.3 Interest

Insights, LLC has a customer who would like to create more personalized ad campaigns for its target customer segments to increase revenue and customer satisfaction. With the plethora of data collected & speed in which it is collected on its customers, the ability to harness it for either a) an increase of revenue via new products/services b) identification of user behavior for both positive & negative trends in customer satisfaction. I am using data science methodology to solve this business problem.

## 2. Data Science Methodology<a name="datasciencemethodology"></a>

### 2.1	Data Requirements – Data Tooling,  Sources & Collection

The data tooling I will be using will be Python language for (data cleansing, data manipulation, data modeling, data analytics & visualization), Jupyter notebook for sharing code & data analysis pushed to GitHub for source control.


The customer has asked me to gather insights for three specific cities: 
•	Atlanta
•	San Francisco 
•	New Orleans. 

I am using the following data: 
•	Wikipedia web scrape: Neighborhood data for the various cities
•	Geocoder nominatim: Retrieval of latitude and longitude of the neighborhoods 
•	Foursquare Places API: venue & check-in data for these neighborhoods 
•	Kaggle datasets: crime data 

The data that I will be using will be both structured & unstructured.

Factors that will influence the marketing campaign: likes per venue, crime, venues most frequented per neighborhood
 

In [24]:
%%capture
!pip install geocoder
!pip install folium


#import statements, installation of libraries for data visualization/analysis & geographical data retrieval


import pandas as pd
import geocoder
from geopy.geocoders import Nominatim
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
import folium 
import urllib.request
from urllib.request import urlopen
from pandas.io.json import json_normalize
from io import StringIO
from zipfile import ZipFile 



### Wikipedia web scrape for neighborhood data

#### Atlanta neighborhood data web scrape

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_counties_in_Georgia")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
df = pd.read_html(str(table))
df = pd.concat(df)
df.head(5)

Unnamed: 0,0,1,Area[10],Counties of Georgia,Counties of Georgia.1,County,County seat[10],Density,Est.[10],Etymology[11],FIPS code[9],Map,Origin[11],Population[12],vteLists of United States counties and county equivalents,vteLists of United States counties and county equivalents.1,vte State of Georgia,vte State of Georgia.1,vte State of Georgia.2
0,,,,AP AT BC BX BL BA BW BR BH BE BI BY BT BO BN B...,AP AT BC BX BL BA BW BR BH BE BI BY BT BO BN B...,,,,,,,,,,,,,,
1,,,,Location,State of Georgia,,,,,,,,,,,,,,
2,,,,Number,159,,,,,,,,,,,,,,
3,,,,Populations,"Greatest: 1,041,423 (Fulton)Least: 1,680 (Tali...",,,,,,,,,,,,,,
4,,,,Areas,"Largest: 908 square miles (2,350 km2) (Ware)Sm...",,,,,,,,,,,,,,


#### San Francisco neighborhood data web scrape

In [3]:
res = requests.get("http://www.healthysf.org/bdi/outcomes/zipmap.htm")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
df1 = pd.read_html(str(table))
df1 = pd.concat(df1)
df1.head(20)

Unnamed: 0,0,1,2,3,4
0,,San Francisco Burden of Disease & Injury Study...,,,
1,,About Site Determinants Health Outcomes Web Li...,,,
0,About Site,Determinants,Health Outcomes,Web Links,Site Map
0,About Site,Determinants,Health Outcomes,Web Links,Site Map
0,San Francisco Neighborhoods as ZIP Codes Zip C...,Zip Code Neighborhood 94102 Hayes Valley/Tende...,,,
0,Zip Code,Neighborhood,Population (Census 2000),,
1,94102,Hayes Valley/Tenderloin/North of Market,28991,,
2,94103,South of Market,23016,,
3,94107,Potrero Hill,17368,,
4,94108,Chinatown,13716,,


#### New Orleans neighborhood data web scrape

In [4]:
res = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_New_Orleans")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
df3 = pd.read_html(str(table))
df3 = pd.concat(df3)
df3.head(5)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,Latitude,Longitude,Neighborhood,vte City of New Orleans,vte City of New Orleans.1,vte City of New Orleans.2
0,29.946085,-90.026093,U.S. NAVAL BASE,,,
1,29.952462,-90.051606,ALGIERS POINT,,,
2,29.9472,-90.042357,WHITNEY,,,
3,29.932994,-90.12145,AUDUBON,,,
4,29.92444,-90.0,OLD AURORA,,,


### Foursquare Data Collection to retrieve lat and long of 3 SF neighborhoods, venue data of venues within a proximity of the neighborhoods & like/check-in data on the venues

#### Hide code below for API creds

In [5]:
# The code was removed by Watson Studio for sharing.

#### In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>foursquare_agent</em>, as shown below and get coordinates for 3 SF Neighborhoods via function
#### In addition I pull all nearby venues using location coordinates within a 500 mile radius

In [6]:
def getcoords(clientid,secret,vrsn,lmt,address):
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return latitude,longitude


address = 'the castro san francisco'
#address = 'Potrero Hill san francisco'
#address = 'Chinatown san francisco'
coords = getcoords(CLIENT_ID,CLIENT_SECRET,VERSION,LIMIT,address)
print('Coordinates of {}: {}'.format(address, coords))



Coordinates of the castro san francisco: (37.7608561, -122.434957)


#### Retrieving Category Type from rows and returning the names

In [7]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Using Foursquare API to explore venues then convert result into a pandas dataframe from JSON

In [8]:
radius=100
limit=100
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude,
    longitude,
    radius, 
    LIMIT)
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues)

#### Filter columns by Venue ID, Venue Name, Venue Category & Venue Lat & Long

In [9]:

filtered_columns = ['venue.id','venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head(10)

Unnamed: 0,id,name,categories,lat,lng
0,49fe3629f964a5207f6f1fe3,Yoga Tree Castro,Yoga Studio,37.761051,-122.436003
1,5613f02d498e46b274c37a65,Philz Coffee,Coffee Shop,37.760104,-122.434829
2,58f18932d7627e564887c486,The Castro Fountain,Ice Cream Shop,37.760052,-122.435024
3,52efb33f498e5300bf66e245,Réveille Coffee Co.,Coffee Shop,37.761104,-122.43443
4,574f605b498ee997a05fa2d7,Dog Eared Books,Bookstore,37.761206,-122.434959
5,4a8cbbdff964a520070f20e3,Body Clothing,Clothing Store,37.761713,-122.435155
6,49e55498f964a520bf631fe3,Blush! Wine Bar,Wine Bar,37.76126,-122.435116
7,552d9a9a498ef6abfdae677a,Lark,Mediterranean Restaurant,37.760993,-122.434275
8,4ce453c3f5f9236a1bd58461,GLBT History Museum,History Museum,37.760828,-122.43561
9,584339f68ae36321db2805d3,Project Juice,Juice Bar,37.760716,-122.435023


#### Set up to pull the likes from the API based on venue ID

In [10]:
#set up to pull the likes from the API based on venue ID
venue_id_list = nearby_venues['id'].tolist()
venue_id_list

url_list = []
like_list = []
json_list = []

#loop through the apis for each venue and provide the likes in a list
for i in venue_id_list:
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)

[82, 294, 61, 632, 51, 24, 159, 142, 68, 70, 182, 232, 157, 12, 24, 245, 192, 14, 109, 67, 225, 229, 15, 2]


In [11]:
#### Foursquare check-in data(optional)

### Kaggle Data Collection - Crime by Neighborhood

In [15]:
# Link to the Kaggle data set & name of zip file
login_url = 'https://www.kaggle.com/kaggle/san-francisco-crime-classification/data/san-francisco-crime-classification.zip'


In [19]:
# The code was removed by Watson Studio for sharing.

In [28]:
# Login to Kaggle and retrieve the data.
r = requests.post(login_url, data=kaggle_info, stream=True)

#Request returns a bytes file, so first convert bytes to zip file:
#mlz = zipfile.ZipFile(io.BytesIO(r.content))

import zipfile

zf = zipfile.ZipFile('san-francisco-crime-classification.zip', 'r')
print(zf.namelist())

#To see what's in the zipfile
mlz.namelist()

#Then you can extract and read the CSV corresponding to the index, x:
df1  = pd.read_csv(mlz.open(mlz.namelist()[0]))
df2 = pd.read_csv(mlz.open(mlz.namelist()[1]))
crimeData = pd.read_csv(mlz.open('train.csv'))
crimeData.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'san-francisco-crime-classification.zip'

### 2.3	Data Cleansing & Manipulation

Data that I downloaded or web scraped was stored in a table or data frame for further analysis.

### 2.4	Exploratory Data Analysis

In [None]:
venues_map = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the Conrad Hotel

# add a red circle marker to represent the castro neighborhood
folium.features.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='the castro san francisco',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the venues within the proximity as blue circle markers
for lat, lng, label in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

> #### 2.5.1	Regression 

> #### 2.5.2	Classification

### 3. Results 

### 4. Discussion

### 5. Conclusion