# Final Capstone Project - Week 5 - The Battle of the Neighborhoods - Applied Data Science Capstone - IBM/Coursera

> ## Targeted Marketing Campaign Prediction & User Behavior Analysis using Classification & Regression Algorithms

> ## Brandy Guillory

> ## December 26, 2019

## Table of contents
1. [Introduction – Business Understanding](#0)<br>
2. [Data Science Methodology](#datasciencemethodology)
3. [Results](#results)
4. [Discussion](#discussion)
5. [Conclusion](#conclusion)

## Introduction – Business Understanding<a id="0"></a>

### 1.1 Background

There has been an evolution of marketing. Since the 1900s starting on both radio and television, the marketing focus was on "selling" starting. The Golden age of Advertising introduced such ads as "Uncle Sam Wants You for the Army" and "Eat Your Wheaties". Marketing became more personalized with a focus on brand awareness and problem solving. Then there was the digital ad revolution  that began with online advertising in the 1990s and mobile ads in 2000. There is a plethora of data collected daily about users and the ability to harness this data to produce more targeted and personalized ad campaigns to create better customer experience and revenue generation.

### 1.2 Problem

An employee at a fictitious big data marketing company, Insights LLC has been tasked with helping its customer determine four ideal marketing campaigns based on user behavior in specific geographic regions and appropriate customer segmentation based on this data.

### 1.3 Interest

Insights, LLC has a customer who would like to create more personalized ad campaigns for its target customer segments to increase revenue and customer satisfaction. With the plethora of data collected & speed in which it is collected on its customers, the ability to harness it for either a) an increase of revenue via new products/services b) identification of user behavior for both positive & negative trends in customer satisfaction. I am using data science methodology to solve this business problem.

## 2. Data Science Methodology<a name="datasciencemethodology"></a>

### 2.1	Data Requirements – Data Tooling,  Sources & Collection

The data tooling I will be using will be Python language for (data cleansing, data manipulation, data modeling, data analytics & visualization), Jupyter notebook for sharing code & data analysis pushed to GitHub for source control.


The customer has asked me to gather insights for three specific cities: 
•	Atlanta
•	San Francisco
•	New Orleans
 

I am using the following data: 
•	Wikipedia web scrape: Neighborhood data for the various cities
•	Geocoder nominatim: Retrieval of latitude and longitude of the neighborhoods 
•	Foursquare Places API: venue& likes data for these neighborhoods 
•	Kaggle datasets: SF crime data & Atlanta salary/occupation data

The data that I will be using will be both structured & unstructured.

Factors that will influence the marketing campaign: likes per venue, crime, venues most frequented per neighborhood
 

In [1]:
%%capture
!pip install geocoder
!pip install folium


#import statements, installation of libraries for data visualization/analysis & geographical data retrieval
import pandas as pd
import geocoder
from geopy.geocoders import Nominatim
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
import folium 
import urllib.request
from urllib.request import urlopen
from pandas.io.json import json_normalize

### Wikipedia web scrape for neighborhood data

#### Atlanta neighborhood data web scrape

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_counties_in_Georgia")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
atlantaneighborhood_df = pd.read_html(str(table))
atlantaneighborhood_df = pd.concat(atlantaneighborhood_df)
#atlantaneighborhood_df.head(50)
atlantaneighborhood_df.tail(50)

Unnamed: 0,0,1,Area[10],Counties of Georgia,Counties of Georgia.1,County,County seat[10],Density,Est.[10],Etymology[11],FIPS code[9],Map,Origin[11],Population[12],vteLists of United States counties and county equivalents,vteLists of United States counties and county equivalents.1,vte State of Georgia,vte State of Georgia.1,vte State of Georgia.2
120,,,324 sq mi(839 km2),,,Richmond County,Augusta,625.27,1777.0,"Charles Lennox, 3rd Duke of Richmond (1735–180...",245.0,,St Paul Parish,202587.0,,,,,
121,,,131 sq mi(339 km2),,,Rockdale County,Conyers,655.11,1870.0,"Rockdale Church, which was so named for the su...",247.0,,Henry and Newton counties,85820.0,,,,,
122,,,168 sq mi(435 km2),,,Schley County,Ellaville,29.7,1857.0,"William Schley (1786–1858), governor of Georgi...",249.0,,Marion and Sumter counties,4990.0,,,,,
123,,,"648 sq mi(1,678 km2)",,,Screven County,Sylvania,21.92,1793.0,"General James Screven (1744–1778), a hero of t...",251.0,,Burke and Effingham Counties,14202.0,,,,,
124,,,238 sq mi(616 km2),,,Seminole County,Donalsonville,37.59,1920.0,Seminole Nation,253.0,,Decatur and Early Counties,8947.0,,,,,
125,,,198 sq mi(513 km2),,,Spalding County,Griffin,322.55,1851.0,"Thomas Spalding (1774–1851), U.S. Congressman,...",255.0,,"Fayette, Henry, and Pike County",63865.0,,,,,
126,,,179 sq mi(464 km2),,,Stephens County,Toccoa,144.64,1905.0,"Alexander Stephens (1812–83), U.S. Congressman...",257.0,,Franklin and Habersham Counties,25891.0,,,,,
127,,,"459 sq mi(1,189 km2)",,,Stewart County,Lumpkin,13.16,1830.0,"General Daniel Stewart (1759–1829), a hero of ...",259.0,,Randolph County,6042.0,,,,,
128,,,"485 sq mi(1,256 km2)",,,Sumter County,Americus,65.06,1831.0,"General Thomas Sumter (1734–1832), the ""Fighti...",261.0,,Lee County,31554.0,,,,,
129,,,"393 sq mi(1,018 km2)",,,Talbot County,Talbotton,16.58,1827.0,"Matthew Talbot (1762–1827), served in the Geor...",263.0,,Muscogee County,6517.0,,,,,


#### San Francisco neighborhood data web scrape

In [3]:
res = requests.get("http://www.healthysf.org/bdi/outcomes/zipmap.htm")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
sanfrancisconeighborhood_df = pd.read_html(str(table))
sanfrancisconeighborhood_df = pd.concat(sanfrancisconeighborhood_df)
sanfrancisconeighborhood_df.head(5)

Unnamed: 0,0,1,2,3,4
0,,San Francisco Burden of Disease & Injury Study...,,,
1,,About Site Determinants Health Outcomes Web Li...,,,
0,About Site,Determinants,Health Outcomes,Web Links,Site Map
0,About Site,Determinants,Health Outcomes,Web Links,Site Map
0,San Francisco Neighborhoods as ZIP Codes Zip C...,Zip Code Neighborhood 94102 Hayes Valley/Tende...,,,


#### New Orleans neighborhood data web scrape

In [4]:
res = requests.get("https://en.wikipedia.org/wiki/Neighborhoods_in_New_Orleans")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table') 
neworleansneighborhood_df = pd.read_html(str(table))
neworleansneighborhood_df = pd.concat(neworleansneighborhood_df)
neworleansneighborhood_df.head(5)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,Latitude,Longitude,Neighborhood,vte City of New Orleans,vte City of New Orleans.1,vte City of New Orleans.2
0,29.946085,-90.026093,U.S. NAVAL BASE,,,
1,29.952462,-90.051606,ALGIERS POINT,,,
2,29.9472,-90.042357,WHITNEY,,,
3,29.932994,-90.12145,AUDUBON,,,
4,29.92444,-90.0,OLD AURORA,,,


### Foursquare Data Collection to retrieve lat and long of 3 SF neighborhoods, venue data of venues within a proximity of the neighborhoods & like/check-in data on the venues

#### Hide code below for API creds

In [None]:
# The code was removed by Watson Studio for sharing.

#### In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>foursquare_agent</em>, as shown below and get coordinates for 3 SF Neighborhoods via function
#### In addition I pull all nearby venues using location coordinates within a 500 mile radius

In [None]:
def getcoords(clientid,secret,vrsn,lmt,address):
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    return latitude,longitude

address = 'the castro san francisco'
#address = 'Potrero Hill san francisco'
#address = 'Chinatown san francisco'
coords = getcoords(CLIENT_ID,CLIENT_SECRET,VERSION,LIMIT,address)
print('Coordinates of {}: {}'.format(address, coords))

Coordinates of the castro san francisco: (37.7608561, -122.434957)


#### Retrieving Category Type from rows and returning the names

In [None]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Using Foursquare API to explore venues then convert result into a pandas dataframe from JSON

In [None]:
radius=100
limit=100
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude,
    longitude,
    radius, 
    LIMIT)
results = requests.get(url).json()
venues = results['response']['groups'][0]['items']
nearby_venues_df = json_normalize(venues)

#### Filter columns by Venue ID, Venue Name, Venue Category & Venue Lat & Long

In [None]:
filtered_columns = ['venue.id','venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues_df =nearby_venues_df.loc[:, filtered_columns]# filter the category for each row
nearby_venues_df['venue.categories'] = nearby_venues_df.apply(get_category_type, axis=1)# clean columns
nearby_venues_df.columns = [col.split(".")[-1] for col in nearby_venues_df.columns]
nearby_venues_df.rename(columns={'lat':'Latitude'}, inplace=True)
nearby_venues_df.rename(columns={'lng':'Longitude'}, inplace=True)
nearby_venues_df.head(5)

Unnamed: 0,id,name,categories,Latitude,Longitude
0,49fe3629f964a5207f6f1fe3,Yoga Tree Castro,Yoga Studio,37.761051,-122.436003
1,5613f02d498e46b274c37a65,Philz Coffee,Coffee Shop,37.760104,-122.434829
2,58f18932d7627e564887c486,The Castro Fountain,Ice Cream Shop,37.760052,-122.435024
3,52efb33f498e5300bf66e245,Réveille Coffee Co.,Coffee Shop,37.761104,-122.43443
4,574f605b498ee997a05fa2d7,Dog Eared Books,Bookstore,37.761206,-122.434959


#### Set up to pull the likes from the API based on venue ID

In [None]:
#set up to pull the likes from the API based on venue ID
venue_id_list = nearby_venues_df['id'].tolist()
venue_id_list

url_list = []
like_list = []
json_list = []

#loop through the apis for each venue and provide the likes in a list
for i in venue_id_list:
    venue_url = 'https://api.foursquare.com/v2/venues/{}/likes?client_id={}&client_secret={}&v={}'.format(i, CLIENT_ID, CLIENT_SECRET, VERSION)
    url_list.append(venue_url)
for link in url_list:
    result = requests.get(link).json()
    likes = result['response']['likes']['count']
    like_list.append(likes)
print(like_list)

[82, 293, 61, 632, 51, 159, 24, 68, 142, 70, 184, 12, 3, 232, 24, 157, 14, 245, 192, 109, 16, 225, 229, 2, 15]


In [None]:
#### Foursquare check-in data(optional)

### Kaggle Data Collection - San Francisco Crime by Neighborhood

In [None]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Dates,Category,Description,DayOfWeek,Neighborhood,Resolution,Address,Latitude,Longitude
0,5/13/15 23:53,WARRANTS,WARRANT ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
1,5/13/15 23:53,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",OAK ST / LAGUNA ST,-122.425892,37.774599
2,5/13/15 23:33,OTHER OFFENSES,TRAFFIC VIOLATION ARREST,Wednesday,NORTHERN,"ARREST, BOOKED",VANNESS AV / GREENWICH ST,-122.424363,37.800414
3,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,NORTHERN,NONE,1500 Block of LOMBARD ST,-122.426995,37.800873
4,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,PARK,NONE,100 Block of BRODERICK ST,-122.438738,37.771541
5,5/13/15 23:30,LARCENY/THEFT,GRAND THEFT FROM UNLOCKED AUTO,Wednesday,INGLESIDE,NONE,0 Block of TEDDY AV,-122.403252,37.713431
6,5/13/15 23:30,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,INGLESIDE,NONE,AVALON AV / PERU AV,-122.423327,37.725138
7,5/13/15 23:30,VEHICLE THEFT,STOLEN AUTOMOBILE,Wednesday,BAYVIEW,NONE,KIRKWOOD AV / DONAHUE ST,-122.371274,37.727564
8,5/13/15 23:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,RICHMOND,NONE,600 Block of 47TH AV,-122.508194,37.776601
9,5/13/15 23:00,LARCENY/THEFT,GRAND THEFT FROM LOCKED AUTO,Wednesday,CENTRAL,NONE,JEFFERSON ST / LEAVENWORTH ST,-122.419088,37.807802


### 2.3	Data Cleansing & Preprocessing

Now that we have done our web scraping and downloads to source our data, we must cleanse & preprocess it. We will identify within our dataframes missing data, duplicates, anomalies, corruption and fix or remove them. We will also rename columns and perform some merging of dataframes

#### Atlanta neighborhood data cleanse/preprocessing



In [None]:
# Drop rows where majority of values are blank in the row (first 8 rows and last 11 rows) 
atlantaneighborhood_df = atlantaneighborhood_df.iloc[8:,]
atlantaneighborhood_df = atlantaneighborhood_df.iloc[:-11,]
# Drop columns where majority of values are blank in the column, axis=1 specifies column
atlantaneighborhood_df.dropna(axis=1, how='all', thresh=None, subset=None, inplace=True)
## Drop irrelevant columns that contain data unrelated to our problem/project 
atlantaneighborhood_df.drop(['County seat[10]', 'Density', 'Est.[10]', 'Etymology[11]', 'FIPS code[9]', 'Origin[11]'], axis=1,inplace=True)
## Rename columns to be more meaningful
atlantaneighborhood_df.columns = ['Area', 'County' , 'Population']
atlantaneighborhood_df.head()

Unnamed: 0,Area,County,Population
0,"509 sq mi(1,318 km2)",Appling County,18368.0
1,338 sq mi(875 km2),Atkinson County,8284.0
2,285 sq mi(738 km2),Bacon County,11198.0
3,343 sq mi(888 km2),Baker County,3366.0
4,258 sq mi(668 km2),Baldwin County,46367.0


#### San Francisco neighborhood data cleanse/preprocessing

In [None]:
# Rename columns to more meaningful names
sanfrancisconeighborhood_df.columns = ['Zip Code', 'Neighborhood' , 'Population', 'Drop1', 'Drop2']
# Drop NaN columns 
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.drop(columns=['Drop1', 'Drop2'])
# Drop duplicate values in column Zip Code
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.drop_duplicates(subset='Zip Code', keep='first')
# Sort values by zip code so irrelevant data goes to bottom
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.sort_values('Zip Code')
# Count of remaining records
print(sanfrancisconeighborhood_df.shape[0])
# Drop columns with irrelevant data
sanfrancisconeighborhood_df = sanfrancisconeighborhood_df.iloc[0:-6,]
sanfrancisconeighborhood_df

27


Unnamed: 0,Zip Code,Neighborhood,Population
1,94102,Hayes Valley/Tenderloin/North of Market,28991
2,94103,South of Market,23016
3,94107,Potrero Hill,17368
4,94108,Chinatown,13716
5,94109,Polk/Russian Hill (Nob Hill),56322
6,94110,Inner Mission/Bernal Heights,74633
7,94112,Ingelside-Excelsior/Crocker-Amazon,73104
8,94114,Castro/Noe Valley,30574
9,94115,Western Addition/Japantown,33115
10,94116,Parkside/Forest Hill,42958


In [None]:
#### Render nearby Castro venues within a 500 mile radius 

In [None]:
tmap=folium.Map(location=[latitude,longitude],zoom_start=11)
for lat,lng,descr,neighb in zip(sfcrimebyneighborhood_df['Latitude'],sfcrimebyneighborhood_df['Longitude'],sfcrimebyneighborhood_df['Description'],sfcrimebyneighborhood_df['Neighborhood']):
    label='{}, {}'.format(descr,neighb)
    label=folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(tmap)
tmap

### 2.4	Exploratory Data Analysis

In [None]:
venues_map=folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the Conrad Hotel

# add a red circle marker to represent the castro neighborhood
folium.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup='the castro san francisco',
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(venues_map)

# add the venues within the proximity as blue circle markers
for lat, lng, label in zip(nearby_venues_df.Latitude, nearby_venues_df.Longitude, nearby_venues_df.categories):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(venues_map)

# display map
venues_map

> #### 2.5.1	Regression 

> #### 2.5.2	Classification

### 3. Results 

### 4. Discussion

### 5. Conclusion