## Background:
One overarching goal of this research project is to determine the feasibility of using the online Yelp platform to collect business listings for retailers of Electronic Nicotine Delivery Systems (ENDS; e.g., vape shops) in Pennsylvania. To this end, I utilized the Yelp API to collect data for all metropolitan and micropolitan census regions in Pennsylvania (which includes some surrounding states). Each API call was set to maximum search radius of 25 miles, targeting a central zipcode within each identified census region (some geographic overlap was present). 

Data have been collected from the Yelp API at roughly 1-month intervals from September 2016 through present. Data are currently stored in CSV files. Monthly API calls were repeated using several distinct search terms such as "vape", "vaping", and "ecig" to maximize sensitivity. 

As a result of overlap in search radii and search terms, each monthly data file contains many redundant listings. As the searches were repeated several times, search results are also redundant among the data files. Also, several results are outside of Pennsylvania, so additional data cleaning is needed to limit the scope of results to Pennsylvania ENDS retailers. 

Data were collected as part of an ongoing research project at the University of Pittsburgh's [Center for Research on Media, Technology, and Health](http://mth.pitt.edu/) and are stored in [CRMTH's GitHub "YelpEpi" repository](https://github.com/CRMTH/YelpEpi/). For the purposes of this project, I forked that repo to [my personal GitHub "YelpEpi" repository](https://github.com/colditzjb/YelpEpi/).



### Loading the data

First, I needed to clone the repository to my computer. I did this in BASH command line, but I'm including that step for posterity (this may not work as expected within Jupyter).

In [None]:
!git clone http://github.com/colditzjb/YelpEpi/

Depending on your current working directory (e.g., where you started Jupyter), this repo may end up in a different location on your own computer. Change this next line to point to the correct directory:

In [2]:
dir_in = '//home/jason/repos/YelpEpi/'

Okay, now that we're (hopefully) on the same page, let's navigate to the data subdirectory and see what we're working with...

In [16]:
import os
data_dir = dir_in+'data/'
files = sorted(os.listdir(data_dir)) # Use sorted() to list them in ascending order
files

['2016-09-02.csv',
 '2016-09-30.csv',
 '2016-10-31.csv',
 '2016-11-30.csv',
 '2017-01-03.csv',
 '2017-01-31.csv',
 '2017-02-28.csv',
 '2017-04-03.csv',
 '2017-05-01.csv',
 '2017-05-30.csv',
 'README.md']

A couple issues:
* There is a README.md file among the CSV data files, so we'll need to ignore that.
* The data collection dates are in the file names, and we'll need those for later.

Let's clean that up a bit:

In [20]:
# Figuring out the slicing options
f = '2017-05-01.csv'
print('extension is:\t'+f[-4:]) # Use this to select only CSV files
print('filename is:\t'+f[:-4]) # Use this for parsing out dates, later on...

extension is:	.csv
filename is:	2017-05-01


In [22]:
# Making lists of only CSV files
files_csv = []
for f in files:
    if '.csv' in f[-4:]:
        files_csv.append(f)
files = files_csv 
files

['2016-09-02.csv',
 '2016-09-30.csv',
 '2016-10-31.csv',
 '2016-11-30.csv',
 '2017-01-03.csv',
 '2017-01-31.csv',
 '2017-02-28.csv',
 '2017-04-03.csv',
 '2017-05-01.csv',
 '2017-05-30.csv']

Now we'll read in the first CSV file as a Pandas object and examine it.

In [26]:
import pandas as pd
df = pd.read_csv(data_dir+files[0])
df.head(3) # Only display the first 3 records

Unnamed: 0,termnum,term,category,radius_miles,loci,location,lat,lng,distance,i,...,is_closed,rating,review_count,name,phone,display_address,url,yelpcats,isWTS,isTobShop
0,0,vape,,25,1,Allentown-Bethlehem-Easton | PA-NJ Metro Area ...,40.549806,-75.491105,,1,...,False,4.0,4,Get Your Vape On,6104218310,610 State Ave | Emmaus PA 18049,http://www.yelp.com/biz/get-your-vape-on-emmau...,vapeshops |,,
1,0,vape,,25,1,Allentown-Bethlehem-Easton | PA-NJ Metro Area ...,40.549495,-75.597692,,2,...,False,4.5,2,Vape Flow,4846498347,7150 Hamilton Blvd | Trexlertown PA 18087,http://www.yelp.com/biz/vape-flow-trexlertown-...,vapeshops |,,
2,0,vape,,25,1,Allentown-Bethlehem-Easton | PA-NJ Metro Area ...,40.629023,-75.477516,,3,...,False,1.0,1,Blue Monkey Vape,6102310555,250 Lehigh Valley Mall | Whitehall PA 18052,http://www.yelp.com/biz/blue-monkey-vape-white...,vapeshops |,,


Next, let's confirm that all of the files are Pandas-readable...

_Spoiler alert:_ they're not all good!

In [29]:
for f in files:
    try:
        df = pd.read_csv(data_dir+f)
        print(f+' is all good')
    except:
        print(f+' is NOT good')

2016-09-02.csv is all good
2016-09-30.csv is all good
2016-10-31.csv is all good
2016-11-30.csv is all good
2017-01-03.csv is NOT good
2017-01-31.csv is NOT good
2017-02-28.csv is NOT good
2017-04-03.csv is NOT good
2017-05-01.csv is all good
2017-05-30.csv is all good


What's going on with the "NOT good" files?!

Let's iterate to find the error...

In [68]:
for line in open(data_dir+'2017-01-03.csv'):
    pass

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 2336: invalid continuation byte

Let's examine the file in question...

In [47]:
lineNum = 0
try:
    for line in open(data_dir+'2017-01-03.csv'):
        lineNum += 1
        pass
except:
    print('Last good line was #'+str(lineNum)+' and had this data:\n')
    print(line)


Last good line was #944 and had this data:

0,vape,,25,36,York-Hanover | PA Metro Area; Pennsylvania (Spring Grove | PA),39.9766769,-76.7686615,5516.60113329,18,djs-westgate-beverage-york,False,4.0,1,DJ's Westgate Beverage,7177641550,1550 Kenneth Rd | York PA 17408,https://www.yelp.com/biz/djs-westgate-beverage-york?adjust_creative=wDwCvDADIHyvJYDHOmNK2g&utm_campaign=yelp_api&utm_medium=api_v2_search&utm_source=wDwCvDADIHyvJYDHOmNK2g,beer_and_wine | tobaccoshops | 



At this point, we know that there is a utf-8 encoding error around line # 944 in the '2017-01-03.csv' file (and also in some subsequent files). Upon reviewing the raw data, there weren't any obviously strange text characters, so we're going to try a different encoding strategy.

After some trial-and-error (and various StackOverflow pages), _"latin-1"_ might be a viable encoding strategy when interpreting text from an API that is international in scope. Let's try that...

In [66]:
for f in files:
    try:
        df = pd.read_csv(data_dir+f, encoding='latin-1')
        print(f+' is all good')
    except:
        print(f+' is NOT good')

2016-09-02.csv is all good
2016-09-30.csv is all good
2016-10-31.csv is all good
2016-11-30.csv is all good
2017-01-03.csv is all good
2017-01-31.csv is all good
2017-02-28.csv is all good
2017-04-03.csv is all good
2017-05-01.csv is all good
2017-05-30.csv is all good


__Success - it's all good!__

Update the repository...

Note on how I "save" my work:

In [83]:
#!git config user.email "colditzjb@gmail.com"
#!git commit -m "update files"
#!git push

### Documenting the data

* Read these CSV files into four Pandas Dataframes
    * Read the `id` column in as an index (using the `index_col` parameter of the `read_csv` function). See the [column and index locations documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#column-and-index-locations-and-names) for more information.
    * Parse any date columns (using the `parse_date` and `infer_datetime_format` parameters of the `read_csv` function). See the [datetime-handling documentation](http://pandas.pydata.org/pandas-docs/stable/io.html#datetime-handling) for more information.
    