This notebook is to help you prepare all data needed and save them to AWS S3.

Steps:
1. Download a kaggle.json(API token from Kaggle website) file with the API tokens in it. We need to move this file to – ~/.kaggle/kaggle.json.
2. Download file from Kaggle to your local box.
3. Use Yelp API to download category data.
4. Scrape weather data from wunderground.
5. Copy local files to Amazon S3.


## yelp dataset from kaggle

In [1]:
import os
import json
import requests
import numpy as np
import pandas as pd
import configparser
import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options 

In [2]:
# Path of the project
project_path = '/mnt/data-ubuntu/Projects/data_engineering_projects/project_6_capstone/'

config = configparser.ConfigParser()
config.read(project_path + '/resource/project.cfg') # The project configure file is not included in the repo.

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['ACCESS_KEY']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['SECRET_KEY']
os.environ['REGION']=config['AWS']['REGION']

In [3]:
# For your security, ensure that other users of your computer do not have read access to your credentials. 
# On Unix-based systems you can do this with the following command:
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
# Set up an environment variable to specify the path your kaggle installed.
os.environ['PATH'] = '/home/rick/anaconda3/envs/de_capstone/bin'
!kaggle datasets download -d yelp-dataset/yelp-dataset -p {project_path + 'data/'}

In [5]:
# Unzipping data set
import zipfile
with zipfile.ZipFile(project_path + 'data/yelp-dataset.zip', 'r') as zip_ref:
    zip_ref.extractall(project_path + 'data/yelp-dataset')

## Category data from yelp api

In [7]:
# Set working dir and import data_preparation.py
os.chdir(project_path)
from src.data_preparation import *

In [8]:
# Extract yelp business category data through yelp API
api_key=config['YELP']['API_KEY']
create_yelp_category(api_key, project_path + 'data/yelp-dataset')

yelp_category.csv is created in /mnt/data-ubuntu/Projects/data_engineering_projects/project_6_capstone/data/yelp-dataset



## Weather Data
Get weather data based on the latitude and longitude. So we can analysze the relation between review and weather.

After analyzing the data, AZ has the most number of businesses, while other stats don't have enough data.
Thus, we will only focus on business in AZ in order to make a meaningful data model.

And the reveiw started from 2004 to 2019.

Steps to get histical weather data:
    1. Check missing values for column latitude and longitude.
    2. Impute missing values if any.
    3. Scrape weather data from www.wunderground.com based on latitude and longitude.
    4. Save weather data by station as csv files.

In [9]:
# First lets check out our location info in df_business.
df_business = pd.read_json(project_path + 'data/yelp-dataset/yelp_academic_dataset_business.json', lines = True)
df_business_address = df_business[df_business['state']=='AZ'].loc[:,['business_id',
                                                                     'latitude',
                                                                     'longitude']]
# Replace field that's entirely space (or empty) with NaN
df_business_address = df_business_address.replace(r'^\s*$', np.nan, regex=True)
# Trim the column city

In [10]:
# Check all missing values
df_business_address.isnull().sum().sort_values(ascending = False)
# There are no missing values

In [11]:
df_business_address.count()

In [12]:
# We will only keep one digits for latitude and longitude.(Reduce the number for searching)
df_business_address['latit_s'] = df_business_address['latitude'].map(lambda x: round(x, 1))
df_business_address['longi_s'] = df_business_address['longitude'].map(lambda x: round(x, 1))
df_business_address.reset_index(drop = True, inplace = True)
df_business_address

In [13]:
# I was planning to use city and state to loop up the weather, but handling the missing values is a real headache.
# mapquest api key. You can use the free 15,000 transactions per month
# mapquest_api= config['MAPQUEST']['API_KEY']
# for i, row in df_business_address.loc[:,['latitude', 'longitude']].iterrows():
#     url = r'''
#         http://open.mapquestapi.com/geocoding/v1/reverse?key={}&location={},{}
#         &includeRoadMetadata=true
#         &includeNearestIntersection=true 
#         '''.format(mapquest_api, row['latitude'], row['longitude'])
       
#     r = requests.get(url)
#     df_business.at[i, 'city'] = r.json()['results'][0]['locations'][0]['adminArea5']
#     df_business_address.at[i, 'city'] = r.json()['results'][0]['locations'][0]['adminArea5']
# # Save the city and state as csv.
# df_business_address.to_csv(project_path + 'data/yelp-dataset/address_imputation.csv')

**Look up weather data based on city and state**

I built a scraper to get weather data(cities in AZ, from 2014-2019) from www.wunderground.com

In [14]:
# Create a table to look up the weather
df_geo = df_business_address.loc[:,['latit_s', 'longi_s']] \
    .drop_duplicates() \
    .reset_index(drop = True)
df_geo.to_csv(project_path + 'data/yelp-dataset/geo_location.csv')
# We need to find all historical weather info for these geo location

**I tried to call function weather_scraper to downlaod the data but there was a bug that I can't fix when I use jupyter notebook to call the function. So I just run the data_preparation.py directly to scrape the data.**

In [15]:
# # Makesure to change the chromedriver to executable
# data_weather, df_city_state = weather_scraper(df_city_state, 
#                                               project_path + 'resource/chromedriver',
#                                               '2004-01-01',
#                                               '2020-01-01',
#                                               config['WD']['API_KEY'])
# df_city_state.to_csv(project_path + 'data/yelp-dataset')
# # Parse weather date and save it as csv by station
# parse_data(data_weather, project_path + 'data/weather-data')

## Upload data to s3

In [4]:
yelp_data = project_path + 'data/yelp-dataset'
weather_data = project_path + 'data/weather-data'
bucket = 'sparkify-de'
# Create a bucket store data files
!aws s3 mb s3://{bucket}
# Copy your files to s3 bucket, this will take a while...
!aws s3 cp {yelp_data} s3://{bucket}/test --recursive --exclude "*"  --include "*.json"

In [5]:
# csv files for yelp_data
!aws s3 cp {yelp_data} s3://{bucket}/yelp-dataset --recursive --exclude "*"  --include "*.csv"

upload: ../data/yelp-dataset/geo_location.csv to s3://sparkify-de/yelp-dataset/geo_location.csv
upload: ../data/yelp-dataset/yelp_category.csv to s3://sparkify-de/yelp-dataset/yelp_category.csv


In [6]:
# Copy weather data to s3
!aws s3 cp {weather_data} s3://{bucket}/weather-data --recursive --exclude "*"  --include "*.csv"

upload: ../data/weather-data/KVGT.csv to s3://sparkify-de/weather-data/KVGT.csv
upload: ../data/weather-data/KPHX.csv to s3://sparkify-de/weather-data/KPHX.csv
upload: ../data/weather-data/KIWA.csv to s3://sparkify-de/weather-data/KIWA.csv
upload: ../data/weather-data/KLAS.csv to s3://sparkify-de/weather-data/KLAS.csv
