# Term Project 
During the course, you will be working on a term project to either pull data from an API or scrape a
webpage. You will need to select either an API (different than Twitter) or a Webpage and create a
process in Python that will extract data into a formatted dataset.

There are no restrictions on what API or Webpage you use, other than you cannot use Twitter or the
Webpages used in the exercises from your book.
You will turn in your project at the end of the term.
The following is due submitted to the assignment link or submit a link to your GitHub repository to the
assignment link:
- Your formatted dataset with at least 15-20 variables (if the API or Webpage you selected doesn’t have that many fields available on it, you will want to search again, or do multiple!)
- Your code or screenshots of your code outlining the steps and process you had to take to pull data from the API or web page and the steps you took to format the data.
- 2 Data Transformation/Clean-up Steps (can be any that we learned in class)
- A 250-word paper summarizing your steps and any challenges you ran into during the project. Discuss the importance and relevance of this type of process if you were a data scientist. How often do you think you would have to do this to get the data you need?



In [44]:
from twitter_keys import google_maps
import requests 
import json
import pandas as pd
from pandas.io.json import json_normalize

In [76]:
search_params = {'input':'Bellevue%20University','type': 'textquery', 'fields':'place_id', 'api_key': secret['api_key']}

In [99]:
def find_place(params):
    '''
    Searches for a place and returns the place ID. 
    Params:
        a dict of the following: 
        input: term which will be searched
        type: type of search to be performed
        fields: data to return back from search (always place_id)
        api_key: secret api key to make the request
    '''
    url = 'https://maps.googleapis.com/maps/api/place/findplacefromtext/json?input=_IN_&inputtype=_TP_&fields=_F_&key=_K_'
    url = url.replace('_IN_', params['input']).replace('_TP_',params['type']).replace('_F_',params['fields']).replace('_K_',secret['api_key'])
    r = requests.get(url)
    result = r.json()
    candidates = result['candidates'][0]
    return candidates['place_id']    

In [102]:
place_id = find_place(search_params)

In [113]:
find_place_by_id_params = {'fields':'rating,formatted_address,geometry,name,permanently_closed,place_id,plus_code,scope,type,url,vicinity,website,formatted_phone_number,international_phone_number',
                           'placeid':place_id,
                           'api_key': secret['api_key']
                          }

In [114]:
def get_place_by_id(params_dict):
    '''
    Searches for place once the ID is known. The function also returns a dataframe.
    The dataframe will serve as the first data cleanup step by transforming raw JSON
    into a dataframe
    Params
    '''
    url = 'https://maps.googleapis.com/maps/api/place/details/json?placeid=_P_&fields=_F_&key=_K_'
    url = url.replace('_P_', params_dict['placeid']).replace('_F_',params_dict['fields']).replace('_K_',params_dict['api_key'])
    r = requests.get(url)
    r_text = r.text
    r_text= json.loads(r_text)
    df = json_normalize(r_text)
    return df
    

In [115]:
df = get_place_by_id(find_place_by_id_params)

In [116]:
df

Unnamed: 0,html_attributions,result.formatted_address,result.formatted_phone_number,result.geometry.location.lat,result.geometry.location.lng,result.geometry.viewport.northeast.lat,result.geometry.viewport.northeast.lng,result.geometry.viewport.southwest.lat,result.geometry.viewport.southwest.lng,result.international_phone_number,...,result.place_id,result.plus_code.compound_code,result.plus_code.global_code,result.rating,result.scope,result.types,result.url,result.vicinity,result.website,status
0,[],"1000 Galvin Rd S, Bellevue, NE 68005, USA",(402) 293-2000,41.150977,-95.919596,41.155714,-95.915233,41.146619,-95.92562,+1 402-293-2000,...,ChIJf4DKhmqIk4cR63XRQHKFjoM,"532J+95 Bellevue, NE, United States",86H6532J+95,3.8,GOOGLE,"[university, point_of_interest, establishment]",https://maps.google.com/?cid=9479660991421707755,"1000 Galvin Road South, Bellevue",http://www.bellevue.edu/,OK


## Data Cleanup #2
Drop bad columns and rename

In [124]:
df = df.drop(columns=['html_attributions','status'])

In [125]:
df.columns

Index(['result.formatted_address', 'result.formatted_phone_number',
       'result.geometry.location.lat', 'result.geometry.location.lng',
       'result.geometry.viewport.northeast.lat',
       'result.geometry.viewport.northeast.lng',
       'result.geometry.viewport.southwest.lat',
       'result.geometry.viewport.southwest.lng',
       'result.international_phone_number', 'result.name', 'result.place_id',
       'result.plus_code.compound_code', 'result.plus_code.global_code',
       'result.rating', 'result.scope', 'result.types', 'result.url',
       'result.vicinity', 'result.website'],
      dtype='object')

In [129]:
columns_dict = {}
for column in df.columns:
    columns_dict[column] = column.replace('result.','')
df = df.rename(columns=columns_dict)

In [130]:
df

Unnamed: 0,formatted_address,formatted_phone_number,geometry.location.lat,geometry.location.lng,geometry.viewport.northeast.lat,geometry.viewport.northeast.lng,geometry.viewport.southwest.lat,geometry.viewport.southwest.lng,international_phone_number,name,place_id,plus_code.compound_code,plus_code.global_code,rating,scope,types,url,vicinity,website
0,"1000 Galvin Rd S, Bellevue, NE 68005, USA",(402) 293-2000,41.150977,-95.919596,41.155714,-95.915233,41.146619,-95.92562,+1 402-293-2000,Bellevue University,ChIJf4DKhmqIk4cR63XRQHKFjoM,"532J+95 Bellevue, NE, United States",86H6532J+95,3.8,GOOGLE,"[university, point_of_interest, establishment]",https://maps.google.com/?cid=9479660991421707755,"1000 Galvin Road South, Bellevue",http://www.bellevue.edu/


# A 250-word paper summarizing your steps and any challenges you ran into during the project. Discuss the importance and relevance of this type of process if you were a data scientist. How often do you think you would have to do this to get the data you need?

For my term project, I chose to use the Google Maps API and search for information about Bellevue University. In Google Maps, every address has a unique identifier: place_ID. I first query the API on Bellevue University. That returns a place ID which is then used for the second function. Now that we have the place ID, we are able to gather more information about the specific place we want to look at. Using that place ID, I make another request to Google Maps requesting more information about Bellevue University. There were more fields available, but I chose to use:
-	rating
-	formatted_address
-	geometry
-	name
-   permanently_closed
-	place_id
-	plus_code
-	scope
-	type
-	url
-	vicinity
-	website
-	formatted_phone_number
-	international_phone_number


Once the request is made, a formatted JSON is returned. For the first transformation, I chose to move the data into a dataframe. I use the JSON.loads() function to move the data into a more digestable format. I then use the Pandas.io.json_normalizer function to ingest the formatted data into a dataframe. The dataframe is the result which is returned from the function. The importance of performing this project as a 2 step process I think can speak to some of the challenges a data scientist will face. The first step was to find the unique identifier for Bellevue University. Google has a very robust API, and can very intuitively find what it is that you are looking for. Exactly what I was looking for was returned in the first result. In real life, that may not happen, and you may need to adjust your search parameters until you're able to get specific to exactly what it is you need. The second step can be performed regardless of if you have the right query result or not. It could even be hard coded. I think that this 2 step process, find the unique identifier, query the remaining information that I would need about would probably be used every day in the day to day of a data scientist. 
