# EXTRACTION
## Data Source
- Use Events API: https://api.predicthq.com/v1/events?
## Methodology
### Data Analysis - View Data
1. Convert to json format and read json. Determine which variables are useful and required for the deliverable.

FINDINGS: The API allows user to call for 50 entries/page, and up to 100 pages. i.e. Each time a call is made, there are only 50 entry results. 
### Data Analysis - Deetermine and Extract Data Required

- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

# TRANSFORMATION
- Concatonate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

# LOAD
- Load data into PostSQL

-------------------------------

# EXTRACTION

In [1]:
# Dependencies
import pandas
import requests
import pprint
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize 

# Credential File: py_config.py containing variable ACCESS_TOKEN = "xxxxxxxxx"
import py_config

## Data Analysis - View Data
Convert to json format and read json. Determine which variables are useful and required for the deliverable.

In [28]:
# Connect to API url and get data
# Variable ACCESS_TOKEN is referenced in py_config.py file and is in list .gitignore
response = requests.get(
    url ="https://api.predicthq.com/v1/events?",
    headers={
      "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
      "Accept": "application/json"
    },
    # params={
    #     "limit": 50,
    params={
      "limit":10,
      "country": "AU",
      "start": "2021-01-01",
      "end": "2022-12-31"
    # }
    }
)

In [29]:
# Convert data to json format (all data)
# Save to variable "data1"
data = response.json()

# Print json (formatted) and analyse which variables to use for deliverable
print(json.dumps(data, indent=4, sort_keys=True))

{
    "count": 5000,
    "next": "https://api.predicthq.com/v1/events/?country=AU&end=2022-12-31&limit=10&offset=10&start=2021-01-01",
    "overflow": true,
    "previous": null,
    "results": [
        {
            "aviation_rank": null,
            "brand_safe": true,
            "category": "community",
            "country": "AU",
            "description": "The Busy Peacock is a place to come and let the children experiment and create. Be it mud, paint, glue, goo, water play, sensory rice play or otherwise, all those things you really want your child to experience but just not in your own home.\n\n45 minute play session running Tuesday \u2013 Sunday, check the calendar for session times and availability.",
            "duration": 2700,
            "end": "2021-07-21T00:15:00Z",
            "entities": [
                {
                    "category": "community",
                    "entity_id": "Dn7bwAXxAEg29PzwifNf5V",
                    "labels": [
                        

In [30]:
# Extract data within dictionary key "results"
# Save to variable "data1"
data1 = data['results']
print(json.dumps(data1, indent=4, sort_keys=True))
print('-----------------')

[
    {
        "aviation_rank": null,
        "brand_safe": true,
        "category": "community",
        "country": "AU",
        "description": "The Busy Peacock is a place to come and let the children experiment and create. Be it mud, paint, glue, goo, water play, sensory rice play or otherwise, all those things you really want your child to experience but just not in your own home.\n\n45 minute play session running Tuesday \u2013 Sunday, check the calendar for session times and availability.",
        "duration": 2700,
        "end": "2021-07-21T00:15:00Z",
        "entities": [
            {
                "category": "community",
                "entity_id": "Dn7bwAXxAEg29PzwifNf5V",
                "labels": [
                    "community",
                    "event-group",
                    "recurring"
                ],
                "name": "Messy Play Sessions",
                "type": "event-group"
            }
        ],
        "first_seen": "2020-11-18T06:35:0

In [31]:
f_address_df = pd.json_normalize(data1,['entities'], errors='ignore')
f_address_df.head(3)

Unnamed: 0,entity_id,name,type,category,labels,formatted_address
0,Dn7bwAXxAEg29PzwifNf5V,Messy Play Sessions,event-group,community,"[community, event-group, recurring]",
1,G3EQ2FUNKgmcdjgtCy7YkX,Future of Financial Services,event-group,conferences,"[conference, event-group]",8 Whiteman St Southbank VIC 3006
2,nSefEmFFYeKvx6YFVUThjj,The Pub at Crown,venue,,,8 Whiteman St\nSouthbank VIC 3006\nAustralia


In [32]:
# View json data in dataframe format using pandas. (Note: variable "data" includes all data)
# variable "events_df" dataframe from dictionary data > results 
events_df = pd.json_normalize(data, ['results'], errors='ignore')
events_df.head(3)

Unnamed: 0,relevance,id,title,description,category,labels,rank,local_rank,aviation_rank,phq_attendance,...,updated,first_seen,timezone,location,scope,country,place_hierarchies,state,brand_safe,private
0,1.0,UZGsZMeY5YegR7MMdD,Messy Play Sessions,The Busy Peacock is a place to come and let th...,community,[community],0,0,,,...,2020-11-18T06:48:00Z,2020-11-18T06:35:06Z,Australia/Melbourne,"[145.18578, -38.261048]",locality,AU,"[[6295630, 6255151, 2077456, 2145234, 7839813,...",active,True,False
1,1.0,eQN8f47RzTXvibpxY7,INVOLVING FAMILIES AND OTHERS IN THE CARE OF Y...,This workshop will explore the benefits and ba...,conferences,"[conference, health]",0,0,,,...,2021-01-26T07:52:26Z,2021-01-26T07:50:49Z,Australia/Melbourne,"[144.947904, -37.780223]",locality,AU,"[[6295630, 6255151, 2077456, 2145234, 7839805,...",active,True,False
2,1.0,57NrcFdTDNZcHGZ5Jw,Future of Financial Services,,conferences,[conference],46,56,0.0,600.0,...,2021-03-19T15:48:49Z,2021-02-07T15:50:42Z,Australia/Melbourne,"[144.959213, -37.82341]",locality,AU,"[[6295630, 6255151, 2077456, 2145234, 7839805,...",active,True,False


In [36]:
# events_df.join(f_address_df)
events_final_df = pd.merge(events_df, f_address_df, left_index=True, right_index=True)
events_final_df.head(3)

Unnamed: 0,relevance,id,title,description,category_x,labels_x,rank,local_rank,aviation_rank,phq_attendance,...,place_hierarchies,state,brand_safe,private,entity_id,name,type,category_y,labels_y,formatted_address
0,1.0,UZGsZMeY5YegR7MMdD,Messy Play Sessions,The Busy Peacock is a place to come and let th...,community,[community],0,0,,,...,"[[6295630, 6255151, 2077456, 2145234, 7839813,...",active,True,False,Dn7bwAXxAEg29PzwifNf5V,Messy Play Sessions,event-group,community,"[community, event-group, recurring]",
1,1.0,eQN8f47RzTXvibpxY7,INVOLVING FAMILIES AND OTHERS IN THE CARE OF Y...,This workshop will explore the benefits and ba...,conferences,"[conference, health]",0,0,,,...,"[[6295630, 6255151, 2077456, 2145234, 7839805,...",active,True,False,G3EQ2FUNKgmcdjgtCy7YkX,Future of Financial Services,event-group,conferences,"[conference, event-group]",8 Whiteman St Southbank VIC 3006
2,1.0,57NrcFdTDNZcHGZ5Jw,Future of Financial Services,,conferences,[conference],46,56,0.0,600.0,...,"[[6295630, 6255151, 2077456, 2145234, 7839805,...",active,True,False,nSefEmFFYeKvx6YFVUThjj,The Pub at Crown,venue,,,8 Whiteman St\nSouthbank VIC 3006\nAustralia


In [6]:
# View all columns (in dictionary: events > results). Determine required columns for deliverable.
events_df.columns

Index(['relevance', 'id', 'title', 'description', 'category', 'labels', 'rank',
       'local_rank', 'aviation_rank', 'phq_attendance', 'entities', 'duration',
       'start', 'end', 'updated', 'first_seen', 'timezone', 'location',
       'scope', 'country', 'place_hierarchies', 'state', 'brand_safe',
       'private', 'predicted_end'],
      dtype='object')

## Data Analysis - Determine and Extract Data Required
- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

In [7]:
# Test for loop 
for i in range (0,500,50):
    print(i)

0
50
100
150
200
250
300
350
400
450


In [11]:
# Create variable "entries" to store the list of dictionaries; 
# Each loop will contain a dictionary (as each pandas dataframe is a dictionaries). 
entries=[]

# Do a 'for loop' which loops from index 0 to 5000, at muliples of 50 (as API page can only call 50 entries max each time)
for i in range(0,1000,50):

    response = requests.get(
        url=f"https://api.predicthq.com/v1/events?offset={i}&limit=50",
        headers={
        "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
        "Accept": "application/json"
        },
        params={
            "country": "AU",
            "start": "2021-01-01",
            "end": "2022-12-31"
        }
    )

    # TEST: print(i)

    # Save response to variable "data"
    data = response.json()

    # Save to variable "events_df" the dictionary (data> results) 
    events_df = pd.json_normalize(data, ['results'], errors='ignore')

    # Extract out only required variables (column headings)
    events_df=events_df[["id","title","description","category","start","end","country","location"]]
    
    entries.append(events_df)

    print(entries)

[                    id                                              title  \
0   UZGsZMeY5YegR7MMdD                                Messy Play Sessions   
1   eQN8f47RzTXvibpxY7  INVOLVING FAMILIES AND OTHERS IN THE CARE OF Y...   
2   57NrcFdTDNZcHGZ5Jw                       Future of Financial Services   
3   5gt6YrELDVgyH9H6N4          The Trevor Kennedy Collection: Highlights   
4   65grzvvvAQcmZH7rim                                 Gunnedah Saleyards   
5   7yXLeY7ZcWoLx8YQU9  PlayUP: The Right to Have an Opinion and Be Heard   
6   9JFGB6yX8SjKcm3BZm  Indigenous Stock Workers and Rodeo Riders Disp...   
7   Aii4mGwYmU6GLoCY97                 Fitness Industry Technology Summit   
8   EJqNn748zA5y6wMUQq                                  The Polished Opal   
9   EuRTzyaAcG7KyFV5E3                      Truth, Power and a Free Press   
10  NshvSdyCLo8Fy2tjHV                             Democracy. Are You In?   
11  PRcjfW7FWEqhUi43kp                              Cute box making class  

# TRANSFORMATION
- Concatenate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

In [9]:
# Concatenate all the dictionaries within list "entries". 
# i.e. Convert list "entries" to contain the data in the 1 dictionary.
# Save into variable "entries_df"
entries_df = pd.concat(entries)
entries_df

Unnamed: 0,id,title,category,start,end,country,location
0,rii6TAkfVYBDb5dQDe,SunKIDZ at SunPAC,community,2021-07-19T23:30:00Z,2021-07-20T00:30:00Z,AU,"[153.073449, -27.572601]"
1,6aXi7LBhouqXBCy4Hq,Democracy. Are You In?,expos,2021-07-19T23:00:00Z,2021-07-20T07:00:00Z,AU,"[149.129959, -35.30196]"
2,7DLJtjGczTVybEszeb,Sandcastle Workshops for Children and Adults,community,2021-07-19T23:00:00Z,2021-07-20T00:30:00Z,AU,"[153.101805, -26.406535]"
3,Bx8DkaQfMtrgkVZnbN,The Gruffalo Spotter,expos,2021-07-19T23:00:00Z,2021-07-20T07:00:00Z,AU,"[152.959309, -26.557275]"
4,RwMoGyruxqZ4FuRRxS,Flower Crown Group Class,community,2021-07-19T23:00:00Z,2021-07-20T07:00:00Z,AU,"[151.17119, -33.78903]"
...,...,...,...,...,...,...,...
45,WWmgc3ZLpzXdutFvYh,Winter Night Market,community,2021-07-14T07:00:00Z,2021-07-14T12:00:00Z,AU,"[144.958081, -37.807671]"
46,WfonVPwTxfKdWYuQjj,Art After Hours,community,2021-07-14T07:00:00Z,2021-07-14T12:00:00Z,AU,"[151.216202, -33.868829]"
47,ZJE8EKUvasC8mjWuP6,Mindil Beach Sunset Market,festivals,2021-07-14T06:30:00Z,2021-07-14T12:30:00Z,AU,"[130.828231, -12.448136]"
48,GHsxqJRekrjsPahqnt,Weekly Youth Art Classes,community,2021-07-14T05:45:00Z,2021-07-14T07:15:00Z,AU,"[152.921212, -29.671462]"


In [10]:
# Analyse the category types and entries within each category.
entries_df['category'].value_counts()

community          313
expos              262
sports             234
performing-arts    122
festivals           33
concerts            17
conferences         16
school-holidays      2
observances          1
Name: category, dtype: int64