# EXTRACTION
## Data Source
- Use Events API: https://api.predicthq.com/v1/events?
## Methodology
### Data Analysis - View Data
1. Convert to json format and read json. Determine which variables are useful and required for the deliverable.

FINDINGS: The API allows user to call for 50 entries/page, and up to 100 pages. i.e. Each time a call is made, there are only 50 entry results. 
### Data Analysis - Deetermine and Extract Data Required

- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

# TRANSFORMATION
- Concatonate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

# LOAD
- Load data into PostSQL

-------------------------------

# EXTRACTION

In [1]:
# Dependencies
import pandas
import requests
import pprint
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize 

# Credential File: py_config.py containing variable ACCESS_TOKEN = "xxxxxxxxx"
import py_config

## Data Analysis - View Data
Convert to json format and read json. Determine which variables are useful and required for the deliverable.

In [2]:
# Connect to API url and get data
# Variable ACCESS_TOKEN is referenced in py_config.py file and is in list .gitignore
response = requests.get(
    url ="https://api.predicthq.com/v1/events?",
    headers={
      "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
      "Accept": "application/json"
    },
    # params={
    #     "limit": 50,
    params={
      "limit":10,
      "country": "AU",
      "start": "2021-01-01",
      "end": "2022-12-31"
    # }
    }
)

In [3]:
# Convert data to json format (all data)
# Save to variable "data1"
data = response.json()

# Print json (formatted) and analyse which variables to use for deliverable
print(json.dumps(data, indent=4, sort_keys=True))

{
    "count": 5000,
    "next": "https://api.predicthq.com/v1/events/?country=AU&end=2022-12-31&limit=10&offset=10&start=2021-01-01",
    "overflow": true,
    "previous": null,
    "results": [
        {
            "aviation_rank": null,
            "brand_safe": true,
            "category": "community",
            "country": "AU",
            "description": "The Busy Peacock is a place to come and let the children experiment and create. Be it mud, paint, glue, goo, water play, sensory rice play or otherwise, all those things you really want your child to experience but just not in your own home.\n\n45 minute play session running Tuesday \u2013 Sunday, check the calendar for session times and availability.",
            "duration": 2700,
            "end": "2021-07-21T00:15:00Z",
            "entities": [
                {
                    "category": "community",
                    "entity_id": "Dn7bwAXxAEg29PzwifNf5V",
                    "labels": [
                        

In [4]:
# Extract data within dictionary key "results"
# Save to variable "data1"
data1 = data['results']
print(json.dumps(data1, indent=4, sort_keys=True))
print('-----------------')

[
    {
        "aviation_rank": null,
        "brand_safe": true,
        "category": "community",
        "country": "AU",
        "description": "The Busy Peacock is a place to come and let the children experiment and create. Be it mud, paint, glue, goo, water play, sensory rice play or otherwise, all those things you really want your child to experience but just not in your own home.\n\n45 minute play session running Tuesday \u2013 Sunday, check the calendar for session times and availability.",
        "duration": 2700,
        "end": "2021-07-21T00:15:00Z",
        "entities": [
            {
                "category": "community",
                "entity_id": "Dn7bwAXxAEg29PzwifNf5V",
                "labels": [
                    "community",
                    "event-group",
                    "recurring"
                ],
                "name": "Messy Play Sessions",
                "type": "event-group"
            }
        ],
        "first_seen": "2020-11-18T06:35:0

In [5]:
address_df = pd.json_normalize(data1,['entities'], errors='ignore')
address_df.head(20)

Unnamed: 0,entity_id,name,type,category,labels,formatted_address
0,Dn7bwAXxAEg29PzwifNf5V,Messy Play Sessions,event-group,community,"[community, event-group, recurring]",
1,G3EQ2FUNKgmcdjgtCy7YkX,Future of Financial Services,event-group,conferences,"[conference, event-group]",8 Whiteman St Southbank VIC 3006
2,nSefEmFFYeKvx6YFVUThjj,The Pub at Crown,venue,,,8 Whiteman St\nSouthbank VIC 3006\nAustralia
3,aTApFCCi542T6J2Eu43jGX,The Trevor Kennedy Collection: Highlights,event-group,expos,"[event-group, expo, recurring]",
4,3B3FqxeXvKHJTL7hN4wS3UL,National Museum of Australia,venue,,,Lawson Crescent Acton Peninsula\nCanberra ACT ...
5,7YcCaPcvZ7ANVcuYkgvJLv,Gunnedah Saleyards,event-group,expos,"[event-group, expo, recurring]",
6,itALLWYpzvpGMAqC8ASvsT,PlayUP: The Right to Have an Opinion and Be Heard,event-group,expos,"[event-group, expo, recurring]",
7,KCMHGXPVBUW8gD2iSsZndN,Indigenous Stock Workers and Rodeo Riders Disp...,event-group,expos,"[community, event-group, expo, recurring]",
8,3AdSUYSNnAXV7jnbvJS48BH,Dockside,venue,,,Wheat Road\nSydney NSW 2000\nAustralia
9,WyuSXvLNHDDBxnVAUuwKbE,The Polished Opal,event-group,community,"[community, event-group, recurring]",


In [6]:
# View json data in dataframe format using pandas. (Note: variable "data" includes all data)
# variable "events_df" dataframe from dictionary data > results 
events_df = pd.json_normalize(data, ['results'], errors='ignore')
events_df.head(20)

Unnamed: 0,relevance,id,title,description,category,labels,rank,local_rank,aviation_rank,phq_attendance,...,updated,first_seen,timezone,location,scope,country,place_hierarchies,state,brand_safe,private
0,1.0,UZGsZMeY5YegR7MMdD,Messy Play Sessions,The Busy Peacock is a place to come and let th...,community,[community],0,0,,,...,2020-11-18T06:48:00Z,2020-11-18T06:35:06Z,Australia/Melbourne,"[145.18578, -38.261048]",locality,AU,"[[6295630, 6255151, 2077456, 2145234, 7839813,...",active,True,False
1,1.0,eQN8f47RzTXvibpxY7,INVOLVING FAMILIES AND OTHERS IN THE CARE OF Y...,This workshop will explore the benefits and ba...,conferences,"[conference, health]",0,0,,,...,2021-01-26T07:52:26Z,2021-01-26T07:50:49Z,Australia/Melbourne,"[144.947904, -37.780223]",locality,AU,"[[6295630, 6255151, 2077456, 2145234, 7839805,...",active,True,False
2,1.0,57NrcFdTDNZcHGZ5Jw,Future of Financial Services,,conferences,[conference],46,56,0.0,600.0,...,2021-03-19T15:48:49Z,2021-02-07T15:50:42Z,Australia/Melbourne,"[144.959213, -37.82341]",locality,AU,"[[6295630, 6255151, 2077456, 2145234, 7839805,...",active,True,False
3,1.0,5gt6YrELDVgyH9H6N4,The Trevor Kennedy Collection: Highlights,Discover objects of rare beauty and items of c...,expos,[expo],45,70,,562.0,...,2021-04-20T07:05:10Z,2021-04-20T06:41:52Z,Australia/Sydney,"[149.119532, -35.292481]",locality,AU,"[[6295630, 6255151, 2077456, 2177478, 2172517]]",active,True,False
4,1.0,65grzvvvAQcmZH7rim,Gunnedah Saleyards,Experience one of the largest stock selling ce...,expos,[expo],45,73,,562.0,...,2021-04-13T06:46:31Z,2021-04-13T06:35:03Z,Australia/Sydney,"[150.224511, -30.958274]",locality,AU,"[[6295630, 6255151, 2077456, 2155400, 7839725,...",active,True,False
5,1.0,7yXLeY7ZcWoLx8YQU9,PlayUP: The Right to Have an Opinion and Be Heard,PlayUP is the Museum of Australian Democracy's...,expos,[expo],45,70,,562.0,...,2021-03-23T08:31:56Z,2021-03-23T06:01:49Z,Australia/Sydney,"[149.129768, -35.301112]",locality,AU,"[[6295630, 6255151, 2077456, 2177478, 2172517]]",active,True,False
6,1.0,9JFGB6yX8SjKcm3BZm,Indigenous Stock Workers and Rodeo Riders Disp...,An informative and visual display in recogniti...,expos,"[community, expo]",45,75,,562.0,...,2021-01-18T22:08:33Z,2020-12-17T06:33:48Z,Australia/Brisbane,"[141.081268, -17.668348]",locality,AU,"[[6295630, 6255151, 2077456, 2152274, 7839568,...",active,True,False
7,1.0,Aii4mGwYmU6GLoCY97,Fitness Industry Technology Summit,,conferences,"[conference, health, technology]",40,47,0.0,300.0,...,2021-02-04T14:08:50Z,2020-09-07T14:32:51Z,Australia/Sydney,"[151.202228, -33.871976]",locality,AU,"[[6295630, 6255151, 2077456, 2155400, 6619279,...",active,True,False
8,1.0,EJqNn748zA5y6wMUQq,The Polished Opal,The workshop starts with a short talk about op...,community,[community],0,0,,,...,2020-07-02T23:02:56Z,2019-12-09T07:03:51Z,Australia/Sydney,"[150.333328, -33.702448]",locality,AU,"[[6295630, 6255151, 2077456, 2155400, 2175228,...",active,True,False
9,1.0,EuRTzyaAcG7KyFV5E3,"Truth, Power and a Free Press","Truth, Power and a Free press is a compelling ...",expos,[expo],45,71,,562.0,...,2021-03-31T06:58:42Z,2020-07-02T06:35:22Z,Australia/Sydney,"[149.129959, -35.30196]",locality,AU,"[[6295630, 6255151, 2077456, 2177478, 2172517]]",active,True,False


In [7]:
# events_df.join(f_address_df)
events_final_df = pd.merge(events_df, address_df, left_on="title", right_on="name")
events_final_df[["title","name"]].head(10)

Unnamed: 0,title,name
0,Messy Play Sessions,Messy Play Sessions
1,Future of Financial Services,Future of Financial Services
2,The Trevor Kennedy Collection: Highlights,The Trevor Kennedy Collection: Highlights
3,Gunnedah Saleyards,Gunnedah Saleyards
4,PlayUP: The Right to Have an Opinion and Be Heard,PlayUP: The Right to Have an Opinion and Be Heard
5,Indigenous Stock Workers and Rodeo Riders Disp...,Indigenous Stock Workers and Rodeo Riders Disp...
6,The Polished Opal,The Polished Opal
7,"Truth, Power and a Free Press","Truth, Power and a Free Press"


In [8]:
# View all columns (in dictionary: events > results). Determine required columns for deliverable.
events_final_df.columns

Index(['relevance', 'id', 'title', 'description', 'category_x', 'labels_x',
       'rank', 'local_rank', 'aviation_rank', 'phq_attendance', 'entities',
       'duration', 'start', 'end', 'updated', 'first_seen', 'timezone',
       'location', 'scope', 'country', 'place_hierarchies', 'state',
       'brand_safe', 'private', 'entity_id', 'name', 'type', 'category_y',
       'labels_y', 'formatted_address'],
      dtype='object')

## Data Analysis - Determine and Extract Data Required
- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

In [9]:
# Test for loop 
for i in range (0,500,50):
    print(i)

0
50
100
150
200
250
300
350
400
450


In [31]:
# Create variable "entries" to store the list of dictionaries; 
# Each loop will contain a dictionary (as each pandas dataframe is a dictionaries). 
events_entries=[]
address_entries=[]


# Do a 'for loop' which loops from index 0 to 5000, at muliples of 50 (as API page can only call 50 entries max each time)
for i in range(0,1500,50):

    response = requests.get(
        url=f"https://api.predicthq.com/v1/events?offset={i}&limit=50",
        headers={
        "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
        "Accept": "application/json"
        },
        params={
            "country": "AU",
            "start": "2021-01-01",
            "end": "2022-12-31"
        }
    )

    # TEST: print(i)

    # Save response to variable "data" and "data1"
    data = response.json()
    data1 = data['results']
    # Save to variable "events_df" the dictionary (data> results) 
    events_df = pd.json_normalize(data, ['results'], errors='ignore')
    # print(events_df.head(3))
    address_df = pd.json_normalize(data1,['entities'], errors='ignore')
    # print(address_df.head(3))

    # Extract out only required variables (column headings)
    events_df = events_df[["id","title","description","category","start","end","country","location"]]
    print(events_df.head(3))
    address_df = address_df[["name","formatted_address"]]
    print(address_df.head(3))
        
    events_entries.append(events_df)
    address_entries.append(address_df)

    # TEST
    # print(events_entries)
    # print(address_entries)

                   id                                              title  \
0  UZGsZMeY5YegR7MMdD                                Messy Play Sessions   
1  eQN8f47RzTXvibpxY7  INVOLVING FAMILIES AND OTHERS IN THE CARE OF Y...   
2  57NrcFdTDNZcHGZ5Jw                       Future of Financial Services   

                                         description     category  \
0  The Busy Peacock is a place to come and let th...    community   
1  This workshop will explore the benefits and ba...  conferences   
2                                                     conferences   

                  start                   end country  \
0  2021-07-20T23:30:00Z  2021-07-21T00:15:00Z      AU   
1  2021-07-20T23:30:00Z  2021-07-21T06:30:00Z      AU   
2  2021-07-20T23:00:00Z  2021-07-21T08:00:00Z      AU   

                   location  
0   [145.18578, -38.261048]  
1  [144.947904, -37.780223]  
2   [144.959213, -37.82341]  
                           name                             formatted

# TRANSFORMATION
- Concatenate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

In [42]:
# Concatenate all the dictionaries within list "events_entries". 
# i.e. Convert list to contain the data in the 1 dictionary.
# Save into variable "events_entries_df"
events_entries_df = pd.concat(events_entries)

# Replace fields which a blank with NaN value
nan_value = float("NaN")
events_entries_df.replace("", nan_value, inplace=True)

events_entries_df.dropna(subset = ["title"], inplace=True)
events_entries_df

Unnamed: 0,id,title,description,category,start,end,country,location
0,UZGsZMeY5YegR7MMdD,Messy Play Sessions,The Busy Peacock is a place to come and let th...,community,2021-07-20T23:30:00Z,2021-07-21T00:15:00Z,AU,"[145.18578, -38.261048]"
1,eQN8f47RzTXvibpxY7,INVOLVING FAMILIES AND OTHERS IN THE CARE OF Y...,This workshop will explore the benefits and ba...,conferences,2021-07-20T23:30:00Z,2021-07-21T06:30:00Z,AU,"[144.947904, -37.780223]"
2,57NrcFdTDNZcHGZ5Jw,Future of Financial Services,,conferences,2021-07-20T23:00:00Z,2021-07-21T08:00:00Z,AU,"[144.959213, -37.82341]"
3,5gt6YrELDVgyH9H6N4,The Trevor Kennedy Collection: Highlights,Discover objects of rare beauty and items of c...,expos,2021-07-20T23:00:00Z,2021-07-21T07:00:00Z,AU,"[149.119532, -35.292481]"
4,65grzvvvAQcmZH7rim,Gunnedah Saleyards,Experience one of the largest stock selling ce...,expos,2021-07-20T23:00:00Z,2021-07-21T01:00:00Z,AU,"[150.224511, -30.958274]"
...,...,...,...,...,...,...,...,...
45,h8Tgn52FZ9ehPwsngi,Queensland NPL 2 Youth - Sunshine Coast U23 vs...,,sports,2021-07-11T04:00:00Z,2021-07-11T04:00:00Z,AU,"[153.119022, -26.731251]"
46,kWzqFJbmKmN2tU82rQ,Private photography class in Sydney,"Develop your composition and shooting skills, ...",community,2021-07-11T04:00:00Z,2021-07-11T07:00:00Z,AU,"[151.208565, -33.858768]"
47,7URTbV4WsM39x4UTsC,Queensland NPL Youth League - Brisbane Roar U2...,,sports,2021-07-11T03:45:00Z,2021-07-11T03:45:00Z,AU,"[153.262453, -27.532772]"
48,LauEV6prCSrCp5hjbr,Queensland NPL 2 Youth - Souths United U23 vs ...,,sports,2021-07-11T03:45:00Z,2021-07-11T03:45:00Z,AU,"[153.069671, -27.590388]"


In [43]:
# Concatenate all the dictionaries within list "address_entries". 
# i.e. Convert list to contain the data in the 1 dictionary.
# Save into variable "address_entries_df"
address_entries_df = pd.concat(address_entries)
address_entries_df

# Replace fields which a blank with NaN value
nan_value = float("NaN")
address_entries_df.replace("", nan_value, inplace=True)

address_entries_df.dropna(subset = ["name"], inplace=True)
address_entries_df

Unnamed: 0,name,formatted_address
0,Messy Play Sessions,
1,Future of Financial Services,8 Whiteman St Southbank VIC 3006
2,The Pub at Crown,8 Whiteman St\nSouthbank VIC 3006\nAustralia
3,The Trevor Kennedy Collection: Highlights,
4,National Museum of Australia,Lawson Crescent Acton Peninsula\nCanberra ACT ...
...,...,...
43,Capitol Theatre,13 Campbell Street\nHaymarket NSW 2000\nAustralia
44,Kawana SC Field 1,Milieu Place\nKawana Waters\nAustralia
45,Arthur & Allan Morris Field (Cleveland Showgro...,"60 - 76 Waterloo Street, Cleveland\nBrisbane\n..."
46,Wakerley Park,23 Dew St.\nRuncorn QLD 4113\nAustralia


- Merge dataframes together by title and name

In [46]:
 # At this stage, there are 2 dataframes, both with 1000 entries.
    # Merge dataframes by title and name. (events_df contain col heading 'title', f_address_df contain col heading 'name')
    events_final_df = pd.merge(events_entries_df, address_entries_df, how="left", left_on="title", right_on="name")
    events_final_df.dropduplicates()

AttributeError: 'DataFrame' object has no attribute 'dropduplicates'

In [30]:
# Analyse the category types and entries within each category.
events_final_df['category'].value_counts()

expos              20
community          15
festivals           2
conferences         2
performing-arts     1
sports              1
Name: category, dtype: int64