# EXTRACTION
## Data Source
- Use Events API: https://api.predicthq.com/v1/events?
## Methodology
### Data Analysis - View Data
1. Convert to json format and read json. Determine which variables are useful and required for the deliverable.

FINDINGS: The API allows user to call for 50 entries/page, and up to 100 pages. i.e. Each time a call is made, there are only 50 entry results. 
### Data Analysis - Deetermine and Extract Data Required

- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

# TRANSFORMATION
- Concatonate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

# LOAD
- Load data into PostSQL

-------------------------------

# EXTRACTION

In [52]:
# Dependencies
import requests
import pprint
import json
import requests
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize 
from sqlalchemy import create_engine

# Credential File: py_config.py containing variable ACCESS_TOKEN = "xxxxxxxxx"
import py_config

## Data Analysis - View Data
Convert to json format and read json. Determine which variables are useful and required for the deliverable.

In [2]:
# Connect to API url and get data
# Variable ACCESS_TOKEN is referenced in py_config.py file and is in list .gitignore
response = requests.get(
    url ="https://api.predicthq.com/v1/events?",
    headers={
      "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
      "Accept": "application/json"
    },
    # params={
    #     "limit": 50,
    params={
      "limit":10,
      "country": "AU",
      "start": "2021-01-01",
      "end": "2022-12-31"
    # }
    }
)

In [4]:
# Convert data to json format (all data)
# Save to variable "data1"
data = response.json()

# Print json (formatted) and analyse which variables to use for deliverable
print(json.dumps(data, indent=4, sort_keys=True))

 ],
            "local_rank": 80,
            "location": [
                143.355924,
                -36.269452
            ],
            "phq_attendance": 562,
            "place_hierarchies": [
                [
                    "6295630",
                    "6255151",
                    "2077456",
                    "2145234",
                    "7839534",
                    "2171731"
                ]
            ],
            "private": false,
            "rank": 45,
            "relevance": 1.0,
            "scope": "locality",
            "start": "2021-07-22T23:30:00Z",
            "state": "active",
            "timezone": "Australia/Melbourne",
            "title": "Charlton Badminton Club Ladies Tournament",
            "updated": "2020-08-11T06:36:24Z"
        },
        {
            "aviation_rank": null,
            "brand_safe": true,
            "category": "expos",
            "country": "AU",
            "description": "Democracy. Are You In? is a contem

In [18]:
data['results'][0]

{'relevance': 1.0,
 'id': '8MpZhBcd3DTi98XpjA',
 'title': 'Entertainment Fridays at Exchange Hotel Gawler',
 'description': "Fridays Live is back again - DJs playing tracks on the deck every week! It's always an awesome night!\n\nGet keen for groovy tunes all night!",
 'category': 'performing-arts',
 'labels': ['performing-arts'],
 'rank': 45,
 'local_rank': 73,
 'aviation_rank': None,
 'phq_attendance': 562,
 'entities': [{'entity_id': 'QkQyzzMwBEnGQd5axxn7Tj',
   'name': 'Entertainment Fridays at Exchange Hotel Gawler',
   'type': 'event-group',
   'category': 'performing-arts',
   'labels': ['event-group', 'performing-arts', 'recurring']}],
 'duration': 53940,
 'start': '2021-07-22T23:30:00Z',
 'end': '2021-07-23T14:29:00Z',
 'updated': '2021-01-18T06:42:51Z',
 'first_seen': '2021-01-18T06:35:24Z',
 'timezone': 'Australia/Adelaide',
 'location': [138.749307, -34.596536],
 'scope': 'locality',
 'country': 'AU',
 'place_hierarchies': [['6295630',
   '6255151',
   '2077456',
   '206132

## Data Analysis - Determine and Extract Data Required
- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

In [43]:
# Test for loop 
for i in range (0,500,50):
    print(i)

0
50
100
150
200
250
300
350
400
450


In [44]:
# Create variable "entries" to store the list of dictionaries; 
# Each loop will contain a dictionary (as each pandas dataframe is a dictionaries). 
events_entries=[]

# Do a 'for loop' which loops from index 0 to 5000, at muliples of 50 (as API page can only call 50 entries max each time)
for i in range(0,1500,50):

    response = requests.get(
        url=f"https://api.predicthq.com/v1/events?offset={i}&limit=50",
        headers={
        "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
        "Accept": "application/json"
        },
        params={
            "country": "AU",
            "start": "2021-01-01",
            "end": "2022-12-31"
        }
    )

    # Save response to variable "data" and "data1"
    data = response.json()
    # data1 = data['results']
    # Save to variable "events_df" the dictionary (data> results) 
    events_df = pd.json_normalize(data, ['results'], errors='ignore')
    # print(events_df.head(3))

    def getEntitiesName(entities):
        try:
            return entities[0]['name']
        except:
            return 'no name'
    
    events_df['name'] = events_df.entities.apply(getEntitiesName)



    def getEntitiesAddress(entities):
        try:
            return entities[0]['formatted_address']
        except:
            return 'no address'
    events_df['formatted_address'] = events_df.entities.apply(getEntitiesAddress)


    def getEntitiesVenue(entities):
        try:
            return entities[1]['name']
        except IndexError:
            return 'no venue'
    events_df['venue_name'] = events_df.entities.apply(getEntitiesVenue)

    # Extract out only required variables (column headings)
    events_df = events_df[["id","title","description","category","start","end","country","location","rank","name","venue_name","formatted_address"]]
        
    events_entries.append(events_df)


    # TEST
    # print(events_entries)

# TRANSFORMATION
- Concatenate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

In [46]:
# Concatenate all the dictionaries within list "events_entries". 
# i.e. Convert list to contain the data in the 1 dictionary.
# Save into variable "events_entries_df"
events_entries_df = pd.concat(events_entries)

# Rename columns
events_entries_df = events_entries_df.rename(columns={'start': 'start_date','end':'end_date','location':'coords','name':'title2'})
# events_entries_df

# Drop column title2
events_entries_df = events_entries_df.drop(columns={'title2'})
events_entries_df

Unnamed: 0,id,title,description,category,start_date,end_date,country,coords,rank,venue_name,formatted_address
0,8MpZhBcd3DTi98XpjA,Entertainment Fridays at Exchange Hotel Gawler,Fridays Live is back again - DJs playing track...,performing-arts,2021-07-22T23:30:00Z,2021-07-23T14:29:00Z,AU,"[138.749307, -34.596536]",45,no venue,no address
1,u6KMRTCBHZQ2EqxXq2,Charlton Badminton Club Ladies Tournament,The Charlton Badminton Club conducts an annual...,sports,2021-07-22T23:30:00Z,2021-07-23T05:00:00Z,AU,"[143.355924, -36.269452]",45,no venue,no address
2,5DCJjv8bUXro8FjRk5,Democracy. Are You In?,Democracy. Are You In? is a contemporary exhib...,expos,2021-07-22T23:00:00Z,2021-07-23T07:00:00Z,AU,"[149.129959, -35.30196]",45,Museum of Australian Democracy,no address
3,5MqdzFWEQag8zjmuzV,Education Show,,expos,2021-07-22T23:00:00Z,2021-07-24T08:00:00Z,AU,"[144.953111, -37.825394]",61,no venue,1 Convention Centre Place\nSouth Wharf VIC 300...
4,6s3gqBBLNqXUEYBHAb,PlayUP: The Right to Have an Opinion and Be Heard,PlayUP is the Museum of Australian Democracy's...,expos,2021-07-22T23:00:00Z,2021-07-23T07:00:00Z,AU,"[149.129768, -35.301112]",45,no venue,no address
...,...,...,...,...,...,...,...,...,...,...,...
45,mDD6Pyyt2fy2NA3hGE,Bowen parkrun,"Parkrun is a free, weekly, timed five kilometr...",sports,2021-07-13T21:00:00Z,2021-07-13T22:30:00Z,AU,"[148.252191, -19.986951]",45,no venue,no address
46,6goKpCode6wsHBgxg7,Making Meditation Mainstream Free Beach Medita...,Making Meditation Mainstream is a community mo...,community,2021-07-13T20:30:00Z,2021-07-13T21:00:00Z,AU,"[153.120592, -26.680521]",45,no venue,no address
47,7DDBUNp6j3VhR8pnEU,Absolute Beginners Salsa Classes,Never danced before? Not a problem! Tropical S...,community,2021-07-13T20:00:00Z,2021-07-13T20:30:00Z,AU,"[151.164101, -33.887906]",0,no venue,no address
48,6sfpWP6VN8duPsMCnU,Screen Coach | Acting For Screen Class,Want to learn a new creative skill or develop ...,community,2021-07-13T17:30:00Z,2021-07-14T08:30:00Z,AU,"[151.27134, -33.900037]",0,no venue,no address


In [47]:
events_entries_df.columns

Index(['id', 'title', 'description', 'category', 'start_date', 'end_date',
       'country', 'coords', 'rank', 'venue_name', 'formatted_address'],
      dtype='object')

In [57]:
# Check dtypes
events_entries_df.dtypes

id                   object
title                object
description          object
category             object
start_date           object
end_date             object
country              object
coords               object
rank                  int64
venue_name           object
formatted_address    object
dtype: object

In [58]:
# Analyse the category types and entries within each category.
events_entries_df['category'].value_counts()

community          499
expos              454
sports             255
performing-arts    178
festivals           47
conferences         37
concerts            26
school-holidays      2
observances          2
Name: category, dtype: int64

# LOAD

In [59]:
# Connect to local database
rds_connection_string = "postgres:postgres@localhost:5432/events_db"
engine = create_engine(f'postgresql://{rds_connection_string}')

In [63]:
# Create a events_db. Add the tables into postgres database as per schema.sql
engine.table_names()

['events_table']

In [64]:
# Load pandas dataframe events_entries_df to database events_db, table 'events_table'
events_entries_df.to_sql(name='events_table', con=engine, if_exists='append', index=False)