# EXTRACTION
## Data Source
- Use Events API: https://api.predicthq.com/v1/events?
## Methodology
### Data Analysis - View Data
1. Convert to json format and read json. Determine which variables are useful and required for the deliverable.

FINDINGS: The API allows user to call for 50 entries/page, and up to 100 pages. i.e. Each time a call is made, there are only 50 entry results. 
### Data Analysis - Deetermine and Extract Data Required

- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

# TRANSFORMATION
- Concatonate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

# LOAD
- Load data into PostSQL

-------------------------------

# EXTRACTION

In [1]:
# Dependencies
import pandas
import requests
import pprint
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize 

# Credential File: py_config.py containing variable ACCESS_TOKEN = "xxxxxxxxx"
import py_config

## Data Analysis - View Data
Convert to json format and read json. Determine which variables are useful and required for the deliverable.

In [2]:
# Connect to API url and get data
# Variable ACCESS_TOKEN is referenced in py_config.py file and is in list .gitignore
response = requests.get(
    url ="https://api.predicthq.com/v1/events?",
    headers={
      "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
      "Accept": "application/json"
    },
    params={
        "limit": 50,
        # "offset": 100,
        # "q": "England"
    }
)

In [3]:
# Convert data to json format (all data)
# Save to variable "data1"
data = response.json()

# Print json (formatted) and analyse which variables to use for deliverable
print(json.dumps(data, indent=4, sort_keys=True))

{
    "count": 5000,
    "next": "https://api.predicthq.com/v1/events/?limit=50&offset=50",
    "overflow": true,
    "previous": null,
    "results": [
        {
            "aviation_rank": null,
            "brand_safe": true,
            "category": "concerts",
            "country": "US",
            "description": "",
            "duration": 0,
            "end": "2021-07-19T23:30:00Z",
            "entities": [
                {
                    "entity_id": "yTUkSTXiPVwcNMNXpTKL27",
                    "formatted_address": "1349 Bomber Rd\nFt.Worth, TX 76108\nUnited States of America",
                    "name": "The Point",
                    "type": "venue"
                }
            ],
            "first_seen": "2021-01-15T12:25:57Z",
            "id": "4FxycKBAweP7GkjBdT",
            "labels": [
                "concert",
                "music"
            ],
            "local_rank": 0,
            "location": [
                -97.453308,
                32.7834

In [4]:
# Extract data within dictionary key "results"
# Save to variable "data1"
data1 = data['results']
print(json.dumps(data1, indent=4, sort_keys=True))
print('-----------------')

[
    {
        "aviation_rank": null,
        "brand_safe": true,
        "category": "concerts",
        "country": "US",
        "description": "",
        "duration": 0,
        "end": "2021-07-19T23:30:00Z",
        "entities": [
            {
                "entity_id": "yTUkSTXiPVwcNMNXpTKL27",
                "formatted_address": "1349 Bomber Rd\nFt.Worth, TX 76108\nUnited States of America",
                "name": "The Point",
                "type": "venue"
            }
        ],
        "first_seen": "2021-01-15T12:25:57Z",
        "id": "4FxycKBAweP7GkjBdT",
        "labels": [
            "concert",
            "music"
        ],
        "local_rank": 0,
        "location": [
            -97.453308,
            32.783417
        ],
        "phq_attendance": null,
        "place_hierarchies": [
            [
                "6295630",
                "6255149",
                "6252001",
                "4736286",
                "4685912",
                "4691930"
   

In [5]:
# View json data in dataframe format using pandas. (Note: variable "data" includes all data)
# variable "events_df" dataframe from dictionary data > results 
events_df = pd.json_normalize(data, ['results'], errors='ignore')
events_df

Unnamed: 0,relevance,id,title,description,category,labels,rank,local_rank,aviation_rank,phq_attendance,...,first_seen,timezone,location,scope,country,place_hierarchies,state,brand_safe,private,predicted_end
0,1.0,4FxycKBAweP7GkjBdT,Rob Owen,,concerts,"[concert, music]",0,0.0,,,...,2021-01-15T12:25:57Z,America/Chicago,"[-97.453308, 32.783417]",locality,US,"[[6295630, 6255149, 6252001, 4736286, 4685912,...",active,True,False,
1,1.0,CLzr9XYshKobzR5PY6,The Bulldogs,,concerts,"[concert, music]",0,0.0,,,...,2020-11-02T21:00:57Z,America/Indiana/Indianapolis,"[-85.671104, 40.258144]",locality,US,"[[6295630, 6255149, 6252001, 4921868, 4923124,...",active,True,False,
2,1.0,MmLncHJEX8VHGDneqd,Riley County Democratic Party Monthly Meeting,Every Second Monday at 6:30 p.m.\nBlue Hills Room,community,"[community, family, politics]",17,38.0,,22.0,...,2020-12-28T03:16:21Z,America/Chicago,"[-96.566167, 39.178959]",locality,US,"[[6295630, 6255149, 6252001, 4273857, 4278061,...",active,True,False,
3,1.0,rii6TAkfVYBDb5dQDe,SunKIDZ at SunPAC,SunPAC and Harmonie Music Centre present\n\nSu...,community,[community],0,0.0,,,...,2021-03-30T06:42:00Z,Australia/Brisbane,"[153.073449, -27.572601]",locality,AU,"[[6295630, 6255151, 2077456, 2152274, 7839562,...",active,True,False,
4,1.0,uAcGR7kQs4tnhBAkjT,Planning Board Work Session,Planning Board Work Session,community,[community],17,38.0,,22.0,...,2020-07-06T03:15:57Z,America/New_York,"[-75.049463, 39.910542]",locality,US,"[[6295630, 6255149, 6252001, 5101760, 4501019,...",active,True,False,
5,1.0,ytHcNGZePu5KU2hXcn,Justin Bieber (Rescheduled from 8/24/2020),,concerts,"[concert, music]",71,90.0,0.0,11520.0,...,2019-12-25T18:24:56Z,America/New_York,"[-78.876438, 42.875439]",locality,US,"[[6295630, 6255149, 6252001, 5128638, 5116642,...",active,True,False,
6,1.0,THRJFby3jgtzzFQvR3,San Diego Padres vs Atlanta Braves,,sports,"[baseball, mlb, sport]",74,95.0,61.0,15903.0,...,2020-07-17T18:32:15Z,America/New_York,"[-84.467771, 33.890785]",locality,US,"[[6295630, 6255149, 6252001, 4197000, 4188547,...",active,True,False,2021-07-20T02:10:00Z
7,1.0,3X7TLwKNzrtvPA2sRz,New York Mets vs Cincinnati Reds,,sports,"[baseball, mlb, sport]",73,92.0,62.0,14856.0,...,2020-07-11T18:29:03Z,America/New_York,"[-84.508151, 39.097931]",locality,US,"[[6295630, 6255149, 6252001, 6254925, 4286705,...",active,True,False,2021-07-20T02:00:00Z
8,1.0,9RP6cuRKcpBJmG9TZa,Baltimore Orioles vs Tampa Bay Rays,,sports,"[baseball, mlb, sport]",67,90.0,61.0,7320.0,...,2020-07-18T18:31:00Z,America/New_York,"[-82.653392, 27.768225]",locality,US,"[[6295630, 6255149, 6252001, 4155751, 4168618,...",active,True,False,2021-07-20T02:00:00Z
9,1.0,B5QE5Dm2o92rQBPa6K,Texas Rangers vs Detroit Tigers,,sports,"[baseball, mlb, sport]",66,87.0,0.0,6356.0,...,2021-01-15T00:57:31Z,America/Detroit,"[-83.04852, 42.338998]",locality,US,"[[6295630, 6255149, 6252001, 5001836, 5014227,...",active,True,False,2021-07-20T02:00:00Z


In [6]:
# View all columns (in dictionary: events > results). Determine required columns for deliverable.
events_df.columns

Index(['relevance', 'id', 'title', 'description', 'category', 'labels', 'rank',
       'local_rank', 'aviation_rank', 'phq_attendance', 'entities', 'duration',
       'start', 'end', 'updated', 'first_seen', 'timezone', 'location',
       'scope', 'country', 'place_hierarchies', 'state', 'brand_safe',
       'private', 'predicted_end'],
      dtype='object')

## Data Analysis - Determine and Extract Data Required
- Number of entries required: 5000. 
    1. Loop and set offset at every 50 interval (0, 50, 100, 150 etc until it reaches 4950. 5000 results max from API.)
- Variables required: id, country, category, title, start_date, end_date, country, location
    2. Use pandas to filter for the variables required. 
        - *location is the coordinates

In [7]:
# Test for loop 
for i in range (0,500,50):
    print(i)

0
50
100
150
200
250
300
350
400
450


In [8]:
# Create variable "entries" to store the list of dictionaries; 
# Each loop will contain a dictionary (as each pandas dataframe is a dictionaries). 
entries=[]

# Do a 'for loop' which loops from index 0 to 5000, at muliples of 50 (as API page can only call 50 entries max each time)
for i in range(0,1000,50):

    response = requests.get(
        url=f"https://api.predicthq.com/v1/events?offset={i}&limit=50",
        headers={
        "Authorization": f"Bearer {py_config.ACCESS_TOKEN}",
        "Accept": "application/json"
        },
        params={
            "country": "AU",
            "start": "2021-01-01",
            "end": "2022-12-31"
        }
    )

    # TEST: print(i)

    # Save response to variable "data"
    data = response.json()

    # Save to variable "events_df" the dictionary (data> results) 
    events_df = pd.json_normalize(data, ['results'], errors='ignore')

    # Extract out only required variables (column headings)
    events_df=events_df[["id","title","category","start","end","country","location"]]
    
    entries.append(events_df)

    print(entries)

[                    id                                              title  \
0   rii6TAkfVYBDb5dQDe                                  SunKIDZ at SunPAC   
1   6aXi7LBhouqXBCy4Hq                             Democracy. Are You In?   
2   7DLJtjGczTVybEszeb       Sandcastle Workshops for Children and Adults   
3   Bx8DkaQfMtrgkVZnbN                               The Gruffalo Spotter   
4   RwMoGyruxqZ4FuRRxS                           Flower Crown Group Class   
5   XPR9sumhtKbzk6fdPy  Behind the Lines 2020: The year in political c...   
6   Z55sEJQreUZJWjdZP7  PlayUP: The Right to Have an Opinion and Be Heard   
7   j7HgWRugZfDT8D6n3P  Australian Primary Principals Association & Ne...   
8   m7NKn9rqQM5JL3P2w4                                 Happy and Glorious   
9   oKgCSdkVDHMsDsszyT          The Trevor Kennedy Collection: Highlights   
10  qPdzQb77TNkppYLeyj      onetoeight: Australia’s first prime ministers   
11  suhVhvyz6UmfvSektq                                 Gunnedah Saleyards  

# TRANSFORMATION
- Concatenate the dataframes together (from all 100 pages) and save the final dataframe to a variable.

In [9]:
# Concatenate all the dictionaries within list "entries". 
# i.e. Convert list "entries" to contain the data in the 1 dictionary.
# Save into variable "entries_df"
entries_df = pd.concat(entries)
entries_df

Unnamed: 0,id,title,category,start,end,country,location
0,rii6TAkfVYBDb5dQDe,SunKIDZ at SunPAC,community,2021-07-19T23:30:00Z,2021-07-20T00:30:00Z,AU,"[153.073449, -27.572601]"
1,6aXi7LBhouqXBCy4Hq,Democracy. Are You In?,expos,2021-07-19T23:00:00Z,2021-07-20T07:00:00Z,AU,"[149.129959, -35.30196]"
2,7DLJtjGczTVybEszeb,Sandcastle Workshops for Children and Adults,community,2021-07-19T23:00:00Z,2021-07-20T00:30:00Z,AU,"[153.101805, -26.406535]"
3,Bx8DkaQfMtrgkVZnbN,The Gruffalo Spotter,expos,2021-07-19T23:00:00Z,2021-07-20T07:00:00Z,AU,"[152.959309, -26.557275]"
4,RwMoGyruxqZ4FuRRxS,Flower Crown Group Class,community,2021-07-19T23:00:00Z,2021-07-20T07:00:00Z,AU,"[151.17119, -33.78903]"
...,...,...,...,...,...,...,...
45,WWmgc3ZLpzXdutFvYh,Winter Night Market,community,2021-07-14T07:00:00Z,2021-07-14T12:00:00Z,AU,"[144.958081, -37.807671]"
46,WfonVPwTxfKdWYuQjj,Art After Hours,community,2021-07-14T07:00:00Z,2021-07-14T12:00:00Z,AU,"[151.216202, -33.868829]"
47,ZJE8EKUvasC8mjWuP6,Mindil Beach Sunset Market,festivals,2021-07-14T06:30:00Z,2021-07-14T12:30:00Z,AU,"[130.828231, -12.448136]"
48,GHsxqJRekrjsPahqnt,Weekly Youth Art Classes,community,2021-07-14T05:45:00Z,2021-07-14T07:15:00Z,AU,"[152.921212, -29.671462]"


In [10]:
# Analyse the category types and entries within each category.
entries_df['category'].value_counts()

community          313
expos              262
sports             234
performing-arts    122
festivals           33
concerts            17
conferences         16
school-holidays      2
observances          1
Name: category, dtype: int64