<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M1-API-JSON-MongoDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with web- & app-data

In this session, you will learn to work with Python dictionaries, the JSON (Java Script Object Notation) format. This will also include dealing (a bit) with APIs and MongoDB

In [None]:
# some necessary installs
!pip install "pymongo[srv]"
!pip install cloudscraper
!pip install srsly

In [None]:
import pandas as pd

## Python dictionaries (recap)

Dictionaries are a super-flexible datatype that is able to take in and index nested structures.

In [None]:
capitals = {"USA":"Washington D.C.", "France":"Paris", "India":"New Delhi"}

In [None]:
capitals['Denmark'] = 'Copenhagen'

In [None]:
capitals['Denmark']

In [None]:
capitals.keys()

In [None]:
capitals.values()

In [None]:
country_info = {}

In [None]:
country_info['USA'] = {'capital':"Washington D.C.", 'population': 329.5, 'cities': ["Washington D.C.",'New York','San Francisco']}

In [None]:
country_info['Denmark'] = {'capital':"Copenhagen", 'population': 5.831, 'cities': ["Copenhagen",'Århus','Odense','Aalborg']}

In [None]:
country_info

### Dicts in Pandas?

In [None]:
pd.DataFrame(country_info)

In [None]:
pd.DataFrame(country_info).T

In [None]:
pd.DataFrame(country_info).T.to_json('country_info.json')

### Introducing srsly: Modern high-performance serialization utilities for Python

https://pypi.org/project/srsly/

In [None]:
import srsly

In [None]:
json_string = srsly.json_dumps(country_info)

In [None]:
json_string

In [None]:
srsly.write_json("country_info_str.json", country_info)

In [None]:
country_info_from_str = srsly.read_json("country_info_str.json")

### Nomadlist API and Digital Nomads

We will look at https://nomadlist.com and its API to gather data about trips around the world by digital nomads.
* We will start with the `requests` library to perform simle API intraction.
* We will use the standard `json` library to parse the respons

To do so, we need to identify a user of the platform and then add `.json` to the URL

In [None]:
# import the relevant libraries
import requests as rq
import json

In [None]:
# get data
result = rq.get('https://nomadlist.com/@ambroisedebret.json')

In [None]:
# parse the response content
doc = json.loads(result.content)

In [None]:
handles = srsly.read_gzip_json("handles-out.json.gz")

In [None]:
handles

In [None]:
# alternatively
# handles = pd.read_json('/content/handles-out.json.gz', typ='series')

In [None]:
# prepare handles for scrape
handle_list = pd.Series(handles).sample(10).to_list()

In [None]:
# managing time...
import time

In [None]:
profiles = []

In [None]:
for i in handle_list:
  url = 'https://nomadlist.com' + i + '.json'
  result = rq.get(url)
  result_json = json.loads(result.content)
  profiles.append(result_json)
  time.sleep(1)

In [None]:
result

Oh no!!! 🤯
Most modern webapps are protected against full-automatic scraping.
https://www.cloudflare.com/learning/bots/what-is-content-scraping/
However, it's a chicken and egg game...


**cloudscraper**

> A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests. Cloudflare changes their techniques periodically, so I will update this repo frequently.







In [None]:
import cloudscraper

In [None]:
# Instantiate scraper
scraper = cloudscraper.create_scraper()

In [None]:
profiles = []

In [None]:
for i in handle_list:
  url = 'https://nomadlist.com' + i + '.json'
  result = scraper.get(url)
  result_json = json.loads(result.content)
  profiles.append(result_json)
  time.sleep(1)

In [None]:
len(profiles)

In [None]:
trips = pd.DataFrame(profiles)['trips'].to_list()

In [None]:
trips[1]

In [None]:
trips_all = []

for trip in trips:
  trips_all.extend(trip)

In [None]:
trips_all = []

for trip in trips:
  if type(trip) != float:
    trips_all.extend(trip)

In [None]:
pd.DataFrame(trips_all)

## Introducing NoSQL with MongoDB
What if there was a Database where we could insert all these documents?
Well, there is one and it's called MongoDB and is powering many many of the apps that you use daily...

you can install it on your machine, on uCloud or use the free-tier version https://cloud.mongodb.com/ from mongodb.com (Atlas)

Learn more with [this course](https://campus.datacamp.com/courses/introduction-to-using-mongodb-for-data-science-with-python/)

In [None]:
# pymongo will make it possible for our python driver to connect (pymongo is the official python driver)
import pymongo

In [None]:
from pymongo import MongoClient
client = MongoClient('xxx')

In [None]:
db = client.test_database


In [None]:
collection = db.test_collection

In [None]:
collection.insert_one(profiles[0])

In [None]:
profiles[0]

In [None]:
collection.find_one()

In [None]:
profiles[0]

In [None]:
collection.insert_many(profiles[1:])

In [None]:
a = collection.find({'trips.country_code' : 'DK' })

In [None]:
collection.count_documents({'trips.country_code' : 'DK' })

In [None]:
list(a)

### More complex aggregation pipeline

Mongo allows us to perform really nice data manipulation within the DB (superfast)
However, some slightly different syntax is needed. (Not beginner friendly...but doable)

In [None]:
# aggregation

c = collection.aggregate([
   {'$match':{'trips.country_code' : 'DK' }}, # Find all trips that went to DK
   {'$project': 
   {'trips': 1} #project (reveal) only data from trips key
   }, 
   {'$unwind': '$trips'} # unwind...flatten
  ])

In [None]:
# turning cursor into list
c = list(c)

In [None]:
# flatten using list comprehension
c = [c['trips'] for c in c]

In [None]:
# making it a DF for in memory analysis
pd.DataFrame(c)

#### sequence matters

In [None]:
# aggregation

c = collection.aggregate([
   
  { '$project': 
   {'trips': 1} #project (reveal) only data from trips key
   }, 
   {'$unwind': '$trips'}, # unwind...flatten
   {'$match':{'trips.country_code' : 'DK' }} # Find all trips that went to DK
  ])

In [None]:
# turning cursor into list
c = list(c)

In [None]:
# flatten using list comprehension
c = [c['trips'] for c in c]

In [None]:
# making it a DF for in memory analysis
pd.DataFrame(c)

### Adding some EDA viz to it

In [None]:
# aggregation

c = collection.aggregate([
  { '$project': 
   {'trips': 1} 
   }, 
   {'$unwind': '$trips'}
  
  ])

In [None]:

c = list(c)

In [None]:

trips_df = pd.DataFrame([x['trips'] for x in c])

In [None]:
trips_df

In [None]:
pd.to_datetime(trips_df.epoch_start, unit='s')

In [None]:
trips_df.epoch_start = pd.to_datetime( trips_df.epoch_start, unit='s')
trips_df.epoch_end = pd.to_datetime( trips_df.epoch_end, unit='s')

In [None]:
trips_df['length'] = pd.to_timedelta(trips_df.epoch_duration, unit='s')

In [None]:
trips_df.place.value_counts()[:20]

In [None]:
trips_df.groupby('place')['length'].median().nlargest(20)

In [None]:
trips_2020 = trips_df[(trips_df.epoch_start > pd.to_datetime('2020')) & (trips_df.epoch_end < pd.to_datetime('2021'))]

In [None]:
trips_2020.length.mean()

In [None]:
len(trips_2020)

In [None]:
trips_2021 = trips_df[(trips_df.epoch_start > pd.to_datetime('2021')) & (trips_df.epoch_end < pd.to_datetime('2022'))]

In [None]:
trips_2021.length.mean()

In [None]:
len(trips_2021)

In [None]:
pd.Series([x.year for x in trips_df.epoch_start]).hist()

## Introducing Altair

https://altair-viz.github.io/
The probably best open source interactive plotting library for Python.

In [None]:
trips_df.groupby('place')['length'].median().nlargest(20)

In [None]:
import altair as alt

In [None]:
source = pd.DataFrame(trips_df.groupby('place')['length'].median().nlargest(20)).reset_index()

source['length'] = [d.days for d in source['length']]

alt.Chart(source).mark_bar().encode(
    x='length',
    y='place:N',
    tooltip=['length']
).interactive()