# Extracting and transforming data

## Goal

> Check which videos from Vimeo's staff picks are featured at the blog Motionographer.com.


## Libs

Notes
* Generate an APP
* Vimeo's API: https://developer.vimeo.com/api/guides/start
* Token: https://developer.vimeo.com/apps/168643#personal_access_tokens
* API wrapper pip install PyVimeo

In [221]:
import numpy as np
import pandas as pd
import requests
import json
import multiprocessing
import glob
import datetime
import re

from bs4 import BeautifulSoup
from sqlalchemy import create_engine
from pandas.io.json import json_normalize
import math
from tqdm.notebook import tqdm

pd.options.display.max_rows = 500
pd.options.display.max_columns = 500

In [3]:
def get_page(current_page):    

    headers = {"Authorization": "Bearer 6d797fb7512534142b202cc24aaab742"}
    endpoint = f'https://api.vimeo.com/channels/staffpicks/videos?page={current_page}&per_page=100'
    vimeo_page = requests.get(endpoint, headers=headers)
    page_content = vimeo_page.json()
    return page_content

def return_data(num_page):
    response = get_page(num_page)
    page_data = pd.json_normalize(response['data'])
    page_data.to_csv(f'./downloaded_pages/v_page_{num_page:0>4}.csv')
    print(f'Page {num_page:0>4} saved.')

In [4]:
first_page_response = get_page(1)
total_pages = math.ceil(first_page_response['total'] / first_page_response['per_page'])

## First function written (before using map for multiprocessing) 

### Deprecated

In [None]:
def all_pages():
    '''
    Takes current_page number. Returns dict.
    Return = {
        'paging': {
            'next': next page's uri or none
        }, 
        'data': [
            {},
            ...
        ]
    }
    '''

    response = get_page(1)
    response_df = pd.json_normalize(response['data'])
    pages_info = response['paging']
    total_pages = math.ceil(response['total'] / response['per_page'])

    for i in tqdm( range ( 1, ( total_pages + 1 ) ) ) :
        response = get_page(i)
        page_data = pd.json_normalize(response['data'])
        page_data.to_csv(f'./downloaded_pages/v_page_{i:0>4}.csv')
        print(f'Page {i:0>4} saved in disk.')


## Reading first page

In [5]:
# columns to filter
cols = ['name', 'link', 'duration', 'release_time', 'content_rating', 'tags', 
'categories', 'stats.plays', 'user.name', 'user.link', 'user.gender', 'user.websites', 
'user.account', 'user.websites', 'user.location_details.formatted_address','user.short_bio','user.skills', 'user.available_for_hire', 'user.location_details.latitude',
'user.location_details.longitude', 'user.location_details.city',
'user.location_details.state', 'user.location_details.neighborhood', 'user.location_details.sub_locality',
'user.location_details.state_iso_code', 'user.location_details.country',
'user.location_details.country_iso_code', 'width', 'height']


In [None]:
first_page = pd.read_csv('./downloaded_pages/v_page_0001.csv', usecols=cols)
# l = pd.json_normalize(first_page['tags'][1])
# type(first_page['tags'][1])
# first_page['tags'] = pd.to_numeric(first_page['tags'], errors='ignore')
# first_page['tags'][1]
# json.load(first_page['tags'])
json.loads(element for element in first_page['tags'])
# json.loads(var)
# first_page.dtypes
# usecols=cols
# 'tags', 'categories', 'user.websites', 'user.skills'

# release_time
# first_page.sample(15)

> Each .csv has 100 rows, corresponding to 100 videos, and 175 columns

In [17]:
list(first_page[cols])

['name',
 'link',
 'duration',
 'release_time',
 'content_rating',
 'tags',
 'categories',
 'stats.plays',
 'user.name',
 'user.link',
 'user.gender',
 'user.websites',
 'user.account',
 'user.websites',
 'user.location_details.formatted_address',
 'user.short_bio',
 'user.skills',
 'user.available_for_hire',
 'user.location_details.latitude',
 'user.location_details.longitude',
 'upload.size',
 'user.location_details.city',
 'user.location_details.state',
 'user.location_details.neighborhood',
 'user.location_details.sub_locality',
 'user.location_details.state_iso_code',
 'user.location_details.country',
 'user.location_details.country_iso_code',
 'width',
 'height']

In [34]:
first_page.sample(10)

Unnamed: 0,name,link,duration,release_time,content_rating,tags,categories,stats.plays,user.name,user.link,user.gender,user.short_bio,user.websites,user.location_details.formatted_address,user.location_details.latitude,user.location_details.longitude,user.skills,user.available_for_hire,user.account
30,VERT,https://vimeo.com/398274283,733,2020-03-17T16:20:07+00:00,['safe'],[],"[{'uri': '/categories/narrative', 'name': 'Nar...",158701.0,Kate Cox,https://vimeo.com/katecox,n,Kate&#039;s aim as a director is to unearth fe...,[],"London, UK",51.507351,-0.127758,[],False,pro
20,FLUT by Malte Stein,https://vimeo.com/399313424,595,2020-03-20T22:21:38+00:00,['language'],[],"[{'uri': '/categories/animation', 'name': 'Ani...",70025.0,maltestein,https://vimeo.com/user17714648,,,[],,,,[],False,basic
13,Magnetic Fields,https://vimeo.com/400100317,106,2020-03-24T02:05:38+00:00,['safe'],[],"[{'uri': '/categories/experimental', 'name': '...",6794.0,Benjamin Bardou,https://vimeo.com/benjaminbardou,n,► benjaminbardou.com ► benjaminbardou@gmail.co...,"[{'name': 'website', 'link': 'http://benjaminb...","Paris, France",48.856613,2.352222,"[{'uri': '/marketplace/skills/59', 'name': 'Fi...",True,plus
74,The Last Video Store,https://vimeo.com/arthurcauty/thelastvideostore,478,2020-02-22T17:04:04+00:00,['language'],"[{'uri': '/tags/documentary', 'name': 'documen...",[],34635.0,Arthur Cauty | Filmmaker,https://vimeo.com/arthurcauty,m,Multi award-winning filmmaker | inquiries: ac@...,"[{'name': ""Arthur's Official Website"", 'link':...","Bristol, UK",51.454514,-2.58791,"[{'uri': '/marketplace/skills/17', 'name': 'Di...",True,plus
96,S+C+A+R+R - The Rest Of My Days,https://vimeo.com/391501121,234,2020-02-14T14:36:00+00:00,['safe'],[],"[{'uri': '/categories/music', 'name': 'Music',...",18918.0,Passion Paris,https://vimeo.com/passionparis,n,Soci&eacute;t&eacute; ind&eacute;pendante de p...,"[{'name': 'Site Passion Paris', 'link': 'http:...",Paris,,,[],False,pro
51,GIRLFRIENDS,https://vimeo.com/395282487,1168,2020-03-03T20:25:31+00:00,['safe'],[],[],126542.0,Travelling distribution,https://vimeo.com/travellingdistribution,n,"For more than 10 years, Travelling has been re...","[{'name': None, 'link': 'www.travellingdistrib...","Trois-Rivières, Québec, Canada",,,[],False,pro
68,Zoe and Hanh,https://vimeo.com/393553415,536,2020-02-24T22:56:00+00:00,['safe'],"[{'uri': '/tags/comedy', 'name': 'Comedy', 'ta...","[{'uri': '/categories/narrative', 'name': 'Nar...",11582.0,Kim Tran,https://vimeo.com/kimtrantexas,,"Kim Tran is a writer, filmmaker and middle chi...","[{'name': 'Instagram', 'link': 'https://www.in...","Austin, TX, USA",30.267153,-97.743057,[],False,basic
33,JUTLAND II | Breath of the Seasons,https://vimeo.com/397912933,212,2020-03-16T07:59:44+00:00,['safe'],"[{'uri': '/tags/timelapse', 'name': 'timelapse...","[{'uri': '/categories/travel', 'name': 'Travel...",20523.0,Jonas Høholt,https://vimeo.com/jonashoholt,m,I bend and warp time and motion,"[{'name': 'Instagram', 'link': 'http://www.ins...","Aarhus, Danmark",56.162937,10.203921,"[{'uri': '/marketplace/skills/93', 'name': 'Ti...",True,plus
7,Jesse Jams,https://vimeo.com/400592143,951,2020-03-25T13:45:36+00:00,['safe'],[],"[{'uri': '/categories/documentary', 'name': 'D...",10483.0,Trevor Anderson,https://vimeo.com/trevoranderson,m,"Sundance Film Festival, Drumheller Prison, pla...","[{'name': 'Trevor Anderson Films', 'link': 'ht...",,,,[],False,plus
19,Quilt Fever,https://vimeo.com/399322718,945,2020-03-20T23:03:54+00:00,['safe'],"[{'uri': '/tags/quilt', 'name': 'quilt', 'tag'...","[{'uri': '/categories/documentary', 'name': 'D...",67259.0,Olivia Loomis Merrion,https://vimeo.com/oliviamerrion,f,"Filmmaker based in Oakland, CA","[{'name': 'oliviamerrion.com', 'link': 'http:/...","Oakland, CA, USA",,,"[{'uri': '/marketplace/skills/17', 'name': 'Di...",True,plus


# Sending Parallel Requests to Vimeo

> Saves pages in a csv

In [None]:
%%time
pool = multiprocessing.Pool()
result = pool.map(return_data, range(1, total_pages + 1))
pool.terminate()
pool.join()

> The Challenge: I begun the process by getting 25 videos per page, without multiprocessing and it was taking a whole night to download the pages, and either the kernel broke or I got some error in the middle of the process. Waiting for data was the most time consuming task in the project.

## Filtering columns and merging all pages

In [None]:
path = './downloaded_pages/'
all_files = glob.glob(path + "*.csv")
each_csv = (pd.read_csv(f)[cols] for f in all_files)
sp_df = pd.concat(each_csv, ignore_index=True)

In [340]:
# Exporting filtered .csv
date_time = datetime.datetime.now().strftime("%d%b%Y").replace('/', '').lower() 
sp_df.to_csv(f'./downloaded_pages/staffpicks_{date_time}.csv')


## Vimeo's Dataset

In [66]:
to_export = sp_df.sort_values(by='release_time', ascending=False)
to_export.to_csv(f'./downloaded_pages/sp_{date_time}_tableau.csv')

> When I wrote the function to get all pages I forgot to add '.csv' when I named. I tried all possible methods to concat the files and got several errors. To sum it up, I was trying to concat .txt files, so when I imported the merged file, I was getting a very strange renderization (the file wasn't separated by comma, it was plai text!). 

# Web Scraping Motionographer

> Motionographer - curated motion design content: http://motionographer.com/

• https://motionographer.com/wp-json/wp/v2/posts

In [387]:
# first page
first_page = 1
first_page_link = f'http://motionographer.com/articles/page/{first_page}'
m_soup = BeautifulSoup(requests.get(first_page_link).content)

last_page = int(m_soup.select('body div nav li a')[-2].text)
last_page_link = m_soup.select('body div nav li a')[-2]['href']

In [425]:
all_pages_url = [f"http://motionographer.com/articles/page/{item}" for item in range(first_page, last_page + 1)]
# all_posts_content = [download_page(page) for page in all_pages_url]
# all_posts = [get_post_url(content) for content in all_posts_content]
all_pages_url

['http://motionographer.com/articles/page/1',
 'http://motionographer.com/articles/page/2',
 'http://motionographer.com/articles/page/3',
 'http://motionographer.com/articles/page/4',
 'http://motionographer.com/articles/page/5',
 'http://motionographer.com/articles/page/6',
 'http://motionographer.com/articles/page/7',
 'http://motionographer.com/articles/page/8',
 'http://motionographer.com/articles/page/9',
 'http://motionographer.com/articles/page/10',
 'http://motionographer.com/articles/page/11',
 'http://motionographer.com/articles/page/12',
 'http://motionographer.com/articles/page/13',
 'http://motionographer.com/articles/page/14',
 'http://motionographer.com/articles/page/15',
 'http://motionographer.com/articles/page/16',
 'http://motionographer.com/articles/page/17',
 'http://motionographer.com/articles/page/18',
 'http://motionographer.com/articles/page/19',
 'http://motionographer.com/articles/page/20',
 'http://motionographer.com/articles/page/21',
 'http://motionographe

In [434]:
def download_page(page_link):
    '''From page url, makes the soup with content.'''
    downloaded_page = BeautifulSoup(requests.get(page_link).content)
#     naming = re.findall('[0-9]+', str(page_link))
#     with open(f"./motionographer/m_{naming[0]}.html", "w") as file:
#         file.write(str(downloaded_page))
#     print('page downloaded!')
    return downloaded_page

In [432]:
def get_post_url(page_content):
    '''From soup, gets posts url.'''
    page_posts = page_content.select('article.post > div.article-header > a')
    url_list = [link['href'] for link in page_posts]
#     print('page downloaded.')
    return url_list

In [401]:
def crawl_posts(post_soup):
    
    '''Returns a pandas dataframe from soup.'''
    
    title = soup.select('body article h1')[0].text # post title
    video_url = 'https://vimeo.com/' + re.findall('/(\d+)', soup.select('div.video > iframe')[0]['src']) # vimeo links
    date = soup.select('body article p time')[0]['datetime'] # date / time
    author = soup.select('body article p a')[0].text # author
    author_url = soup.select('body article p a')[0]['href'] # author link
    content = soup.select('body div article')[0] # article content
    
    post_page = {'Title': title,'URL': video_url, 'Date': date, 'Author': author, 
                 'Author_URL': author_url, 'Content': content}
    post_df = pd.DataFrame(post_page)
    
    return post_df

In [426]:
def get_all_posts_urls(pages_list):
    all_posts_url = []
    
    for page in pages_list:
        cur_page = download_page(page)
        for content in cur_page:
            all_posts_url.append(get_post_url(content))
    return all_posts_url

In [382]:
%%time
pool = multiprocessing.Pool()
result = pool.map(download_page, all_pages)
pool.terminate()
pool.join()

page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloade

page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloaded!
page downloade

# Storing data in a database

## Imports final dataset

In [7]:
sp_df = pd.read_csv('./downloaded_pages/sp_02apr2020_tableau.csv')

## Creates engines

In [12]:
vimeo_engine = create_engine('postgresql+psycopg2://postgres:123@localhost')
# motionographer_engine = create_engine('postgresql+psycopg2://postgres:admin@localhost/motionographer')
engines = vimeo_engine
conn = engines.connect()

## Runs engines

In [13]:
sp_df.to_sql('staff_picks', conn, index=False, if_exists='append')

# Next Steps

1. Crawl all Motionographer pages
2. Create dataset from it
3. Filter useful information from Motionographer's posts
4. Consolidate Pipeline
5. Save Vimeo and Motionographer's data in a SQL database
6. Update remote repo

Extra:
* Clean data
* Share in Kaggle
* *Write content from it, with data visualization*
* Share on LinkedIn with the community of designers/filmakers
* Have 100% of functions with proper docstring description
