# Introduction
This part of the repository downloads previously scraped data.

# Environment setup

## Google Drive mount
I'm using Google Colaboratory as my default platform, therefore I need to set up my environment to integrate it with Google Drive. You can skip this bit if you're working locally.

1. Mount Google Drive on the runtime to be able to read and write files. This will ask you to log in to your Google Account and provide an authorization code.
2. Create a symbolic link to a working directory 
3. Change the directory to the one where I cloned my repository.


In [None]:
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [None]:
!ln -s /content/gdrive/My\ Drive/Colab\ Notebooks/dezeenAI /mydrive
!ls /mydrive

ln: failed to create symbolic link '/mydrive/dezeenAI': File exists
classes.gsheet	dezeen_basic-detection.ipynb  files
darknet		dezeen_download.ipynb	      LICENSE
data		dezeen_scrape.ipynb	      OIDv4_ToolKit
dezeenAI	dezeen_test.ipynb	      README.md


In [None]:
%cd /mydrive

/content/gdrive/My Drive/Colab Notebooks/dezeenAI


## Libraries & functions
- `pandas` - data manipulation & analysis
- `timeit` - cell runtime check
- `tqdm` - loop progress bar
- `os` - operating system interfaces
- `os.path` - pathname manipulation
- `urllib.request` - url opening
- 'json` - JSON files handling

In [None]:
import pandas as pd
import numpy as np
import timeit
from tqdm import tqdm
import os
from os.path import basename
import urllib.request
import json

# Files download

## Load DataFrame
Imports the base DataFrame from the previously exported pickle file.

In [None]:
articles_df = pd.read_pickle('data/articles.pkl')
articles_df

Unnamed: 0,id,title,url,images
0,1588230,Daosheng Design creates monochromatic bar with...,https://www.dezeen.com/2020/11/19/the-flow-of-...,[https://static.dezeen.com/uploads/2020/11/the...
1,1588111,Issey Miyake store in Osaka is splashed with w...,https://www.dezeen.com/2020/11/19/issey-miyake...,[https://static.dezeen.com/uploads/2020/11/iss...
2,1586860,Nook Pod is a gabled workspace,https://www.dezeen.com/2020/11/18/nook-pod-dez...,[https://static.dezeen.com/uploads/2020/11/noo...
3,1587217,Project #13 is an office for Studio Wills + Ar...,https://www.dezeen.com/2020/11/17/project-13-h...,[https://static.dezeen.com/uploads/2020/11/pro...
4,1586339,AHEAD Europe 2020 awards winners announced in ...,https://www.dezeen.com/2020/11/16/ahead-europe...,[]
...,...,...,...,...
4946,70,Marcel Wanders launches Crochet Chair,https://www.dezeen.com/2006/12/10/marcel-wande...,[https://static.dezeen.com/uploads/2006/12/Mar...
4947,67,WOKmedia show at Design Miami,https://www.dezeen.com/2006/12/10/wokmedia-sho...,[https://static.dezeen.com/uploads/2006/12/WOK...
4948,54,Zaha Hadid furniture exhibited in New York,https://www.dezeen.com/2006/12/07/zaha-hadid-f...,[https://static.dezeen.com/uploads/2006/12/Zah...
4949,48,Thomas Heatherwick beach cafe takes shape,https://www.dezeen.com/2006/12/04/thomas-heath...,[https://static.dezeen.com/uploads/2006/12/Tho...


## Download files

1. Iterate through rows of the DataFrame to extract necessary data as variables
2. Create a unique folder for each article to store the files
3. Save a JSON file in each folder containing information about the origin of the files downloaded
4. Iterate through the list of images for each article to:
  - download the image file
  - save the images' paths to a list

In [None]:
start = timeit.default_timer() # start the times

listofpaths = []
for _, row in tqdm(articles_df.iterrows()):
  id = row['id']
  title = row['title']  
  url = row['url']
  images = row['images']

  path = 'data/dezeen/'+str(id)+'/'
  os.makedirs(path, exist_ok=True)
  
  info = {
      'id': id,
      'title': title,
      'url': url,
      'images': images
  }
  with open('data/dezeen/'+str(id)+'/'+str(id)+'.json', 'w') as fp:
    json.dump(info, fp)

  for image in images:
    filename = basename(image)
    if not os.path.exists(path+filename):
      try:
        urllib.request.urlretrieve(image, path+filename)
        listofpaths.append('/mydrive/'+path+filename)
      except: break  

stop = timeit.default_timer() # stop the timer
print('\nRuntime: {} seconds.'.format(stop-start))

4951it [1:22:13,  1.00it/s]


Runtime: 4933.615870068999 seconds.





## Export list of files
Export the list of paths to image file to an external txt file for later use.

In [None]:
with open('files/dezeen-images.txt', 'w') as output:
    output.write('\n'.join(listofpaths))