# Vogue Business
## Fashion Job Losses Tracker 2020

### _Feb 2020 - Ongoing_

ADMIN SITE:       https://fashionjoblossestracker.herokuapp.com/admin/

API: https://fashionjoblossestracker.herokuapp.com/api/countries

GITHUB REPO: https://github.com/andyclarkemedia/fashionjoblosses

SOURCE LIST: https://docs.google.com/spreadsheets/d/1zXYchoIP6KXfFetMQXt9qI8JWmBH2NEW1DYqhRCVrHo/edit?usp=sharing

_An online weekly web scraper that looks for mentions of job losses (and synonyms) in news articles from around 50 digital fashion publishers._

------

### Notebook Description

> This notebook provides code to retrieve data generated by this project's news article crawler from a Heroku API which runs weekly (Monday 10am ). 

> With an API request you can reteive all data in the DB or retrieve the latest data

> Data will be stored locally as a csv or json file.

> Additionally, instructions on how to add sources to the site list are given below

-----

### Contents

> #### <a href="#packages" >Packages Needed</a>

> #### <a href="#add_new_sources" >How to Add New Sources</a>

> #### <a href="#imports" >Imports & Configuration</a>

> #### <a href="#download_all" >Download All Data</a>

> #### <a href="#download_latest" >Download Latest</a>

> #### <a href="#notes" >Notes</a>

------

<h3 id="#packages" >Packages Needed</h3>

In order to run this notebook you need the following packages installed:

- requests
- json
- csv
- pandas
- pandas.io.json
- datetime

----

<h3 id="#add_new_sources" >How to Add New Sources</h3>

The <a href="https://github.com/andyclarkemedia/fashionjoblosses/blob/master/news_crawler/modules/sources_dictionary.py" >sources dictionary</a> uses xpath code to recognise HTML characteristics of each source site.

Xpath syntax: `"//span[contains(@class, 'tie-date')]",`

> ##### A Sample Source Entry

```
"Drapers - Home": {  
        "country": "United Kingdom", 
		"language": "English",  
		"landing_urls": ["https://www.drapersonline.com/"],  
		"landing_characteristics": '//div[contains(@id, "main-content")]//a/@href',  
		"article_characteristics": "//p",  
		"headline_characteristics": "//h1",  
		"published_date_characteristics": "//span[contains(@class, 'tie-date')]",  
		"article_url_prefix": "https://www.drapersonline.com",  
		"fashionb2b": True  
	},  
```

#### Key Explanations:

`country` - main country covered in publication

`language` - language used by publication

`landing_urls` - urls where scraper will find articles from

`landing_characteristics` - characteristics of HTML '< a >' tags leading to articles

`article_characteristics` - characteristics of HTML where article text is stored

`headline_characteristics` - characteristics of HTML where article headline is stored

`published_date_characteristics` - characteristics of HTML where article date information is stored

`article_url_prefix` - (if required) URL prefix to inner article link addresses

`fashionb2b` - Boolean specifying whether publication exclusively publishes content about the fashion industry

#### Adding a New Source:

To add a new source, find the site characteristics and make a new entry in the <a href="https://github.com/andyclarkemedia/fashionjoblosses/blob/master/news_crawler/modules/sources_dictionary.py" >sources dictionary</a> below the existing entries.

#### Negotiating Pagination:

For sites where you want the scraper to visit multiple landing pages with similar URLs - use the <a href="https://github.com/andyclarkemedia/fashionjoblosses/blob/master/news_crawler/modules/url_helpers.py" >url helpers</a> file.

>  ##### Example 
_Create a list of the following URLs that feeds directly into the sources dictionary:_

> https://uk.reuters.com/news/archive/businessNews?view=page&page=1  
https://uk.reuters.com/news/archive/businessNews?view=page&page=2  
https://uk.reuters.com/news/archive/businessNews?view=page&page=3  

##### Sources Dictionary

`"landing_urls": reuters_url_list_creator(),`

##### URL Helper

```
def reuters_url_list_creator():

	urls = []

	for n in range(1, 4):
		urls.append("https://uk.reuters.com/news/archive/businessNews?view=page&page="+ str(n))

	return urls
```

-------

<h3 id="#imports" >Imports & Configuration</h3>

In [62]:
import requests
import json
import csv
import pandas as pd
from pandas.io.json import json_normalize
import datetime

In [7]:
# Set API Address
api_url_all = "https://fashionjoblossestracker.herokuapp.com/api/countries/"

-----

<h3 id="#download_all" >Download All Data</h3>

In [23]:
# Make the request
response_all = requests.get(api_url_all)

In [27]:
# Save the content of the response
content_all = response_all.text

In [30]:
# Load Json format
json_data_all = json.loads(response_all.text)

In [55]:
# Normalize the data layers
flattened_format_all = json_normalize(json_data_all, 'job_losses', ['country', 'id'], record_prefix="job_loss_")

#### CSV

In [56]:
# Download all data as csv - FLATTENED FORM
flattened_format_latest.to_csv (r'fjl_all_data_csv_flattened.csv', index=None)

#### JSON

In [None]:
# Download all data as a Json - ORIGINAL FORM
with open('fjl_all_data_json_original.json', 'w') as fjl_all_data_json:
    json.dump(json_data_all, fjl_all_data_json)

In [None]:
# Download all data as a Json - FLATTENED FORM
flattened_format.to_json(r'fjl_all_data_json_flattened.json')

-----

<h3 id="#download_latest" >Download Latest Data</h3>

In [79]:
# Set the latest crawl date to last monday
today = datetime.date.today()
latest_crawl_date = today - datetime.timedelta(days=today.weekday())

In [80]:
# Set API Address - for latest data
api_url_latest = "https://fashionjoblossestracker.herokuapp.com/api/joblossmentions/?date_accessed=" + str(latest_crawl_date)

In [81]:
# Make the request
response_latest = requests.get(api_url_latest)

In [82]:
# Save the content of the response
content_latest = response_latest.text

In [83]:
# Load Json format
json_data_latest = json.loads(response_latest.text)

In [84]:
flattened_format_latest = pd.DataFrame(json_data_latest)

#### CSV

In [88]:
# Download latest data as csv - FLATTENED FORM
flattened_format_latest.to_csv (r'fjl_latest_data_csv_flattened.csv', index=None)

#### JSON

In [87]:
# Download all data as a Json - FLATTENED FORM
flattened_format_latest.to_json(r'fjl_latest_data_json_flattened.json')