# Exercise: Data Extraction Fundamentals

## Using CSV module

In [8]:
"""
Your task is to process the supplied file and use the csv module to extract data from it.
The data comes from NREL (National Renewable Energy Laboratory) website. Each file
contains information from one meteorological station, in particular - about amount of
solar and wind energy for each hour of day.

Note that the first line of the datafile is neither data entry, nor header. It is a line
describing the data source. You should extract the name of the station from it.

The data should be returned as a list of lists (not dictionaries).
You can use the csv modules "reader" method to get data in such format.
Another useful method is next() - to get the next line from the iterator.
You should only change the parse_file function.
"""
import csv
import os

DATADIR = "data/"
DATAFILE = "745090.csv"


def parse_file(datafile):
    name = ""
    data = []
    with open(datafile,'r') as f:
        reader = csv.reader(f)
        name = next(reader)[1]  # extract station name
        next(reader)   # skip the column names
        data = [[v for v in row] for row in reader] 
        
    return (name, data)


datafile = os.path.join(DATADIR, DATAFILE)
name, data = parse_file(datafile)

print(name)
print(data[2])

MOUNTAIN VIEW MOFFETT FLD NAS
['01/01/2005', '03:00', '0', '0', '0', '2', '0', '0', '2', '0', '0', '2', '0', '0', '2', '0', '0', '2', '0', '0', '2', '0', '0', '2', '0', '8', 'E', '9', '8', 'E', '9', '7.0', 'A', '7', '6.0', 'A', '7', '93', 'A', '7', '1013', 'A', '7', '120', 'A', '7', '2.1', 'A', '7', '16100', 'A', '7', '2100', 'A', '7', '1.1', 'E', '8', '0.099', 'F', '8', '0.160', 'F', '8', '0', '1', 'A', '7']


## Excel to CSV

In [27]:
# -*- coding: utf-8 -*-
'''
Find the time and value of max load for each of the regions
COAST, EAST, FAR_WEST, NORTH, NORTH_C, SOUTHERN, SOUTH_C, WEST
and write the result out in a csv file, using pipe character | as the delimiter.

An example output can be seen in the "example.csv" file.
'''

import xlrd
import csv

datafile = "data/2013_ERCOT_Hourly_Load_Data.xls"
outfile = "data/2013_Max_Loads.csv"


def parse_file(datafile):
    workbook = xlrd.open_workbook(datafile)
    sheet = workbook.sheet_by_index(0)   

    data = []
    data.append(['Station', 'Year', 'Month', 'Day', 'Hour', 'Max Load'])

    stations = sheet.row_values(0)[1:9]
    timecol = sheet.col_values(0, start_rowx=1)

    for i in range(8):
        col = sheet.col_values(i+1, start_rowx=1)
        maxcol = max(col)
        maxidx = col.index(maxcol)
        time = xlrd.xldate_as_tuple(timecol[maxidx], 0) 
        data.append([stations[i]] + list(time[:4]) + [maxcol])
  
    return data

def save_file(data, filename):
    with open(filename, 'w') as csvfile:
        w = csv.writer(csvfile, delimiter='|')
        w.writerows(data)
    
    
data = parse_file(datafile)
save_file(data, outfile)

data


[['Station', 'Year', 'Month', 'Day', 'Hour', 'Max Load'],
 ['COAST', 2013, 8, 13, 17, 18779.025510000003],
 ['EAST', 2013, 8, 5, 17, 2380.1654089999956],
 ['FAR_WEST', 2013, 6, 26, 17, 2281.2722140000024],
 ['NORTH', 2013, 8, 7, 17, 1544.7707140000005],
 ['NORTH_C', 2013, 8, 7, 18, 24415.570226999993],
 ['SOUTHERN', 2013, 8, 8, 16, 5494.157645],
 ['SOUTH_C', 2013, 8, 8, 18, 11433.30491600001],
 ['WEST', 2013, 8, 7, 17, 1862.6137649999998]]

## Wrangling JSON

If you want to know more, or query the site by yourself, please read the [NYTimes Developer Documentation for the Most Popular API](http://developer.nytimes.com/docs/most_popular_api/) and [apply for your own API Key for NY Times](http://developer.nytimes.com/page).

In [68]:
"""
This exercise shows some important concepts that you should be aware about:
- using codecs module to write unicode files
- using authentication with web APIs
- using offset when accessing web APIs

To run this code locally you have to register at the NYTimes developer site 
and get your own API key. You will be able to complete this exercise in our UI
without doing so, as we have provided a sample result. (See the file 
'popular-viewed-1.json' from the tabs above.)

Your task is to modify the article_overview() function to process the saved
file that represents the most popular articles (by view count) from the last
day, and return a tuple of variables containing the following data:
- labels: list of dictionaries, where the keys are the "section" values and
  values are the "title" values for each of the retrieved articles.
- urls: list of URLs for all 'media' entries with "format": "Standard Thumbnail"

All your changes should be in the article_overview() function. 
The rest of functions are provided for your convenience, if you want to access
the API by yourself.
"""
import json
import codecs
import requests

URL_MAIN = "http://api.nytimes.com/svc/"
URL_POPULAR = URL_MAIN + "mostpopular/v2/"
API_KEY = { "popular": "",
            "article": ""}



def article_overview(kind, period):
    data = get_popular(URL_POPULAR, kind, period)
    data = data['results']
    titles = []
    urls =[]

    for item in data:
        titles.append({item['section']: item['title']})
        for it in item['media']:
            i = it['media-metadata']
            urls.append([x['url'] for x in i if x['format'] == 'Standard Thumbnail'][0])

    return (titles, urls)


def query_site(url, target, offset):
    # This will set up the query with the API key and offset
    # Web services often use offset paramter to return data in small chunks
    # NYTimes returns 20 articles per request, if you want the next 20
    # You have to provide the offset parameter
    if API_KEY["popular"] == "" or API_KEY["article"] == "":
        print("You need to register for NYTimes Developer account to run this program.")
        print("See Intructor notes for information")
        return False
    params = {"api-key": API_KEY[target], "offset": offset}
    r = requests.get(url, params = params)

    if r.status_code == requests.codes.ok:
        return r.json()
    else:
        r.raise_for_status()


def get_popular(url, kind, days, section="all-sections", offset=0):
    # This function will construct the query according to the requirements of the site
    # and return the data, or print an error message if called incorrectly
    if days not in [1,7,30]:
        print("Time period can be 1,7, 30 days only")
        return False
    if kind not in ["viewed", "shared", "emailed"]:
        print("kind can be only one of viewed/shared/emailed")
        return False

    url += "most{0}/{1}/{2}.json".format(kind, section, days)
    data = query_site(url, "popular", offset)

    return data


def save_file(kind, period):
    # This will process all results, by calling the API repeatedly with supplied offset value,
    # combine the data and then write all results in a file.
    data = get_popular(URL_POPULAR, "viewed", 1)
    num_results = data["num_results"]
    full_data = []
    with codecs.open("data\popular-{0}-{1}.json".format(kind, period), encoding='utf-8', mode='w') as v:
        for offset in range(0, num_results, 20):        
            data = get_popular(URL_POPULAR, kind, period, offset=offset)
            full_data += data["results"]
        
        v.write(json.dumps(full_data, indent=2))



titles, urls = article_overview("viewed", 1)


In [69]:
titles

[{'Opinion': 'Donald, This I Will Tell You'},
 {'Magazine': 'Trump vs. Congress: Now What?'},
 {'U.S.': 'After Barring Girls for Leggings, United Airlines Defends Decision'},
 {'U.S.': 'Trump Becomes Ensnared in Fiery G.O.P. Civil War'},
 {'World': 'Alone in the Wild for a Year, TV Contestants Learn Their Show Was Canceled'},
 {'Opinion': 'I Loved My Grandmother. But She Was a Nazi.'},
 {'World': 'Canadians Adopted Refugee Families for a Year. Then Came ‘Month 13.’'},
 {'Business Day': 'One Nation, Under Fox: 18 Hours With a Network That Shapes America'},
 {'World': 'Protesters Gather in 99 Cities Across Russia; Top Putin Critic Is Arrested'},
 {'U.S.': 'Boris Epshteyn, Trump TV Surrogate, Is Leaving White House Job'},
 {'Well': 'The Best Exercise for Aging Muscles'},
 {'Opinion': 'Trump’s Triumph of Incompetence'},
 {'U.S.': 'Jeanine Pirro Calls for Paul Ryan to Step Down After Health Bill Failure'},
 {'U.S.': 'Who Stopped the Republican Health Bill?'},
 {'Opinion': 'Break Up the Libe

In [70]:
urls

['https://static01.nyt.com/images/2017/03/26/sunday-review/26Dowd2/26Dowd2-thumbStandard.jpg',
 'https://static01.nyt.com/images/2017/04/02/magazine/02trump6/02trump6-thumbStandard-v2.jpg',
 'https://static01.nyt.com/images/2017/03/27/us/27xp-leggings/27xp-leggings-thumbStandard.jpg',
 'https://static01.nyt.com/images/2017/03/26/us/26Trump-1/26Trump-1-thumbStandard.jpg',
 'https://static01.nyt.com/images/2017/03/25/arts/25xp-eden/25xp-eden-thumbStandard.jpg',
 'https://static01.nyt.com/images/2017/03/25/opinion/25shattuckWweb/25shattuckWweb-thumbStandard.jpg',
 'https://static01.nyt.com/images/2017/03/24/world/canada/24canada-slide-0VZF/24canada-slide-0VZF-thumbStandard-v2.jpg',
 'https://static01.nyt.com/images/2017/03/25/business/25fox-top/25fox-top-thumbStandard-v2.jpg',
 'https://static01.nyt.com/images/2017/03/27/world/27russia-web1/27russia-web1-thumbStandard.jpg',
 'https://static01.nyt.com/images/2017/03/26/us/26boris/26boris-thumbStandard.jpg',
 'https://static01.nyt.com/image