# Problem set: Data in more complex format

## HTML
### Carrier List

In [8]:

"""
Your task in this exercise is to modify 'extract_carrier()` to get a list of
all airlines. Exclude all of the combination values like "All U.S. Carriers"
from the data that you return. You should return a list of codes for the
carriers.
"""

from bs4 import BeautifulSoup
html_page = "data/airport.html"


def extract_carriers(page):

    with open(page, "r") as html:
        soup = BeautifulSoup(html, "lxml")
        carriers = soup.find(id="CarrierList")
        data = [option['value'] for option in carriers.find_all('option')
               if not option['value'].startswith('All')]
            
    return data


data = extract_carriers(html_page)
data

['AS',
 'G4',
 'AA',
 '5Y',
 'DL',
 'MQ',
 'EV',
 'F9',
 'HA',
 'B6',
 'OO',
 'WN',
 'NK',
 'UA',
 'VX']

### Airport List

In [7]:
"""
Complete the 'extract_airports()' function so that it returns a list of airport
codes, excluding any combinations like "All".

"""

def extract_airports(page):

    with open(page, "r") as html:
        soup = BeautifulSoup(html, "lxml")
        airports = soup.find(id="AirportList")
        data = [option['value'] for option in airports.find_all('option') 
                if not option['value'].startswith('All')]
        
    return data



data = extract_airports(html_page)
data[:5]


['ATL', 'BWI', 'BOS', 'CLT', 'MDW']

### Processing all


In [16]:
"""
Let's assume that you combined the code from the previous 2 exercises with code
from the lesson on how to build requests, and downloaded all the data locally.
The files are in a directory "data", named after the carrier and airport:
"{}-{}.html".format(carrier, airport), for example "FL-ATL.html".

The table with flight info has a table class="dataTDRight". Your task is to
use 'process_file()' to extract the flight data from that table as a list of
dictionaries, each dictionary containing relevant data from the file and table
row. This is an example of the data structure you should return:

data = [{"courier": "FL",
         "airport": "ATL",
         "year": 2012,
         "month": 12,
         "flights": {"domestic": 100,
                     "international": 100}
        },
         {"courier": "..."}
]

Note - year, month, and the flight data should be integers.
You should skip the rows that contain the TOTAL data for a year.

"""
from bs4 import BeautifulSoup

def process_file(f):
    """
    This function extracts data from the file given as the function argument in
    a list of dictionaries. This is example of the data structure you should
    return:

    data = [{"courier": "FL",
             "airport": "ATL",
             "year": 2012,
             "month": 12,
             "flights": {"domestic": 100,
                         "international": 100}
            },
            {"courier": "..."}
    ]


    Note - year, month, and the flight data should be integers.
    You should skip the rows that contain the TOTAL data for a year.
    """
    data = []
    info = {}
    info["courier"], info["airport"] = f[:6].split("-")
    info["flights"] = {}
    # Note: create a new dictionary for each entry in the output data list.
    # If you use the info dictionary defined here each element in the list 
    # will be a reference to the same info dictionary.
    with open("{}/{}".format(datadir, f), "r") as html:

        soup = BeautifulSoup(html, 'lxml')
        items = soup.find_all('tr', class_='dataTDRight')
        for item in items:
            infolist = [it.text for it in item.find_all('td')]
            if infolist[1] != 'TOTAL':
                info['year'] = int(infolist[0])
                info['month'] = int(infolist[1])
                info['flights']['domestic'] = int(infolist[2].replace(',',''))
                info['flights']['international'] = int(infolist[3].replace(',',''))
                data.append(info)        

    return data


datadir = 'data'
data = process_file('FL-ALT.html')

data[:2]

[{'airport': 'ALT',
  'courier': 'FL',
  'flights': {'domestic': 798879, 'international': 97094},
  'month': 12,
  'year': 2003},
 {'airport': 'ALT',
  'courier': 'FL',
  'flights': {'domestic': 798879, 'international': 97094},
  'month': 12,
  'year': 2003}]

## XML
### Patent database

In [18]:
"""
This and the following exercise are using US Patent database. The patent.data
file is a small excerpt of much larger datafiles that are available for
download from US Patent website. These files are pretty large ( >100 MB each).
The original file is ~600MB large, you might not be able to open it in a text
editor.

The data itself is in XML, however there is a problem with how it's formatted.
Please run this script and observe the error. Then find the line that is
causing the error. You can do that by just looking at the datafile in the web
UI, or programmatically. For quiz purposes it does not matter, but as an
exercise we suggest that you try to do it programmatically.

NOTE: You do not need to correct the error - for now, just find where the error
is occurring.
"""

import xml.etree.ElementTree as ET

PATENTS = 'data/patent.data'

def get_root(fname):

    tree = ET.parse(fname)
    return tree.getroot()


get_root(PATENTS)

ParseError: junk after document element: line 657, column 0 (<string>)

The Error above is caused by line 657 "<?xml version="1.0" encoding="UTF-8"?>" in patent.data, because ElementTree expects only a single root node .

### Processing patent

In [29]:
# So, the problem is that the gigantic file is actually not a valid XML, because
# it has several root elements, and XML declarations.
# It is, a matter of fact, a collection of a lot of concatenated XML documents.
# So, one solution would be to split the file into separate documents,
# so that you can process the resulting files as valid XML documents.

import xml.etree.ElementTree as ET
DIR = 'data/'
PATENTS = 'patent.data'


def split_file(filename):
    """
    Split the input file into separate files, each containing a single patent.
    As a hint - each patent declaration starts with the same line that was
    causing the error found in the previous exercises.
    
    The new files should be saved with filename in the following format:
    "{}-{}".format(filename, n) where n is a counter, starting from 0.
    """

    with open(DIR+filename, 'r') as f:
        i=0
        for line in f:
            if line.startswith('<?xml'):
                out_filename = "{}-{}".format(filename, i)
                out = open(DIR+out_filename, 'w')  
                i += 1
            out.write(line)



def test():
    split_file(PATENTS)
    for n in range(4):
        try:
            fname = "{}-{}".format(PATENTS, n)
            f = open(DIR+fname, "r")
            if not f.readline().startswith("<?xml"):
                print("You have not split the file {} in the correct boundary!".format(fname))
            f.close()
        except:
            print("Could not find file {}. Check if the filename is correct!".format(fname))


test()