In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up. 

In the first exercise we want you to audit
the datatypes that can be found in some particular fields in the dataset.

The possible types of values can be:

- NoneType if the value is a string "NULL" or an empty string ""
- list, if the value starts with "{"
- int, if the value can be cast to int
- float, if the value can be cast to float, but CANNOT be cast to int. For example, '3.23e+07' should be considered a float because it can be cast as float but int('3.23e+07') will throw a ValueError
- 'str', for all other values

The audit_file function should return a dictionary containing fieldnames and a 
SET of the types that can be found in the field. e.g.
{"field1": set([type(float()), type(int()), type(str())]),
 "field2": set([type(str())]),
  ....
}
The type() function returns a type object describing the argument given to the 
function. You can also use examples of objects to create type objects, e.g.
type(1.1) for a float: see the test function below for examples.

All the data initially is a string, so you have to do some checks on the values
first.

In [1]:
import codecs
import csv
import json
import pprint

CITIES = 'data/cities.csv'

FIELDS = ["name", "timeZone_label", "utcOffset", "homepage", "governmentType_label", "isPartOf_label", 
          "areaCode", "populationTotal", "elevation", "maximumElevation", "minimumElevation", "populationDensity", 
          "wgs84_pos#lat", "wgs84_pos#long", "areaLand", "areaMetro", "areaUrban"]

In [8]:
with open(CITIES,'r') as infile:
    soup = csv.DictReader(infile)
    for i,row in enumerate(soup):
        if 'dbpedia' in row['URI']:
            print i,'-->',[{key:row[key]} for key in FIELDS],'\n'

3 --> [{'name': 'Kud'}, {'timeZone_label': 'Indian Standard Time'}, {'utcOffset': '+5:30'}, {'homepage': 'NULL'}, {'governmentType_label': 'NULL'}, {'isPartOf_label': '{Jammu and Kashmir|Udhampur district}'}, {'areaCode': 'NULL'}, {'populationTotal': '1140'}, {'elevation': '1855.0'}, {'maximumElevation': 'NULL'}, {'minimumElevation': 'NULL'}, {'populationDensity': 'NULL'}, {'wgs84_pos#lat': '33.08'}, {'wgs84_pos#long': '75.28'}, {'areaLand': 'NULL'}, {'areaMetro': 'NULL'}, {'areaUrban': 'NULL'}] 

4 --> [{'name': 'Kuju'}, {'timeZone_label': 'Indian Standard Time'}, {'utcOffset': '+5:30'}, {'homepage': 'NULL'}, {'governmentType_label': 'NULL'}, {'isPartOf_label': '{Jharkhand|Ramgarh district}'}, {'areaCode': 'NULL'}, {'populationTotal': '18049'}, {'elevation': '426.0'}, {'maximumElevation': 'NULL'}, {'minimumElevation': 'NULL'}, {'populationDensity': 'NULL'}, {'wgs84_pos#lat': '23.72'}, {'wgs84_pos#long': '85.5'}, {'areaLand': 'NULL'}, {'areaMetro': 'NULL'}, {'areaUrban': 'NULL'}] 

5 -

In [23]:
data = []
with open(CITIES,'r') as infile:
    soup = csv.DictReader(infile)
    for row in soup:
        if 'dbpedia' in row['URI']:
            data.append({key:row[key] for key in FIELDS})

In [25]:
data[0]

{'areaCode': 'NULL',
 'areaLand': 'NULL',
 'areaMetro': 'NULL',
 'areaUrban': 'NULL',
 'elevation': '1855.0',
 'governmentType_label': 'NULL',
 'homepage': 'NULL',
 'isPartOf_label': '{Jammu and Kashmir|Udhampur district}',
 'maximumElevation': 'NULL',
 'minimumElevation': 'NULL',
 'name': 'Kud',
 'populationDensity': 'NULL',
 'populationTotal': '1140',
 'timeZone_label': 'Indian Standard Time',
 'utcOffset': '+5:30',
 'wgs84_pos#lat': '33.08',
 'wgs84_pos#long': '75.28'}

In [2]:
import re

def isfloat(value):
    try:
        float(value)
        return True
    except ValueError:
        return False

def parse(value):
    if value == 'NULL' or value == '':
        #print value,'-->None'
        typo = type(None)
    elif re.match(u'{',value):
        #print value,'-->list'
        typo = type([])
    elif value.isdigit():
        #print value,'-->Int'
        typo = type(1)
    elif isfloat(value):
        typo = type(1.1)
        #print value,'-->float'
    else:
        typo = type('string')
        #print value,'-->string'
    return typo

In [78]:
for di in data:
    print parse(di['areaCode']),'\n'

<type 'NoneType'> 

<type 'NoneType'> 

<type 'NoneType'> 

<type 'NoneType'> 

<type 'str'> 

<type 'NoneType'> 

<type 'str'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'str'> 

<type 'str'> 

<type 'str'> 

<type 'int'> 

<type 'int'> 

<type 'int'> 

<type 'str'> 

<type 'int'> 

<type 'int'> 



Dictionary learning section

In [113]:
mydict = {}

In [114]:
mydict.update({'i1':['v1']})
print mydict

{'i1': ['v1']}


In [115]:
mydict.update({'i2':[45]})
print mydict

{'i1': ['v1'], 'i2': [45]}


The `update` method overwrites the previous stoder value.

In [116]:
mydict.update({'i1':['v1.2']})
print mydict

{'i1': ['v1.2'], 'i2': [45]}


We need to append instead of update.

In [117]:
mydict['i1'].append('v2')
print mydict

{'i1': ['v1.2', 'v2'], 'i2': [45]}


In [118]:
mydict['i2'].append(56)
print mydict

{'i1': ['v1.2', 'v2'], 'i2': [45, 56]}


In [122]:
try:
    mydict['i3']
    print 1
except:
    print 0

0


And then ...

In [3]:
fieldtypes = {}
with open(CITIES,'r') as infile:
    soup = csv.DictReader(infile)
    for row in soup:
        if 'dbpedia' in row['URI']:
            for key in FIELDS:
                try:
                    fieldtypes[key]
                    #print 1
                    fieldtypes[key].append(parse(row[key]))
                except:
                    #print 0
                    fieldtypes.update({key:[parse(row[key])]})
                                        

In [4]:
fieldtypes['areaLand']

[NoneType,
 NoneType,
 NoneType,
 NoneType,
 NoneType,
 NoneType,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 float,
 float,
 float,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 list,
 NoneType,
 NoneType,
 list,
 list]

In [5]:
for key in fieldtypes.keys():
    fieldtypes[key] = set(fieldtypes[key])

In [6]:
fieldtypes

{'areaCode': {int, NoneType, str},
 'areaLand': {float, NoneType, list},
 'areaMetro': {float, NoneType},
 'areaUrban': {float, NoneType},
 'elevation': {float, NoneType, list},
 'governmentType_label': {NoneType, str},
 'homepage': {NoneType, str},
 'isPartOf_label': {NoneType, str, list},
 'maximumElevation': {NoneType},
 'minimumElevation': {NoneType},
 'name': {NoneType, str, list},
 'populationDensity': {float, NoneType, list},
 'populationTotal': {int, NoneType},
 'timeZone_label': {NoneType, str},
 'utcOffset': {float, NoneType, str, list},
 'wgs84_pos#lat': {float},
 'wgs84_pos#long': {float}}

In [77]:
def audit_file(filename, fields):
    fieldtypes = {}
    #CODE HERE
    return fieldtypes

In [None]:
def test():
    fieldtypes = audit_file(CITIES, FIELDS)

    pprint.pprint(fieldtypes)

    assert fieldtypes["areaLand"] == set([type(1.1), type([]), type(None)])
    assert fieldtypes['areaMetro'] == set([type(1.1), type(None)])
    
if __name__ == "__main__":
    test()