# Data Normalization

The assignment:
>In this assignment, you must take a file from the Nebraska Department of Environmental Quality and make it useful. I want to know how many leaking underground storage tanks there are in each city in Nebraska.  
>  
>To do this, you will need to:  
>
>1. Get the file from the DEQ. The file you want is called spillfac.csv, but keep this page handy because it has some filter conditions you're going to need.
2. The file that comes from the state is not UTF-8. Follow the walkthrough. Use Excel and csvkit to zap the non-UTF-8 characters.
3. Normalize the data using Open Refine. Specifically, the fields you need to normalize are the owner company -- OWNCO -- and the city the tank is in -- SPCITY.
4. Export your newly cleaned data into a new csv file.
5. Import your newly cleaned up data into Agate.
6. Filter out any leaking underground storage tanks that aren't leaking. (see the documentation from where you downloaded the file)
7. Group it by the OWNCO and count them.
8. Sort it.
8. Print the top 20 to the screen.


Import agate.

In [1]:
import agate

Create a table named `lust` from our .csv

In [2]:
lust = agate.Table.from_csv('filteredlustfinal.csv')

Print it to see column headers.

In [3]:
print(lust)

|--------------------------------------+------------|
|  column                              | data_type  |
|--------------------------------------+------------|
|  SPILLNO-------                      | Text       |
|  S                                   | Text       |
|  OWNCO--------------------           | Text       |
|  OWNSTREET-----------                | Text       |
|  OWNCITY-------------                | Text       |
|  OS                                  | Text       |
|  OZIP                                | Text       |
|  TY                                  | Number     |
|  DIDATE----                          | Date       |
|  SPLOC     ------------------------- | Text       |
|  SPCITY-------------------           | Text       |
|  SPCOUN-------------------           | Text       |
|  MATERIAL----------------------      | Text       |
|  SFM_ID--                            | Text       |
|  FAC_NAME-----------------           | Text       |
|---------------------------

Create a new table with records grouped by OWNCO.

In [4]:
by_OWNCO = lust.group_by('OWNCO--------------------')

Here we count how many of each OWNCO there are.

In [5]:
totals = by_OWNCO.aggregate([
        ('count', agate.Count())
    ])

Then we sort by that number.

In [6]:
sorted_OWNCO = totals.order_by('count', reverse=True)

And we print it.

In [7]:
sorted_OWNCO.print_table(max_rows=20)

|-----------------------+--------|
|  OWNCO------------... | count  |
|-----------------------+--------|
|  UNKNOWN              |   277  |
|  BNSF                 |    28  |
|  BENSON 66 SERVICE... |    13  |
|  BOSSELMAN INC        |    10  |
|  PILOT/FLYING J       |     9  |
|  UPRR                 |     9  |
|  SAPP BROS INC        |     8  |
|  WHITEHEAD OIL CO     |     7  |
|  OFFUTT A F B         |     6  |
|  ARMY CORPS OF ENG... |     5  |
|  CITY OF OMAHA        |     5  |
|  CONTEMPORARY INDU... |     5  |
|  RITEWAY OIL & GAS... |     5  |
|  AG VALLEY COOP       |     4  |
|  BUCKYS INC           |     4  |
|  CASEYS GENERAL STORE |     4  |
|  KWIK STOP            |     4  |
|  NEBR DEPT OF ROADS   |     4  |
|  SINCLAIR OIL         |     4  |
|  CENTERGAS INC        |     3  |
|  ...                  |   ...  |
|-----------------------+--------|
