# Cleaning Data Sets - TRI Data Command Line Tool

There is a lot of info in these data sets! Quite a few of these columns will not be of help to our research, and we want to handle our NaN values and similar issues up-front, so let's pare them down a bit!

In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


In [2]:
# TRI national data for 2012 - Using for examples in this notebook
# Functions defined here for trimming and cleaning data sets will be copied into command-line callable functions to 
# apply to all relevant TRI data sets.

us2012_initial_df = pd.read_csv('./data/usdata/TRI_2012_US.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
# All available columns
for i in range(len(us2012_initial_df.columns)):
    print us2012_initial_df.columns[i]

YEAR
TRI_FACILITY_ID
FRS_ID
FACILITY_NAME
STREET_ADDRESS
CITY
COUNTY
ST
ZIP
BIA_CODE
TRIBE
LATITUDE
LONGITUDE
FEDERAL_FACILITY
INDUSTRY_SECTOR_CODE
INDUSTRY_SECTOR
PRIMARY_SIC
SIC_2
SIC_3
SIC_4
SIC_5
SIC_6
PRIMARY_NAICS
NAICS_2
NAICS_3
NAICS_4
NAICS_5
NAICS_6
DOC_CTRL_NUM
CHEMICAL
CAS_#/COMPOUND_ID
SRS_ID
CLEAR_AIR_ACT_CHEMICAL
CLASSIFICATION
METAL
METAL_CATEGORY
CARCINOGEN
FORM_TYPE
UNIT_OF_MEASURE
5.1_FUGITIVE_AIR
5.2_STACK_AIR
5.3_WATER
5.4_UNDERGROUND
5.4.1_UNDERGROUND_CLASS_I
5.4.2_UNDERGROUND_CLASS_II-V
5.5.1_LANDFILLS
5.5.1A_RCRA_C_LANDFILLS
5.5.1B_OTHER_LANDFILLS
5.5.2_LAND_TREATMENT
5.5.3_SURFACE_IMPOUNDMENT
5.5.3A_RCRA_C_SURFACE_IMP.
5.5.3B_Other_SURFACE_IMP.
5.5.4_OTHER_DISPOSAL
ON-SITE_RELEASE_TOTAL
6.1_POTW-TRANSFERS_FOR_RELEASE
6.1_POTW-TRANSFERS_FOR_TREATM.
6.1_POTW-TOTAL_TRANSFERS
6.2_M10
6.2_M41
6.2_M62
6.2_M71
6.2_M81
6.2_M82
6.2_M72
6.2_M63
6.2_M66
6.2_M67
6.2_M64
6.2_M65
6.2_M73
6.2_M79
6.2_M90
6.2_M94
6.2_M99
OFF-SITE_RELEASE_TOTAL
6.2_M20
6.2_M24
6.2_M26
6.2_M28
6

# Which Attributes To Keep? - Data Dictionary for TRI Data
***

In the TRI data, there are several columns that deal with numerical/proprietary facility ID numbers, government codes, and things like street address. There are also a bunch of unnamed columns that need to be taken out.

For our use, looking at release amounts across industry sectors and facilities, the following columns seem most relevant(by category). (<b>Note: '-->' means that we're changing that attribute name to the name listed on the right side of the arrow.</b>):

##### YEAR

<u><h2> Facility Information </h2></u>
##### FACILITY_NAME
##### FEDERAL_FACILITY: 
    Indicates if the facility is federally subsidized
##### PARENT_COMPANY_NAME: 
    Name of facility parent company.
##### INDUSTRY_SECTOR: 
    Name (string) of the industry sector type.
<u><h2>Location Information</h2></u>
##### ZIP
##### ST --> STATE:
    Facility state.
##### CITY
##### COUNTY
##### LATITUDE
##### LONGITUDE

<u><h2>Chemical Information</h2></u>
##### CHEMICAL: 
    Chemical for that record. A single facility can report several chemicals, and will thus appear in the data multiple times. Chemicals appear once per facility.
##### UNIT_OF_MEASURE: 
    Unit in which CHEMICAL was reported (e.g. 'pounds')
##### CARCINOGEN:
    Indicates if CHEMICAL is classified as a carcinogen.
##### CLEAR_AIR_ACT_CHEMICAL --> CAA_CHEMICAL: 
    Indicates if CHEMICAL is classified as a CAA chamical. CAA chemicals are classified as "hazardous air pollutants" by the U.S. government. Listed as 'CLEAR' Air Act, which is actially a typo on the part of the EPA(!), so we will change that to 'CAA_CHEMICAL'.
<u><h2>Release Information</h2></u>
##### TOTAL_RELEASES:
    Total ON-SITE_RELEASE_TOTAL plus OFF-SITE_RELEASE_TOTAL
##### ON-SITE_RELEASE_TOTAL --> ON_SITE_RELEASE_TOTAL (changing the intial dash to underscore to fit format of other attributes)
##### OFF-SITE_RELEASE_TOTAL --> OFF_SITE_RELEASE_TOTAL
##### OFF-SITE_RECYCLED_TOTAL --> OFF_SITE_RECYCLED_TOTAL
##### 8.4_RECYCLING_ON-SITE --> ON_SITE_RECYCLED_TOTAL:
    Total on-site recycling, change this attribute to 'ON_SITE_RECYCLED_TOTAL' to match OFF_SITE_RELEASE_TOTAL
##### 8.8_ONE-TIME_RELEASE --> ONE_TIME_RELEASES  :
    An interesting stat! Facilities will report this number as the total amount of accidental or non-recurring releases. This may help us identify facilities that regularly have accidental releases in their areas, and further, compare parent companies and evaluate their ability to minimize on-site accidents across facilities. Will change this to ONE_TIME_RELEASES

In [4]:
# Function to capture/rename the attributes listed above from TRI dataframe
# Some input validataion could be done here... but we're all going to agree to only use valid inputs!!
def trim_tri_df(tri_df):
    # list of the categories we want from each tri data set
    desired_categories = ['YEAR','FACILITY_NAME','FEDERAL_FACILITY','PARENT_COMPANY_NAME',   \
                          'INDUSTRY_SECTOR','ZIP','ST', 'CITY', 'COUNTY', 'LATITUDE',        \
                          'LONGITUDE', 'CHEMICAL', 'UNIT_OF_MEASURE', 'CARCINOGEN',          \
                          'CLEAR_AIR_ACT_CHEMICAL','TOTAL_RELEASES','ON-SITE_RELEASE_TOTAL', \
                          'OFF-SITE_RELEASE_TOTAL', 'OFF-SITE_RECYCLED_TOTAL',               \
                          '8.4_RECYCLING_ON-SITE', '8.8_ONE-TIME_RELEASE']   
    
    tri_df = tri_df[desired_categories]
    
    #rename categories as described above
    tri_df.rename(columns = {                                                     \
            'ST':'STATE',                                                                    \
            'CLEAR_AIR_ACT_CHEMICAL':'CAA_CHEMICAL',                                         \
            'ON-SITE_RELEASE_TOTAL':'ON_SITE_RELEASE_TOTAL',                                 \
            'OFF-SITE_RELEASE_TOTAL':'OFF_SITE_RELEASE_TOTAL',                               \
            '8.4_RECYCLING_ON-SITE':'ON_SITE_RECYCLED_TOTAL',                                \
            'OFF-SITE_RECYCLED_TOTAL':'OFF_SITE_RECYCLED_TOTAL',                             \
            '8.8_ONE-TIME_RELEASE':'ONE_TIME_RELEASES'}, inplace=True)
    
    #return trimmed tri
    return tri_df  

In [5]:
# Testing trim_tri_df function to verify dataframe output
us2012_trimmed = trim_tri_df(us2012_initial_df)
us2012_trimmed

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  **kwargs)


Unnamed: 0,YEAR,FACILITY_NAME,FEDERAL_FACILITY,PARENT_COMPANY_NAME,INDUSTRY_SECTOR,ZIP,STATE,CITY,COUNTY,LATITUDE,...,CHEMICAL,UNIT_OF_MEASURE,CARCINOGEN,CAA_CHEMICAL,TOTAL_RELEASES,ON_SITE_RELEASE_TOTAL,OFF_SITE_RELEASE_TOTAL,OFF_SITE_RECYCLED_TOTAL,ON_SITE_RECYCLED_TOTAL,ONE_TIME_RELEASES
0,2012,FLINT HILLS RESOURCES PINE BEND LLC,NO,KOCH INDUSTRIES INC,Petroleum,55068,MN,ROSEMOUNT,DAKOTA,44.768400,...,TOLUENE,Pounds,NO,YES,1.962360e+04,1.960450e+04,19.10,2.000000e-01,1700.000,
1,2012,GEORGIA BIOMASS LLC WAYCROSS FACILITY,NO,GEORGIA BIOMASS LLC,Wood Products,31503,GA,WAYCROSS,WARE,31.256800,...,LEAD COMPOUNDS,Pounds,NO,YES,9.523000e+01,1.020000e+00,94.21,0.000000e+00,0.000,
2,2012,TECHNICAL PRODUCTS INC,NO,,Chemical Wholesalers,44102,OH,CLEVELAND,CUYAHOGA,41.461430,...,ETHYLENE GLYCOL,Pounds,NO,YES,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
3,2012,INGREDION INC ARGO PLANT,NO,INGREDION INC,Food,60501,IL,BEDFORD PARK,COOK,41.778470,...,AMMONIA,Pounds,NO,NO,2.425000e+03,2.425000e+03,0.00,0.000000e+00,5200.000,
4,2012,COLEMAN CABLE - TEXARKANA FACILITY,NO,COLEMAN CABLE INC,Electrical Equipment,71854,AR,TEXARKANA,MILLER,33.433670,...,LEAD,Pounds,YES,YES,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,
5,2012,SOUTH CAROLINA ELECTRIC & GAS CO COPE STATION,NO,SCANA CORP,Electric Utilities,29038,SC,COPE,ORANGEBURG,33.365120,...,MANGANESE COMPOUNDS,Pounds,NO,YES,2.954800e+04,2.954800e+04,0.00,0.000000e+00,0.000,
6,2012,FUTURE FINISHES INC,NO,,Fabricated Metals,45015,OH,HAMILTON,BUTLER,39.337330,...,CYANIDE COMPOUNDS,Pounds,NO,YES,1.288000e+01,1.288000e+01,0.00,1.850000e+01,0.000,
7,2012,SHEAROUSE LUMBER CO,NO,,Wood Products,31322,GA,POOLER,CHATHAM,32.119993,...,COPPER COMPOUNDS,Pounds,NO,NO,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
8,2012,VALERO CHARLES CITY PLANT,NO,VALERO ENERGY CORP,Chemicals,50616,IA,CHARLES CITY,FLOYD,43.111110,...,TOLUENE,Pounds,NO,YES,4.400000e+01,4.400000e+01,0.00,0.000000e+00,0.000,
9,2012,OXY VINYLS LP DEER PARK-VCM PLANT,NO,OCCIDENTAL CHEMICAL HOLDING CORP,Chemicals,77536,TX,DEER PARK,HARRIS,29.710967,...,CHLOROBENZENE,Pounds,NO,YES,2.990000e+00,2.990000e+00,0.00,0.000000e+00,0.000,


# Handling Missing Values
***

Now that the TRI file is trimmed and includes only the desired attributes, we want to handle any possible NaN or undesired values. Converting these values now will save us from potential headaches when using the TRI CSV files for analysis.

In [6]:
# Function to handle missing values in our trimmed TRI data fram
# Some input validataion could also be done here... but this will be chained with the previous trim_tri_df() function
# in the source for the actual utility program... so again we'll just agree to not give bad inputs to this function!
# The output and input to this function is very specific to this particular data set and project, of course.
def handle_nan_tri_df(tri_df):
    # Filling NaN categories for attributes that may need to be summed across
    fill_zero_categories = ['TOTAL_RELEASES','ON_SITE_RELEASE_TOTAL','OFF_SITE_RELEASE_TOTAL',    \
                            'ON_SITE_RECYCLED_TOTAL','OFF_SITE_RECYCLED_TOTAL','ONE_TIME_RELEASES']
    
    for i in range(len(fill_zero_categories)):
        tri_df[fill_zero_categories[i]].fillna(0,inplace=True)
    
    #return now-cleaned (after running through trim_tri_df) data frame
    return tri_df

In [7]:
# Testing handle_nan_tri_df function to verify output
us2012_clean = handle_nan_tri_df(us2012_trimmed)
us2012_clean

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Unnamed: 0,YEAR,FACILITY_NAME,FEDERAL_FACILITY,PARENT_COMPANY_NAME,INDUSTRY_SECTOR,ZIP,STATE,CITY,COUNTY,LATITUDE,...,CHEMICAL,UNIT_OF_MEASURE,CARCINOGEN,CAA_CHEMICAL,TOTAL_RELEASES,ON_SITE_RELEASE_TOTAL,OFF_SITE_RELEASE_TOTAL,OFF_SITE_RECYCLED_TOTAL,ON_SITE_RECYCLED_TOTAL,ONE_TIME_RELEASES
0,2012,FLINT HILLS RESOURCES PINE BEND LLC,NO,KOCH INDUSTRIES INC,Petroleum,55068,MN,ROSEMOUNT,DAKOTA,44.768400,...,TOLUENE,Pounds,NO,YES,1.962360e+04,1.960450e+04,19.10,2.000000e-01,1700.000,0.0
1,2012,GEORGIA BIOMASS LLC WAYCROSS FACILITY,NO,GEORGIA BIOMASS LLC,Wood Products,31503,GA,WAYCROSS,WARE,31.256800,...,LEAD COMPOUNDS,Pounds,NO,YES,9.523000e+01,1.020000e+00,94.21,0.000000e+00,0.000,0.0
2,2012,TECHNICAL PRODUCTS INC,NO,,Chemical Wholesalers,44102,OH,CLEVELAND,CUYAHOGA,41.461430,...,ETHYLENE GLYCOL,Pounds,NO,YES,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
3,2012,INGREDION INC ARGO PLANT,NO,INGREDION INC,Food,60501,IL,BEDFORD PARK,COOK,41.778470,...,AMMONIA,Pounds,NO,NO,2.425000e+03,2.425000e+03,0.00,0.000000e+00,5200.000,0.0
4,2012,COLEMAN CABLE - TEXARKANA FACILITY,NO,COLEMAN CABLE INC,Electrical Equipment,71854,AR,TEXARKANA,MILLER,33.433670,...,LEAD,Pounds,YES,YES,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
5,2012,SOUTH CAROLINA ELECTRIC & GAS CO COPE STATION,NO,SCANA CORP,Electric Utilities,29038,SC,COPE,ORANGEBURG,33.365120,...,MANGANESE COMPOUNDS,Pounds,NO,YES,2.954800e+04,2.954800e+04,0.00,0.000000e+00,0.000,0.0
6,2012,FUTURE FINISHES INC,NO,,Fabricated Metals,45015,OH,HAMILTON,BUTLER,39.337330,...,CYANIDE COMPOUNDS,Pounds,NO,YES,1.288000e+01,1.288000e+01,0.00,1.850000e+01,0.000,0.0
7,2012,SHEAROUSE LUMBER CO,NO,,Wood Products,31322,GA,POOLER,CHATHAM,32.119993,...,COPPER COMPOUNDS,Pounds,NO,NO,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
8,2012,VALERO CHARLES CITY PLANT,NO,VALERO ENERGY CORP,Chemicals,50616,IA,CHARLES CITY,FLOYD,43.111110,...,TOLUENE,Pounds,NO,YES,4.400000e+01,4.400000e+01,0.00,0.000000e+00,0.000,0.0
9,2012,OXY VINYLS LP DEER PARK-VCM PLANT,NO,OCCIDENTAL CHEMICAL HOLDING CORP,Chemicals,77536,TX,DEER PARK,HARRIS,29.710967,...,CHLOROBENZENE,Pounds,NO,YES,2.990000e+00,2.990000e+00,0.00,0.000000e+00,0.000,0.0


### Note:
There are still quite a few categories that may contain NaN values, but these won't be summed across for statistical analysis, and so should be dealt with or dropped as needed when we actually use them.

# Outputting DataFrame to CSV
***

Now that we've trimmed out the columns that we want, and dealt with potentially problematic zeroes, we are ready to write the DataFrame back to CSV for further use. Again, the functions here will be put into a single python script that will take a TRI data file argument and output a cleaned version of that CSV in a file in the same directory.

In [8]:
# Function to write the DataFrame to csv. For the purposes of this project, we will call the folder clean_TRI_Data/
# and the files will be in the format 'TRI_<YEAR>_US_CLEAN.csv'
def write_cleaned_tri_to_csv(tri_df):
    import os
    
    directory = './clean_TRI_Data'
    if not os.path.exists(directory):
        os.makedirs(directory)
        
    tri_df.to_csv('%s/%s'%(directory,'TRI_%s_US_CLEAN.csv'%(tri_df.YEAR[0])))

In [9]:
# Testing write_cleaned_tri_to_csv function to verify output
write_cleaned_tri_to_csv(us2012_clean)

### Before writing to local directory
![Before](./images/before_tri_clean.png)

### After writing to local directory
![After](./images/after_tri_clean.png)

It works! Now let's load in this newly created 'clean' CSV file to check that we're getting the correct output.

In [10]:
# Importing the newly created CSV file to verify contents
ostensibly_cleaned_us2012_df = pd.read_csv('./clean_TRI_Data/TRI_2012_US_CLEAN.csv')
ostensibly_cleaned_us2012_df

Unnamed: 0.1,Unnamed: 0,YEAR,FACILITY_NAME,FEDERAL_FACILITY,PARENT_COMPANY_NAME,INDUSTRY_SECTOR,ZIP,STATE,CITY,COUNTY,...,CHEMICAL,UNIT_OF_MEASURE,CARCINOGEN,CAA_CHEMICAL,TOTAL_RELEASES,ON_SITE_RELEASE_TOTAL,OFF_SITE_RELEASE_TOTAL,OFF_SITE_RECYCLED_TOTAL,ON_SITE_RECYCLED_TOTAL,ONE_TIME_RELEASES
0,0,2012,FLINT HILLS RESOURCES PINE BEND LLC,NO,KOCH INDUSTRIES INC,Petroleum,55068,MN,ROSEMOUNT,DAKOTA,...,TOLUENE,Pounds,NO,YES,1.962360e+04,1.960450e+04,19.10,2.000000e-01,1700.000,0.0
1,1,2012,GEORGIA BIOMASS LLC WAYCROSS FACILITY,NO,GEORGIA BIOMASS LLC,Wood Products,31503,GA,WAYCROSS,WARE,...,LEAD COMPOUNDS,Pounds,NO,YES,9.523000e+01,1.020000e+00,94.21,0.000000e+00,0.000,0.0
2,2,2012,TECHNICAL PRODUCTS INC,NO,,Chemical Wholesalers,44102,OH,CLEVELAND,CUYAHOGA,...,ETHYLENE GLYCOL,Pounds,NO,YES,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
3,3,2012,INGREDION INC ARGO PLANT,NO,INGREDION INC,Food,60501,IL,BEDFORD PARK,COOK,...,AMMONIA,Pounds,NO,NO,2.425000e+03,2.425000e+03,0.00,0.000000e+00,5200.000,0.0
4,4,2012,COLEMAN CABLE - TEXARKANA FACILITY,NO,COLEMAN CABLE INC,Electrical Equipment,71854,AR,TEXARKANA,MILLER,...,LEAD,Pounds,YES,YES,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
5,5,2012,SOUTH CAROLINA ELECTRIC & GAS CO COPE STATION,NO,SCANA CORP,Electric Utilities,29038,SC,COPE,ORANGEBURG,...,MANGANESE COMPOUNDS,Pounds,NO,YES,2.954800e+04,2.954800e+04,0.00,0.000000e+00,0.000,0.0
6,6,2012,FUTURE FINISHES INC,NO,,Fabricated Metals,45015,OH,HAMILTON,BUTLER,...,CYANIDE COMPOUNDS,Pounds,NO,YES,1.288000e+01,1.288000e+01,0.00,1.850000e+01,0.000,0.0
7,7,2012,SHEAROUSE LUMBER CO,NO,,Wood Products,31322,GA,POOLER,CHATHAM,...,COPPER COMPOUNDS,Pounds,NO,NO,0.000000e+00,0.000000e+00,0.00,0.000000e+00,0.000,0.0
8,8,2012,VALERO CHARLES CITY PLANT,NO,VALERO ENERGY CORP,Chemicals,50616,IA,CHARLES CITY,FLOYD,...,TOLUENE,Pounds,NO,YES,4.400000e+01,4.400000e+01,0.00,0.000000e+00,0.000,0.0
9,9,2012,OXY VINYLS LP DEER PARK-VCM PLANT,NO,OCCIDENTAL CHEMICAL HOLDING CORP,Chemicals,77536,TX,DEER PARK,HARRIS,...,CHLOROBENZENE,Pounds,NO,YES,2.990000e+00,2.990000e+00,0.00,0.000000e+00,0.000,0.0


Everything looks good! Now to take the functions we've defined in this notebook and create a command line script that will clean our TRI files.

### Command Line Script
This script will also be uploaded as a .py file in the group repository under jacob/


```python
#clean_tri_csv.py
#importing utility modules
import os, sys
import pandas as pd

# Function to capture/rename the attributes listed above from TRI dataframe
# Some input validataion could be done here... but we're all going to agree to only use valid inputs!!
def trim_tri_df(tri_df):
    # list of the categories we want from each tri data set
    desired_categories = ['YEAR','FACILITY_NAME','FEDERAL_FACILITY','PARENT_COMPANY_NAME',   \
                          'INDUSTRY_SECTOR','ZIP','ST', 'CITY', 'COUNTY', 'LATITUDE',        \
                          'LONGITUDE', 'CHEMICAL', 'UNIT_OF_MEASURE', 'CARCINOGEN',          \
                          'CLEAR_AIR_ACT_CHEMICAL','TOTAL_RELEASES','ON-SITE_RELEASE_TOTAL', \
                          'OFF-SITE_RELEASE_TOTAL', 'OFF-SITE_RECYCLED_TOTAL',               \
                          '8.4_RECYCLING_ON-SITE', '8.8_ONE-TIME_RELEASE']

    tri_df = tri_df[desired_categories]

    #rename categories as described above
    tri_df.rename(columns = {                                                     \
            'ST':'STATE',                                                                    \
            'CLEAR_AIR_ACT_CHEMICAL':'CAA_CHEMICAL',                                         \
            'ON-SITE_RELEASE_TOTAL':'ON_SITE_RELEASE_TOTAL',                                 \
            'OFF-SITE_RELEASE_TOTAL':'OFF_SITE_RELEASE_TOTAL',                               \
            '8.4_RECYCLING_ON-SITE':'ON_SITE_RECYCLED_TOTAL',                                \
            'OFF-SITE_RECYCLED_TOTAL':'OFF_SITE_RECYCLED_TOTAL',                             \
            '8.8_ONE-TIME_RELEASE':'ONE_TIME_RELEASES'}, inplace=True)

    #return trimmed tri
    return tri_df

# Function to handle missing values in our trimmed TRI data fram
# Some input validataion could also be done here... but this will be chained with the previous trim_tri_df() function
# in the source for the actual utility program... so again we'll just agree to not give bad inputs to this function!
# The output and input to this function is very specific to this particular data set and project, of course.
def handle_nan_tri_df(tri_df):
    # Filling NaN categories for attributes that may need to be summed across
    fill_zero_categories = ['TOTAL_RELEASES','ON_SITE_RELEASE_TOTAL','OFF_SITE_RELEASE_TOTAL',    \
                            'ON_SITE_RECYCLED_TOTAL','OFF_SITE_RECYCLED_TOTAL','ONE_TIME_RELEASES']

    for i in range(len(fill_zero_categories)):
        tri_df[fill_zero_categories[i]].fillna(0,inplace=True)

    #return now-cleaned (after running through trim_tri_df) data frame
    return tri_df

# Function to write the DataFrame to csv. For the purposes of this project, we will call the folder clean_TRI_Data/
# and the files will be in the format 'TRI_<YEAR>_US_CLEAN.csv'
def write_cleaned_tri_to_csv(tri_df):
    #change this if you want files written to a different directory
    directory = './clean_TRI_Data'
    #create target directory if it doesn't already exist
    if not os.path.exists(directory):
        os.makedirs(directory)

    tri_df.to_csv('%s/%s'%(directory,'TRI_%s_US_CLEAN.csv'%(tri_df.YEAR[0])))

if(len(sys.argv) == 2):
    print """
        This script is particularly for TRI US data sets. Read documentation at
        <insert GitHub URL here> for example.

        Cleaning CSV File...
    """
    #read in csv file
    initial_df = pd.read_csv(sys.argv[1])

    print 'Trimming and renaming columns'
    #trim and rename columns
    trimmed_df = trim_tri_df(initial_df)

    print 'Converting NaN values in numerical columns to zeroes'
    #fill NaN values in appropriate categories
    clean_df = handle_nan_tri_df(trimmed_df)

    print 'Writing cleaned DataFrame to ./clean_TRI_Data/TRI_%s_US_CLEAN.csv'%(clean_df.YEAR[0])
    #write cleaned TRI DataFrame to csv and store in local folder - create if it doesn't exist yet
    write_cleaned_tri_to_csv(clean_df)

    #it probably works!
    print 'TRI CSV File cleaned and placed in ./clean_TRI_Data/TRI_%s_US_CLEAN.csv'%(clean_df.YEAR[0])

else:
    print """
        Provide a single TRI data file for cleaning.

        This script is particularly for TRI US data sets. Read documentation at
        <insert GitHub URL here> for example.
    """
```

## Testing the Script

### Before

![beforecmd](./images/before_tri_cmd_clean.png)

### After
![aftercmd](./images/after_tri_cmd_clean.png)

### Verify CSV in iPython
![verifycmd](./images/verify_tri_cmd_clean.png)

# Was there any point in doing this??!?!
***

Yes! As we have seen, we can now quickly trim and clean all of our desired TRI files in a standard way. This script also significantly reduces the file size of the CSV:

### Original TRI files (notice TRI_2012_US.csv)
The original file size is ~57MB for each year.
![filesizebefore](./images/filesize_before.png)

### Cleaned TRI file (TRI_2012_US.csv --> TRI_2012_US_CLEAN.csv)
The file size for our 2012 TRI data is now just under 15MB, only contains the columns we're interested in, and has zeroes in numerical columns taken care of. Good!
![filesizeafter](./images/filesize_after.png)



***
##### -jacob