Files and Printing
------------------
You'll often be reading data from a file, or writing the output of your python scripts back into a file. Python makes this very easy. You need to open a file in the appropriate mode, using the `open` function, then you can read or write to accomplish your task. The `open` function takes two arguments, the name of the file, and the mode. The mode is a single letter string that specifies if you're going to be reading from a file, writing to a file, or appending to the end of an existing file. The function returns a file object that performs the various tasks you'll be performing: `a_file = open(filename, mode)`. The modes are:

+ `'r'`: open a file for reading
+ `'w'`: open a file for writing. Caution: this will overwrite any previously existing file
+ `'a'`: append. Write to the end of a file. 

When reading, you typically want to iterate through the lines in a file using a for loop, as above. Some other common methods for dealing with files are: 

+ `file.read()`: read the entire contents of a file into a string
+ `file.write(some_string)`: writes to the file, note this doesn't automatically include any new lines. Also note that sometimes writes are buffered- python will wait until you have several writes pending, and perform them all at once
+ `file.flush()`: write out any buffered writes
+ `file.close()`: close the open file. This will free up some computer resources occupied by keeping a file open.

Here is an example using files:

#### Writing a file to disk

In [None]:
# Create the file temp.txt, and get it ready for writing
f = open("temp.txt", "w")
f.write("This is my first file! The end!\n")
f.write("Oh wait, I wanted to say something else.")
f.close()

In [None]:
# Let's check that we did everything as expected
# the command below is one of the IPython "magics" - commands within the notebook unrelated to python
# %magic shows you the list of basic commands and %lsmagic shows you all the super commands
%more temp.txt

#### Reading a file from disk

In [None]:
# We now open the file for reading
f = open("temp.txt", "r")
# And we read the full content of the file in memory, as a big string
content = f.read()
f.close()

In [None]:
content

Once we read the file, we have the lines in a big string. Let's process that big string a little bit:

In [None]:
# Read the file in the cell above, the content is in f2_content

# Split the content of the file using the newline character \n
lines = content.split("\n")

# Iterate through the line variable (it is a list of strings)
# and then print the length of each line
for line in lines:
    print(line, " ===> ", len(line))

In [None]:
# Create a file numbers.txt and write the numbers from 0 to 24 there
f = open("numbers.txt", "w")
for num in range(25):
    f.write(str(num)+'\n')
f.close()

In [None]:
# Let's check that we did everything as expected
%more numbers.txt

In [None]:
# We now open the file for reading
f = open("numbers.txt", "r")
# And we read the full content of the file in memory, as a big string
content = f.read()
f.close()
content

In [None]:
# here we convert the strings into integers
# we have the conditional to avoid trying to parse the string '' that 
# is at the end of the list
numbers = []
lines = content.split("\n")
for line in lines:
    if len(line) > 0:
        numbers.append(int(line))
    else:
        continue
print(numbers)

In [None]:
# Let's clean up
# windows
#!del temp.txt
#!del numbers.txt

# macOS
#!rm temp.txt
#!rm numbers.txt

## Python `os` standard library
Another addition to our file handling toolkit is the `os` library which provides ways to move files, make directories, and gather data about the file system. Like other standard libraries, we need to import it to use it via `import os`

In [None]:
import os

# let's get information about our current working directory - the folder our Python applications are working within
os.getcwd()

In [None]:
# next, let's list everything in the directory
os.listdir()

In [None]:
# ... and see what our login name is
os.getlogin()

In [None]:
# ... and change our directory.
os.chdir("sql-class")
os.getcwd()

In [None]:
# ... and change it back
os.chdir("..")  # ".." is a relative directory shortcut that means "my parent folder" no matter where you are
os.getcwd()

In [None]:
# .. and find out the "separate" used in constructing file paths
# every operating system is different, and this value enables your python
# to be cross-platform 
os.path.sep

In [None]:
# ... now we can create our own paths for new files - important for creating a "clean" version of source data
dir_list = os.getcwd().split(os.path.sep)
print(dir_list)

In [None]:
# ... now let's create an output file in a new sub-folder called 7-tmp

# first - append the new folder name 
# and create a file path string using os.path.sep and the string.join() method
dir_list.append("7-tmp")
dir_string = os.path.sep.join(dir_list)

# second - create the directory using the file path
os.mkdir(dir_string)

# third - add the file name "tmp2.txt" to the path
dir_list.append("tmp2.txt")
dir_string = os.path.sep.join(dir_list)
print(dir_string)

In [None]:
# we can now open the file for writing using the absolute path 
# note: we could have also used os.chdir("7-tmp") and open("tmp2.txt")
# but it's better programming to nearly always use absolute paths
file_handle = open(dir_string,"w")

In [None]:
file_handle.write("test file\nsecond line\n")
file_handle.close()

# **Putting our Python to work:** 
# **Exploring and Cleaning the Census data file**

So far in our class project, we have identified three data files (IRS tax return counts for NYC, US census data for NYC, and NYC film permits data) and we have skipped over the cleaning step to transform them in SQL.

Now, we'll take one step back and begin cleaning them, starting with the smallest file - the US Census data.

Our approach will be as follows:
1. Inspect the data thoroughly to understand its benefits and risks for processing, and what information lies within
* Clean/fix the data to remove any issues that would prevent it from working with SQL, such as weird characters, too many columns, missing data, splitting values into multiple rows, or combining multiple values into a single row
* Structure the data found in the file as a Python native structure so we can manipulate and prepare it for use in SQL
* Migrate our approach to a dedicated Python file

We'll use: file read/write, loops, nested structures and UDFs to do all of this. (Some will be homework.)

## **1- INSPECT DATA** 
First - we need to understand the data in this file - using either Excel or JupyterLab's CSV reader.

Upon inspection that way, here's what we learn:
* there's 350+ columns
* there's zipcodes for just NYC 
* there's a mix of letter and numbers for values
* we have both percent and numbers values
* it appears to be comma-separated
* there are two column header lines: one with codes, and another with human-readable labels

Next - we inspect with Python in two ways: first we'll look at the raw file as a string, then we'll probe whether the comma-separation is fail safe or not.

### **REVIEW DATA AS STRING**

In [3]:
# we'll use our newly learned ability to read files to capture the Census file as a string
# to use this command, we'll need the full file path for our census file
import os
os.getcwd()

'C:\\Users\\colling\\!dwd_spring2019\\classes\\class7\\IPYNB solved'

In [4]:
# find the location of the raw_census_2010.csv file you downloaded and open it for reading into a variable
file_handle = open('C:\\Users\\colling\\!dwd_spring2019\\classes\\class7\\raw_census_2010.csv',"r")
census_data = file_handle.read()
file_handle.close()

In [5]:
# print the first 5000 characters of the census file
census_data[:5000]

'GEO.id,GEO.id2,GEO.display-label,HD01_S001,HD02_S001,HD01_S002,HD02_S002,HD01_S003,HD02_S003,HD01_S004,HD02_S004,HD01_S005,HD02_S005,HD01_S006,HD02_S006,HD01_S007,HD02_S007,HD01_S008,HD02_S008,HD01_S009,HD02_S009,HD01_S010,HD02_S010,HD01_S011,HD02_S011,HD01_S012,HD02_S012,HD01_S013,HD02_S013,HD01_S014,HD02_S014,HD01_S015,HD02_S015,HD01_S016,HD02_S016,HD01_S017,HD02_S017,HD01_S018,HD02_S018,HD01_S019,HD02_S019,HD01_S020,HD02_S020,HD01_S021,HD02_S021,HD01_S022,HD02_S022,HD01_S023,HD02_S023,HD01_S024,HD02_S024,HD01_S025,HD02_S025,HD01_S026,HD02_S026,HD01_S027,HD02_S027,HD01_S028,HD02_S028,HD01_S029,HD02_S029,HD01_S030,HD02_S030,HD01_S031,HD02_S031,HD01_S032,HD02_S032,HD01_S033,HD02_S033,HD01_S034,HD02_S034,HD01_S035,HD02_S035,HD01_S036,HD02_S036,HD01_S037,HD02_S037,HD01_S038,HD02_S038,HD01_S039,HD02_S039,HD01_S040,HD02_S040,HD01_S041,HD02_S041,HD01_S042,HD02_S042,HD01_S043,HD02_S043,HD01_S044,HD02_S044,HD01_S045,HD02_S045,HD01_S046,HD02_S046,HD01_S047,HD02_S047,HD01_S048,HD02_S048,HD01_S

In [6]:
# and the last 5000 characters
census_data[-5000:]

',812,12.3,400,6.0,9,0.1,86,1.3,16,0.2,34,0.5,267,4.0,3.0, ( X ) ,11.6, ( X ) ,5805,100.0,2766,47.6,6571, ( X ) ,2.38, ( X ) ,3039,52.4,6819, ( X ) ,2.24, ( X ) \n8600000US14903,14903,ZCTA5 14903,7567,100.0,439,5.8,491,6.5,494,6.5,484,6.4,426,5.6,448,5.9,425,5.6,414,5.5,532,7.0,613,8.1,643,8.5,557,7.4,451,6.0,298,3.9,257,3.4,217,2.9,203,2.7,175,2.3,41.6, ( X ) ,6049,79.9,5837,77.1,5566,73.6,1404,18.6,1150,15.2,3690,48.8,220,2.9,253,3.3,249,3.3,250,3.3,203,2.7,212,2.8,225,3.0,191,2.5,255,3.4,304,4.0,335,4.4,272,3.6,229,3.0,137,1.8,122,1.6,99,1.3,80,1.1,54,0.7,40.9, ( X ) ,2923,38.6,2810,37.1,2675,35.4,626,8.3,492,6.5,3877,51.2,219,2.9,238,3.1,245,3.2,234,3.1,223,2.9,236,3.1,200,2.6,223,2.9,277,3.7,309,4.1,308,4.1,285,3.8,222,2.9,161,2.1,135,1.8,118,1.6,123,1.6,121,1.6,42.1, ( X ) ,3126,41.3,3027,40.0,2891,38.2,778,10.3,658,8.7,7567,100.0,7434,98.2,7159,94.6,129,1.7,10,0.1,101,1.3,19,0.3,35,0.5,5,0.1,7,0.1,8,0.1,2,0.0,25,0.3,2,0.0,0,0.0,2,0.0,0,0.0,0,0.0,33,0.4,133,1.8,17,0.2,24,0.3,74,1

### **REVIEW COMMA-based SEPARABILITY**

In [7]:
# in the first 5000 characters, we see that the census has "code labels" for each column
# none of which have funky characters that could trip us up with comma separation
# we need to work with just that first line - we'll use readline on the file handle
file_handle = open('C:\\Users\\colling\\!dwd_spring2019\\classes\\class7\\raw_census_2010.csv',"r")
census_first_line = file_handle.readline()
file_handle.close()
census_first_line

'GEO.id,GEO.id2,GEO.display-label,HD01_S001,HD02_S001,HD01_S002,HD02_S002,HD01_S003,HD02_S003,HD01_S004,HD02_S004,HD01_S005,HD02_S005,HD01_S006,HD02_S006,HD01_S007,HD02_S007,HD01_S008,HD02_S008,HD01_S009,HD02_S009,HD01_S010,HD02_S010,HD01_S011,HD02_S011,HD01_S012,HD02_S012,HD01_S013,HD02_S013,HD01_S014,HD02_S014,HD01_S015,HD02_S015,HD01_S016,HD02_S016,HD01_S017,HD02_S017,HD01_S018,HD02_S018,HD01_S019,HD02_S019,HD01_S020,HD02_S020,HD01_S021,HD02_S021,HD01_S022,HD02_S022,HD01_S023,HD02_S023,HD01_S024,HD02_S024,HD01_S025,HD02_S025,HD01_S026,HD02_S026,HD01_S027,HD02_S027,HD01_S028,HD02_S028,HD01_S029,HD02_S029,HD01_S030,HD02_S030,HD01_S031,HD02_S031,HD01_S032,HD02_S032,HD01_S033,HD02_S033,HD01_S034,HD02_S034,HD01_S035,HD02_S035,HD01_S036,HD02_S036,HD01_S037,HD02_S037,HD01_S038,HD02_S038,HD01_S039,HD02_S039,HD01_S040,HD02_S040,HD01_S041,HD02_S041,HD01_S042,HD02_S042,HD01_S043,HD02_S043,HD01_S044,HD02_S044,HD01_S045,HD02_S045,HD01_S046,HD02_S046,HD01_S047,HD02_S047,HD01_S048,HD02_S048,HD01_S

In [8]:
# let's separate that first line into a list and see exactly how many columns we have
header1_list = census_first_line.split(",")
len(header1_list)

375

We knew it had 350+ and this is too many columns to effectively work with. We know from our transform work together that we need to weight up/down the data, which means percentages won't be useful, so we can remove those, and we can also remove some of the family occupancy data because it was decided to focus on gender, income and ethnicity in the project.

Before we get to removing data, we need to put our data into a structure we can slice and dice. In other words, we need to transform our "string" data into a list of lists - a nested data structure where each row is a list, and within that list, each column is a list - a nested data structure.

For example: let's look at this simple 3x3 table:

In [9]:
example_list = [ 
["a","b","c"],
[123,456,789],
[987,654,321]
]
print(example_list[0])
print(example_list[1])
print(example_list[2])
print("row 1, column 3 ==>",example_list[0][2])
print("row 2, column 2 ==>",example_list[1][1])
print("row 3, column 1 ==>",example_list[2][0])

['a', 'b', 'c']
[123, 456, 789]
[987, 654, 321]
row 1, column 3 ==> c
row 2, column 2 ==> 456
row 3, column 1 ==> 987


## **2- CLEANING UP WONKY DATA**

But before we can even create a nested structure - we need to be confident we can split the data correctly for every single line.

Unfortunately, CSV data is known to be particularly  can be tricky because sometimes data sources use commas in column labels but surround those column headers with "" because Excel will treat it right. 

Python won't be so forgiving so we need to test for "" in the data - we'll do this using a for loop.

In [10]:
# use a for loop
quote_cnt = 0
for char in census_data:
    if char == '"':
        quote_cnt += 1
    else:
        continue
print(quote_cnt)

28


Uh-oh! those quotes spell trouble so now we need to see where that first quote appears and if a comma appears after it.

In [11]:
quote_pos = census_data.find('"')
comma_pos = census_data.find(",",quote_pos)
census_data[quote_pos - 10: comma_pos + 20]

' 18 years,"Number; HOUSEHOLDS BY TYPE - Total households - Family households (families) [7] - Male householder, no wife present","'

We suspect that those ""  in human-readable column headers of the second line of data are hiding "...**,**..." and will cause any nesting using comma-separation to create extra columns.

Let's prove it.

In [12]:
# we need to isolate the second line 
# and we can do that using find: 
# the second line is between the first and second \n
first_nl = census_data.find("\n")
second_nl = census_data.find("\n",first_nl+1)
second_line = census_data[first_nl+1:second_nl]

# split the second line using commas to see how many columns we get
len(second_line.split(","))

391

Now that we've proven those will be a problem, we are going to use Python to clean those up via our own User Defined Function for this purpose, because it may appear in another data source, too.

Before we create the UDF, we will describe what it will do
1. it will accept a string input
2. it will remove " characters from the input
3. when it finds a , between "" it will replace , characters in the input with a -. However, it will NOT change , otherwise since it is a CSV file.
4. it will return a string output

And, we will create a test_input and expected_output to test our UDF during development:

In [13]:
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@@@@  UDF unquotable() @@@@@@@@@@@@@@@
# removes pesky , within "" values before changing , to \t in program
def unquotable(input_string):
    
    # remove " char 
    # we remove it by splitting the input string into a list using that char
    tmp_list = input_string.split('"')
 
    # remove , char
    for i in range(len(tmp_list)):
        # skip items that start/end with commas 
        # as these aren't quoted items 
        # and we don't want to remove their commas 
        if (tmp_list[i][0] == ",") or (tmp_list[i][-1] == ","):
            continue
        # replace , with - in quoted items
        else:
            tmp_list[i] = tmp_list[i].replace(",","-")

    # rejoin items into a string
    output = ",".join(tmp_list)

    # get rid of any ",," and ",,," that could exist as it will add extra columns
    output = output.replace(",,,",",")
    output = output.replace(",,",",")
    
    # return string
    return output

In [14]:
# our test data
test_input = 'something,"test string: a,b,c",other thing'
exp_output = 'something,test string: a-b-c,other thing'

# our testing lines - run the UDF
test_output = unquotable(test_input)

# compare UDF run to expected output
test_output == exp_output

True

In [15]:
# now we try our second_line variable storing the problem row for cleaning 
# proof it works: no " and len after CSV split = 375
new_row = unquotable(second_line)
print('"' in new_row)
print(len(new_row.split(",")))

False
375


It works! Now, let's fix our original input data string, `census_data`

In [16]:
clean_census_data = unquotable(census_data)

## **3-CREATING A NESTED STRUCTURE**
Now that we've cleaned up the wonkiness in the data, we can create our nested structure using a `for` loop.

Let's start by reviewing our simple 3x3 table example:

In [17]:
example_list = [ 
["a","b","c"],
[123,456,789],
[987,654,321]
]
print(example_list[0])
print(example_list[1])
print(example_list[2])
print("row 1, column 3 ==>",example_list[0][2])
print("row 2, column 2 ==>",example_list[1][1])
print("row 3, column 1 ==>",example_list[2][0])

['a', 'b', 'c']
[123, 456, 789]
[987, 654, 321]
row 1, column 3 ==> c
row 2, column 2 ==> 456
row 3, column 1 ==> 987


In our example, the data is structured as:

`list[row][column]`

And we will use split commands to do the same for our data, this time creating another UDF.

Again - as before - let's state what it will do and create test data:
1. the UDF will take a data string as input
* it will create a list where each line in the data, identified using `\n`, is an item
* for each item in the first list `list[row]`, it will create a list of columns, using comma-separatiuon (`,`) to identify each item
* it would be nice for the UDF user to specify what character separates rows and columns, separately
* the UDF will return the nested data structure

In [18]:
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    
#@@@@@@@@ UDF nester() @@@@@@@@@
# take a dataset string where each row is separated by input_row_delim 
# and each column is separate by input_col_delim to create a nested object of lists
def nester(input_string,input_row_delim,input_col_delim):

    # create a list item for each row in the file using the row delimiter
    row_list = input_string.split(input_row_delim)

    #output var
    nested_data = []
    
    # created nested structure to store each column separately
    # list of rows where each row is a list of columns)
    for i in range(len(row_list)): 
        row = row_list[i]
        col = row.split(input_col_delim)
        nested_data.append(col)
    
    # return the nested structure
    return nested_data

In [19]:
# our test data
test_input = 'r1c1,r1c2,r1c3\nr2c1,r2c2,r2c3\nr3c1,r3c2,r3c3'
exp_output = [
    ['r1c1','r1c2','r1c3'],
    ['r2c1','r2c2','r2c3'],
    ['r3c1','r3c2','r3c3']
]

# our testing - run the UDF
test_output = nester(test_input,'\n',',')

# compare UDF output to expected output
test_output == exp_output

True

Now we can create a nested structure of our census data:

In [20]:
# structure our original string
census_struc = nester(clean_census_data,'\n',',')

In [21]:
# let's explore it for a few moments
print(census_struc[2])

['8600000US06390', '06390', 'ZCTA5 06390', '236', '100.0', '6', '2.5', '9', '3.8', '16', '6.8', '11', '4.7', '7', '3.0', '7', '3.0', '7', '3.0', '13', '5.5', '20', '8.5', '27', '11.4', '22', '9.3', '29', '12.3', '19', '8.1', '20', '8.5', '6', '2.5', '6', '2.5', '8', '3.4', '3', '1.3', '49.4', ' ( X ) ', '200', '84.7', '196', '83.1', '193', '81.8', '55', '23.3', '43', '18.2', '118', '50.0', '2', '0.8', '7', '3.0', '8', '3.4', '4', '1.7', '4', '1.7', '3', '1.3', '2', '0.8', '7', '3.0', '9', '3.8', '17', '7.2', '8', '3.4', '21', '8.9', '4', '1.7', '11', '4.7', '3', '1.3', '4', '1.7', '4', '1.7', '0', '0.0', '49.2', ' ( X ) ', '100', '42.4', '99', '41.9', '96', '40.7', '24', '10.2', '22', '9.3', '118', '50.0', '4', '1.7', '2', '0.8', '8', '3.4', '7', '3.0', '3', '1.3', '4', '1.7', '5', '2.1', '6', '2.5', '11', '4.7', '10', '4.2', '14', '5.9', '8', '3.4', '15', '6.4', '9', '3.8', '3', '1.3', '2', '0.8', '4', '1.7', '3', '1.3', '49.7', ' ( X ) ', '100', '42.4', '97', '41.1', '97', '41.1', '3

## **4-MOVING FROM NOTEBOOKS TO \*.py FILES**
But before you go to do that, we need to start moving the findings of our exploration into a dedicated python file for cleaning the Census data. We're making this migration because notebooks are great for exploring data, but as our files and project grow larger, it is simpler to run the Python files outside notebooks AND sometimes very large files can cause our notebooks to crash.

Here's what that new Python file needs to do:
1. read the census data file into a variable
2. clean the data by removing the "...,..." problem using a UDF
3. create a nested data structure
4. remove unwanted columns
5. create a new string from the nested structure
6. write the file to disk

We've already #1, #2, #3, and #6 together, so we'll isolate those below along with comments to do the other work before putting in its own file.

And recall that our UDFs has to be put ahead of the main program to work because the Main Program needs to know what happens in those UDFs.

Below is the exact code that will be put into its own file named `clean_census.py` and we'll run it together.

```python
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
#@@@@@  UDF unquotable() @@@@@@@@@@@@@@@
# removes pesky , within "" values before changing , to \t in program
def unquotable(input_string):
    
    # remove " char 
    # we remove it by splitting the input string into a list using that char
    tmp_list = input_string.split('"')
 
    # remove , char
    for i in range(len(tmp_list)):
        # skip items that start/end with commas as these aren't quoted items
        if (tmp_list[i][0] == ",") or (tmp_list[i][-1] == ","):
            continue
        # replace , with - in quoted items
        else:
            tmp_list[i] = tmp_list[i].replace(",","-")

    # rejoin items into a string
    output = ",".join(tmp_list)

    # get rid of any ",," and ",,," as it will add extra columns
    output = output.replace(",,,",",")
    output = output.replace(",,",",")
    
    # return string
    return output

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    
#@@@@@@@@ UDF nester() @@@@@@@@@
# take a dataset string where each row is separated by input_row_delim 
# and each column is separate by input_col_delim to create a nested object of lists
def nester(input_string,input_row_delim,input_col_delim):

    # create a list item for each row in the file using the row delimiter
    row_list = input_string.split(input_row_delim)

    #output var
    nested_data = []
    
    # created nested structure to store each column separately
    # list of rows where each row is a list of columns)
    for i in range(len(row_list)): 
        row = row_list[i]
        col = row.split(input_col_delim)
        nested_data.append(col)
    
    # return the nested structure
    return nested_data


################################################################
##### MAIN PROGRAM #############################################
################################################################

# 1. read the census data file into a variable
file_handle = open('C:\\Users\\colling\\!dwd_spring2019\\classes\\class7\\raw_census_2010.csv',"r")
census_data = file_handle.read()
file_handle.close()

# 2. clean the data by removing the "...,..." problem using a UDF
clean_census_data = unquotable(census_data)

# 3. create a nested data structure with a UDF
nested_census_data = nester(clean_census_data,"\n",",")

# 4. remove unwanted columns using a UDF
# HOMEWORK

# 5. create a new string from the nested structure using a UDF
# HOMEWORK

# 6. write the file to disk
file_handle = open('C:\\Users\\colling\\!dwd_spring2019\\classes\\class7\\clean_census_2010.csv',"w")
file_handle.write('final data var from step #5 goes here')
file_handle.close()
```