# Dealing with Data Spring 2022 – Class 5

---

# Files and Printing

Often you will need to read data from a file, or write the output of a Python script back to a file. 

We use the `open` function to open the file in the appropriate mode, which takes two arguments: 

1. the name of the file,
2. and the mode. 

> `a_file = open(filename, mode)`

The `mode` is a single letter string that specifies if you are going to be reading from a file, writing to a file, or appending to the end of an existing file. The modes are: 

+ `'r'` : open a file for reading
+ `'w'` : open a file for writing (beware, this will overwrite any previously existing file) 
+ `'a'` : append (write to the end of a file) 

When reading a file, you usually want to iterate through the lines in that file using a `for loop`. Some other common methods for dealing with files are: 

+ `file.read()` : read the entire contents of a file into a string
+ `file.write(some_string)` : writes to the file (note, this doesn't automatically include new lines) 
+ `file.close()` : close the open file

# Writing a file to disk

In [None]:
# create the file temp.txt, and get it ready for writing

f = open("temp.txt", "w") # "w" meaning we want to open it for writing
f.write("This is my first file! The end!\n") # writing our text to our file
f.write("Oh wait, I wanted to say something else.") # writing some more text to our file
f.close() # closing out the file as a best practices

In [None]:
# the command below is one of the IPython "magics" - commands within the notebook unrelated to python
# %magic shows you the list of basic commands and %lsmagic shows you all the super commands

# for more info, check out https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

%less temp.txt

# Reading a file from disk

In [None]:
# open the file for reading

f = open("temp.txt", "r") # opening the file with read permissions
content = f.read() # read the full content of the file in memory, as a big string
f.close()

In [None]:
content

Once we read the file, we have the lines in a big string. Let's process that big string a little bit:

In [None]:
# read the file in the cell above and split the content of the file using the newline character '\n'

lines = content.split("\n") # let's split on the new line return; remember the output is going to be a list

for line in lines: # iterate through the line variable (it is a list of strings) and then print the length of each line
    print(line, " ===> ", len(line))

In [None]:
# create a new file numbers.txt and write the numbers from 0 to 24 there

f = open("numbers.txt", "w")

for num in range(25): 
    f.write(str(num)+'\n') # write each number followed by a new line
    
f.close()

In [None]:
%less numbers.txt

In [None]:
f = open("numbers.txt", "r") # now open the file for reading

content = f.read() # and read the full content of the file in memory, as a big string

f.close()

content

In [None]:
# here we convert the strings into integers

lines = content.split("\n")

for line in lines: 
    print(line)

In [None]:
# Let's clean up

# windows
#!del temp.txt
#!del numbers.txt

# macOS
!rm temp.txt # remove temp.txr file
!rm numbers.txt #remove numbers.txt file

^Please note that in Colab it takes a minute or so for the !rm changes to come into effect.

---

# ⭕ **QUESTIONS?**

---

# Python `os` standard library
Another addition to our file handling toolkit is the `os` library which provides ways to move files, make directories, and gather data about the file system. Like other standard libraries, we need to import it to use it via `import os`

In [None]:
import os

# let's get info about our current working directory - the folder our Python applications are working in
os.getcwd() # cwd = "current working directory"

^ Remember, all of the above is applying to our Colab environment. If you open up your computer's terminal or command line and type ```pwd``` (print working directory) you will most likely get a different response.

In [None]:
# next, let's list everything in the directory

os.listdir() # list directory contents

In [None]:
# ... and find out the "separate" used in constructing file paths
# every operating system is different, and this value enables your python to be cross-platform 

os.path.sep

In [None]:
# ... now we can create our own paths for new files - important for creating a "clean" version of source data

dir_list = os.getcwd().split(os.path.sep)
print(dir_list)

In [None]:
# ... now let's create an output file in a new sub-folder called Class 5 - tmp

# first - append the new folder name and create a file path string using os.path.sep and the string.join() method

dir_list.append("Class5_tmp") # add a new suffix to our list
dir_string = os.path.sep.join(dir_list) # and join everything in our directory list with the '/' speartor

# second - create the directory using the file path

os.mkdir(dir_string)

# third - add the file name "tmp2.txt" to the path

dir_list.append("tmp2.txt")
dir_string = os.path.sep.join(dir_list)

print(dir_string)

In [None]:
# we can now open the file for writing using the absolute path 

file_handle = open(dir_string,"w")

In [None]:
file_handle.write("test file\nsecond line\n")
file_handle.close()

In [None]:
%less /content/Class5_tmp/tmp2.txt

In [None]:
# clean up 

!rm Class5_tmp/tmp2.txt
!rmdir Class5_tmp

---

# ⭕ **QUESTIONS?**

---

# **Putting our Python to work: Exploring and Cleaning the Census Data file**

Our approach will be as follows:
* Inspect the data thoroughly to understand its benefits and risks for processing, and what information lies within
* Clean/fix the data to remove any issues that would prevent it from working with SQL, such as weird characters, too many columns, missing data, splitting values into multiple rows, or combining multiple values into a single row
* Structure the data found in the file as a Python native structure so we can manipulate and prepare it for use in SQL
* Migrate our approach to a dedicated Python file

We'll use: file read/write, loops, nested structures and UDFs to do all of this. (Some will be homework.)

---

## **1) Inspect our Data**

Upon visual inspection:
* there are 350+ columns
* there are zipcodes for just NYC 
* there are a mix of letter and numbers for values
* we have both percent and numbers values
* it appears to be comma-separated
* there are two column header lines: one with codes, and another with human-readable labels

Next - we inspect with Python in two ways: first we'll look at the raw file as a string, then we'll probe whether the comma-separation is fail safe or not.

In [None]:
# REVIEWING DATA AS A STRING

import os
os.getcwd()

import shutil # will allow us to do some more work with copying, removing files in our Python os

To work in Google Colab, we must manually upload our CSV into the environment...

In [None]:
# find the location of the raw_census_2010.csv file you downloaded, copy, and open it for reading into a variable

shutil.copyfile('/content/raw_census_2010.csv','/content/raw_census_2010_copy.csv')

file_handle = open('/content/raw_census_2010_copy.csv',"r")

census_data = file_handle.read()
file_handle.close()

In [None]:
# print the first 500 characters of the census file

census_data[:500]

In [None]:
# and the last 500 characters
census_data[-500:]

# **Review Comma-Based Separability**

In the first 500 characters, we see that the census has "code labels" for each column, none of which have funky characters that could trip us up with comma separation. 

We need to work with just that first line, though, so we'll use readline (which reads one entire line from the file) on the file handle

In [None]:
file_handle = open('/content/raw_census_2010_copy.csv',"r")
census_first_line = file_handle.readline()
file_handle.close()
census_first_line

In [None]:
# let's separate that first line into a list and see exactly how many columns we have

header1_list = census_first_line.split(",")
len(header1_list)

We knew it had 350+ and this is too many columns to effectively work with. 

Before we get to removing data, though, we need to put our data into a structure we can slice and dice. In other words, we need to transform our "string" data into a list of lists - a nested data structure where each row is a list, and within that list, each column is a list - a nested data structure.

For example: let's look at this simple 3x3 table:

In [None]:
example_list = [ 
    ["a","b","c"],
    [123,456,789],
    [987,654,321]
    ]

print(example_list[0])
print(example_list[1])
print(example_list[2])

print("row 1, column 3 ==>",example_list[0][2])
print("row 2, column 2 ==>",example_list[1][1])
print("row 3, column 1 ==>",example_list[2][0])

## **2) Cleaning Up Wonky Data**

But before we can even create a nested structure - we need to be confident we can split the data correctly for every single line.

Unfortunately, CSV data is known to be particularly tricky because sometimes data sources use commas in column labels but surround those column headers with "" because Excel will treat it right. 

Python won't be so forgiving so we need to test for "" in the data - we'll do this using a for loop.

In [None]:
# using a for loop

quote_count = 0

for char in census_data:
    if char == '"':
        quote_count += 1
    else:
        continue
        
print(quote_count)

Uh-oh! those quotes spell trouble so now we need to see where that first quote appears and if a comma appears after it.

In [None]:
quote_pos = census_data.find('"')
comma_pos = census_data.find(",",quote_pos) # quote_pos is the starting index of that "
census_data[quote_pos - 10: comma_pos + 20] # let's take a look at ten characters before and 20 after to check it out

We suspect that those ""  in human-readable column headers of the second line of data are hiding "...**,**..." and will cause any nesting using comma-separation to create extra columns.

Let's prove it.

In [None]:
# we need to isolate the second line using "find": 
# the second line is between the first and second \n

first_nl = census_data.find("\n")
second_nl = census_data.find("\n",first_nl+1)
second_line = census_data[first_nl+1:second_nl]

# split the second line using commas to see how many columns we get

len(second_line.split(","))

Now that we've proven those will be a problem, we are going to use Python to clean those up via our own User Defined Function for this purpose, because it may appear in another data source, too.

Before we create the UDF, let's describe what we want our function to do: 

1. it will accept a string input
2. it will remove " characters from the input
3. when it finds a , between "" it will replace , characters in the input with a `-`. However, it will NOT change `,` otherwise since it is a CSV file.
4. it will return a string output

And, we will create a test_input and expected_output to test our UDF during development:

In [None]:
def unquotable(input_string): # will remove those , within "" before changing , to \t 
    
    tmp_list = input_string.split('"') # remove " char by splitting the input string into a list using that char

    # remove , char
    for i in range(len(tmp_list)):
        # skip items that start/end with commas as these aren't quoted items & we don't want to remove their commas 
        if (tmp_list[i][0] == ",") or (tmp_list[i][-1] == ","):
            continue
        else: # replace , with - in quoted items
            tmp_list[i] = tmp_list[i].replace(",","-")

    output = ",".join(tmp_list) # rejoin items into a string

    # get rid of any ",," and ",,," that could exist as it will add extra columns
    output = output.replace(",,,",",")
    output = output.replace(",,",",")
    
    return output # return string

In [None]:
# our test data

test_input = 'something,"test string: a,b,c",other thing'
exp_output = 'something,test string: a-b-c,other thing'

test_output = unquotable(test_input) # our testing lines - run the UDF

test_output == exp_output # compare UDF run to expected output

In [None]:
# now we try our second_line variable storing the problem row for cleaning 
# proof it works: no " and len after CSV split = 375

new_row = unquotable(second_line)

print('"' in new_row)
print(len(new_row.split(",")))

It works! Now, let's fix our original input data string, `census_data`

In [None]:
clean_census_data = unquotable(census_data)

---

## **3) Creating a Nested Structure**

Now that we've cleaned up the wonkiness in the data, we can create our nested structure using a `for` loop.

Let's start by reviewing our simple 3x3 table example:

In [None]:
example_list = [ 
    ["a","b","c"],
    [123,456,789],
    [987,654,321]
    ]

print(example_list[0])
print(example_list[1])
print(example_list[2])

print("row 1, column 3 ==>",example_list[0][2])
print("row 2, column 2 ==>",example_list[1][1])
print("row 3, column 1 ==>",example_list[2][0])

In our example, the data is structured as:

`list[row][column]`

And we will use split commands to do the same for our data, this time creating another UDF.

Again - as before - let's state what it will do and create test data:
1. the UDF will take a data string as input
* it will create a list where each line in the data, identified using `\n`, is an item
* for each item in the first list `list[row]`, it will create a list of columns, using comma-separation (`,`) to identify each item
* it would be nice for the UDF user to specify what character separates rows and columns, separately
* the UDF will return the nested data structure

In [None]:
# take a dataset string where each row is separated by input_row_delim and each column is separate by 
# input_col_delim to create a nested object of lists

def nester(input_string,input_row_delim,input_col_delim):

    row_list = input_string.split(input_row_delim) # create a list item for each row in the file using the row delimiter

    nested_data = [] #output var

    # created nested structure to store each column separately list of rows where each row is a list of columns)
    for i in range(len(row_list)): 
        row = row_list[i]
        col = row.split(input_col_delim)
        nested_data.append(col)
    
    return nested_data # return the nested structure

In [None]:
test_input = 'r1c1,r1c2,r1c3\nr2c1,r2c2,r2c3\nr3c1,r3c2,r3c3' # our test data
exp_output = [
    ['r1c1','r1c2','r1c3'],
    ['r2c1','r2c2','r2c3'],
    ['r3c1','r3c2','r3c3']
    ]

test_output = nester(test_input,'\n',',') # our testing - run the UDF

test_output == exp_output # compare UDF output to expected output

Now we can create a nested structure of our census data:

In [None]:
census_struc = nester(clean_census_data,'\n',',') # structure our original string

In [None]:
print(census_struc[2]) # let's explore more

---

## **4) Moving from Notebooks to \*.py Files** 

But before you go to do that, we need to start moving the findings of our exploration into a dedicated python file for cleaning the Census data. 

We're making this migration because notebooks are great for exploring data, but as our files and project grow larger, it is simpler to run the Python files outside notebooks AND sometimes very large files can cause our notebooks to crash.

Here's what that new Python file needs to do:

1. read the census data file into a variable
2. clean the data by removing the "...,..." problem using a UDF
3. create a nested data structure
4. remove unwanted columns
5. create a new string from the nested structure
6. write the file to disk

We've already #1, #2, #3, and #6 together, so we'll isolate those below along with comments to do the other work before putting in its own file.

And recall that our UDFs has to be put ahead of the main program to work because the Main Program needs to know what happens in those UDFs.

Below is the exact code that will be put into its own file named `clean_census.py`.

```python
################################################################
##### UNQUOTABLE #############################################
################################################################

def unquotable(input_string): # removes pesky , within "" values before changing , to \t in program
    
    # remove " char by splitting the input string into a list using that char
    tmp_list = input_string.split('"')
 
    # remove , char
    for i in range(len(tmp_list)):
        # skip items that start/end with commas as these aren't quoted items
        if (tmp_list[i][0] == ",") or (tmp_list[i][-1] == ","):
            continue
        # replace , with - in quoted items
        else:
            tmp_list[i] = tmp_list[i].replace(",","-")

    # rejoin items into a string
    output = ",".join(tmp_list)

    # get rid of any ",," and ",,," as it will add extra columns
    output = output.replace(",,,",",")
    output = output.replace(",,",",")
    
    # return string
    return output

################################################################
##### NESTER #############################################
################################################################

def nester(input_string,input_row_delim,input_col_delim): # take a dataset string where each row is separated by input_row_delim and each column is separate by input_col_delim to create a nested object of lists

    # create a list item for each row in the file using the row delimiter
    row_list = input_string.split(input_row_delim)

    #output var
    nested_data = []
    
    # created nested structure to store each column separately
    # list of rows where each row is a list of columns)
    for i in range(len(row_list)): 
        row = row_list[i]
        col = row.split(input_col_delim)
        nested_data.append(col)
    
    # return the nested structure
    return nested_data


################################################################
##### MAIN PROGRAM #############################################
################################################################

# 1. read the census data file into a variable

file_handle = open('raw_census_2010_copy.csv',"r")
census_data = file_handle.read()
file_handle.close()

# 2. clean the data by removing the "...,..." problem using a UDF

clean_census_data = unquotable(census_data)

# 3. create a nested data structure with a UDF

nested_census_data = nester(clean_census_data,"\n",",")

# 4. remove unwanted columns using a UDF

# HOMEWORK

# 5. create a new string from the nested structure using a UDF

# HOMEWORK

# 6. write the file to disk

file_handle = open('raw_census_2010_copy.csv',"r")
file_handle.write('final data var from step #5 goes here')
file_handle.close()
```

---

# ⭕ **QUESTIONS?**

---