# Working With Files and Doing Data Cleaning

## First, a Review on Planning Your Programs

One of the most difficult things about learning to program is to learn how to start.  What to do before you write the code, and how to work your way through the coding process.

A good general idea is to think through the problem you want to solve first -- just conceptually.  How will you know you have solved it?  Are there tests that you can use to be sure?  Can you break the problem down into smaller components, and solve those sequentially?  This is a step that is conceptualizing your algorithm, or your plan for the code.

What approach would you use to solve each of those components? Can you describe those steps in English? We call this step writing 'pseudo-code'

Finally, there is the coding step. And the inevitable debugging step.  You really can't do one without the other.

Generally, it is good practice to work your way through problems in this way, and write the code for each building block, testing it to be sure it works for all the kinds of cases you can imagine, then test them together.  You'll end up being more productive, and far less frustrated, using a systematic, problem-solving approach.  

And by all means, don't try to tackle it all at once.  Below is an example of how to work through this process.

### Phase 1: Conceptual Plan

What do we need to do to determine whether each number between 1 and 100 is a prime number?

1. We need to see if any given number can be divided evenly by any other number besides 1 and itself. 
2. We need to test this for every number between 1 and 100

### Phase 2: Pseudo-Code

1. Write a function (isprime) to test whether a number passed to it as an argument (x) is a prime number.
Iterate over all values from x to 1.
At each iteration, test whether the original x is evenly divisible by this iteration value.
Keep track of how many times you get an evenly divisible result.
If the result is more than 2, call x a prime number.

2. Write a loop from 1 to 100, call this value (z)
Within the loop, call function isprime, and pass it the value of z.
Print the list of prime numbers.

### Phase 3: Code Incrementally, Test, and Document

Generally, build the code one step at a time, and test that step.  Add comments to explain your logic.  Make sure you include narrative in your assignments explaining your reasoning, and adding explanatory comments in the code every few lines to explain what you are doing in each part.

## Working With Files

Most data you might want to work with is likely to be in some kind of file.  You often will want to work with data in comma separated text files, or in spreadsheet tables, or in tables stored in a database.  And you will increasingly find data online, not only as text or spreadsheets to download, but as Open Data APIs, returned from a web service as a JSON object.  This session covers how to work with a variety of file formats in Python, and how to begin processing data in files to clean data.

### Basics of Reading and Writing Files in Python

Let's start by creating a simple file, and then reading it back.  We will use 'open' to open a file we will call 'tempfile' in 'write' (w) mode.  We will assign the file object, which is **iterable**, to an object we will arbitrarily call 'f'.

In [24]:
f = open('tempfile.txt', 'w')
for i in range(10):
    f.write('this is line ' + str(i) + '\n')
f.close()

In [9]:
with open('mytextfile.txt', 'w') as m:
    m.write('this is my text file \nThis is the second line in it \nand the third')


In [29]:
m = open('mytextfile.txt', 'r')
m.readline()


'this is my text file \n'

In [30]:
m.readline()

'This is the second line in it \n'

You can open the text file in an editor to verify that this code wrote the file as expected.  Now open the file we just created in Python, in read mode (r)

In [25]:
f = open('tempfile.txt', 'r')

The first method to read a file is to read it in all at once, with a read() method. Here we load the whole file in memory and assign it to a. Note that we will re-open the file here to start from the beginning. Otherwise it will be positioned at the end of the file and give us back an empty string. 

In [2]:
f = open('tempfile.txt', 'r')
a = f.readlines()
a

['this is line 0\n',
 'this is line 1\n',
 'this is line 2\n',
 'this is line 3\n',
 'this is line 4\n',
 'this is line 5\n',
 'this is line 6\n',
 'this is line 7\n',
 'this is line 8\n',
 'this is line 9\n']

In [3]:
type(a)

list

An alternative approach is to step through reading each line of the file with the readline() method.  Notice that each time you execute this it advances to the next line.  Each line is read into a string object.  In this case we are not doing anything with that object except printing it.

In [None]:
f = open('tempfile.txt', 'r')
print(f.readline())
print(f.readline())

Here is another way to loop through the lines of the file and print them all out. Notice that printing the lines suppresses the quotes and the newline string.

In [None]:
f = open('tempfile.txt', 'r')
for line in f:
    print(line, end='')

The plural version of the readline() method generates a list of the lines in a file, with a string containing each line of the file.  Notice that this list contains the raw text contents, including the newline string '\n'.

In [None]:
f = open('tempfile.txt', 'r')
a = f.readlines()
a

Using **with** is a handy way to open a file, load its data, and automatically close the file.

In [31]:
with open('tempfile.txt', 'r') as f:
    read_data = f.read()
print(read_data)
f.closed


this is line 0
this is line 1
this is line 2
this is line 3
this is line 4
this is line 5
this is line 6
this is line 7
this is line 8
this is line 9



True

In [34]:
with open('mytextfile.txt', 'r') as m:
    print(m.readlines())

['this is my text file \n', 'This is the second line in it \n', 'and the third']


### Working with JSON

JSON (JavaScript Object Notation) is a common format for data accessed from a web browser, which is generally running JavaScript.

The json dumps() method converts Python objects to JSON format, using the counterpart format for each data type, as in the table below.

In [35]:
import json

json.dumps([1,2,3])

'[1, 2, 3]'

Notice that objects can be complex, containing multiple types of data, and still be easily translated between Python objects and JSON format.  The following example converts a Python list, containing one element that is a dictionary, to JSON.

In [None]:
json.dumps([1,2,3,{'foo': 'bar'}])

Below we convert the contents of tempfile to a json object.

In [None]:
f = open('tempfile.txt', 'r')

x = json.dumps(f.readlines())
x

With the dump() method, we can write JSON data to a file.  Here we read tempfile, and create a new JSON formatted file into which we write the contents of tempfile.

In [None]:
f = open('tempfile.txt', 'r')
j = open('temp.json', 'w')
json.dump(f.readlines(), j)

In [36]:
json.dump?

Using the load() method, we can read JSON formatted data and load it into a Python object.

In [None]:
j = open('temp.json', 'r')
x = json.load(j)
x

### Working with CSV Files

CSV (Comma Separated Values) is probably the most common format of data you will encounter.  Files in this format are often exported in this format from a database table or from Excel, or just used as a simple, standard text (ASCII) file format for ease of use.

Let's begin by writing a CSV file like the JSON example above, by importing the csv module, and writing a file with several columns, separated by commas.

In [37]:
my_data = []
for i in range(10):
    my_data.append([i, i*2, i+2])
my_data

[[0, 0, 2],
 [1, 2, 3],
 [2, 4, 4],
 [3, 6, 5],
 [4, 8, 6],
 [5, 10, 7],
 [6, 12, 8],
 [7, 14, 9],
 [8, 16, 10],
 [9, 18, 11]]

Now we will write the CSV file using my_data, and adding a header row first with column names.  Note that we open the file as before, in write mode, but now use the writerow() method to write one row with the header, and writerows() to iterate over the rows and write them to the file.

In [40]:
import csv
with open('my_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(["x", "y", "z"])
    writer.writerows(my_data)

In [45]:
import csv
with open('my_data2.csv', 'w', newline = '') as f:
    csv.writer(f).writerow(['x'])

Reading a CSV file is very similar to writing one, but simpler.  We create a reader object that is iterable, and then we can iterate over the rows and do things, like print each row.

In [None]:
with open('my_data.csv', newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

In [56]:
a = []
with open('my_data.csv', newline='') as f:
       for r in csv.reader(f):
        a.append(r)    
            

a

[['x', 'y', 'z'],
 ['0', '0', '2'],
 ['1', '2', '3'],
 ['2', '4', '4'],
 ['3', '6', '5'],
 ['4', '8', '6'],
 ['5', '10', '7'],
 ['6', '12', '8'],
 ['7', '14', '9'],
 ['8', '16', '10'],
 ['9', '18', '11']]

If we want to actually work with the data, then we need to assign it to an object rather than just printing it.  Here we can use the list method to convert the iterable reader object to a list, one per row.

In [50]:
with open('my_data.csv', newline='') as f:
    reader = csv.reader(f)
    my_data = list(reader)
my_data

[['x', 'y', 'z'],
 ['0', '0', '2'],
 ['1', '2', '3'],
 ['2', '4', '4'],
 ['3', '6', '5'],
 ['4', '8', '6'],
 ['5', '10', '7'],
 ['6', '12', '8'],
 ['7', '14', '9'],
 ['8', '16', '10'],
 ['9', '18', '11']]

If you want to skip the header row in order to have the data without the header, you can use **next** after instantiating the reader object, to advance one row in the CVS file.

In [None]:
with open('my_data.csv', newline='') as f:
    reader = csv.reader(f)
    next(reader)
    my_data = list(reader)
my_data

Since the data is now available as an object, you can do normal Python processing on it, like selecting the first entry of each row and printing it.

In [None]:
for row in my_data:
    print(row[0])

### Reading a CSV File and Computing Statistics With it

Use rain.csv to calculate mean and maximum values in a column

In [None]:
with open('rain.csv', 'r') as csvfile:
    
    # initialize a counter and variables to contain our descriptive stats
    count = 0 #at the end, divide cumulative_sum by this to get the mean
    cumulative_sum = 0 #our rolling sum
    max_value = -1 #pick a really small number that's guaranteed to be less than the max
    
    # open the file and skip the header row
    my_csv = csv.reader(csvfile)
    next(my_csv)
    
    # loop through each data row
    for row in my_csv:
        
        # rainfall amount is in column 1, only process this row's value if not an empty string
        if not row[1] == '':
            
            # increment the counter and extract this row's rainfall as a float
            count = count + 1
            rainfall = float(row[1])
            
            # add this row's rainfall to the cumulative sum
            cumulative_sum = cumulative_sum + rainfall
            
            # if this row's rainfall is greater than the current max value, update with the new max
            if rainfall > max_value:
                max_value = rainfall

    # after looping through all the rows, divide the cumulative sum by the count and round to get the mean
    mean_value = round(cumulative_sum / count, 1)
    
    # print out the mean and max values
    print('mean:', mean_value, 'inches')
    print('max:', max_value, 'inches')

How would you find the minimum rainfall amount?

In [65]:
a = []
with open('Data\\rain.csv', 'r') as csvfile:
    raindt = csv.reader(csvfile)
    next(raindt)
    for row in raindt:
        if not row[1] == '':
            a.append(float(row[1]))

minrainfall = min(a)  
minrainfall

0.7

### Cleaning up Messy Data

Let's look at another data file - one that contains a few Craigslist rental listings, that we have already done some cleanup on.

In [20]:
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        print(row)

['neighborhood', 'price', 'bedrooms', 'date', 'sqft', 'longitude', 'latitude']
['foster city', '2495', '1', '11/14/2014 12:26', '755', '-122.27', '37.5538']
['palo alto', '2695', '', '11/14/2014 12:25', '443', '-122.161524', '37.450289']
['brisbane', '3150', '2', '11/14/2014 12:24', '1242', '-122.417912', '37.692415']
['palo alto', '2800', '2', '11/14/2014 12:24', '', '', '']
['san mateo', '2196', '1', '11/14/2014 12:24', '676', '-122.2998', '37.5395']
['santa clara', '3264', '3', '11/14/2014 12:28', '1138', '', '']
['san jose south', '2000', '2', '11/14/2014 12:28', '822', '-121.902268', '37.253503']
['sunnyvale', '4740', '3', '11/14/2014 12:28', '1406', '-122.034683', '37.368445']
['inner sunset / UCSF', '3395', '2', '11/14/2014 12:32', '', '-122.479345', '37.764582']
['richmond / seacliff', '2699', '1', '11/14/2014 12:32', '', '-122.503781', '37.7718']
['SOMA / south beach', '3620', '1', '11/14/2014 12:30', '860', '-122.395195', '37.775133']
['dublin / pleasanton / livermore', '2025

In [22]:
# the column headers are the first row in the data file
# use next to iterate our csv reader to the first row to grab the headers
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    headers = next(my_csv)
    print(headers)

['neighborhood', 'price', 'bedrooms', 'date', 'sqft', 'longitude', 'latitude']


In [71]:
# what is the 1st column (zero-indexed) in our data set?
headers[0]

'neighborhood'

In [23]:
# for each row in the data set, print the price column's value
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        print(row[1])

price
2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025

1795
4299


In [24]:
# create a new list to contain the column of prices in the data set
prices = []
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        prices.append(row[1])  
prices

['price',
 '2495',
 '2695',
 '3150',
 '2800',
 '2196',
 '3264',
 '2000',
 '4740',
 '3395',
 '2699',
 '3620',
 '2025',
 '',
 '1795',
 '4299']

This list has a couple of problems. First, it includes the header. Second, it's all strings even though prices are numeric data. Third, it contains some empty strings. We'll have to clean it up.

In [25]:
# to remove the first element of the list, we can just capture position 1 through the end of the list
prices_noheader = prices[1:]
prices_noheader

['2495',
 '2695',
 '3150',
 '2800',
 '2196',
 '3264',
 '2000',
 '4740',
 '3395',
 '2699',
 '3620',
 '2025',
 '',
 '1795',
 '4299']

In [26]:
# now let's convert the price strings to integers
for price in prices_noheader:
    print(int(float(price)), ' ')

2495  
2695  
3150  
2800  
2196  
3264  
2000  
4740  
3395  
2699  
3620  
2025  


ValueError: could not convert string to float: 

In [27]:
# you can't convert an empty string to a numeric type
for price in prices_noheader:
    if not price == '':
        print(int(float(price)))
    else:
        print('None')

2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025
None
1795
4299


In [28]:
# you can't convert an empty string to a numeric type
for price in prices_noheader:
    try:
        if not price == '':
            print(int(float(price)))
    except:
        print('None')

2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025
1795
4299


In [29]:
# encapsulate this functionality inside a new function
def extract_int_price(price):
    if not price == '':
        return int(float(price))
    else:
        return None

In [30]:
# use our function to convert each element in the list of prices to an integer
for price in prices_noheader:
    print(extract_int_price(price))

2495
2695
3150
2800
2196
3264
2000
4740
3395
2699
3620
2025
None
1795
4299


In [31]:
# rather than just printing each converted value, turn it into a new list called int_prices
int_prices = []
for price in prices_noheader:
    int_prices.append(extract_int_price(price))
print(int_prices)

[2495, 2695, 3150, 2800, 2196, 3264, 2000, 4740, 3395, 2699, 3620, 2025, None, 1795, 4299]


### Now let's clean up our neighborhood names


In [33]:
# replace any forward slashes in neighborhood name with a hyphen
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    next(my_csv) #skip the header row
    for row in my_csv:
        print(row[0].replace('/', '-')) #use string.replace() method

foster city
palo alto
brisbane
palo alto
san mateo
santa clara
san jose south
sunnyvale
inner sunset - UCSF
richmond - seacliff
SOMA - south beach
dublin - pleasanton - livermore
concord - pleasant hill - martinez
hercules, pinole, san pablo, el sob
corte madera


### Create a new data set with cleaned up variables

In [34]:
# create a new function to convert bedrooms from a string to an int
def extract_int_bedrooms(bedrooms):
    if not bedrooms == '':
        return int(float(bedrooms))
    else:
        return None

In [35]:
# create a new function to replace forward slashes and commas with hyphens
def clean_neighborhood(neighborhood_name):
    # you can daisy chain multiple string.replace() methods
    return neighborhood_name.replace('/', '-').replace(',', '')

In [37]:
# clean the data set by calling the cleaning functions and save the results to variables
rentals_cleaned = []
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    next(my_csv)
    for row in my_csv:
        neighborhood_cleaned = clean_neighborhood(row[0])
        price_cleaned = extract_int_price(row[1])
        bedrooms_cleaned = extract_int_bedrooms(row[2])
        rentals_cleaned.append([neighborhood_cleaned, price_cleaned, bedrooms_cleaned])      

# display our nested lists of data        
rentals_cleaned

[['foster city', 2495, 1],
 ['palo alto', 2695, None],
 ['brisbane', 3150, 2],
 ['palo alto', 2800, 2],
 ['san mateo', 2196, 1],
 ['santa clara', 3264, 3],
 ['san jose south', 2000, 2],
 ['sunnyvale', 4740, 3],
 ['inner sunset - UCSF', 3395, 2],
 ['richmond - seacliff', 2699, 1],
 ['SOMA - south beach', 3620, 1],
 ['dublin - pleasanton - livermore', 2025, 1],
 ['concord - pleasant hill - martinez', None, 2],
 ['hercules pinole san pablo el sob', 1795, 1],
 ['corte madera', 4299, 3]]

# Exercise: 

1. Calculate the price per square foot, and write the result to a new file.  
2. Calculate the average price per square foot. 
3. Explain how you have dealt with missing data in 1 and 2, and how that might affect your result. 

### Question 1

In [38]:
import csv
#/Users//eugene//Documents//Github//ce599//04-Data Files and Cleaning//

In [39]:
sqrts = []
with open('Data//rents_raw.csv', 'r') as csvfile:
    my_csv = csv.reader(csvfile)
    for row in my_csv:
        sqrts.append(row[4])  
sqrts
sqrts_noheader = sqrts[1:]

In [40]:
def extract_int_price(price):
    if not price == '':
        return int(float(price))
    else:
        return None
def extract_sqrt(sqrt):
    if not sqrt == '':
        return int(float(sqrt))
    else:
        return None

In [41]:
priceraw = []
for price in prices_noheader:
    priceraw.append(extract_int_price(price))

In [42]:
sqftraw = []
for sqft in sqrts_noheader:
    sqftraw.append(extract_sqrt(sqft))
sqftraw

[755,
 443,
 1242,
 None,
 676,
 1138,
 822,
 1406,
 None,
 None,
 860,
 636,
 1019,
 715,
 1533]

In [43]:
cumsump, countp, cumsq, counts = 0, 0, 0, 0
for price in priceraw:
    if not price == None:
        countp += 1
        cumsump += price
meanprice = cumsump//countp

for s in sqftraw:
    if not s == None:
        counts += 1
        cumsq += s
meansqft = cumsq//counts



In [44]:
prices = []
for price in priceraw:
    if price == None:
        price = meanprice
    prices.append(price)
prices

sqft = []
for s in sqftraw:
    if s == None:
        s = meansqft
    sqft.append(s)

In [46]:
#price per sqft = ppsf
ppsf = []
for i in range(len(prices)):
    ppsqft = prices[i]/sqft[i]
    ppsf.append(round(ppsqft,2))
ppsf

[3.3,
 6.08,
 2.54,
 2.99,
 3.25,
 2.87,
 2.43,
 3.37,
 3.62,
 2.88,
 4.21,
 3.18,
 2.89,
 2.51,
 2.8]

In [57]:
with open('price_per_sqft.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['Price/sqft'])
    writer.writerows([r] for r in ppsf)

### Question 2

In [None]:
with open('rain.csv', 'r') as csvfile:
    
    # initialize a counter and variables to contain our descriptive stats
    count = 0 #at the end, divide cumulative_sum by this to get the mean
    cumulative_sum = 0 #our rolling sum
    max_value = -1 #pick a really small number that's guaranteed to be less than the max
    
    # open the file and skip the header row
    my_csv = csv.reader(csvfile)
    next(my_csv)
    
    # loop through each data row
    for row in my_csv:
        
        # rainfall amount is in column 1, only process this row's value if not an empty string
        if not row[1] == '':
            
            # increment the counter and extract this row's rainfall as a float
            count = count + 1
            rainfall = float(row[1])
            
            # add this row's rainfall to the cumulative sum
            cumulative_sum = cumulative_sum + rainfall
            
            # if this row's rainfall is greater than the current max value, update with the new max
            if rainfall > max_value:
                max_value = rainfall

    # after looping through all the rows, divide the cumulative sum by the count and round to get the mean
    mean_value = round(cumulative_sum / count, 1)
    
    # print out the mean and max values
    print('mean:', mean_value, 'inches')
    print('max:', max_value, 'inches')

In [66]:
with open('price_per_sqft.csv', 'r') as csvfile:
    count = 0
    cumsum = 0
    my_csv = csv.reader(csvfile)
    next(my_csv)
    for row in my_csv:
        count += 1
        cumsum += float(row[0])
#average price per square foot
avg_ppsf = cumsum/count
avg_ppsf

3.261333333333333

The mean price and mean square foot values were used in place of missing data values in the read files. This could introduce some bias in statistics computed as the real prices and areas could differ significantly from the mean 

## What you've Learnt

open('file.extension', 'w' or 'r'  ) 
w for write mode
r for read mode
then close('file.extension')

### JSON 

import json


json.dump(  ) - to convert python objects to json format
json.lead(  ) - open a json file in python

### CSV

In [None]:
import csv

csv.writer (   )
csv.reaader(   ) to read in csv file
