# File Handling

Today we will learn about how to read from and write to files on your computer using a Python script! Credit to Rochelle Terman

In [1]:
## Import required libraries
import tweepy
import json

### Reading from a file

Reading a file requires three steps:

1. Opening the file
2. Reading the file
3. Closing the file

An exclamation point `!` puts you in [bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell). The `touch` command creates a file. You use it by including an argument which is the name of the file you create.

In [5]:
!touch sample.txt

In [6]:
my_file = open("sample.txt", "r")
text = my_file.read()
my_file.close()

print("--" + text + "--")
print(len(text))

----
0


We see that when we create a new file using bash, it's empty. Let's try reading from a file with text in it; for example, `example.txt`.

After we read from the file, we must be sure to close it. If we fail to close the file, this can lead to security or data integrity problems within the program.

(Also note that "\n" is a new line character in Python)

In [7]:
my_file = open("example.txt", "r")
text = my_file.read()
my_file.close()

print("--\n" + text + "--")
print(len(text))

--
This is line 1.
This is line 2.
This is line 3.
This is line 4.
This is line 5.
--
80


However, if you use the `with open` syntax, the program will automatically close files for you. The `'r'` indicates that you are reading the file, as opposed to, say, writing to it. If we don't include the `r/w` argument, the `with` command will default to read only permissions.

In [9]:
# better code
with open('example.txt', 'r') as my_file:
    text = my_file.read()
# my_file.read()
print("--\n" + text + "--")
print(len(text))

--
This is line 1.
This is line 2.
This is line 3.
This is line 4.
This is line 5.
--
80


The `with` function will keep the file open as long as the program is still in the indented block. Once outside, the file is no longer open, and you can't access the contents. You can only access what you have saved to a variable.

### Reading a file as a list

Often times, we want to read in a file line by line, storing those lines as a list. To do that, Python has a command that looks very much like the English translation: we simply say `for line in my_file`.

In [10]:
stored = []
with open('example.txt', 'r') as my_file:
    for line in my_file:
        stored.append(line)

In [11]:
stored

['This is line 1.\n',
 'This is line 2.\n',
 'This is line 3.\n',
 'This is line 4.\n',
 'This is line 5.\n']

As we learned in the Python review, we can use the String `strip` [method](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#method) to get rid of those newline breaks at the end of each line.

In [18]:
stored = []
with open('example.txt', 'r') as my_file:
    for line in my_file:
        stored.append(line.strip())

In [15]:
stored

['This is line 5.',
 'This is line 5.',
 'This is line 5.',
 'This is line 5.',
 'This is line 5.']

### Writing to a file

We can use the same `with open` syntax for writing files as well.

In [19]:
# this is okay...
new_file = open("example2.txt", "w")
bees = ['bears', 'beets', 'Battlestar Galactica']
for i in bees:
    new_file.write(i + '\n')
new_file.close()

Another useful bash command is `cat`, which requires a single parameter filename. When you run `cat filename`, the contents of the file named `filename` will be printed out.

In [20]:
!cat example2.txt

bears
beets
Battlestar Galactica


In [21]:
# but this is better...
bees = ['bears', 'beets', 'Battlestar Galactica']
with open('example2.txt', 'w') as new_file:
    for i in bees:
        new_file.write(i + '\n')

In [22]:
!cat example2.txt

bears
beets
Battlestar Galactica


### Using the CSV Module

It is often useful to have the results of a computer program output to a CSV file. Python has already built out a `csv` module, which makes this process easy. Also note that in Python, a csv is usually read as a list of dictionaries.

In [23]:
import csv

In [24]:
# read csv and write into np arrays
capitals = [] # make empty list
with open('capitals.csv', 'r') as csvfile: # open file
    reader = csv.DictReader(csvfile) # create a reader
    for row in reader: # loop through rows
        capitals.append(row)

In [25]:
capitals[:5]

[{'Capital': 'Kabul',
  'Country': 'Afghanistan',
  'Latitude': "34¡28'N",
  'Longitude': "69¡11'E"},
 {'Capital': 'Tirane',
  'Country': 'Albania',
  'Latitude': "41¡18'N",
  'Longitude': "19¡49'E"},
 {'Capital': 'Algiers',
  'Country': 'Algeria',
  'Latitude': "36¡42'N",
  'Longitude': "03¡08'E"},
 {'Capital': 'Pago Pago',
  'Country': 'American Samoa',
  'Latitude': "14¡16'S",
  'Longitude': "170¡43'W"},
 {'Capital': 'Andorra la Vella',
  'Country': 'Andorra',
  'Latitude': "42¡31'N",
  'Longitude': "01¡32'E"}]

Writing a list of dictionaries to a CSV file is similar:

In [26]:
print(len(capitals))

200


In [27]:
# get the keys in each dictionary
keys = capitals[1].keys()
print(keys)
# convert the data type to a list
keys = list(keys)
print(keys)

dict_keys(['Longitude', 'Capital', 'Latitude', 'Country'])
['Longitude', 'Capital', 'Latitude', 'Country']


In [28]:
# write rows
with open('capitals2.csv', 'w') as output_file:
    dict_writer = csv.DictWriter(output_file, ['Country', 'Capital', 'Latitude', 'Longitude'])
    dict_writer.writeheader()
    dict_writer.writerows(capitals)

### Challenge 1: Read in a list

The file `counties.txt` has a column of counties in California. Read in the data into a list called `counties`.

In [31]:
counties_lst = []
with open('counties.txt', 'r') as counties:
    for line in counties:
        counties_lst.append(line.strip())

print(counties_lst)
print(len(counties_lst))

['Alameda', 'Alpine', 'Amador', 'Butte', 'Calaveras', 'Colusa', 'Contra Costa', 'Del Norte', 'El Dorado', 'Fresno', 'Glenn', 'Humboldt', 'Imperial', 'Inyo', 'Kern', 'Kings', 'Lake', 'Lassen', 'Los Angeles', 'Madera', 'Marin', 'Mariposa', 'Mendocino', 'Merced', 'Modoc', 'Mono', 'Monterey', 'Napa', 'Nevada', 'Orange', 'Placer', 'Plumas', 'Riverside', 'Sacramento', 'San Benito', 'San Bernardino', 'San Diego', 'San Francisco', 'San Joaquin', 'San Luis Obispo', 'San Mateo', 'Santa Barbara', 'Santa Clara', 'Santa Cruz', 'Shasta', 'Sierra', 'Siskiyou', 'Solano', 'Sonoma', 'Stanislaus', 'Sutter', 'Tehama', 'Trinity', 'Tulare', 'Tuolumne', 'Ventura', 'Yolo', 'Yuba']
58


### Challenge 2: Writing a CSV file

Below is a list of dictionaries representing US states. Write this [object](https://github.com/dlab-berkeley/python-intensive/blob/master/Glossary.md#object) as a CSV file called `states.csv`

In [33]:
states = [{'state': 'Ohio', 'population': 11.6, 'year in union': 1803, 'state bird': 'Northern cardinal', 'capital': 'Columbus'},
          {'state': 'Michigan', 'population': 9.9, 'year in union': 1837, 'capital': 'Lansing'},
          {'state': 'California', 'population': 39.1, 'year in union': 1850, 'state bird': 'California quail', 'capital': 'Sacramento'},
          {'state': 'Florida', 'population': 20.2, 'year in union': 1834, 'capital': 'Tallahassee'},
          {'state': 'Alabama', 'population': 4.9, 'year in union': 1819, 'capital': 'Montgomery'}]

In [36]:
keys = []

# get a comprehensive list of keys, since not all states have all keys
for state in states:
    for key in state.keys():
        if key not in keys:
            keys.append(key)

print(keys)

with open('states.csv', 'w') as csv_file:
    dict_write = csv.DictWriter(csv_file, keys)
    dict_write.writeheader()
    dict_write.writerows(states)

['state bird', 'year in union', 'population', 'state', 'capital']


### Challenge 3: Write CSV Data to a Numpy array

In [40]:
"""
From last week: As we saw on the website, the 14 attributes used in the
published experiment are as follows. We will use these fields to retrieve
data from the CSV file, and write them into a list of Numpy arrays.
"""
fields = ["age", "sex", "chest_pain_type", "rest_blood_pressure",
          "cholestoral", "fasting_blood_sugar","rest_ecg", "max_hr",
          "ex_ang", "oldpeak", "slope", "ca", "thal", "num"]

In [46]:
import numpy as np

"""
As we saw last week, in order to construct an array, we first create a
list and then use the Numpy np.array(lst) constructor. Now we will
apply this technique to construct a multidimensional array, or matrix.

Use the Python CSV library to read from the CSV data file. Once you
read the data into a list, construct an array that has one row of
values. We add each row to a list, and then create a matrix or array of
arrays from that list.
"""
# for each row, create an array of values corresponding to the fields
arrays = []

from numpy import genfromtxt
arrays_2 = genfromtxt('processed_cleveland_data.csv', delimiter=',')

with open('processed_cleveland_data.csv', 'r') as my_file:
    reader = csv.DictReader(my_file, fields)
    for row in reader:
        lst = []
        for field in fields:
            value = row[field]
            if value == "?":
                value = 0
            lst.append(value)
        arr = np.array(lst)
        arrays.append(arr)
# note that some values are ommitted from the dataset, so you might run
# into errors

# print(arrays)
matrix = np.array(arrays)
print(matrix)

[['63' '1' '1' ..., '0' '6' '0']
 ['67' '1' '4' ..., '3' '3' '2']
 ['67' '1' '4' ..., '2' '7' '1']
 ..., 
 ['57' '1' '4' ..., '1' '7' '3']
 ['57' '0' '2' ..., '1' '3' '1']
 ['38' '1' '3' ..., '0' '3' '0']]


In [None]:
"""
For extra practice on the stuff we learned last week, try to transpose the
array and find the average age of the heart disease patients from the study.
"""

# numpy stuff

###Challenge 4: Writing Twitter API data to a CSV

We will learn (probably next week) about how to use APIs to get both data and functionality from other websites. Below, we initialize some variables necessary to use the Twitter API. The details will be explained next week.

In [None]:
## Our access key, mentioned above
consumer_key = 'Q8kC59z8t8T7CCtIErEGFzAce'
## Our signature, also given upon app creation
consumer_secret = '24bbPpWfjjDKpp0DpIhsBj4q8tUhPQ3DoAf2UWFoN4NxIJ19Ja'
## Our access token, generated upon request
access_token = '719722984693448704-lGVe8IEmjzpd8RZrCBoYSMug5uoqUkP'
## Our secret access token, also generated upon request
access_token_secret = 'LrdtfdFSKc3gbRFiFNJ1wZXQNYEVlOobsEGffRECWpLNG'

## Set of Tweepy authorization commands
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

Now we make a query string using URL formatting (also coming next week!), and we send it to Twitter to retrieve data about Hillary Clinton and Donald Trump. The query will return a list of Twitter statuses, each of which has data that we will write to the CSV.

In [None]:
# Search for tweets containing a positive attitude to 'hillary' or
# 'clinton' since October 1st
query1 = "hillary%20OR%20clinton%20%3A%29"

# Search for tweets containing a positive attitude to 'donald' or
# 'trump' since October 1st
query2 = "donald%20OR%20trump%20%3A%29"

results1 = api.search(q=query1)
results2 = api.search(q=query2)

*Remember*: in order to write a set of dictionaries to a CSV file, we will need a list of **all** keys found in any of the dictionaries, and a list of the dictionaries.

In [None]:
'''
Things to know:
- results1 and results2 are lists
- Each item in lists results1 and results2 is a Twitter status object, which
  has a _json attribute
- This _json attribute can be accessed from the status using "dot notation"
- This _json attribute can be used as a dictionary
- We also need a list of keys *without duplicates* in order to write to a
  CSV file
'''

# Your variables here are:
## "keys1": a list of keys for the first set of statuses
## "lst_1": a list of _json dictionary objects
keys1 = []
lst_1 = []

for status in results1:
    dictionary = status._json # access this using dot notation!
    lst_1.append(dictionary) # function for adding to a list
    for key in dictionary.keys():
        if key not in keys1: # check for duplicates
            keys1.append(key)

print("KEYS 1: " + str(keys1) + "\n")

# Your variables here are:
## "keys2": a list of keys for the second set of statuses
## "lst_2": a list of _json dictionary objects
keys2 = []
lst_2 = []
for status in results2:
    dictionary = status._json
    lst_2.append(dictionary)
    for key in dictionary.keys():
        if key not in keys2:
            keys2.append(key)
            
print("KEYS 2: " + str(keys1) + "\n")

In [None]:
# write rows for each dictionary