# Files

* Python is an excellent language for working with textual data.
* In this tutorial, we cover how to read and write text files.
* Download the file at: http://www.gutenberg.org/cache/epub/100/pg100.txt and save it to "shakespeare.txt"

In [2]:
# "open" will return a "file" object. 
# Note we are opening the file for reading (by providing a mode --> "r" argument)
shakespeare_works=open("shakespeare.txt", "r")
print(type(shakespeare_works))

<type 'file'>


* The file must be in your current directory (same as this Jupyter notebook or where your ".py" script lives)
* Another option would be to provide the "full path" where the file lives on desk.

* Modes include: **"r" for reading**, **"w" for writing**, and **"a" for appending**

In [4]:
# Read the whole file as a big "string"
f =open("shakespeare.txt", "r")
f=f.read() # f =open("shakespeare.txt", "r").read() --> will also work
print(type(f))

<type 'str'>


In [5]:
# It's a very big string
print(len(f))

5589889


In [7]:
print(f[:25])

﻿The Project Gutenberg 


In [10]:
print(f[:50])

﻿The Project Gutenberg EBook of The Complete Wor


In [11]:
print(f[100:200])



This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoe


In [12]:
# readlines() will return a list of all the lines in the file
f =open("shakespeare.txt", "r").readlines()
print(type(f))

<type 'list'>


In [13]:
print(len(f))

124787


In [14]:
print(f[:5])

['\xef\xbb\xbfThe Project Gutenberg EBook of The Complete Works of William Shakespeare, by\r\n', 'William Shakespeare\r\n', '\r\n', 'This eBook is for the use of anyone anywhere at no cost and with\r\n', 'almost no restrictions whatsoever.  You may copy it, give it away or\r\n']


In [15]:
print(f[-1])

*** END: FULL LICENSE ***



In [None]:
# Close your files!
f.close()

In [28]:
out_f = open("./small_text.txt", "w")

In [29]:
question="Do you like Python?\n" # Note we add "\n" as write does not automatically do that for us
out_f.write(question)

In [31]:
out_f = open("./small_text.txt", "a") # Note the "a" (append) mode

In [32]:
another_question="Do you like Lua?\n"
out_f.write(another_question)

# More with files

In [27]:
from collections import defaultdict

def get_dict(sentences):
    """
    arguments:
    input: @sentences: a list of sentences
    returns: a dictionary of the words in the sentences.
             dict key is a word and value is word frequency
    """
    word_freq=defaultdict(int)
    for sent in sentences:
        words=sent.lower().split()
        for w in words:
            word_freq[w]+=1
    return word_freq

#
lines=open("shakespeare.txt", "r").readlines()
sentences=lines[:10]
freqs=get_dict(sentences)
# This will sort by count/value of the "freqs" dictionary in reverse order such that the highest values occur first
d=sorted(freqs.items(), key = lambda x: x[1], reverse=True) # "d" is now a list of tuples!
for i in d:
    # For readability let's assign each item in the tuple to a meaningfully named variable
    w=i[0]
    freq=i[-1]
    if freq > 2:
        print("The word \"{}\" occurs \"{}\" times.").format(w, freq)

The word "the" occurs "5" times.
The word "**" occurs "4" times.
The word "this" occurs "4" times.
The word "of" occurs "4" times.
The word "ebook" occurs "3" times.
The word "gutenberg" occurs "3" times.
The word "project" occurs "3" times.


In [31]:
# This will write freqs to a file
lines=open("shakespeare.txt", "r").readlines()
# We only take the top 1000 lines
sentences=lines[:1000]
freqs=get_dict(sentences)
d=sorted(freqs.items(), key = lambda x: x[1], reverse=True)
word_freqs=open("./word_list.txt", "w")
for i in d:
    # For readability
    w=i[0]
    freq=i[-1]
    word_freqs.write(w+"\t"+str(freq)+"\n")
    
word_freqs.close()
    