# Files
* Python is an excellent language for working with textual data.
* In this tutorial, we cover how to read and write text files.

# Motivations
* Why would we need to process files?
* Real-world examples?

# Practice set-up
* Download the file "shakespeare.txt" from Canvas >> Files >> data >> "shakespeare.txt".
* FYI: The file is also available at: http://www.gutenberg.org/cache/epub/100/pg100.txt.

In [4]:
# "open" will return a "file" object (also called file-like object or a stream). 
# Note we are opening the file for reading (by providing a mode --> "r" argument)
shakespeare_works=open("shakespeare.txt", "r")
print(type(shakespeare_works)) 
#More info.: https://docs.python.org/3/library/io.html#io.TextIOWrapper

<class '_io.TextIOWrapper'>


* The file must be in your current directory (same as this Jupyter notebook or where your ".py" script lives)
* Another option would be to provide the "full path" where the file lives on desk.

* Modes include: **"r" for reading**, **"w" for writing**, and **"a" for appending**
* The mode **"x", for exclusive creation** was also newly added in Python 3.3. It is similar to **"w"**, but will fail if the file already exists. (see more: https://docs.python.org/3/library/functions.html#open).

In [57]:
# Read the whole file as a big "string"
f =open("shakespeare.txt", "r")
f=f.read()
print(type(f))

<class 'str'>


In [5]:
f =open("shakespeare.txt", "r")
f=f.read() 

# Below will also work:
#-----------------------
# f =open("shakespeare.txt", "r").read()

In [6]:
# It's a very big string
print(len(f))

5465395


In [7]:
print(f[:25])

﻿The Project Gutenberg EB


In [8]:
print(f[:50])

﻿The Project Gutenberg EBook of The Complete Works


In [9]:
s= "Welcome everyone"
new= s[0:7]
print(new)

Welcome


In [10]:
print(f[100:200])

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. 


In [58]:
# readlines() will return a list of all the lines in the file
f =open("shakespeare.txt", "r").readlines()
print(type(f))

<class 'list'>


In [59]:
print(len(f))

124796


In [60]:
print(f[:5])

['\ufeffThe Project Gutenberg EBook of The Complete Works of William Shakespeare, by \n', 'William Shakespeare\n', '\n', 'This eBook is for the use of anyone anywhere at no cost and with\n', 'almost no restrictions whatsoever.  You may copy it, give it away or\n']


In [14]:
print(f[-1])

*** END: FULL LICENSE ***



In [61]:
out_f = open("small_text.txt", "w")

In [62]:
# This will give an error since the file already exists
out_f = open("small_text.txt", "x")

FileExistsError: [Errno 17] File exists: 'small_text.txt'

In [63]:
# Even though the file exists, we do not get an error
out_f = open("small_text.txt", "w")

In [64]:
text="Do you like Python?\n" # Note we add "\n" as write does not automatically do that for us
out_f.write(text)

20

In [65]:
out_f.close()

In [72]:
out_f = open("small_text.txt", "a") # Note the "a" (append) mode

In [73]:
another_question="Do you like Lua?\n"
out_f.write(another_question)

17

In [74]:
out_f.close()

In [75]:
text3="Do you like Java?\n"
out_f = open("small_text.txt", "a") # Note the "a" (append) mode
out_f.write(text3)
out_f.close()

# More with files

In [77]:
from collections import defaultdict

def get_dict(sentences):
    """
    arguments:
    input: @sentences: a list of sentences
    returns: a dictionary of the words in the sentences.
             dict key is a word and value is word frequency
    """
    word_freq=defaultdict(int)
    for sent in sentences:
        words=sent.lower().split()
        for w in words:
            word_freq[w]+=1
    return word_freq

#
lines=open("shakespeare.txt", "r").readlines()
sentences=lines[:10]
freqs=get_dict(sentences)
print(freqs)

defaultdict(<class 'int'>, {'\ufeffthe': 1, 'project': 3, 'gutenberg': 3, 'ebook': 3, 'of': 4, 'the': 5, 'complete': 1, 'works': 1, 'william': 2, 'shakespeare,': 1, 'by': 1, 'shakespeare': 1, 'this': 4, 'is': 2, 'for': 1, 'use': 1, 'anyone': 1, 'anywhere': 1, 'at': 2, 'no': 2, 'cost': 1, 'and': 1, 'with': 2, 'almost': 1, 'restrictions': 1, 'whatsoever.': 1, 'you': 1, 'may': 1, 'copy': 1, 'it,': 1, 'give': 1, 'it': 2, 'away': 1, 'or': 2, 're-use': 1, 'under': 1, 'terms': 1, 'license': 1, 'included': 1, 'online': 1, 'www.gutenberg.org': 1, '**': 4, 'a': 1, 'copyrighted': 1, 'ebook,': 1, 'details': 1, 'below': 1, 'please': 1, 'follow': 1, 'copyright': 1, 'guidelines': 1, 'in': 1, 'file.': 1})


In [80]:
# This will sort by count/value of the "freqs" dictionary in reverse order such that the highest values occur first
# See additional part on this sorting in accompanying code from instructor.
d=sorted(freqs.items(), key = lambda x: x[1], reverse=True) # "d" is now a list of tuples!
print(type(d))
#print(d)

<class 'list'>


In [85]:
print(d[0])

('the', 5)


In [86]:
print(d)

[('the', 5), ('of', 4), ('this', 4), ('**', 4), ('project', 3), ('gutenberg', 3), ('ebook', 3), ('william', 2), ('is', 2), ('at', 2), ('no', 2), ('with', 2), ('it', 2), ('or', 2), ('\ufeffthe', 1), ('complete', 1), ('works', 1), ('shakespeare,', 1), ('by', 1), ('shakespeare', 1), ('for', 1), ('use', 1), ('anyone', 1), ('anywhere', 1), ('cost', 1), ('and', 1), ('almost', 1), ('restrictions', 1), ('whatsoever.', 1), ('you', 1), ('may', 1), ('copy', 1), ('it,', 1), ('give', 1), ('away', 1), ('re-use', 1), ('under', 1), ('terms', 1), ('license', 1), ('included', 1), ('online', 1), ('www.gutenberg.org', 1), ('a', 1), ('copyrighted', 1), ('ebook,', 1), ('details', 1), ('below', 1), ('please', 1), ('follow', 1), ('copyright', 1), ('guidelines', 1), ('in', 1), ('file.', 1)]


In [93]:
for i in d:
    w= i[0]
    freq=i[-1]
    if freq > 2:
        print(i)

('the', 5)
('of', 4)
('this', 4)
('**', 4)
('project', 3)
('gutenberg', 3)
('ebook', 3)


In [111]:
lines=open("shakespeare.txt", "r").readlines()
sentences=lines[:10000]
freqs=get_dict(sentences)
#print(freqs)

In [96]:
def keep_top_n(d, n=2):
    short_d={}
    for i in d:
        w=i[0]
        freq=i[-1]
        if freq > n:
            short_d[w]=freq
    return short_d

In [112]:
d=sorted(freqs.items(), key = lambda x: x[1], reverse=True) # "d" is now a list of tuples!

my_new_d= keep_top_n(d, n=200)
print(my_new_d)


{'the': 1931, 'and': 1771, 'i': 1476, 'to': 1471, 'of': 1319, 'my': 1063, 'a': 934, 'that': 901, 'in': 862, 'you': 755, 'is': 708, 'not': 605, 'for': 588, 'with': 555, 'his': 545, 'but': 505, 'thy': 480, 'be': 477, 'it': 472, 'thou': 470, 'have': 469, 'he': 464, 'your': 461, 'me': 401, 'this': 397, 'as': 384, 'so': 351, 'by': 328, 'will': 302, 'what': 301, 'which': 285, 'all': 281, 'shall': 278, 'do': 272, 'her': 267, 'or': 262, 'if': 258, 'are': 253, 'our': 252, 'him': 248, 'when': 240, 'from': 236, 'no': 235, 'we': 233, 'antony.': 223, 'on': 207}


In [113]:

dd=sorted(my_new_d.items(), key = lambda x: x[1], reverse=True)
dd

[('the', 1931),
 ('and', 1771),
 ('i', 1476),
 ('to', 1471),
 ('of', 1319),
 ('my', 1063),
 ('a', 934),
 ('that', 901),
 ('in', 862),
 ('you', 755),
 ('is', 708),
 ('not', 605),
 ('for', 588),
 ('with', 555),
 ('his', 545),
 ('but', 505),
 ('thy', 480),
 ('be', 477),
 ('it', 472),
 ('thou', 470),
 ('have', 469),
 ('he', 464),
 ('your', 461),
 ('me', 401),
 ('this', 397),
 ('as', 384),
 ('so', 351),
 ('by', 328),
 ('will', 302),
 ('what', 301),
 ('which', 285),
 ('all', 281),
 ('shall', 278),
 ('do', 272),
 ('her', 267),
 ('or', 262),
 ('if', 258),
 ('are', 253),
 ('our', 252),
 ('him', 248),
 ('when', 240),
 ('from', 236),
 ('no', 235),
 ('we', 233),
 ('antony.', 223),
 ('on', 207)]

In [34]:
d["guidelines"]

1

In [129]:
# This will write freqs to a file
lines=open("shakespeare.txt", "r").readlines()
# We only take the top 1000 lines
sentences=lines[:10000]
#freqs=get_dict(sentences)
freqs=get_dict(lines)

lt=freqs.items()

my_new_d= keep_top_n(lt, n=200)

d=sorted(my_new_d.items(), key = lambda x: x[1], reverse=True) # "d" is now a list of tuples!
word_freqs=open("./more_frequent_words.txt", "w")
for i in d:
    # For readability
    w=i[0]
    freq=i[-1]
    word_freqs.write(w+"\t"+str(freq)+"\n")
word_freqs.close()
    

In [118]:
type(lt)

dict_items

In [119]:
type(my_new_d)

dict

In [120]:
!less "word_list.txt"

the     27729=
and     26099
i       19540
to      18763
of      18126
a       14436
my      12455
in      10730
you     10696
that    10501
is      9168
for     8000
with    7981
not     7663
your    6878
his     6749
be      6717
this    5930
as      5893
but     5891
he      5886
it      5879
have    5683
[Kthou    5138xt[m[K
:[K

In [133]:
!less "more_frequent_words.txt"

the     27729=
and     26099
i       19540
to      18763
of      18126
a       14436
my      12455
in      10730
you     10696
that    10501
is      9168
for     8000
with    7981
not     7663
your    6878
his     6749
be      6717
this    5930
as      5893
but     5891
he      5886
it      5879
have    5683
[Kthou    5138nt_words.txt[m[K
:[K

# Excercice
* Download the file "pos.swn.txt" from Canvas >> Files > data
* Write a function that reads the file lines into a list.

* Your function should:
* a. Keep only the first 20 lines in the list.
* b. Replace "_" with a space " "
* c. return the list

* Call the function and print the list
