### Raw Strings
In Python, we use `r` before the beginning quotation mark of a string -- single or double quote -- to declare a raw string. A raw string ignores all escape characters it might encounter and prints them. Python considers the backslash as part of the string and not as the start of an escape character. Raw strings are helpful if you are typing string values that contain many backslashes, such as the strings used for regular expressions (which we will cover in our next meeting).


In [2]:
print(
    r"The history of Budapest began when an early Celtic settlement transformed into the Roman town of Aquincum, \n the capital of Lower Pannonia.\n"
)

The history of Budapest began when an early Celtic settlement transformed into the Roman town of Aquincum, \n the capital of Lower Pannonia.\n


<IPython.core.display.Javascript object>

### Multiline Strings with Triple Quotes
While you can use the `\n` escape character to put a newline into a string, it is often easier to use multiline strings. A multiline string in Python begins and ends with either three single quotes or three double quotes. Any quotes, tabs, or newlines in between the "triple quotes" are considered part of the string. Python's indentation rules for blocks do not apply to lines inside a multiline string.

In [3]:
budapest_paragraph = """The history of Budapest began when an early Celtic settlement transformed into the Roman town of Aquincum,[15][16] the capital of Lower Pannonia.[15] The Hungarians arrived in the territory in the late 9th century.[17] 
The area was pillaged by the Mongols in 1241.[18] Buda, the settlements on the west bank of the river,
became one of the centres of Renaissance humanist culture by the 15th century.[19][20][21] 
The Battle of Mohács, in 1526, was followed by nearly 150 years of Ottoman rule.[22]
After the reconquest of Buda in 1686, the region entered a new age of prosperity. 
Pest-Buda became a global city with the unification of Buda, Óbuda, and Pest on 17 November 1873, with the name
'Budapest' given to the new capital.[12][23] Budapest also became the co-capital of the Austro-Hungarian Empire,
[24] a great power that dissolved in 1918, following World War I. The city was the focal point of the Hungarian 
Revolution of 1848, the Battle of Budapest in 1945, 
and the Hungarian Revolution of 1956.[25][26]"""

<IPython.core.display.Javascript object>

In [4]:
type(budapest_paragraph)

str

<IPython.core.display.Javascript object>

### Reading files 

We can import string from a file and store it in a variable. Let's do this using the file named `Hello.txt`.

In Python, we open a file using the `open()` method and pass the file you want to open. Usually, we can open a file to read or write. Main exceptions are pdf where we cannot only write a pdf combining information from existing pdfs. More on this later. Text files with the .txt extension or Python script files with the .py extension are examples of plain text files. 

We are going to start reading and writing text files. Your programs can easily read the contents of plaintext files and treat them as an ordinary string value. (Binary files are all other file types, such as word processing documents, PDFs, images, spreadsheets, etc.)

In [5]:
import os

<IPython.core.display.Javascript object>

In [6]:
pwd

'/Users/ariedamuco/Dropbox (CEU Econ)/Python-Programming-and-Text-Analysis/Code/Class2/Part2'

<IPython.core.display.Javascript object>

In [7]:
os.chdir("../")

<IPython.core.display.Javascript object>

In [8]:
pwd

'/Users/ariedamuco/Dropbox (CEU Econ)/Python-Programming-and-Text-Analysis/Code/Class2'

<IPython.core.display.Javascript object>

In [9]:
hello = open("/Users/ariedamuco/Downloads/Introduction-to-Python/input/Hello.txt")

FileNotFoundError: [Errno 2] No such file or directory: '/Users/ariedamuco/Downloads/Introduction-to-Python/input/Hello.txt'

<IPython.core.display.Javascript object>

In [None]:
os.chdir("/Users/ariedamuco/Downloads/Introduction-to-Python-main")

In [None]:
os.listdir()

In [None]:
file_open = open("input/Hello.txt")

In [None]:
file_open

In [None]:
file_open.read()

In [None]:
os.listdir("input")

### Access Modes

Access modes define in which way you want to open a file, whether you want to open a file in:

-- read-only mode -- denoted by `r`, the default mode 

-- write-only mode -- denoted by `w` 

-- append mode -- denoted by `a` 

The most commonly used ones are read and write modes. Sometimes when looking for help on the internet you might encounter that the file is opened both in both read and write mode  -- denoted by `rw`. Be careful when using both read and write method contemporaneously as you may modify original files, which you usually do not want to. 

Now let's read the content of our file.

In [None]:
hello_string = hello.read()

In [None]:
type(hello_string)

In [None]:
budapest = open("../input/budapest.txt", mode='r', encoding="utf-8" )

In [None]:
budapest.readlines()

In [None]:
type(budapest.readlines())

We can also write our own text files.

In [None]:
my_file = open("../output/first_file.txt",'w') 

In [None]:
my_file.write("This is the first file that I write with Python.")

To see what's writen in the file, we should first close it. An alternative way would be to do the following
```
with open("Output/first_file.txt",'w') as my_file:
    my_file.write("This is the first file that I write with Python.")
```

In [None]:
my_file.close()

In [None]:
my_file = open("../output/first_file.txt",'w') 

In [None]:
lines = ["First Line","Second Line","Third Line","Fourth Line"]

In [None]:
with open("../output/first_file.txt",'w') as my_file:
    for line in lines:
        my_file.write(line + "\n")

In [None]:
f = open("../output/first_file.txt",'r')
f.readlines()

In case you want to append the list to the first line you wrote previously, use append mode.

In [None]:
#!pip install Wikipedia

In [None]:
import wikipedia

In [None]:
wikipedia.search("Budapest")

In [None]:
budapest_wiki = wikipedia.page('Budapest')

In [None]:
budapest_wiki.title

In [None]:
budapest_wiki.url

In [None]:
budapest_summary = wikipedia.summary("Budapest")

In [None]:
budapest_summary

In [None]:
budapest_content = budapest_wiki.content

In [None]:
budapest_content

### Challenge: Write the string stored with the name BP_page in a file named `Budapest_wiki.txt`. Save the file in the `Output` folder. 

In [None]:
import re
budapest_content_clean = re.sub('[^a-zA-Z]+', ' ', budapest_content)

In [None]:
budapest_content_tokens=budapest_content_clean.lower().split()

In [None]:
budapest_content_tokens[0:10]

### Challenge: Open the file nltk_stopwords.txt and store it's content in a list called `nltk_stopwords` where each element of the list is a line of the document. Use the method `replace()`  if needed.

In [None]:
nltk= open("../input/nltk_stopwords.txt", mode='r')
nltk_stopwords=[element.replace("\n","") for element in nltk.readlines()]

In [None]:
budapest_stopwords_removed = [word for word in budapest_content_tokens if word not in nltk_stopwords and len(word)>3]

In [None]:
from collections import Counter

dict_bp = Counter(budapest_stopwords_removed)
dict_bp

In [None]:
#!conda install WordCloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud

In [None]:
wc = WordCloud(width=1800, height=900, background_color="white",
               max_words = 20, relative_scaling = 0.5, 
               normalize_plurals = False).generate_from_frequencies(dict_bp)
plt.imshow(wc)

### Challenge: Create a function named `count_elements` that counts the unique elements of the list and returns a dictionary that has keys the unique elemets of the list and values that count the number of time each element appears in the dictionary. Use  `count_elements` to plot the word cloud.

In [None]:
#write the dictionary in a text file
write_budapest = open('../output/Budapest_counter_freq.txt', 'w') 
write_budapest.write('word' + ";" + "frequency" + "\n")
for key, value in Counter(dict_bp).items():
    #print (key, value)
    write_budapest.write(key + ';' + str(value) + "\n")
write_budapest.close()

In [None]:
"""
# Using the csv library
import csv

with open('Output/Budapest_counter.csv', 'w') as csv_file:
    writer = csv.writer(csv_file, delimiter=';')
    writer.writerow(['word','frequency'])
    for key, value in dict_counts.items():
       writer.writerow([key, value])
"""

With Python we can open, read and write other types of files. One of them is `pdf` format. While we can read pdfs we cannot write pdf from scratch, we only write content that is already in pdf format. To do this, we'll use another third party module called PyPDF2. 

In [None]:
#install third party module
#!pip install PyPDF2

In [None]:
import PyPDF2

In [None]:
AEJ_health = open('../input/app20170295.pdf', "rb") #read binary
reader_health = PyPDF2.PdfFileReader(AEJ_health)

In [None]:
reader_health

In [None]:
reader_health.numPages

In [None]:
page1_health = reader_health.getPage(0)

In [None]:
page1_health.extractText()

In [None]:
writer_AEJ_health=PyPDF2.PdfFileWriter()

In [None]:
writer_AEJ_health.addPage(page1_health)

In [None]:
outputFile_health=open('../output/AEJ_health_abstract.pdf','wb')
writer_AEJ_health.write(outputFile_health)
outputFile_health.close()

In [None]:
#read other AEJ file
AEJ_temperature=open('../input/app20170223.pdf', "rb") 
reader_temperature=PyPDF2.PdfFileReader(AEJ_temperature)
page1_temperature=reader_temperature.getPage(0)

In [None]:
writer_AEJ=PyPDF2.PdfFileWriter()
writer_AEJ.addPage(page1_temperature)
writer_AEJ.addPage(page1_health)
outputFile=open('../output/AEJ_combined.pdf','wb')
writer_AEJ.write(outputFile)
outputFile.close()

### Challenge: Write a function that combines odd pages from `app20170295.pdf` and even pages from  `app20170223.pdf `. 

### References 
https://www.datacamp.com/community/tutorials/reading-writing-files-python

https://automatetheboringstuff.com/chapter6

https://pypi.org/project/wikipedia/

https://medium.com/@Alexander_H/scraping-wikipedia-with-python-8000fc9c9e6c

https://automatetheboringstuff.com/chapter8/

https://automatetheboringstuff.com/chapter13/