# **File Handling**

File handling is an important part of programming which allows to perform CRUD operations (Create, Update, Read and Delete). In python, to handle data we use the `open()` built-in function.

open("filename", mode) ---> where mode is either r, a, w, x, t or b

- `"r"` - read - By default. Opens a file for reading. Returns error if the file doesn't exist.
- `"a"` - append - Opens a file for appending. Creates the file if it doesn't exist.
- `"w"` - write - Opens a file for writing. Creates the file if it doesn't exist.
- `"x"` - create - Creates the specified file. Returns error if the file exists.
- `"t"` - text - Default value. Text mode.
- `"b"` - binary - Bynary model. (e.g. images).

In [2]:
f = open("./files/reading_file_example.txt")
print(f)

<_io.TextIOWrapper name='./files/reading_file_example.txt' mode='r' encoding='UTF-8'>


An opened file has different reading methods: `read()`, `readline()`, `readlines()`. An opened file has to be closed with `close()` method.

### **.read()**

Reads the whole text as string. If we want to limit the number of characters we want to read, we can limit it by passing int value to the `.read(`*number*`)` method.

In [5]:
f = open("./files/reading_file_example.txt")
txt = f.read()
print(type(txt))
print(txt)
f.close()

<class 'str'>
This is an example to show how to open a file and read.
This is the second line of the text.


In [6]:
# Printing the first 10 characters of the text file.
f = open("./files/reading_file_example.txt")
txt = f.read(10)
print(type(txt))
print(txt)
f.close()

<class 'str'>
This is an


### **.readline()**

Reads only the first line.

In [7]:
f = open("./files/reading_file_example.txt")
line = f.readline()
print(type(line))
print(line)
f.close()

<class 'str'>
This is an example to show how to open a file and read.



### **.readlines()**

Reads all the text line by line and returns a list of lines.

In [8]:
f = open("./files/reading_file_example.txt")
line = f.readlines()
print(type(line))
print(line)
f.close()

<class 'list'>
['This is an example to show how to open a file and read.\n', 'This is the second line of the text.']


Another way to get all the lines as a list is using `splitlines()`:

In [9]:
f = open('./files/reading_file_example.txt')
lines = f.read().splitlines()
print(type(lines))
print(lines)
f.close()

<class 'list'>
['This is an example to show how to open a file and read.', 'This is the second line of the text.']


It is important to always close the file after opening it. Since there's a high tendency of forgetting to close them, there's a way of opening files using the built-in function `with` which closes the files by itself.

In [10]:
with open("./files/reading_file_example.txt") as f:
    lines = f.read().splitlines()
    print(lines)
    print(type(lines))

['This is an example to show how to open a file and read.', 'This is the second line of the text.']
<class 'list'>


### **Opening files for writing and updating**

To write to an existing file, we must add a mode as a parameter to the `open()` function:

In [11]:
with open("./files/reading_file_example.txt", "a") as f:
    f.write("This text has to be appended at the end")

In [12]:
# This method creates a new file if the file doesn't exist.
with open("./files/writing_file_example.txt", "w") as f:
    f.write("This text will be written in a newly created file.")

### **Deleting files**

We have seen in previous section, how to make and remove a directory using the `os` module. Now, if we want to remove a file, we use this module as well.

`import` os

os.`remove(`"./files/example.txt"`)`

If the file doesn't exist, the `remove()` method will raise an error. Therefore, it's good to use a condition like this:

`import` os

`if` os.path.`exists(`"./files/example.txt"`)`:
    
    os.`remove(`"./files/example.txt"`)`
    
`else`:
    
    `print(`"The file does not exist"`)`

## **File Types**

### **JSON**

JSON stands for JavaScript Object Notation. Actually it is a stringified JavaScript object or Python dictionary

*Example:*

In [13]:
# Dictionary
person_dct = {
    "name": "Alex",
    "country": "Colombia",
    "city": "Neiva",
    "skills": ["Python", "Docker", "Power BI", "SQL"]
}

# JSON: A string form of the dictionary
person_json = "{'name': 'Alex', 'country': 'Colombia', 'city': 'Neiva', 'skills': ['Python', 'Docker'', 'Power BI', 'SQL']}"

# Using triple quotes to make it more readable
person_json = """{
    "name": "Alex",
    "country": "Colombia",
    "city": "Neiva",
    "skills": ["Python", "Docker", "Power BI", "SQL"]
}"""

### **Changing JSON to a dictionary**

In [14]:
import json
# JSON
person_json = """{
    "name": "Alex",
    "country": "Colombia",
    "city": "Neiva",
    "skills": ["Python", "Docker", "Power BI", "SQL"]
}"""

person_dct = json.loads(person_json)
print(type(person_dct))
print(person_dct)
print(person_dct["name"])

<class 'dict'>
{'name': 'Alex', 'country': 'Colombia', 'city': 'Neiva', 'skills': ['Python', 'Docker', 'Power BI', 'SQL']}
Alex


### **Changing a dictionary to JSON**

To change a dictionary to a JSON we use `dumps()` method from the json module.

In [19]:
import json
# Dictionary
person_dct = {
    "name": "Alex",
    "country": "Colombia",
    "city": "Neiva",
    "skills": ["Python", "Docker", "Power BI", "SQL"]
}

person_json = json.dumps(person_dct, indent=4) # indent could be 2, 4 or 8. It makes the json more beatiful
print(type(person_json))
print(person_json)

<class 'str'>
{
    "name": "Alex",
    "country": "Colombia",
    "city": "Neiva",
    "skills": [
        "Python",
        "Docker",
        "Power BI",
        "SQL"
    ]
}


### **Saving a JSON file**

We can also save our data as a json file. Let us save it as a json file using the following steps. For writing a json file, we use the json.`dump()` method, it can take dictionary, output file, ensure_ascii and indent.

In [None]:
import json
# Dictionary
person = {
    "name": "Alex",
    "country": "Colombia",
    "city": "Neiva",
    "skills": ["Python", "Docker", "Power BI", "SQL"]
}

with open("./files/json_example.json", "w", encoding="utf-8") as f:
    json.dump(person, f, ensure_ascii=False, indent=4)

### **CSV**

CSV stands for Comma Separated Values. CSV is a simple file format used to store structured data, such as a spreadsheet or database. CSV is a very common data format in data science.

Example:

"name", "country", "city", "skills"

"Alex", "Colombia", "Neiva", "Python"

In [1]:
import csv
with open("./files/csv_example.csv") as f:
    csv_reader = csv.reader(f, delimiter=",") # we use reader method to read csv
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f"column names are: {', '.join(row)}")
            line_count += 1
        else:
            print(f"\t{row[0]} is a data analyst. He lives in {row[1]}, {row[2]}.")
            line_count += 1
    print(f"Number of lines: {line_count}")

column names are: name, country, city, skills
	Alex is a data analyst. He lives in  "Colombia",  "Neiva".
Number of lines: 2


### **XLSX**

To read excel files we need to install the `xlrd` package.

excel_book = xlrd.`open_workbook(`"sample.xlsx"`)`

`print(`excel_book.nsheets`)`

`print(`excel_book.sheet_names`)`

### **XML**

XML is another structured data format which looks like HTML. in XML the tags are not predefined. The first line is an XML declaration. The person tacg is the root of the XML. The person has a gender attribute. Example: XML

```
<?xml version="1.0"?>
<person gender="female">
  <name>Asabeneh</name>
  <country>Finland</country>
  <city>Helsinki</city>
  <skills>
    <skill>JavaScrip</skill>
    <skill>React</skill>
    <skill>Python</skill>
  </skills>
</person>
```

```
import xml.etree.ElementTree as ET
tree = ET.parse('./files/xml_example.xml')
root = tree.getroot()
print('Root tag:', root.tag)
print('Attribute:', root.attrib)
for child in root:
    print('field: ', child.tag)
```

### **Exercises**

##### **Level 1**

1. Write a function which count the number of lines and number of words in a text. All the files are in the files folder: a) Read obama_speech.txt file and count number of lines and words b) Read michelle_obama_speech.txt file and count number of lines and words c) Read donald_speech.txt file and count number of lines and words d) Read melina_trump_speech.txt file and count number of lines and words.

In [16]:
# Steps
# define function by passing a filepath as argument
# read the file
# split the number of lines into a list
# count the number of elements of the list
# store the value into a counter

# Grab each element of the list and count the words
# store the number of words into a counter
# return the counter of lines and the counter of worsd

import re

def txt_counters(filepath=str):
    """
    Returns the total number of lines and total number of words of a given .txt file
    """
    # Count of lines
    with open(filepath, "r") as f:
        lines_list = f.read().splitlines()
        total_lines = len(lines_list)
    
    # Count of words
    total_word_count = 0
    for line in lines_list:
        word_list = re.findall(r'[a-zA-z]+', line, re.I)
        num_words_per_line = len(word_list)
        total_word_count += num_words_per_line
    
    return total_lines, total_word_count

In [17]:
obama_num_lines, obama_num_words = txt_counters(filepath="./files/obama_speech.txt")
michelle_num_lines, michelle_num_words = txt_counters(filepath="./files/michelle_obama_speech.txt")
trump_num_lines, trump_num_words = txt_counters(filepath="./files/donald_speech.txt")
melina_num_lines, melina_num_words = txt_counters(filepath="./files/melina_trump_speech.txt")

print(f"Obama's speech has {obama_num_lines} lines and {obama_num_words} total words.")
print(f"Michelle Obama's speech has {michelle_num_lines} lines and {michelle_num_words} total words.")
print(f"Trump's speech has {trump_num_lines} lines and {trump_num_words} total words.")
print(f"Melina Trump's speech has {melina_num_lines} lines and {melina_num_words} total words.")

Obama's speech has 66 lines and 2401 total words.
Michelle Obama's speech has 83 lines and 2217 total words.
Trump's speech has 48 lines and 1264 total words.
Melina Trump's speech has 33 lines and 1371 total words.


2. Read the countries_data.json data file in data directory, create a function that finds the ten most spoken languages

In [38]:
import json
import itertools
from collections import Counter

def most_spoken_languages(filepath=str):
    """
    Returns a list of tuples with the top 10 most spoken languages in the world and the number of countries that speak the language in descending order.
    """
    
    # Read .json file
    with open(filepath, "r") as f:
        countries_info = json.load(f)
    
    # Get languages info and store it into a list
    languages_info = itertools.chain.from_iterable(country.get("languages", []) for country in countries_info)

    # Count frequency of languages, sort them by number in descending order and return the first 10
    language_count = dict(Counter(languages_info))
    most_spoken = sorted(language_count.items(), key=lambda x: x[1], reverse=True) # sorted function returns a list object
    return most_spoken[:10]

most_spoken_languages(filepath="./files/countries_data.json")

[('English', 91),
 ('French', 45),
 ('Arabic', 25),
 ('Spanish', 24),
 ('Portuguese', 9),
 ('Russian', 9),
 ('Dutch', 8),
 ('German', 7),
 ('Chinese', 5),
 ('Serbian', 4)]

3. Read the countries_data.json data file in data directory, create a function that creates a list of the ten most populated countries

In [41]:
import json
import itertools
from collections import Counter

def most_populated_countries(filepath=str):
    """
    Returns a list of the top 10 most populated countries and their population.
    """
    
    # Read .json file
    with open(filepath, "r") as f:
        countries_info = json.load(f)
    
    # Get countries names and population data
    countries_names = [country.get("name", None) for country in countries_info]
    countries_pop = [country.get("population", None) for country in countries_info]

    # Store info into a dictionary, sort them by population in descending order and return the first 10
    population_dict = dict(zip(countries_names, countries_pop))
    sorted_pop = sorted(population_dict.items(), key=lambda x: x[1], reverse=True)
    return sorted_pop[:10]

most_populated_countries(filepath="./files/countries_data.json")

[('China', 1377422166),
 ('India', 1295210000),
 ('United States of America', 323947000),
 ('Indonesia', 258705000),
 ('Brazil', 206135893),
 ('Pakistan', 194125062),
 ('Nigeria', 186988000),
 ('Bangladesh', 161006790),
 ('Russian Federation', 146599183),
 ('Japan', 126960000)]

##### **Level 2**

4. Extract all incoming email addresses as a list from the email_exchange_big.txt file.

In [83]:
import re

def get_emails(filepath=str):
    """
    Returns a list with all incoming emails from the emails file.
    """
    with open(filepath, "r", encoding="utf-8") as f:
        lines = f.read().splitlines()

    emails_list = []
    for i in lines:
        match = re.findall(r'From\s([\w.-]+)@([\w.-]+)', i)
        if match != []:
            email = match[0][0] + "@" + match[0][1]
            emails_list.append(email)
    return emails_list

get_emails(filepath="./files/email_exchange_big.txt")


['stephen.marquard@uct.ac.za',
 'louis@media.berkeley.edu',
 'zqian@umich.edu',
 'rjlowe@iupui.edu',
 'zqian@umich.edu',
 'rjlowe@iupui.edu',
 'cwen@iupui.edu',
 'cwen@iupui.edu',
 'gsilver@umich.edu',
 'gsilver@umich.edu',
 'zqian@umich.edu',
 'gsilver@umich.edu',
 'wagnermr@iupui.edu',
 'zqian@umich.edu',
 'antranig@caret.cam.ac.uk',
 'gopal.ramasammycook@gmail.com',
 'david.horwitz@uct.ac.za',
 'david.horwitz@uct.ac.za',
 'david.horwitz@uct.ac.za',
 'david.horwitz@uct.ac.za',
 'stephen.marquard@uct.ac.za',
 'louis@media.berkeley.edu',
 'louis@media.berkeley.edu',
 'ray@media.berkeley.edu',
 'cwen@iupui.edu',
 'cwen@iupui.edu',
 'cwen@iupui.edu',
 'ray@media.berkeley.edu',
 'cwen@iupui.edu',
 'zqian@umich.edu',
 'cwen@iupui.edu',
 'zqian@umich.edu',
 'zqian@umich.edu',
 'zqian@umich.edu',
 'mmmay@indiana.edu',
 'cwen@iupui.edu',
 'zqian@umich.edu',
 'zqian@umich.edu',
 'zqian@umich.edu',
 'cwen@iupui.edu',
 'zqian@umich.edu',
 'cwen@iupui.edu',
 'ray@media.berkeley.edu',
 'zqian@umic

5. Find the most common words in the English language. Call the name of your function find_most_common_words, it will take two parameters - a string or a file and a positive integer, indicating the number of words. Your function will return an array of tuples in descending order.

In [9]:
import re
from collections import Counter

def find_most_common_words(filepath=str, total_words=int):
    with open(filepath, "r") as f:
        file_content = f.read()
    
    word_list = re.findall(r'[a-zA-Z]+', file_content)
    word_count = dict(Counter(word_list))
    sorted_word_count = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_word_count[:total_words]

find_most_common_words(filepath="./files/reading_file_example.txt", total_words=10)

[('This', 3),
 ('to', 3),
 ('the', 3),
 ('is', 2),
 ('text', 2),
 ('an', 1),
 ('example', 1),
 ('show', 1),
 ('how', 1),
 ('open', 1)]

6. Use the function `find_most_frequent_words` to find: a) The ten most frequent words used in Obama's speech b) The ten most frequent words used in Michelle's speech c) The ten most frequent words used in Trump's speech d) The ten most frequent words used in Melina's speech.

In [11]:
print(f"Barak Obama: {find_most_common_words(filepath='./files/obama_speech.txt', total_words=10)}")
print(f"Michelle Obama: {find_most_common_words(filepath='./files/michelle_obama_speech.txt', total_words=10)}")
print(f"Donald Trump: {find_most_common_words(filepath='./files/donald_speech.txt', total_words=10)}")
print(f"Melina Trump: {find_most_common_words(filepath='./files/melina_trump_speech.txt', total_words=10)}")

Barak Obama: [('the', 120), ('and', 107), ('of', 81), ('to', 66), ('our', 58), ('we', 50), ('that', 49), ('a', 48), ('is', 36), ('us', 23)]
Michelle Obama: [('to', 84), ('and', 80), ('the', 78), ('of', 46), ('that', 43), ('a', 41), ('in', 36), ('s', 29), ('I', 28), ('he', 28)]
Donald Trump: [('the', 61), ('and', 53), ('will', 40), ('of', 38), ('to', 32), ('our', 30), ('we', 27), ('is', 20), ('We', 17), ('America', 17)]
Melina Trump: [('and', 73), ('to', 54), ('the', 48), ('is', 29), ('I', 28), ('for', 27), ('of', 25), ('a', 22), ('that', 19), ('you', 18)]


7. Write a python application that checks similarity between two texts. It takes a file or a string as a parameter and it will evaluate the similarity of the two texts. For instance, check the similarity between the transcripts of Michelle's and Melina's speech. You may need a couple of functions, function to clean the text(clean_text), function to remove support words(remove_support_words) and finally to check the similarity(check_text_similarity). List of stop words are in the data directory.