# Python for data science - Part 1: Reading & writing
Welcome to python for data science part 1. Because data science and analytics start with data, and getting data is always a challenge, this notebook aims to help you develop some data handling skills & tricks using Python.

---

### Download data
To start off, let's use this rather simplistic command to download some data. You can just execute the cell below.

In [None]:
! wget https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_1.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_2.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_3.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_4.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_5.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_6.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_7.txt https://github.com/datamind-dotfit/python_for_data_science/raw/master/ex_1/file_8.txt    

---
## Opening a single file
In Python, there is no need for importing external library to read and write files. Python provides built-in functions for creating, writing and reading files. 

To open a single file in Python we can use the **open** function. The open function accepts different argument (parameters that you pass to a function). Let's first use the **! ls** command to execute a shell command that lists all files in the current working directory. 

In [29]:
! ls

[34mdata[m[m                    file_2.txt              file_6.txt
ex1_data_handling.ipynb file_3.txt              file_7.txt
ex2_pandas.ipynb        file_4.txt              file_8.txt
file_1.txt              file_5.txt


It appears that we have a bunch of files named file_1 to file_8, and looking at the extension they are all .text files. Let's see if we can use the **open()** function to open one of the files.

**Question:**
- Open a .txt file from the working directory using the open() function and assign this to the variable **f**.
- What does the mode argument in open mean? What options do you have and when would you use them?
- Try to print out the variable **f** using the **print()** function, what do you see?

In [56]:
# The file mode specifies what type of stream to open
# File modes: https://www.guru99.com/reading-and-writing-files-in-python.html#5
f = open('file_1.txt', 'r')

# Printing F results in an IO wrapper (file object)
print(f)

<_io.TextIOWrapper name='file_1.txt' mode='r' encoding='UTF-8'>


---
### The file object
So we can read in a file using **open()**, but Python stores this as a file object. Variables in Python are objects, and objects have properties and functions. This might sound confusing but it's actually really simple.
<br><br>
Our file object "f" for instance, has the function *read* which is part of the object. This means if we call the **read()** function on the object, it will read the file contents.

Remember that we can call methods on objects by using the dot annotation. So for instance using the file object, we may access any function or property of this object by writing **f.read()**.

**Note**: reading consumes the file, so it might be wise to put the open and read in the same cell if you want to experiment using different functions.


**Question:**
- Open a text file using **open()**
- Print its contents
- What is the difference between the read and readlines function?

In [None]:
### Your solution, approximately 4 lines of code ###

# Print contents
f = open('file_1.txt', 'r')
print(f.read())

# Readlines returns a list of rows, read returns the entire contents as a string.
f = open('file_1.txt', 'r')
print(f.readlines())

---
### Writing data to a file
We may also use the **open()** function to create a new file to write output to. We only have to change the mode to something that allows us to write instead of read.

**Question:**
- Use the **open()** function to create a new file called "output_file.txt", make sure we can write to it.
- Create a string that you will insert into the file
- Insert the string in the file
- Before you can view the result you'll have to close the filestream
- Verify that your output file contains the string


In [None]:
### Your solution, approximately 4 lines of code ###

In [44]:
output = open('output_file.txt', 'w')
output_string = 'This should go into'
output.write(output_string)
output.close()

---
# Python for data science - part 2: Merging files
In practice, it often happens that data is periodically dumped on a certain location. This could be a network disk, your local computer, a data lake, a blob storage, e-mail or an FTP server. 

<br>If you're lucky the data has the same format across files, but it divided into parts. In this part we'll use python file object to read in multiple files and write the contents to a single output file.

Up to now you've learned how to open a single file for reading and writing. Combine this skill with some python basics to merge all the separate files into a single output file.

**Question**:
- Loop over all of the .txt files and store them in a single output file.

<br>

**Bonus**:
- Record the number of rows in each file, and automatically check if the single output file rows match the sum of the rows in each file.

In [55]:
## Variables
OUTPUT_FILENAME = 'output_file.txt'
INPUT_FILENAME = 'file_{}.txt'

In [54]:
## Solution 0 without using With
output_file = open(OUTPUT_FILENAME, 'w')
for i in range(1,9):
    f = open(INPUT_FILENAME.format(i), 'r')
    output_file.write(f.read())

output_file.close()

In [48]:
## Solution using With
with open(OUTPUT_FILENAME, 'w') as output:
    for i in range(1,9):
        with open(INPUT_FILENAME.format(i), 'r') as f:
            output.write(f.read())

In [9]:
## Solution 2 using with and list comprehension
files = [INPUT_FILENAME.format(x) for x in range(1,9)]

with open(OUTPUT_FILENAME, 'w') as output:
    for file_name in files:
        with open(file_name, 'r') as f:
            output.write(f.read())

In [10]:
## View result
with open(OUTPUT_FILENAME, 'r') as f:
    file = f.read()

In [52]:
## Bonus
rows = 0
# Solution 1
with open(OUTPUT_FILENAME, 'w') as output:
    for i in range(1,9):
        with open(INPUT_FILENAME.format(i), 'r') as f:
            lines = f.readlines()
            rows += len(lines)
            output.writelines(lines)
            
with open(OUTPUT_FILENAME, 'r') as f:
    lines = len(f.readlines())
    
print(lines == rows)

True


# Closing notes
So now you've learned how the **open()** statement works in python. You're able to merge multiple files together into one single file. In the future you can even use this statement to save more than just text files. The same syntax applies to machine learning models which may be stored as .joblib files. In this case you would open the .joblib file and process the file object using sci-kit learn. We also provide a machine learning training where we deep dive in bringing your model to production.