## Interacting with the OS and filesystem

The `os` module in Python provides many functions for interacting with the OS and the filesystem. Let's import it and try out some examples.

In [24]:
import os

In [25]:
os.getcwd() #check present working directory

'C:\\Users\\luthr'

In [26]:
os.listdir('.') # relative path to get the list of files in a directory

['.android',
 '.cache',
 '.eclipse',
 '.ipynb_checkpoints',
 '.ipython',
 '.jovian',
 '.jovianrc',
 '.jupyter',
 '.keras',
 '.lemminx',
 '.m2',
 '.matplotlib',
 '.p2',
 '.spyder-py3',
 '.VirtualBox',
 '3D Objects',
 'ACCURACY VISUALIZATION ON ML ALGO WATER QUALITY PREDICTION USING ML (1).ipynb',
 'anaconda3',
 'AppData',
 'Application Data',
 'climate.txt',
 'climate_result.txt',
 'Contacts',
 'Cookies',
 'DATA ANALYSIS WITH PYTHON (LECTURE 1&2).ipynb',
 'DATA ANALYSIS WITH PYTHON (NUMPY).ipynb',
 'DATA ANALYSIS WITH PYTHON (OS).ipynb',
 'DATASCIENCE BOOTCAMP (PYTHON BASICS).ipynb',
 'Desktop',
 'Diabetes.ipynb',
 'Documents',
 'Downloads',
 'eclipse',
 'eclipse-workspace',
 'Favorites',
 'Gauransh_practice',
 'GUI PRACTICE.ipynb',
 'GUI.ipynb',
 'IntelGraphicsProfiles',
 'Jedi',
 'KERAS.ipynb',
 'Links',
 'Local Settings',
 'measurement_values.png',
 'Music',
 'My Documents',
 'NetHood',
 'New Folder',
 'NTUSER.DAT',
 'ntuser.dat.LOG1',
 'ntuser.dat.LOG2',
 'NTUSER.DAT{55860434-1ada-1

In [27]:
os.makedirs('./os_data',exist_ok=True)

The exist_ok parameter is commonly used with the os.makedirs function and the Path.mkdir method to control whether an error should be raised when attempting to create a directory that already exists

In [28]:
os.listdir('C:\\Users\\luthr\\Desktop\\APPLY') # I can also provide absolute path as well

['Gauransh_coverletter.docx',
 'Gauransh_Resume.pdf',
 'Gauransh_transcript.docx',
 'Transcript.docx']

In [29]:
'os_data' in os.listdir('.')

True

In [30]:
os.listdir('os_data')

['loans1.txt', 'loans2.txt', 'loans3.txt']

Let us download some files into the `data` directory using the `urllib` module.

In [31]:
url1 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans1.txt'
url2 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans2.txt'
url3 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans3.txt'

In [32]:
from urllib.request import urlretrieve

In [33]:
urlretrieve(url1, './os_data/loans1.txt')

('./os_data/loans1.txt', <http.client.HTTPMessage at 0x263bb6a8cd0>)

In [34]:
urlretrieve(url1, './os_data/loans2.txt')

('./os_data/loans2.txt', <http.client.HTTPMessage at 0x263bd942080>)

In [35]:
urlretrieve(url1, './os_data/loans3.txt')

('./os_data/loans3.txt', <http.client.HTTPMessage at 0x263bd941780>)

In [36]:
os.listdir('./os_data')

['loans1.txt', 'loans2.txt', 'loans3.txt']

## Reading from a file 

To read the contents of a file, we first need to open the file using the built-in `open` function. The `open` function returns a file object and provides several methods for interacting with the file's contents.

In [37]:
file1 = open('./os_data/loans1.txt', mode='r')

The `open` function also accepts a `mode` argument to specifies how we can interact with the file. The following options are supported:

```
    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'x'       create a new file and open it for writing
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode
    't'       text mode (default)
    '+'       open a disk file for updating (reading and writing)
    'U'       universal newline mode (deprecated)
    ========= ===============================================================
```

To view the contents of the file, we can use the `read` method of the file object.

In [38]:
file1_contents=file1.read()

In [39]:
print(file1_contents)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


The file contains information about loans. It is a set of comma-separated values (CSV). 

> **CSVs**: A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. A CSV file typically stores tabular data (numbers and text) in plain text, in which case each line will have the same number of fields. (Wikipedia)

The first line of the file is the header, indicating what each of the numbers on the remaining lines represents. Each of the remaining lines provides information about a loan. Thus, the second line `100000,36,0.08,20000` represents a loan with:

* an *amount* of `$100000`, 
* *duration* of `36` months, 
* *rate of interest* of `8%` per annum, and 
* a down payment of `$20000`

The CSV is a standard file format used for sharing data for analysis and visualization. Over the course of this tutorial, we will read the data from these CSV files, process it, and write the results back to files. Before we continue, let's close the file using the `close` method (otherwise, Python will continue to hold the entire file in the RAM)

In [40]:
file1.close()

In [41]:
file1.read()  # once a file is closed you can't read it

ValueError: I/O operation on closed file.

## Closing files automatically using `with`

To close a file automatically after you've processed it, you can open it using the `with` statement.

In [42]:
with open('./os_data/loans2.txt') as file2:
    file2_contents = file2.read()
    print(file2_contents)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


Once the statements within the `with` block are executed, the `.close` method on `file2` is automatically invoked. Let's verify this by trying to read from the file object again.

In [43]:
file2.read()

ValueError: I/O operation on closed file.

## Reading a file line by line


File objects provide a `readlines` method to read a file line-by-line. 

In [44]:
with open('./os_data/loans3.txt', 'r') as file3:
    file3_lines = file3.readlines()

In [45]:
file3_lines

['amount,duration,rate,down_payment\n',
 '100000,36,0.08,20000\n',
 '200000,12,0.1,\n',
 '628400,120,0.12,100000\n',
 '4637400,240,0.06,\n',
 '42900,90,0.07,8900\n',
 '916000,16,0.13,\n',
 '45230,48,0.08,4300\n',
 '991360,99,0.08,\n',
 '423000,27,0.09,47200']

## Processing data from files

Before performing any operations on the data stored in a file, we need to convert the file's contents from one large string into Python data types. For the file `loans1.txt` containing information about loans in a CSV format, we can do the following:

* Read the file line by line
* Parse the first line to get a list of the column names or headers
* Split each remaining line and convert each value into a float
* Create a dictionary for each loan using the headers as keys
* Create a list of dictionaries to keep track of all the loans

Since we will perform the same operations for multiple files, it would be useful to define a function `read_csv`. We'll also define some helper functions to build up the functionality step by step. 

Let's start by defining a function `parse_header` that takes a line as input and returns a list of column headers.

In [46]:
def parse_headers(header_line):
    return header_line.strip().split(',')

The `strip` method removes any extra spaces and the newline character `\n`. The `split` method breaks a string into a list using the given separator (`,` in this case).

In [47]:
file3_lines[0]

'amount,duration,rate,down_payment\n'

In [48]:
headers = parse_headers(file3_lines[0])

In [49]:
headers

['amount', 'duration', 'rate', 'down_payment']

Next, let's define a function `parse_values` that takes a line containing some data and returns a list of floating-point numbers.

In [50]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        values.append(float(item))
    return values

In [52]:
file3_lines[1]

'100000,36,0.08,20000\n'

In [53]:
parse_values(file3_lines[1])

[100000.0, 36.0, 0.08, 20000.0]

The values were parsed and converted to floating point numbers, as expected. Let's try it for another line from the file, which does not contain a value for the down payment.

In [54]:
file3_lines[2]

'200000,12,0.1,\n'

In [55]:
parse_values(file3_lines[2])

ValueError: could not convert string to float: ''

The code above leads to a `ValueError` because the empty string `''` cannot be converted to a float. We can enhance the `parse_values` function to handle this *edge case*. We will also handle the case where the value is not a float.

In [56]:
def parse_values(data_line):
    values = []
    for item in data_line.strip().split(','):
        if item == '':
            values.append(0.0)
        else:
            try:
                values.append(float(item))
            except ValueError:
                values.append(item)
    return values

In [57]:
parse_values(file3_lines[2])

[200000.0, 12.0, 0.1, 0.0]

Next, let's define a function `create_item_dict` that takes a list of values and a list of headers as inputs and returns a dictionary with the values associated with their respective headers as keys.

In [59]:
def create_item_dict(values, headers):
    result = {}
    for value, header in zip(values, headers):
        result[header] = value
    return result

In [60]:
for item in zip([1,2,3], ['a', 'b', 'c']):
    print(item)

(1, 'a')
(2, 'b')
(3, 'c')


In [62]:
file3_lines[1]

'100000,36,0.08,20000\n'

In [63]:
values1=parse_values(file3_lines[2])
create_item_dict(values1,headers)

{'amount': 200000.0, 'duration': 12.0, 'rate': 0.1, 'down_payment': 0.0}

In [64]:
values2 = parse_values(file3_lines[2])
create_item_dict(values2, headers)

{'amount': 200000.0, 'duration': 12.0, 'rate': 0.1, 'down_payment': 0.0}

In [65]:
def read_csv(path):
    result = []
    # Open the file in read mode
    with open(path, 'r') as f:
        # Get a list of lines
        lines = f.readlines()
        # Parse the header
        headers = parse_headers(lines[0])
        # Loop over the remaining lines
        for data_line in lines[1:]:
            # Parse the values
            values = parse_values(data_line)
            # Create a dictionary using values & headers
            item_dict = create_item_dict(values, headers)
            # Add the dictionary to the result
            result.append(item_dict)
    return result

In [67]:
with open('./os_data/loans2.txt') as file2:
    print(file2.read())

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


In [68]:
read_csv('./os_data/loans2.txt')

[{'amount': 100000.0, 'duration': 36.0, 'rate': 0.08, 'down_payment': 20000.0},
 {'amount': 200000.0, 'duration': 12.0, 'rate': 0.1, 'down_payment': 0.0},
 {'amount': 628400.0,
  'duration': 120.0,
  'rate': 0.12,
  'down_payment': 100000.0},
 {'amount': 4637400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.07, 'down_payment': 8900.0},
 {'amount': 916000.0, 'duration': 16.0, 'rate': 0.13, 'down_payment': 0.0},
 {'amount': 45230.0, 'duration': 48.0, 'rate': 0.08, 'down_payment': 4300.0},
 {'amount': 991360.0, 'duration': 99.0, 'rate': 0.08, 'down_payment': 0.0},
 {'amount': 423000.0, 'duration': 27.0, 'rate': 0.09, 'down_payment': 47200.0}]