In [1]:
import os

In [2]:
os.getcwd()

'/Users/aaqib/Documents/jupyterNotebook/Practice/dataAnalysis/numpy'

In [3]:
os.listdir('.')

['python-os-ans-filesystem.ipynb',
 '.DS_Store',
 '100-numpy-exercises.ipynb',
 'numpy-array-operations.ipynb',
 '.jovianrc',
 '.ipynb_checkpoints']

In [4]:
os.makedirs('./data')

In [5]:
os.listdir('.')

['python-os-ans-filesystem.ipynb',
 '.DS_Store',
 '100-numpy-exercises.ipynb',
 'numpy-array-operations.ipynb',
 '.jovianrc',
 '.ipynb_checkpoints',
 'data']

Let us download some files into the `data` directory using the `urllib` module.

In [6]:
url1 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans1.txt'
url2 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans2.txt'
url3 = 'https://gist.githubusercontent.com/aakashns/257f6e6c8719c17d0e498ea287d1a386/raw/7def9ef4234ddf0bc82f855ad67dac8b971852ef/loans3.txt'

In [3]:
from urllib.request import urlretrieve

In [8]:
urlretrieve(url1,'./data/loans1.txt')

('./data/loans1.txt', <http.client.HTTPMessage at 0x7fc1fd4325b0>)

In [10]:
urlretrieve(url2,'./data/loans2.txt')

('./data/loans2.txt', <http.client.HTTPMessage at 0x7fc1fd482100>)

In [9]:
urlretrieve(url3,'./data/loans3.txt')

('./data/loans3.txt', <http.client.HTTPMessage at 0x7fc1fd45ea00>)

Let's verify that the files were downloaded.

In [11]:
os.listdir('./data')

['loans2.txt', 'loans3.txt', 'loans1.txt']

## Reading from a file 

To read the contents of a file, we first need to open the file using the built-in `open` function. The `open` function returns a file object and provides several methods for interacting with the file's contents.

The `open` function also accepts a `mode` argument to specifies how we can interact with the file. The following options are supported:

```
    ========= ===============================================================
    Character Meaning
    --------- ---------------------------------------------------------------
    'r'       open for reading (default)
    'w'       open for writing, truncating the file first
    'x'       create a new file and open it for writing
    'a'       open for writing, appending to the end of the file if it exists
    'b'       binary mode
    't'       text mode (default)
    '+'       open a disk file for updating (reading and writing)
    'U'       universal newline mode (deprecated)
    ========= ===============================================================
```

To view the contents of the file, we can use the `read` method of the file object.

In [13]:
with open('./data/loans1.txt','r') as f:
    file1_content = f.read()
    print(file1_content)

amount,duration,rate,down_payment
100000,36,0.08,20000
200000,12,0.1,
628400,120,0.12,100000
4637400,240,0.06,
42900,90,0.07,8900
916000,16,0.13,
45230,48,0.08,4300
991360,99,0.08,
423000,27,0.09,47200


## Read Line by line

In [18]:
with open('./data/loans2.txt','r') as f:
    file2_content = f.readlines()

In [22]:
file2_content

['amount,duration,rate,down_payment\n',
 '828400,120,0.11,100000\n',
 '4633400,240,0.06,\n',
 '42900,90,0.08,8900\n',
 '983000,16,0.14,\n',
 '15230,48,0.07,4300']

get rid of \n

In [23]:
file2_content[0].strip()

'amount,duration,rate,down_payment'

In [24]:
for line in file2_content:
    print(line.strip())

amount,duration,rate,down_payment
828400,120,0.11,100000
4633400,240,0.06,
42900,90,0.08,8900
983000,16,0.14,
15230,48,0.07,4300


In [56]:
with open('./data/loans3.txt', 'r') as file3:
    file3_lines = file3.readlines()

In [49]:
file3_lines

['amount,duration,rate,down_payment\n',
 '45230,48,0.07,4300\n',
 '883000,16,0.14,\n',
 '100000,12,0.1,\n',
 '728400,120,0.12,100000\n',
 '3637400,240,0.06,\n',
 '82900,90,0.07,8900\n',
 '316000,16,0.13,\n',
 '15230,48,0.08,4300\n',
 '991360,99,0.08,\n',
 '323000,27,0.09,4720010000,36,0.08,20000\n',
 '528400,120,0.11,100000\n',
 '8633400,240,0.06,\n',
 '12900,90,0.08,8900']

## Processing data from files

Before performing any operations on the data stored in a file, we need to convert the file's contents from one large string into Python data types. For the file `loans1.txt` containing information about loans in a CSV format, we can do the following:

* Read the file line by line
* Parse the first line to get a list of the column names or headers
* Split each remaining line and convert each value into a float
* Create a dictionary for each loan using the headers as keys
* Create a list of dictionaries to keep track of all the loans

Since we will perform the same operations for multiple files, it would be useful to define a function `read_csv`. We'll also define some helper functions to build up the functionality step by step. 

Let's start by defining a function `parse_header` that takes a line as input and returns a list of column headers.

In [25]:
def parse_header(line_header):
    return line_header.strip().split(',')

In [27]:
header = parse_header(file2_content[0])

In [28]:
header

['amount', 'duration', 'rate', 'down_payment']

Next, let's define a function `parse_values` that takes a line containing some data and returns a list of floating-point numbers.

In [41]:
def parse_values(data_line):
    values = []
    for value in data_line.strip().split(','):
        values.append(float(value))
    return values    

In [44]:
parse_values(file2_content[1])

[828400.0, 120.0, 0.11, 100000.0]

The values were parsed and converted to floating point numbers, as expected. Let's try it for another line from the file, which does not contain a value for the down payment.

In [45]:
file2_content[2]

'4633400,240,0.06,\n'

In [50]:
parse_values(file2_content[2])

ValueError: could not convert string to float: ''

The code above leads to a `ValueError` because the empty string `''` cannot be converted to a float. We can enhance the `parse_values` function to handle this *edge case*. We will also handle the case where the value is not a float.

In [51]:
def parse_values(data_line):
    values = []
    for value in data_line.strip().split(','):
        if value == '':
            values.append(0.0)
        else:  
            try:
                values.append(float(value))
            except ValueError:
                values.append(value)
    return values    

In [52]:
parse_values(file2_content[2])

[4633400.0, 240.0, 0.06, 0.0]

In [63]:
def create_item_dict(headers,values):
    result = {}
    for header , value in zip(headers,values):
        result[header] = value
    return result

In [54]:
header

['amount', 'duration', 'rate', 'down_payment']

In [61]:
value1 = parse_values(file3_lines[1])
value1

[45230.0, 48.0, 0.07, 4300.0]

In [64]:
create_item_dict(header,value1)

{'amount': 45230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}

As expected, the values & header are combined to create a dictionary with the appropriate key-value pairs.

We are now ready to put it all together and define the `read_csv` function.

In [74]:
def read_csv(path):
    result = []
    with open(path , 'r') as f: # read file line by line
        file_lines = f.readlines() 
    #call header parser function ro parse header of a file
    header = parse_header(file_lines[0])
    # read every line start from 1
    for value in file_lines[1:]:
        # call value parser function to value of a file
        values = parse_values(value)
        # creat a single dict of a line with header by calling create dict function
        item_dict = create_item_dict(header,values)
        # create a list of all the dictionaries
        result.append(item_dict)
    return result

In [73]:
with open('./data/loans3.txt', 'r') as f:
        file_lines = f.read()
print(file_lines)

amount,duration,rate,down_payment
45230,48,0.07,4300
883000,16,0.14,
100000,12,0.1,
728400,120,0.12,100000
3637400,240,0.06,
82900,90,0.07,8900
316000,16,0.13,
15230,48,0.08,4300
991360,99,0.08,
323000,27,0.09,4720010000,36,0.08,20000
528400,120,0.11,100000
8633400,240,0.06,
12900,90,0.08,8900


In [80]:
loans3 = read_csv('./data/loans3.txt')

In [81]:
loans3

[{'amount': 45230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0},
 {'amount': 883000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0},
 {'amount': 100000.0, 'duration': 12.0, 'rate': 0.1, 'down_payment': 0.0},
 {'amount': 728400.0,
  'duration': 120.0,
  'rate': 0.12,
  'down_payment': 100000.0},
 {'amount': 3637400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 82900.0, 'duration': 90.0, 'rate': 0.07, 'down_payment': 8900.0},
 {'amount': 316000.0, 'duration': 16.0, 'rate': 0.13, 'down_payment': 0.0},
 {'amount': 15230.0, 'duration': 48.0, 'rate': 0.08, 'down_payment': 4300.0},
 {'amount': 991360.0, 'duration': 99.0, 'rate': 0.08, 'down_payment': 0.0},
 {'amount': 323000.0,
  'duration': 27.0,
  'rate': 0.09,
  'down_payment': 4720010000.0},
 {'amount': 528400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0},
 {'amount': 8633400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 12900.0, 'duration': 90.0, '

In [7]:
# all above funtions
def parse_header(line_header):
    return line_header.strip().split(',')

def parse_values(data_line):
    values = []
    for value in data_line.strip().split(','):
        if value == '':
            values.append(0.0)
        else:  
            try:
                values.append(float(value))
            except ValueError:
                values.append(value)
    return values        

def create_item_dict(headers,values):
    result = {}
    for header , value in zip(headers,values):
        result[header] = value
    return result
def read_csv(path):
    result = []
    with open(path , 'r') as f: # read file line by line
        file_lines = f.readlines() 
    #call header parser function ro parse header of a file
    header = parse_header(file_lines[0])
    # read every line start from 1
    for value in file_lines[1:]:
        # call value parser function to value of a file
        values = parse_values(value)
        # creat a single dict of a line with header by calling create dict function
        item_dict = create_item_dict(header,values)
        # create a list of all the dictionaries
        result.append(item_dict)
    return result

In [78]:
loans2 = read_csv('./data/loans2.txt')

In [79]:
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0},
 {'amount': 4633400.0, 'duration': 240.0, 'rate': 0.06, 'down_payment': 0.0},
 {'amount': 42900.0, 'duration': 90.0, 'rate': 0.08, 'down_payment': 8900.0},
 {'amount': 983000.0, 'duration': 16.0, 'rate': 0.14, 'down_payment': 0.0},
 {'amount': 15230.0, 'duration': 48.0, 'rate': 0.07, 'down_payment': 4300.0}]

In [77]:
import math

def loan_emi(amount, duration, rate, down_payment=0):
    """Calculates the equal montly installment (EMI) for a loan.
    
    Arguments:
        amount - Total amount to be spent (loan + down payment)
        duration - Duration of the loan (in months)
        rate - Rate of interest (monthly)
        down_payment (optional) - Optional intial payment (deducted from amount)
    """
    loan_amount = amount - down_payment
    try:
        emi = loan_amount * rate * ((1+rate)**duration) / (((1+rate)**duration)-1)
    except ZeroDivisionError:
        emi = loan_amount / duration
    emi = math.ceil(emi)
    return emi

In [84]:
# calculate loan emi using loan_emi funtion input values read from the dictionary and add a new 
# key and value emi in that dictionary
for loan in loans2:
    loan['emi'] = loan_emi(loan['amount'], 
                           loan['duration'], 
                           loan['rate']/12, # the CSV contains yearly rates
                           loan['down_payment'])

In [83]:
loans2

[{'amount': 828400.0,
  'duration': 120.0,
  'rate': 0.11,
  'down_payment': 100000.0,
  'emi': 10034},
 {'amount': 4633400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33196},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.08,
  'down_payment': 8900.0,
  'emi': 504},
 {'amount': 983000.0,
  'duration': 16.0,
  'rate': 0.14,
  'down_payment': 0.0,
  'emi': 67707},
 {'amount': 15230.0,
  'duration': 48.0,
  'rate': 0.07,
  'down_payment': 4300.0,
  'emi': 262}]

You can see that each loan now has a new key `emi`, which provides the EMI for the loan. We can extract this logic into a function so that we can use it for other files too.

In [85]:
def compute_emis(loans):
    for loan in loans:
        loan['emi'] = loan_emi(
            loan['amount'], 
            loan['duration'], 
            loan['rate']/12, # the CSV contains yearly rates
            loan['down_payment'])

## Writing to files

Now that we have performed some processing on the data, it would be good to write the results back to a CSV file. We can create/open a file in `w` mode using `open` and write to it using the `.write` method. The string `format` method will come in handy here.

In [86]:
loans1 = read_csv('./data/loans1.txt')

In [89]:
compute_emis(loans1)

In [90]:
loans1

[{'amount': 100000.0,
  'duration': 36.0,
  'rate': 0.08,
  'down_payment': 20000.0,
  'emi': 2507},
 {'amount': 200000.0,
  'duration': 12.0,
  'rate': 0.1,
  'down_payment': 0.0,
  'emi': 17584},
 {'amount': 628400.0,
  'duration': 120.0,
  'rate': 0.12,
  'down_payment': 100000.0,
  'emi': 7582},
 {'amount': 4637400.0,
  'duration': 240.0,
  'rate': 0.06,
  'down_payment': 0.0,
  'emi': 33224},
 {'amount': 42900.0,
  'duration': 90.0,
  'rate': 0.07,
  'down_payment': 8900.0,
  'emi': 487},
 {'amount': 916000.0,
  'duration': 16.0,
  'rate': 0.13,
  'down_payment': 0.0,
  'emi': 62664},
 {'amount': 45230.0,
  'duration': 48.0,
  'rate': 0.08,
  'down_payment': 4300.0,
  'emi': 1000},
 {'amount': 991360.0,
  'duration': 99.0,
  'rate': 0.08,
  'down_payment': 0.0,
  'emi': 13712},
 {'amount': 423000.0,
  'duration': 27.0,
  'rate': 0.09,
  'down_payment': 47200.0,
  'emi': 15428}]

Great, looks like the loan details (along with the computed EMIs) were written into the file.

Let's define a generic function `write_csv` which takes a list of dictionaries and writes it to a file in CSV format. We will also include the column headers in the first line.

In [91]:
with open('./data/emis2.txt', 'w') as f:
    for loan in loans2:
        f.write('{},{},{},{},{}\n'.format(
            loan['amount'], 
            loan['duration'], 
            loan['rate'], 
            loan['down_payment'], 
            loan['emi']))

In [92]:
os.listdir('data')

['loans2.txt', 'loans3.txt', 'loans1.txt', 'emis2.txt']

In [93]:
with open('./data/emis2.txt', 'r') as f:
    print(f.read())

828400.0,120.0,0.11,100000.0,10034
4633400.0,240.0,0.06,0.0,33196
42900.0,90.0,0.08,8900.0,504
983000.0,16.0,0.14,0.0,67707
15230.0,48.0,0.07,4300.0,262



Great, looks like the loan details (along with the computed EMIs) were written into the file.

Let's define a generic function `write_csv` which takes a list of dictionaries and writes it to a file in CSV format. We will also include the column headers in the first line.

In [94]:
def write_csv(items, path):
    # Open the file in write mode
    with open(path, 'w') as f:
        # Return if there's nothing to write
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [95]:
loans3 = read_csv('./data/loans3.txt')

In [96]:
compute_emis(loans3)

In [97]:
write_csv(loans3, './data/emis3.txt')

In [98]:
with open('./data/emis3.txt', 'r') as f:
    print(f.read())

amount,duration,rate,down_payment,emi
45230.0,48.0,0.07,4300.0,981
883000.0,16.0,0.14,0.0,60819
100000.0,12.0,0.1,0.0,8792
728400.0,120.0,0.12,100000.0,9016
3637400.0,240.0,0.06,0.0,26060
82900.0,90.0,0.07,8900.0,1060
316000.0,16.0,0.13,0.0,21618
15230.0,48.0,0.08,4300.0,267
991360.0,99.0,0.08,0.0,13712
323000.0,27.0,0.09,4720010000.0,-193751447
528400.0,120.0,0.11,100000.0,5902
8633400.0,240.0,0.06,0.0,61853
12900.0,90.0,0.08,8900.0,60



With just four lines of code, we can now read each downloaded file, calculate the EMIs, and write the results back to new files:

In [99]:
for i in range(1,4):
    loans = read_csv('./data/loans{}.txt'.format(i))
    compute_emis(loans)
    write_csv(loans, './data/emis{}.txt'.format(i))

In [100]:
os.listdir('./data')

['loans2.txt',
 'loans3.txt',
 'loans1.txt',
 'emis3.txt',
 'emis2.txt',
 'emis1.txt']

In [1]:
import jovian 

In [None]:
jovian.commit()

<IPython.core.display.Javascript object>

## Using Pandas to Read and Write CSVs

There are some limitations to the `read_csv` and `write_csv` functions we've defined above:

* The `read_csv` function fails to create a proper dictionary if any of the values in the CSV files contains commas
* The `write_csv` function fails to create a proper CSV if any of the values to be written contains commas

When a value in a CSV file contains a comma (`,`), the value is generally placed within double quotes. Double quotes (`"`) in values are converted into two double quotes (`""`). Here's an example:

```
title,description
Fast & Furious,"A movie, a race, a franchise"
The Dark Knight,"Gotham, the ""Batman"", and the Joker"
Memento,A guy forgets everything every 15 minutes

```

Let's try it out.

In [2]:
movies_url = "https://gist.githubusercontent.com/aakashns/afee0a407d44bbc02321993548021af9/raw/6d7473f0ac4c54aca65fc4b06ed831b8a4840190/movies.csv"

In [5]:
urlretrieve(movies_url, 'data/movies.csv')

('movies.csv', <http.client.HTTPMessage at 0x7fbcc0c67580>)

In [8]:
movies = read_csv('data/movies.csv')

In [9]:
movies

[{'title': 'Fast & Furious', 'description': '"A movie'},
 {'title': 'The Dark Knight', 'description': '"Gotham'},
 {'title': 'Memento',
  'description': 'A guy forgets everything every 15 minutes'}]

As you can seen above, the movie descriptions weren't parsed properly.

To read this CSV properly, we can use the `pandas` library.

In [11]:
import pandas as pd

In [13]:
movies_dataframe = pd.read_csv('data/movies.csv')

In [14]:
movies_dataframe

Unnamed: 0,title,description
0,Fast & Furious,"A movie, a race, a franchise"
1,The Dark Knight,"Gotham, the ""Batman"", and the Joker"
2,Memento,A guy forgets everything every 15 minutes


In [15]:
movies = movies_dataframe.to_dict('records')

In [16]:
movies 

[{'title': 'Fast & Furious', 'description': 'A movie, a race, a franchise'},
 {'title': 'The Dark Knight',
  'description': 'Gotham, the "Batman", and the Joker'},
 {'title': 'Memento',
  'description': 'A guy forgets everything every 15 minutes'}]

If you don't pass the arguments `records`, you get a dictionary of lists instead.

In [17]:
movies_dict = movies_dataframe.to_dict()

In [18]:
movies_dict

{'title': {0: 'Fast & Furious', 1: 'The Dark Knight', 2: 'Memento'},
 'description': {0: 'A movie, a race, a franchise',
  1: 'Gotham, the "Batman", and the Joker',
  2: 'A guy forgets everything every 15 minutes'}}

In [19]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Committed successfully! https://jovian.ai/ajmehdi5/python-os-ans-filesystem-1007f[0m


'https://jovian.ai/ajmehdi5/python-os-ans-filesystem-1007f'