# Count the number of lines in Python for each file

In [34]:
path_to_zips = "../data/Challenge"

## 1) Command Line

Use shell commands with the `!` notation to count the number of lines in `bookings.csv.bz2` and `searches.csv.bz2`.

In [35]:
! ls -l {path_to_zips}

total 2121264
-rw-r--r--@ 1 aalmagro  staff  554970628  5 may 08:57 bookings.csv.bz2
-rw-r--r--  1 aalmagro  staff   42445466  5 may 10:11 bookings.sample.csv
-rw-r--r--  1 aalmagro  staff    5473249  5 may 10:00 bookings.sample.csv.bz2
-rw-r--r--@ 1 aalmagro  staff  483188920  5 may 08:58 searches.csv.bz2


In [22]:
!bzcat {path_to_zips}bookings.csv.bz2 | wc -l

 10000011


That took a while... Let's make a 100000 line sample for the duration of this class:

```shell
! bzcat {path_to_zips}bookings.csv.bz2 | head -100000 | bzip2 -c > {path_to_zips}bookings.sample.csv.bz2
```

In [23]:
! bzcat {path_to_zips}bookings.csv.bz2 | head -100000 | bzip2 -c > {path_to_zips}bookings.sample.csv.bz2


bzcat: I/O or other error, bailing out.  Possible reason follows.
bzcat: Broken pipe
	Input file = ../data/Challenge/bookings.csv.bz2, output file = (stdout)


In [25]:
! bzcat {path_to_zips}bookings.csv.bz2 | head -100000 > {path_to_zips}bookings.sample.csv


bzcat: I/O or other error, bailing out.  Possible reason follows.
bzcat: Broken pipe
	Input file = ../data/Challenge/bookings.csv.bz2, output file = (stdout)


In [36]:
!ls -lrt {path_to_zips}

total 2121264
-rw-r--r--@ 1 aalmagro  staff  554970628  5 may 08:57 bookings.csv.bz2
-rw-r--r--@ 1 aalmagro  staff  483188920  5 may 08:58 searches.csv.bz2
-rw-r--r--  1 aalmagro  staff    5473249  5 may 10:00 bookings.sample.csv.bz2
-rw-r--r--  1 aalmagro  staff   42445466  5 may 10:11 bookings.sample.csv


In [39]:
testfile=path_to_zips+"/bookings.sample.csv"

In [40]:
count=0
with open(testfile,'r') as file:
    for line in file:
        count += 1
count

100000

## 2) Python:

We have 2 options:

* without uncompressing

* uncompressing the whole file, then reading from the result.

#### Python without uncompressing

In [41]:
import bz2

In [42]:
!ls {path_to_zips}/bookings.sample.csv.bz2

../data/Challenge/bookings.sample.csv.bz2


In [44]:
bookings_path = path_to_zips + "/bookings.sample.csv.bz2"

bz2_file = bz2.BZ2File(bookings_path)

In [45]:
k=0
for line in bz2_file:
    k += 1
    
print("%s has %s lines." % (bookings_path, k))

../data/Challenge/bookings.sample.csv.bz2 has 100000 lines.


#### 2b) Python on row uncompressed file

First, uncompress the file using shell utilities

In [46]:
uncompressed_path = '.'.join(bookings_path.split('.')[:-1])

!bunzip2 -fc {bookings_path} > {uncompressed_path}

Then count the lines:

In [47]:
with open(uncompressed_path, "r", encoding='latin-1') as file_input:
    k=0
    for line in file_input:
        k+=1
        
print ("%s has %s lines."%(bookings_path,k))

file_input.closed

../data/Challenge/bookings.sample.csv.bz2 has 100000 lines.


True