# Count the number of lines in Python for each file

In [1]:
# Where am i?

! pwd

/home/dsc/Repos/AmadeusChallenge


In [2]:
# Let's go to the right folder

% cd /home/dsc/Data/challenge

/home/dsc/Data/challenge


In [3]:
! ls -lsS

total 1014224
541968 -rwxr-x--- 1 dsc dsc 554970628 mar 13  2018 bookings.csv.bz2
471872 -rwxr-x--- 1 dsc dsc 483188920 mar 13  2018 searches.csv.bz2
   264 -rw-r--r-- 1 dsc dsc    270148 ene 30 19:23 bookings_sample.csv.bz2
   120 -rw-r--r-- 1 dsc dsc    120210 ene 30 19:23 searches_sample.csv.bz2


## 1) Command Line

Use shell commands with the `!` notation to count the number of lines in `bookings.csv.bz2` and `searches.csv.bz2`.

In [4]:
#! bzcat bookings.csv.bz2 | wc -l
# Out 10000011

In [5]:
#! bzcat searches.csv.bz2 | wc -l
# Out 20390198

In [6]:
# Files are too big, Let's go to work with smaller samples: 5000 lines

! bzcat bookings.csv.bz2 | head -5000 > bookings_sample.csv


bzcat: I/O or other error, bailing out.  Possible reason follows.
bzcat: Broken pipe
	Input file = bookings.csv.bz2, output file = (stdout)


In [7]:
# Files are too big, Let's go to work with smaller samples: 5000 lines

! bzcat searches.csv.bz2 | head -5000 > searches_sample.csv


bzcat: I/O or other error, bailing out.  Possible reason follows.
bzcat: Broken pipe
	Input file = searches.csv.bz2, output file = (stdout)


In [8]:
# Have the samples been created?

! ls -ls

total 1017188
541968 -rwxr-x--- 1 dsc dsc 554970628 mar 13  2018 bookings.csv.bz2
  2072 -rw-r--r-- 1 dsc dsc   2119069 ene 30 20:12 bookings_sample.csv
   264 -rw-r--r-- 1 dsc dsc    270148 ene 30 19:23 bookings_sample.csv.bz2
471872 -rwxr-x--- 1 dsc dsc 483188920 mar 13  2018 searches.csv.bz2
   892 -rw-r--r-- 1 dsc dsc    910597 ene 30 20:12 searches_sample.csv
   120 -rw-r--r-- 1 dsc dsc    120210 ene 30 19:23 searches_sample.csv.bz2


In [9]:
# Reading the file lines

! cat bookings_sample.csv | wc -l

5000


In [10]:
# Reading the file lines

! cat searches_sample.csv | wc -l

5000


In [11]:
# Compressing the samples

! bzip2 -fk bookings_sample.csv
! bzip2 -fk searches_sample.csv

In [12]:
# Have the samples been compressed?

! ls -ls

total 1017188
541968 -rwxr-x--- 1 dsc dsc 554970628 mar 13  2018 bookings.csv.bz2
  2072 -rw-r--r-- 1 dsc dsc   2119069 ene 30 20:12 bookings_sample.csv
   264 -rw-r--r-- 1 dsc dsc    270148 ene 30 20:12 bookings_sample.csv.bz2
471872 -rwxr-x--- 1 dsc dsc 483188920 mar 13  2018 searches.csv.bz2
   892 -rw-r--r-- 1 dsc dsc    910597 ene 30 20:12 searches_sample.csv
   120 -rw-r--r-- 1 dsc dsc    120210 ene 30 20:12 searches_sample.csv.bz2


## 2) Python:

We have 2 options:

* uncompressing the whole file, then reading from the result.

* without uncompressing: better, because we don't expend as much storage or litter our HDD.


#### Python without uncompressing

In [13]:
import bz2

In [14]:
# BZ2File implements a complete file interface, including readline(), readlines(), writelines()...

bookings = bz2.BZ2File('bookings_sample.csv.bz2')

In [15]:
# BZ2File implements a complete file interface, including readline(), readlines(), writelines()...

searches = bz2.BZ2File('searches_sample.csv.bz2')

In [16]:
k = 0
for lines in bookings:
    k += 1
print('Number of lines in Bookings file are %s' % (k))

Number of lines in Bookings file are 5000


In [17]:
k = 0
for lines in searches:
    k += 1
print('Number of lines in Searches file are %s' % (k))

Number of lines in Searches file are 5000
