### Stimulating the data
1>Start by writing some code to stimulate files containing random DNA,protein and binary data
  - Using np.random.choice Using np.random.choice, generate 100 megabytes of random data containing 100%, 90%, 80%, 70%, 60%, and 50% zeros. 
  - The number of percentage of zeros determined by changing p values to 0 or 1
  - Call np.packbits on the data before writing it to a file. 
  - Write the data to a file in home directory
  

In [7]:
import numpy as np 
# Set 100% of zeros 
myvar1 = np.random.choice([0, 1], size=(8,1024,1024,100),replace=True, p=[1, 0])
myvar1 = np.packbits(myvar1)

#Set 90% of zeros 
myvar2 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.9, 0.1])
myvar2 = np.packbits(myvar2)

#Set 80% of zeros
myvar3 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.8, 0.2])
myvar3 = np.packbits(myvar3)

#Set 70% of zeros
myvar4 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.7, 0.3])
myvar4 = np.packbits(myvar4)

#Set 60% of zeros
myvar5 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.6, 0.4])
myvar5 = np.packbits(myvar5)

#Set 50% of zeros
myvar6 = np.random.choice([0, 1], size=(8,1024,1024,100), replace=True, p=[0.5, 0.5])
myvar6 = np.packbits(myvar6)


In [9]:
# Write into the home directory. 
open("zeros_100p", "wb").write(myvar1)
open("zeros_90p", "wb").write(myvar2)
open("zeros_80p", "wb").write(myvar3)
open("zeros_70p", "wb").write(myvar4)
open("zeros_60p", "wb").write(myvar5)
open("zeros_50p", "wb").write(myvar6)

104857600

After writing into the home directory, the output of the file is about 105MB for each case. 

#### Generate DNA and protein sequences 100 million letters long and write those to your home directory
The process is similar as described above, but with different parameters. For the DNA sequence generation, the np.random choices becomes four types of nucleotide( A,T,C,G). and each letter would have equal probability.

For the protein sequence, the random choices are 20 amino acids with equal probability.

In [10]:
my_seq= np.random.choice(['A', 'T','C','G'], size=100000000, replace=True, p=[0.25, 0.25,0.25,0.25])
my_protein=np.random.choice(['A','R','N','D','C','E','Q','G','H','I','L','K','M','F','P','S','T','W','Y','V'], size=100000000, replace=True)

In [11]:
open('nucleotide_seq','w').write(''.join(my_seq))
open('protein_seq','w').write(''.join(my_protein))

100000000

### Compressing the data
From the terminal to compress the data.
- On each of the files generated above, run gzip, bzip, pbzip2 and ArithmeticCompress as follows:  
time gzip –k file_name  
time bzip2 –k file_name   
time pbzip2 –k file_name  
time ArithmeticCompress file_name 

- Keep track of the size of the input files, the size of the output files, and the time each command took to run in a table in your iPython notebook.



### Result after the compress
#### All the 0/1 input files size are 105MB, and the nucleotide and protein file size are 100MB 

#### 100% zeros
time gzip -k zeros_100p  

real    0m0.732s  
user    0m0.696s  
sys     0m0.036s  

output file: 102kb

time bzip2 -k zeros_100p

real    0m1.017s  
user    0m0.977s  
sys     0m0.040s  

output file:113b

time pbzip2 -k zeros_100p

real    0m0.108s  
user    0m1.947s  
sys     0m0.113s  

output file:5.62kb
    
time ArithmeticCompress zeros_100p zeros_100p.art

real    0m14.886s  
user    0m14.830s  
sys     0m0.056s   

output file:1.03kb



#### 90% zeros
time gzip -k zeros_90p

real    0m19.091s  
user    0m18.938s  
sys     0m0.152s

output file:58.7MB

time bzip2 -k zeros_90p

real    0m10.638s  
user    0m10.569s  
sys     0m0.068s

output file:61.2MB

time pbzip2 -k zeros_90p

real    0m0.785s  
user    0m18.999s  
sys     0m0.994s

output file:61.2MB

time ArithmeticCompress zeros_90p zeros_90p.art

real    0m28.906s  
user    0m28.706s  
sys     0m0.200s

output file:49.2MB

#### 80% zeros
time gzip -k zeros_80p

real    0m14.080s  
user    0m13.925s  
sys     0m0.128s

output file:81.2MB
    
time bzip2 -k zeros_80p

real    0m11.990s  
user    0m11.894s  
sys     0m0.096s

output file:86.6MB

time pbzip2 -k zeros_80p

real    0m0.956s  
user    0m23.806s  
sys     0m0.938s

output file:86.7MB

time ArithmeticCompress zeros_80p zeros_80p.art

real    0m35.604s  
user    0m35.359s  
sys     0m0.212s

output file:75.7MB
    

#### 70% zeros
time gzip -k zeros_70p

real    0m6.078s  
user    0m5.937s  
sys     0m0.140s

output file:93.6MB
    
time bzip2 -k zeros_70p

real    0m13.738s  
user    0m13.642s  
sys     0m0.096s

output file:99.8MB

time pbzip2 -k zeros_70p

real    0m1.145s  
user    0m29.959s  
sys     0m0.887s

output file:99.8MB
    
time ArithmeticCompress zeros_70p zeros_70p.art

real    0m39.585s  
user    0m39.299s  
sys     0m0.237s

output file:92.4MB

#### 60% zeros
time gzip -k zeros_60p

real    0m4.247s  
user    0m4.179s  
sys     0m0.069s

output file:102MB
    
time bzip2 -k zeros_60p

real    0m15.710s  
user    0m15.605s  
sys     0m0.104s

output file:105MB

time pbzip2 -k zeros_60p

real    0m1.387s  
user    0m37.009s  
sys     0m0.856s

output file:105MB

time ArithmeticCompress zeros_60p zeros_60p.art

real    0m40.953s  
user    0m40.648s  
sys     0m0.280s

output file:102MB

#### 50% zeros
time gzip -k zeros_50p

real    0m3.503s  
user    0m3.434s  
sys     0m0.068s

output file:105MB

time bzip2 -k zeros_50p

real    0m16.669s  
user    0m16.544s  
sys     0m0.125s

output file:105MB
    
time pbzip2 -k zeros_50p

real    0m1.532s  
user    0m40.325s  
sys     0m0.937s

output file:105MB
    
time ArithmeticCompress zeros_50p zeros_50p.art

real    0m40.867s  
user    0m40.654s  
sys     0m0.213s  

output file:105MB

#### nucleotide sequence
time gzip -k nucleotide_seq

real    0m12.158s  
user    0m12.057s  
sys     0m0.101s  

output file:29.2 MB

time bzip2 -k nucleotide_seq

real    0m9.474s  
user    0m9.426s  
sys     0m0.048s  

output file:27.3 MB

time pbzip2 -k nucleotide_seq

real    0m0.679s  
user    0m16.024s  
sys     0m0.842s  

output file:27.3 MB

time ArithmeticCompress nucleotide_seq nucleotide_seq.art

real    0m21.268s  
user    0m21.191s  
sys     0m0.076s  

output file:25 MB

#### Protein sequence
time gzip -k protein_seq

real    0m4.211s  
user    0m4.154s  
sys     0m0.057s

output file:60.6 MB
    
time bzip2 -k protein_seq

real    0m9.956s  
user    0m9.895s  
sys     0m0.060s

output file:55.3 MB
    
time pbzip2 -k protein_seq

real    0m0.773s  
user    0m18.709s  
sys     0m0.737s

output file:55.3 MB

time ArithmeticCompress protein_seq protein_seq.art

real    0m29.594s  
user    0m29.404s  
sys     0m0.189s

output file:54 MB

In [30]:
print("            File output size and compression time            ")

print("              ")

print("Compression     gzip        bzip2       pbzip2      Arithmetic")

print("              ")

print("100%           102 KB      113 B        5.62 KB      1.03 KB")

print("               0.732s      1.017s       0.108s       14.886s")

print("              ")

print("90%            58.7 MB     61.2 MB      61.2 MB      49.2MB")

print("               19.091s     10.638s      0.785s       28.906s")

print("              ")
  
print("80%            81.2 MB     86.6 MB      86.7 MB      75.7 MB")

print("               14.080s     11.990s      0.956s       35.604s")

print("              ")

print("70%            93.6 MB     99.8 MB      99.8 MB      92.4 MB")

print("               6.078s      13.738s      1.145s       39.585s")

print("              ")

print("60%            102 MB      105 MB       105 MB       102 MB")

print("               4.247s      15.710s      1.387s       40.953s")

print("              ")

print("50%            105 MB      105 MB       105 MB       105 MB")

print("               3.503s      16.669s      1.532s       40.867s")

print("              ")

print("nucleotide     29.2 MB      27.3 MB      27.3 MB      25 MB")

print("               12.158s      9.474s       0.679s       21.268s")

print("              ")

print("protein        60.6 MB      55.3 MB      55.3 MB      54 MB")

print("               4.211s       9.956s       0.773s       29.594s")

            File output size and compression time            
              
Compression     gzip        bzip2       pbzip2      Arithmetic
              
100%           102 KB      113 B        5.62 KB      1.03 KB
               0.732s      1.017s       0.108s       14.886s
              
90%            58.7 MB     61.2 MB      61.2 MB      49.2MB
               19.091s     10.638s      0.785s       28.906s
              
80%            81.2 MB     86.6 MB      86.7 MB      75.7 MB
               14.080s     11.990s      0.956s       35.604s
              
70%            93.6 MB     99.8 MB      99.8 MB      92.4 MB
               6.078s      13.738s      1.145s       39.585s
              
60%            102 MB      105 MB       105 MB       102 MB
               4.247s      15.710s      1.387s       40.953s
              
50%            105 MB      105 MB       105 MB       105 MB
               3.503s      16.669s      1.532s       40.867s
              
nucleotide     29.2 MB    

#### Questions

##### -Which algorithm achieves the best level of compression on each file type?  

The best level of the compression is achieved under the Arithmetic algorithm for all the cases. Since it has the the smallest output file size. 

##### -Which algorithm is the fastest?  

pbzip2 algorithm is the fastest overall. The most files get compressed within and around 1 second. 

##### -What is the difference between bzip2 and pbzip2? Do you expect one to be faster and why?  

Both Bzip2 and pbzip2 will compress the file and produce a new file with the extension .bz2. But,bzip2 compresses data in blocks of size between 100 and 900 kB and uses the Burrows–Wheeler transform to convert frequently-recurring character sequences into strings of identical letters. 

pbzip2 have multi-cores and looking for a file compressing utility that uses all the available CPU cores ( significantly improving the performance & reducing the time it takes) 
So, from the time prespective, the pbzip2 is much faster than the Bzip2.

##### -How does the level of compression change as the percentage of zeros increases? Why does this happen?

As the file containing the percentage of zeros increases, the level of compression decreases, the file compressed size is smaller and the compression time is less as well. This is because there more zeros in the file, it takes up less percentaage of file.
##### -What is the minimum number of bits required to store a single DNA base?

2 bits 

##### -What is the minimum number of bits required to store an amino acid letter?

Theoretically, for an amino acid letter we need log2(20) which is about of 4.32 bits. and round up to 5 bits

#### -In your tests, how many bits did gzip and bzip2 actually require to store your random DNA and protein sequences?

Random DNA compressed in gzip file requires 29.2 MB, (using the convertion unit 8 bits/byte * 1024 bytes/kilobyte * 1024 kilobytes/megabyte * 100) = 2449473536 bits

Using the convertion factor above, bzip2 file is 27.3 MB (2290089984 bits)

Random protein sequence compressed in gzip file is 60.6 MB (5083496448 bits)
                      in bzip2 file is 55.3 MB (4638900224 bits)
##### -Are gzip and bzip2 performing well on DNA and proteins?
For the DNA compression, the file can be compressed from 100MB to 20MB. the protein compression is from 100MB to 60MB. The compression time for DNA under the bip2 is less than the time compression used in protein. On the oppposite, the gzip is fast for the protein sequence compression

### Compress the real data
1>Find the nucleic acid sequences of gp120 homologs from at least  10 different HIV isolates and concatenate them together into a single multi-FASTA.

2> Compress the multi-FASTA using gzip, bzip2, and arithmetic coding.



In [54]:
from Bio import SeqIO
from Bio import Entrez
Seq=[]
New=[]
Entrez.email="zihuixu1@berkeley.edu"
handle=Entrez.esearch(db='nucleotide',
                     term='gp120',
                     sort='relevance',
                     idtype='acc')

for i in Entrez.read(handle)['IdList']:
    handle=Entrez.efetch(db='nucleotide',id=i,rettype='fasta',retmode='text')
    temp=SeqIO.read(handle,'fasta')
    Seq.append(">" + temp.description + '\n' + str(temp.seq) + '\n')

open('gp120.fa','w').write(''.join(str(Seq)))


10069

#### The result of compress file 
time gzip -k gp120.fa

real    0m0.004s  
user    0m0.000s  
sys     0m0.003s  
output file: 1.88kb
    
time bzip2 -k gp120.fa

real    0m0.006s  
user    0m0.003s  
sys     0m0.003s  
output file: 1.91kb
    
time pbzip2 -k gp120.fa

real    0m0.008s  
user    0m0.004s  
sys     0m0.004s  
output file: 1.91kb
    
time ArithmeticCompress gp120.fa gp120.fa.art

real    0m0.010s  
user    0m0.010s  
sys     0m0.000s  
output file: 3.04kb

In [56]:
# The initial file is 8.44kb
print("           File output size and compression time for HIV gp120 multi-FASTA               ")

print("              ")

print("Compression                                 gzip        bzip2         pbzip2       Arithmetic")

print("              ")

print("Run output file size                        1.88 KB      1.91KB        1.91 KB      3.04 KB")

print("              ")

print("Real running time                           0.004s       0.006s        0.008s       0.01s")
print("              ")

print("Real compression data ratio                 0.22         0.21          0.226        0.36")
print("              ")

print("random compression data ratio               0.29         0.27          0.27         0.25")
 

 





           File output size and compression time for HIV gp120 multi-FASTA               
              
Compression                                 gzip        bzip2         pbzip2       Arithmetic
              
Run output file size                        1.88 KB      1.91KB        1.91 KB      3.04 KB
              
Real running time                           0.004s       0.006s        0.008s       0.01s
              
Real compression data ratio                 0.22         0.21          0.226        0.36
              
random compression data ratio               0.29         0.27          0.27         0.25


#### Analysis:
    
    Compressing the real data is much better than compressing the random data. This may because that the real data with different fasta sequence that are really similar and the mutual information contain is high which save some bites during the compression. The random data is uniform distribution.
    
    After comparing the compression ratio between the file output and input size, the ratio within the real data is less than the random data. 

### Estimating compression of 1000 terabytes
Most of the data, say 80%, is re-sequencing of genomes and plasmids that are very similar to each other. Another 10% might be protein sequences, and the last 10% are binary microscope
images which we’ll assume follow the worst-case scenario of being completely random.

resequncing genome and plasmids:
    Using the gzip algothrim in this case, the file can be compressed into the smaller file since the sequences are really similar to each other. gzip can also compress the file in the fast rate. After the compression, 60% of bit can be saved.
    
protein sequence:
    for the protein compression, I decided to use the pbzip algorithm, since it can compress the file fast and into the small size. it may be about 70 % storage can be saved.
    
microscope image:
    the information contain is really random. using the arithmetic compression to achieve the better result.As long as the time is not a big consideration, the compressed file can save up to 80%.
    
MAYBE if the compressed file can achieve 1% reduction in overall data, that can provide me $500 saving. 
    