## Local files

### Download

Remote working directory can be downloaded with the `--download` parameter:

```bash
python spark-ec2-helper.py --download
```

This method will download all files, including the IPython Notebook and files that your program generated on the server (pickle files, etc.).
Files will be downloaded to the `./remote_files` directory.

### Upload

You can upload a single file or all files in a directory with the `--upload` parameter:

```bash
python spark-ec2-helper.py --upload path/to/a/file
python spark-ec2-helper.py --upload path/to/a/directory
```

If you want to read from a local text file, you can use this method to upload it to the server.


## S3 files

The object `s3helper` is created to help you access S3 files.

In [None]:
help(s3helper)

To access s3 files, the first step is setting AWS credential.

In [None]:
%cd /root/ipython/AWS-Spark-Cluster/
%run Credentials.ipynb

In [None]:
sc.stop()

In [None]:
from pyspark import SparkContext,SparkConf
sparkConfig=SparkConf()
sparkConfig.set("spark.executor.memory","20g")
sparkConfig.set("spark.worker.memory","20g")
sparkConfig.set("spark.driver.cores","8")
sparkConfig.set("spark.python.worker.memory","20g")
sparkConfig.getAll()

In [None]:
sc=SparkContext(conf=sparkConfig)

In [None]:
RDD=sc.parallelize(range(100))

In [None]:
RDD.count()

In [None]:
s3helper.set_credential(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)

Then open the bucket that has your files.

In [None]:
s3helper.open_bucket('yoav-faces')

Now you can list your files in the bucket.

In [None]:
print s3helper.ls()
filenames=s3helper.ls('output/')
filenames[:10]

In [None]:
!mkdir /mnt/output

In [None]:
%cd /mnt/output
s3helper.get_file(filenames[0])
!ls -l

filename=filenames[0][7:]
print filename
!tar -xzvf "$filename"
!ls -l

In [None]:
!ls -l /mnt/output/data1/output/*

In [None]:
from pyspark import StorageLevel
StorageLevel.MEMORY_AND_DISK_SER

In [None]:
partition_no=10
DATA=sc.parallelize(range(partition_no),partition_no)
DATA.persist(StorageLevel.MEMORY_AND_DISK_SER)
DATA.count()

In [None]:
%cd /mnt/output
video_names={}
video_index=0

from sys import getsizeof
from glob import glob
import pickle
import re
import numpy as np

pattern=re.compile(r'.*/([^/]+)_windows(\d+)\.pkl')

list=glob('/mnt/output/data1/output/*')
for file in list:
    match=re.search(pattern,file)
    if match:
        video_name=match.group(1)
        if not video_name in video_names.keys():        
            video_names[video_name]=video_index
            video_index+=1
        video_num=video_names[video_name]
        window_num=int(match.group(2))
    else:
        print 'COULD NOT FIND NUMBER IN',file
        continue

    In = pickle.load(open(file,'r'))
    print window_num,len(In),
    Full=[]
    for f in In:
        descriptor={'video_num':video_num, 'track_num':window_num,
                    'frame no':f[0],
                    'ulx':f[1],'uly':f[2],'size:':f[3]}
        Full.append((descriptor,np.array(f[-1],dtype=np.uint16)))
    In=[]
    data=data+Full
    Full=[]
    print window_num,len(list),len(data)
print 'size of data=',getsizeof(data)

In [None]:
type(data), len(data)

In [None]:
data[0]

In [None]:
New=sc.parallelize(data, numSlices=len(data))

In [None]:
New = New.cache()
New.count()

In [None]:
from sys import getsizeof
print getsizeof(data),getsizeof(data[0][-1])

In [None]:
300*300*2*len(data)

In [None]:
Full[:2]

In [None]:
frame=Full[0][-1]
%pylab inline
max(ravel(frame))

In [None]:
hist(ravel(frame),bins=100);

In [None]:
array(frame,dtype=uint16)

In [None]:
match=re.search(pattern,file)
if match:
    video_name=match.group(1)
    window_num=int(match.group(2))
else:
    print 'COULD NOT FIND NUMBER IN',file
video_name,window_num

In [None]:
!df

To read the files, you have two options. 

(1) Get a list of s3 file paths and pass it to Spark.

In [None]:
files = s3helper.get_path('/model-feb')
print files
rdd = sc.textFile(','.join(files))

(2) Load S3 files to HDFS and read them from HDFS

In [None]:
files = s3helper.load_path('/model-feb', '/feb')
print files
rdd = sc.textFile(','.join(files))

In [None]:
rdd.count()

## Parquet Files

In [None]:
s3helper.open_bucket("mas-dse-public")

files = s3helper.load_path('/Weather/US_Weather.parquet', '/US_Weather.parquet')
files[:10]

In [None]:
from pyspark import SparkContext
from pyspark.sql import SQLContext

sc = SparkContext(master=master_url)
sqlContext = SQLContext(sc)

In [None]:
df = sqlContext.sql("SELECT station, measurement FROM parquet.`/US_Weather.parquet`")
df.head()