# Beginner’s Crash Course in Data Science with Vinita Silaparasetty

## Pyspark

### Create a Spark Context

This is how we connect Spark to Jupyter Notebook.

In [1]:
import findspark
findspark.init()
import pyspark

In [2]:
from pyspark import  SparkContext
from pyspark import SparkConf
#spark configuration
conf = ( SparkConf()
         .setMaster("local[*]")
         .setAppName('pyspark')
        )
sc = SparkContext(conf=conf)  #Name the context 'sc' for convenience.

### Create RDD

Resilient Distributed Database (RDD) is a collection of elements, that can be divided across multiple nodes in a cluster to run parallel processing.

Here, I am uploading a csv file called foo.csv into my RDD.

In [19]:
a = sc.textFile("/Users/vinitasilaparasetty/Skillshare/foo.csv") #As RDD

### Repartitioning

Spark automatically splits data into partitions and executes computations on the partitions in parallel. However, sometimes, you may want to repartition the data into a specific number. In the example below I repartition the data into 4 datasets.

In [53]:
a.repartition(4)

MapPartitionsRDD[66] at coalesce at NativeMethodAccessorImpl.java:0

### Selecting

This command allows you to select a particular section of your RDD.

Note: If you do not specifiy anything in the brackets, the output will be all the elements of the RDD. You can see an example below.

In [39]:
a.collect()

['A,B,C,D',
 '0.358084394621,0.685509002981,-0.459335773064,-1.24134547808',
 '1.25014397425,1.01892490788,-0.0959350533482,-0.455941189274',
 '0.319967935886,2.16400126217,0.141344032066,-0.586861828853',
 '0.82632758854,1.79984468045,0.557307212616,-1.80471064451',
 '0.122304709276,1.44799950965,0.753386792181,-1.56956785238',
 '-1.05808514059,0.123358906305,0.436120535349,-2.77662767157',
 '-0.26468364774,1.06261684753,1.84055876232,-3.21747258414',
 '0.259064008543,0.7126045925,1.56790352421,-3.49873508484',
 '-1.20060038469,2.64931940525,0.0360814598808,-3.51300887073',
 '-2.01287303681,3.0143142038,0.164115444861,-2.49404920146',
 '-2.12217636141,3.88345395924,-2.30262672913,-3.40361993192',
 '-2.51423762751,1.99528307042,-2.62775066979,-3.42330657985',
 '-1.9830279352,2.34739382421,-1.72984223659,-4.06305388463',
 '-1.7725523041,2.50894797081,-2.69242245777,-3.25473846119',
 '-2.0497064552,1.67617229378,-1.47837430569,-5.98097113104',
 '-1.31311195215,1.3823147691,-3.0478312226,

### Creating Sample Datasets

We can take a random sample of data from our RDD and create smaller datasets that are easier to work with. I am going to create two sample datasets.

In [32]:
b1 = a.sample(False, 0.2, 42)
b2 = a.sample(False, 0.2, 43)

#Checking if the dataset has been created successfully.
b1.count(),b2.count()

(216, 215)

### Subtraction

We are now going to remove all the elements of b1 and return only the elements of b2.

In [48]:
b1.subtract(b2).collect()

['0.789408346492,2.81279570306,-9.9688984291,-0.858277283503',
 '-3.40079254276,3.49259893566,-8.32435201563,-8.04939516521',
 '-25.3535869876,8.36029785222,-10.5294133407,-19.039001702',
 '-25.7349528614,2.25796750493,-4.63646013157,-25.3207557222',
 '-29.0084998101,4.55590518068,-6.44915605988,-20.0671520705',
 '-36.9470062971,5.20056911874,-13.7251547875,-26.6577015071',
 '-35.6256567596,1.65797444923,-11.4504932449,-40.1664770135',
 '-29.7107402578,5.52312528548,-13.8260507968,-44.1180879401',
 '-19.1445742253,4.80523646348,-9.53176548006,-54.1026583924',
 '-19.1565175968,3.84368081955,-8.58519733556,-56.0320674306',
 '-15.5800980367,-1.23344947301,-11.3135960926,-53.7124771278',
 '-17.4883243693,5.23492669881,-5.25679424267,-53.7884470759',
 '-24.2285606487,7.16014714414,-9.13784842217,-58.2601170418',
 '-22.9442025838,6.86123291488,-19.0848985795,-67.0893950618',
 '-26.4607294834,2.99146847161,-20.2659637981,-71.001525321',
 '-30.2622702792,-1.53857533029,-18.5553404959,-66.99024

### Cartesian Product

The Cartesian product of two or more datasets is the collection of all ordered pairs obtained by their product. Where an ordered pair means that two elements are taken from each set. It is also, known as the cross-product.

In the example below, we find the cartesian product of b1 and b2.

In [49]:
b1.cartesian(b2).collect()

[('A,B,C,D', '-1.05808514059,0.123358906305,0.436120535349,-2.77662767157'),
 ('A,B,C,D', '-0.26468364774,1.06261684753,1.84055876232,-3.21747258414'),
 ('A,B,C,D', '0.259064008543,0.7126045925,1.56790352421,-3.49873508484'),
 ('A,B,C,D', '-1.20060038469,2.64931940525,0.0360814598808,-3.51300887073'),
 ('A,B,C,D', '-2.01287303681,3.0143142038,0.164115444861,-2.49404920146'),
 ('A,B,C,D', '-2.51423762751,1.99528307042,-2.62775066979,-3.42330657985'),
 ('A,B,C,D', '-1.31311195215,1.3823147691,-3.0478312226,-5.68918630695'),
 ('A,B,C,D', '-1.90322827316,1.40618129456,-3.53217500102,-7.38234597327'),
 ('A,B,C,D', '-3.15253921628,1.25394646383,-2.49168024578,-8.78848952272'),
 ('A,B,C,D', '-2.13904367159,2.67077091737,-4.03693040059,-10.4747716194'),
 ('A,B,C,D', '-1.58420145117,0.852240650527,-3.44415480091,-10.2530124502'),
 ('A,B,C,D', '0.910902632335,0.935290943992,-5.29084156973,-8.85966905614'),
 ('A,B,C,D', '4.89726501421,0.952815409542,-9.75244969101,-7.33204167986'),
 ('A,B,C,D', '

### Sorting

We can sort the elements of the dataset in any order that we require. In the example below, I have sorted b1 using lambda as the parameter.

In [52]:
b1.sortBy(lambda x: x[1]).collect()

['A,B,C,D',
 '0.319967935886,2.16400126217,0.141344032066,-0.586861828853',
 '0.82632758854,1.79984468045,0.557307212616,-1.80471064451',
 '4.61618399107,1.77757880267,-8.12515202634,-7.73919971443',
 '4.84404249375,0.916221056031,-7.86632821431,-5.5888854068',
 '6.9954752085,2.29308260193,-4.93496802298,-4.58160728567',
 '2.14795552365,1.21952265912,-6.72220345028,-2.77859821772',
 '0.789408346492,2.81279570306,-9.9688984291,-0.858277283503',
 '3.72997786124,1.7127812937,-12.4526534982,0.0110313610012',
 '-1.7725523041,2.50894797081,-2.69242245777,-3.25473846119',
 '-1.90322827316,1.40618129456,-3.53217500102,-7.38234597327',
 '-1.12952180639,3.08679027737,-7.60260675791,0.64508165354',
 '-11.7751707453,11.6588689345,-9.87241794672,-10.682079154',
 '-13.1073944697,13.4283234154,-10.4245257937,-14.5600210505',
 '-11.625244916,12.7272353719,-12.2568809525,-15.8738114269',
 '-11.9476067506,12.0168767969,-12.7729942045,-15.3887796301',
 '-11.4692763707,10.8931237112,-9.52248057973,-15.758

### Saving

You can save your work and download it as a text,csv,ldap,json,parquet or orc file.

For example, I have saved the file as a .txt file.

In [None]:
a.saveAsTextFile("a.txt")

### Stop Spark Context

When you are done using a spark context it is best to stop it so that it does not continue to run in the background.

In [54]:
sc.stop()