# *arg

We use this for when the number of input args is unknown


Same thing can be used when there are multiple unknown key word args, we can use **kwargs

(by convention)


In [1]:
from functools import reduce

intersection_among_lists = lambda x, y: x | y

ls1 = [2, 3, 4]
ls2= [3, 4, 5]
ls3 = [3]

def common_elements(*arg):
    result = set(arg[0])
    for i in range(1, len(arg)):
        result = result & set(arg[i])
        
    return list(result)



In [2]:
common_elements(ls1, ls2, ls3)

[3]

In [3]:
# using map-reducing

def common_elements(*args):
    # convert each list to a set
    args = map(lambda x: set(x), args)  # can also just pass in set, because it's already a function (statement)
    # reduce all the lists to a single set
    args = reduce(lambda x, y: x & y, args)
    return list(args)

common_elements(ls1, ls2, ls3)

[3]

## Review: Why is Map Reduce Better?

Because when we have lots of computational resources at our disposal, the map reduce functions make the most efficient use of them - because they can distribute the processing power they use across multiple CPU cores!

## Pyspark installation on local machine

Pyspark is basically the Python SDK for Spark, the most popular Big Data platform (written in Java and Scala)

Pyspark on local machine is not the best platform though, according to Milad - use AWS, or databricks even more

Pyspark on local machine is not so useful, you cannot have clusters but it is for learning

1. Install jdk 8 or higher via https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
2. install pyspark with `brew install apache-spark`
3. install pyspark and findspark via 
    - `pip install findspark`
    - `pip install pyspark`
4. In either your bashrc, zshrc, or whatever shell config you use, enter this:
    * export PYSPARK_PYTHON=python3
    * export SPARK_HOME=/usr/local/lib/python3.7/site-packages/pyspark
5. In your text editor with a python file open, type the following:



Why we use Pyspark

1. Big Data array manipulation
2. Big dataframe manipulation
3. Train Ml models on Big Data


In [4]:
from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SparkSession

spark = SparkSession \
   .builder \
   .appName("Python Spark regression example") \
   .config("spark.some.config.option", "some-value") \
   .getOrCreate()

https://spark.apache.org/downloads.html

In [8]:

from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD as lrSGD


regressionDataFrame = spark.read.csv('Advertising.csv',header=True, inferSchema = True)
# drop the headers
regressionDataFrame = regressionDataFrame.drop('_c0')
# show the top 10 rows
regressionDataFrame.show(10)
# making an RDD
regressionDataRDD = regressionDataFrame.rdd.map(list)
# deciding the predictor and target variables?
regressionDataLabelPoint = regressionDataRDD.map(lambda data : LabeledPoint(data[3], data[0:3]))
# data splitting
regressionLabelPointSplit = regressionDataLabelPoint.randomSplit([0.7, 0.3])

regressionLabelPointTrainData = regressionLabelPointSplit[0]

regressionLabelPointTestData = regressionLabelPointSplit[1]

# training the model
ourModelWithLinearRegression  = lrSGD.train(data = regressionLabelPointTrainData, 
                                            iterations = 200, step = 0.02, intercept = True)

+-----+-----+---------+-----+
|   TV|radio|newspaper|sales|
+-----+-----+---------+-----+
|230.1| 37.8|     69.2| 22.1|
| 44.5| 39.3|     45.1| 10.4|
| 17.2| 45.9|     69.3|  9.3|
|151.5| 41.3|     58.5| 18.5|
|180.8| 10.8|     58.4| 12.9|
|  8.7| 48.9|     75.0|  7.2|
| 57.5| 32.8|     23.5| 11.8|
|120.2| 19.6|     11.6| 13.2|
|  8.6|  2.1|      1.0|  4.8|
|199.8|  2.6|     21.2| 10.6|
+-----+-----+---------+-----+
only showing top 10 rows

