# Big Data Concepts in Python

### Lambda Functions
* Defined inline  and are limited to a single expression
1. Takes an iterable
2. Sets the iterable to lowercase
3. Sorts the items within the list

In [2]:
x = ['Python', 'programming', 'is', 'dank']
print(sorted(x))
print(sorted(x, key=lambda arg: arg.lower()))

['Python', 'dank', 'is', 'programming']
['dank', 'is', 'programming', 'Python']


#### Filter( )

* Function: Filters out items based on a condition typically as a lambda function
1. Takes an iterable
2. Calls the lambda function on each item
3. Returns the items where lambda == True

* Filter in Method 1 returns an iterable rather than the actual item
    * Useful for Big Data sets as this prevents us from having to store datasets up to terabytes in size in memory

In [3]:
# Method 1
print(list(filter(lambda arg: len(arg) < 8, x)))

# Method 2
def is_less_than_8_characters(item):
    return len(item) < 8

x = ['Python', 'programming', 'is', 'dank!']
results = []

for item in x:
    if is_less_than_8_characters(item):
        results.append(item)

print(results)

['Python', 'is', 'dank']


#### Map( )

* Function: Applies a 1:1 mapping of the original items to a function return

1. Takes an iterable
2. Calls the lambda function on each item
3. Returns the mapped item output

* Note: The Map() function (as opposed to filter()) will always return the same number of items passed in

In [4]:
# Method 1
print(list(map(lambda arg: arg.upper(), x)))

# Method 2:
results = []

x = ['Python', 'programming', 'is', 'dank!']
for item in x:
    results.append(item.upper())
    
print(results)

['PYTHON', 'PROGRAMMING', 'IS', 'DANK']


#### Reduce ( ) 
* Function: Applies a function to elements of an iterable to transform them into a single value

1. Takes an iterable
2. Calls the iterable and its subsequent element
3. Returns the combined vlaue of the iterable and its following element

* Note: In this function, the items in teh iterable from left to right are combined into a singl item

In [None]:
from functools import reduce
x = ['Python', 'programming', 'is', 'awesome!']
print(reduce(lambda val1, val2: val1 + val2, x))

### Spark and PySpark

#### What is Spark?
* Apache Spark can be considered as a generic engine for processing large amounts of data
* Primarily runs on Scala and JVM

#### What is PySpark?
* Python-based wrapper on top of the Scala API
    * Like a library that allows for the processing of large amounts of data on a single machine / cluster
    * Almost like utilizing the multithreading / multiprocessing without those modules
* Pyspark can exist due to the following: 
    * Scala is functional-based
    * Functional code is easier to parallelize

#### Pyspark API and Data Structures
* RDDs: ***R***esilient ***D***istributed ***D***atasets
    * Specilized data structures to use within Spark
    * Can almost be considered like a pandas df
    * Hides the complexity of transforming and distributing data across multiple nodes via a scheduler
* SparkContext:
    * Entrypoint of any PySpark program that connects to a Spark cluster and creates RDDs
* RDDs can be created out of common datastructures like lists and tuples
    * Done via the parallelize( ) function
    * As the data isn't actually stored, functions like take ( ) allows for you to see the data without destroying your machine!
    
``` 
      ------<------------ Worker Node: {Executor: [task, task], Cache: []}
      |                 /
Spark Context --> Cluster Manager 
      |                 \
      -------<----------- Worker Node: {Executor: [task, task], Cache: []}

```

##### Example
```python
big_list = range(10000)
rdd = sc.parallelize(big_list, 2)
odds = rdd.filter(lambda x: x % 2 != 0)
odds.take(5)
# [1,3,7,5,9]
```

#### MapReduce

* Map: Filter and sort Data
* Reduce: Aggregate inputs and reduce its size

```
            |Block 1 <-> Map|-
         /                    \               Shuffle
Input ----> |Block 2 <-> Map|----> Combine ->    &    -- > Reduce -> Output
         \                    /                Sort
            |Block 3 <-> Map|-
```


#### Spark vs. MapReduce

* Spark Pros:
  * Spark is great for when you can hold all of your data in memory
      * It leverages RAM in order to be ~100x faster than MapReduce which uses diskspace
  * Spark has a diverse set of API access to make it easy to access
      * Python, Scala, Java, SQL
      * MapReduce does have HIVE and PIG which allows for more access than just pure Java
* MapReduce Pros:
  * MapReduce is great for data that can't fit in memory