# Intro to Spark with Python

### What is Spark
* Distributed data processing framework
* "Yet another Hadoop"
* Based on Resilient Distributed Datasets
* Used for 'big' data processing



* Relational databases are great, but they don't scale above one box.
* Relational databases: query optimization

### Resilient Distributed Datasets
* How to distribute/parallelize a big set of objects
* We can divide in slices and keep each slice in a different nodes
    * **Values are computed only when needed**: speed
    * To guarantee **fault tolerance** we also keep info about how we calculated each slice, so we can re-generate it if a node fails
    * We can hint to keep in cache, or even save on disk
* Immutable ! not designed for read/write
    * Instead, transform an existing one into a new one
* It is basically a huge list
    * But distributed over many computers
    
### Shared Spark Variables
* **Broadcase variables**
    * copy is kept at each node
* **Accumulators**
    * you can only add; main node can read

### Functional programming in Python
* A lot of these concepts are already in python, which is an OOP
    * But Python community tends to promote loops
    * **List comprehensions are more similar to functional programming**
* Functional tools in python
    * `map`: applies a function to each element in a list; returns another list of results; the SELECT of Python
    * `filter`: will only select the elements in a list that satisfy a given function; the WHERE of Python
    * `reduce`: the AGG of Python; aggregates; reduces the elements in a list into a single value or values by applying a function repeatedly to pairs of elements until you get only one value
    * `lambda`: writing functions, simplified
    * itertools
        * `chain`
        * `flatmap`: specifically used in Spark
        
### Map in Python
* Python supports the map operation, over any list
* We apply an operation to each element of a list, return a new list with the results
* **Note:** While in Python `map` is a function, in Spark, `map` is a method of the RDD object (call with `RDD.map()`).

```
a = [1, 2, 3]
def add1(x):
    return x + 1
```
   * `map(add1, a)` $\Rightarrow$ `[2, 3, 4]`
   * `map(add1, [1, 2, 3])` $\Rightarrow$ `[2, 3, 4]`
* We usually do this with a for loop
* This (`map`) is a slightly different way of thinking
* **Important to note:** the original list here is never changed, rather a new list is created.

### Filter
* Select only certain elements from a list
* Example:

```
a = [1, 2, 3, 4]
def isOdd(x):
    return x%2==1
```
* `filter(isOdd, a)` $\Rightarrow$ `[1, 3]`

### Reduce in Python
* Applies a function to all pairs of elements of a list; returns ONE value, not a list
* In Spark, `reduce` is immediate, it is not lazy
* Example:

```
a = [1, 2, 3, 4]
def add(x, y):
    return x + y
```
* `reduce(add, a)` $\Rightarrow$ `10`
    * `add(1, add(2, add(3, 4)))`
* **Better for functions that are commutative and association doesn't matter**
    * Jobs in Spark work in parallel
    
### Lambda
* When doing map/reduce/filter, we end up with many tiny functions
* Lambda allows us to define a function as a value, without giving it a name
* example: `lambda x: x + 1`
    * Can only have one expression
    * Do not write return
    * Option to put parenthesis around it, but usually not needed by syntax
* `(lambda x: x + 1)(3)` $\Rightarrow$ `4`
* `map(lambda X: x + 1, [1, 2, 3])` $\Rightarrow$ `[2, 3, 4]`

#### Exercises (1)
* `(lambda x: 2*x)(3)` $\Rightarrow$ **`6`**
* `map(lambda x: 2*x, [1, 2, 3])` $\Rightarrow$ **`[2, 4, 6]`**
* `map(lambda t: t[0], [(1,2), (3,4), (5,6)])` $\Rightarrow$ **`[1, 3, 5]`**
* `reduce(lambda x,y: x+y, [1,2,3])` $\Rightarrow$ **`6`**
* `reduce(lambda x,y: x+y, map(lambda t: t[0], [(1,2),(3,4),(5,6)]))` $\Rightarrow$ **`9`**

#### Exercises (2)
* Given: `a = [(1,2), (3,4), (5,6)]`

    * **(a)** Write an expression to get only the second elements of each tuple
    * **(b)** Write an expression to get the sum of the second elements
    * **(c)** Write an expression to get the sum of the odd first elements

* `map(lambda t: t[1], a)`
* `reduce(lambda x,y: x+y, map(lambda t: t[1], a))`
* `reduce(lambda x, y: x+y, filter(isOdd, map(lambda t: t[0], a)))`

In [1]:
from functools import reduce

In [3]:
a = [(1,2), (3,4), (5,6)]

In [7]:
list(map(lambda t: t[1], a))

[2, 4, 6]

In [8]:
reduce(lambda x,y: x+y, map(lambda t: t[1], a))

12

In [2]:
def isOdd(x):
    return x%2==1

In [4]:
reduce(lambda x, y: x+y, filter(isOdd, map(lambda t: t[0], a)))

9

### Flatmap
* Sometimes we end up with a list of lists, and we want a "flat" list
* Python doesn't actually have a flatmap function, but provides something similar with `itertools.chain`
* Many functional programming languages (and Spark) provide a function called flatMap, which flattens such a list
* For example:
    * `map(lambda t: range(t[0], t[1], [(1,5),(7,10)])` # Returns a list of lists
* `itertools.chain` maps a list of iterables into a flat list
    * And so enables us to define our own flatmap

In [6]:
from itertools import chain

In [10]:
list(chain(map(lambda t: range(t[0], t[1]), [(1, 5), (7,10)])))

[range(1, 5), range(7, 10)]


### Creating RDDs in Spark
* All spark commands operate on RDDs (think big distributed list)
* You can use `sc.parellelize` to go from list to RDD
* Later we will see how to read from files
* Many commands are lazy (they don't actually compute the results until you need them)
* In pySpark, `sc` represents you SparkContext

### Transformations vs Actions
* We divide RDD methods into two kinds:
    * Transformation
        * return another RDD
        * are not really performed until an action is called (lazy)
    * Actions
        * return a value other than an RDD
        * are performed immediately
        
### Some RDD methods

#### Transformations
* `.map(f)`: returns a new RDD applying f to each element
* `.filter(f)`: returns a new RDD containing elements that satisfy f 
* `.flatmap(f)`: returns a 'flattened' list

#### Actions
* `.reduce(f)`: returns a value reducing RDD elements with f
* `.take(n)`: returns n items from the RDD
* `.collect()`: returns all elements as a lits
* `.sum()`: sum of (numeric) elements of an RDD
    * `max`, `min`, `mean`...

credle.io as a resume builder tool: makes a great data science resume and is easy to use
* Write a summary that focuses on data science or analytics skills used in most recent project(s)
    * highlight these skills early and obviously 
* Always include name in document title (maybe with current date)
* Your resume should explicitly include only the exact items that will help you get a job interview.
* “I have experience building really fast and accurate machine-learning models in Python. I also understand big data technology like Hadoop.”
* “I have experience using stats and machine-learning to find useful insights in data. I also have experience presenting those insights with dashboards and automated reports, and I am good at public speaking.”
* “I am an experienced data scientist, I have a great math background, and I am good at explaining complicated stuff.”
* How proficient do I have to be before I put a skill or technology on my resume?
    * What am I allowed to include? My general rule of thumb is that you should not put something on your resume unless you have actual used it. Just having read about it does not count. Generally, you don’t have to have used it in a massive scale production environment, but you should have at least used it in a personal project.
* Which things should I emphasize?
    *  What should I emphasize? In order to decide what to emphasize, you have two great sources of information. One is the job description itself. If the job description is all about R, you should obviously emphasize R. Another, more subtle, source is the collection of skills that current employees list on LinkedIn. If someone is part of your network or has a public profile, you can see their LinkedIn profile (if you can’t see their profile, it might be worth getting a free trial for LinkedIn premium). If all of the team members have 30 endorsements for Hive, then they probably use Hive at work. You should definitely list Hive if you know it.
* Which things should I not include?
    * What should I not include? Because your resume is there to tell a targeted story in order to get an interview, you really should not have any skills or technologies listed that do not fit with that story.
    


* Including general skills like HTML and CSS is probably good, but you probably do not need to list that you are an expert in Knockout.JS and elastiCSS. This advice is doubly true for non-technical skills like “customer service” or “phone direct sales.” Including things like that actually makes the rest of your resume look worse, because it emphasizes that you have been focused on a lot of things other than data science, and — worse — that you do not really understand what the team is looking for
* If you want to include something like that to add color to your resume, you should add it in the “Additional Info” section at the end of the resume, not in the “Skills and Technologies” section.

#### What if I have no working experience?
* If you have no working experience as a data scientist, then you have to figure out how to signal that you can do the job anyway. There are three main ways to do this: independent projects, education, and competence triggers.

#### Independent projects
If you don’t have any experience as a data scientist, then you absolutely have to do independent projects. Luckily, it is very easy to get started. The simplest way to get started is do a Kaggle competition. Kaggle is a competition site for data science problems, and there are lots of great problems with clean datasets. I wrote a step-by-step tutorial for trying your first competition using R. I recommend working through a couple of Kaggle tutorials and posting your code on Github. Posting your code is extremely important. In fact, having a Github repository posted online is a powerful signal that you are a competent data scientist (it is a competence trigger, which we will discuss in a moment).

Kaggle is great, because steps 1 and 2 are completed for you. But a huge amount of data science is exactly those parts, so Kaggle can’t fully prepare you for a job as a data scientist. I will help you now with steps 1 and 2 by giving you a list of a few ideas for independent data science projects. I encourage you to steal these.
* 1) Use Latent Semantic Analysis to extract topics from tweets. Pull the data using the Twitter API.
* 2) Use a bag of words model to cluster the top questions on /r/AskReddit. Pull the data using the Reddit API.
* 3) Identify interesting traffic volume spikes for certain Wikipedia pages and correlate them to news events. Access and analyze the data by using AWS Open Datasets and Amazon Elastic MapReduce.
* 4) Find topic networks in Wikipedia by examining the link graph in Wikipedia. Use another AWS Open Datasets.
* A few other project ideas in [How to Become a Data Hacker](https://will-stanton.com/becoming-an-effective-data-hacker/)

#### Education
Another way to prove your ability is through your educational background. If you have a Masters or a PhD in a relevant field, you should absolutely list relevant coursework and brag about your thesis. Make sure that you put your thesis work in the context of data science as much as possible. Be creative! If you really can’t think of any way that your thesis is relevant to data science, then you problem should not make a big deal out of it on your resume.

#### Competence triggers and social proof
* A Github page
* A Kaggle profile
* A StackExchange or Quora profile
* A technical blog

#### Resume rules of thumb
* Keep it to one side of one page: Most recruiters only look at a resume for a few seconds. They should be able to see that you are a good candidate immediately, without turning the page.
* Use simple formatting: Don’t do anything too fancy. It should not be hard to parse what your resume says.
* Use appropriate industry lingo, but otherwise keep it simple: Again, this goes to readability.
* Don’t use weird file types: PDF is good, but you should probably also attach a DOCX file. You basically should not use any other file formats, because your resume is useless if people can’t open it.

#### Libraries
* pandas**
* numpy**
* matplotlib**
* seaborn**
* plotly**
* scipy**
* scikit image**
* imageio**
* statsmodels**
* scikit learn**
* sqlalchemy**
* NetworkX**
* datetime**
* pandas_profiling**
* geopandas**
* pycaret**
* catboost**
* hyperopt**
* random**
* lightgbm**
* bayes_opt**
* pickle**
* (22)
***
* io
* pydotplus
* zipfile
* requests
* json
* ppscore
* shap
* pathlib
* os
* featuretools
* (10)