# Module 38 Topic Review: Big Data with PySpark
<img src="images/social_media.png" width=1000>

## What *is* big data?  
There is no clear definition or consensus on exactly how much data is  considered ***big*** data. There are some rules-of-thumb that point in the right direction however. Some things to consider when determing if you are dealing with "big data" or just a large traditional data set are:  
- Anything smaller than a terabyte is probably not big data
- Big data usually needs to be stored in a distributed data base (i.e. all the data can not fit on single machine)
- Big data almost always needs to be computed on a distributed network (i.e. computation is to expensive to take place on a single machine) 

### 3 V's of Big Data

#### Volume:
Refers to how much data is generated.  
<img src="images/rank_users.png" width=1000>

#### Velocity:
Refers to how quickly data is generated.  
<img src="images/internet_minute.jpg" width=600>

#### Variety:
refers to the range of data types and formats generated.  
<img src="images/unstructured_data.png" width=600>

## How to Handle Big Data 

### Distributed Processing  
The two most common ways of organizing computers into a distributed system are the client-server system and peer-to-peer system.

The client-server architecture has nodes that make requests to a central server. The server will then decide to accept or reject these requests and send additional methods out to the outer nodes.

Peer-to-peer systems allow nodes to communicate with one another directly without requiring approval from a server.

<img src='images/types_of_network.png' width=1000>

### Parallel Processing
When using a well-developed distributed system, multiple processors can accomplish tasks at a fraction of the time it would take for a single processor to accomplish.  
<img src='images/parallel.png' width=800>

#### Using the MapReduce Paradigm
Even with parallel and distributed processing there can be limitations to how much data can be processed how quickly, there are some techniques available to in sense *massage* the data before actual computation takes place.  
One of the techniques used for this purpose is the MapReduce paradigm.  

<img src='images/word_count.png' width=1000>

## Using PySpark 
Base Python can not facilitate big data on its own. However, using (Py)Spark it is possible to write python programs to access big data and perform data exploration and machine learning tasks with it. 

<img src='images/spark_structure.png'>

When you are writing Spark code, your code is the "Driver Program" pictured here. Your code needs to instantiate a SparkContext if we want to be able to use the Spark Unstructured API.  
<img src='images/cluster-overview.png' width=800>

## Recap

- Big Data usually refers to datasets that grow so large that they become awkward to work with using traditional database management systems and analytical approaches
- Big data refers to data that is terabytes (TB) to petabytes (PB) in size
- MapReduce can be used to split big datasets up in smaller sets to be distributed over several machines to deal with Big Data Analytics
- PySpark can be installed directly on your computer using conda or in a Docker container
- When you start working with PySpark, you have to create a SparkContext or SparkSession
- The creation or RDDs is essential when working with PySpark
- Examples of actions and transformations include collect(), count(), filter(), first(), take(), and reduce()
- Machine Learning on the scale of big data can be done with Spark using the ml library