# Introducing PySpark


Python, pandas, scipy, and scikit-learn are all great tools for exploring and learning from small to mid-size datasets.  Like most programming languages, Python is primarily designed to work in a certain way: load data from local disk or from a database into memory, process that data, and store the results back to disk.

But what if we want to process data that’s too big for one computer?   In the past, we would mainly avoid this problem by sampling.  As we don’t need to necessarily train a model on a huge dataset, we find that a good statistical sample can often be even better.  That said, there are some problems with this.  How do we know that the sample really reflects a disparate dataset?   Maybe the most important features are in a subset of the data that the sample never sees?

Even so, there are certain types of models for which sampling just does not work.  Recommendations is a good example. Recommenders rely on comparing users with other users in evaluating their preferences.  With a subset of the data, there won’t be a cohort of users who are very similar to one another.  Recommenders really need to run on the full dataset, or not at all.

So what is the answer?  The answer has been obvious for a long time: split the problem up onto multiple computers.  That said, the problem is easier said than done.   Many problems aren’t easily split up. Even if it is possible to partition out the work, developers often have trouble writing parallel code and end up having to solve a bunch of the complex issues around multiprocessing themselves.


### Running this notebook

This notebook will not run in an ordinary jupyter notebook server.  

Here is how we can load pyspark to use Jupyter notebooks.

```bash
  export PYSPARK_PYTHON=python3 
  export PYSPARK_DRIVER_PYTHON="jupyter" 
  export PYSPARK_DRIVER_PYTHON_OPTS="notebook" 
  ~/spark/bin/pyspark
```

The last command will start up a jupyter server similar to what you are used to.  We can then load this file.

In [None]:
lines = sc.textFile("README.md")
lines.take(5)

### Viewing the Spark UI

Congradulations!  You just ran your first Spark job.

We can view the spark UI at YOURMACHINE:4040.  If you're running on a laptop or desktop, then localhost:4040 will probably work.  let's open up a browser window and go to that location.  You should see something like this:

![Spark UI](img/pyspark_4040.png "Spark UI")



Now that we've loaded the readme file, let's try a simple operation.  We'll try a filter so that we only load lines with the word "Spark" in them.

In [None]:
linesWithSpark = lines.filter(lambda line: "spark" in line)
linesWithSpark.count()