# Strategies for dealing with big data 
## Process your Python code with speed

## _By Jeff Hale_

Python is the most popular languae for scientific and numerical computing and pandas is the most Python package for doing data science. 

Using pandas with Python allows you to handle much more data than you can with Excel, Sheets, or Numbers. And Python and its data science libraries have many advantages over working with SQL, and cloud variations when it comes to expressiveness and ability to quickly do data analysis, statistics, machine learning (including deep learning). 

Unfortunately, if you are working locally, the amount of data that pandas can handle is limited by the amount of memory on your machine. And if you're working in the cloud, more memory costs more money. Regardless, we want our operations to happen quickly so we can GSD (Get Stuff Done)!

Don't prematurely optimize. If you can, stay in pandas. Don't worry about these issues if you aren't having problems and you don't expect your data to balloon. 

If you want to time things in a Jupyter notebook, you can use %time or %%timeit magic commands. Or in a script or notebook, import time and do time.now before and after and find the difference. Note that different machines and versions can cause variation and caching will sometimes mess with results and wall time and clock time. As with all experimentation, hold everything you can constant and note what you can't hold constant.

## Things to always do
These are just good good coding practices.
1. Avoid nested loops whenever possible. [Here's](https://stackabuse.com/big-o-notation-and-algorithm-analysis-with-python-examples/) a brief primer on Big-O notation and algorithm analysis. One for loop nested inside another for loop generally leads to polynomial time. If you have more than a few items to search through, this begins to take a while. See chart and nice explanation [here](https://skerritt.blog/big-o/).
1. Use list comprehensions (and dict comprehensions) whenever possible. Creating a list on demand is faster than  load the append attribute of the list and repeatedly calling it as a function - hat tip [here](https://stackoverflow.com/a/30245465/4590385). However, in general, don't sacrifice clarity for speed, so be careful with nesting list comprehensions. 
1. In pandas, use built-in vectorized functions. The principle is really the same as the reason for dict comprehensions. Apply a function to a whole data structure at once is much faster than repeatedly calling a function.

If you find yourself reaching for `apply`, think about whether you really need to. It's looping over rows or columns. Vectorized methods are usually faster and less code, so they are a win win. 🚀 

Know the other pandas Series and DataFrame methods that loop over your data: `applymap`, `itterrows`, `ittertuples`. 

Using the replace() method on a DataFrame instead of any of those will save you lots of time.

TK Real Python and SO have discussions of the time implications of each of these methods. Itertuples is usually fastest if you have to use one. Also, you can create your own vectorized functions.

Notice that these rules might not hold for very small amounts of data, but in that cases, the stakes are low, so who cares. 😉

## Things to do with pretty big data (roughly millions of rows):
1. Use a subset of your data to explore, clean, make a baseline model if you're doing machine learning. Solve 90% of your problems fast and save time and resources. This technique can save you so much time!
1. Load only the columns that you need with the [`usecols`](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#load-less-data) argument when reading in your DataFrame. Less data in = win!
1. Use dtypes efficently. Downcast numeric columns to the smallest dtypes that makes sense with pandas.to_numeric(). Convert columns with low cardinality (just a few values) to a categorical dtype. [Here's]( https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#use-efficient-datatypes) a pandas guide on efficient dtypes.
1. Parallelize model training in scikit-learn to use more processing cores whenever possible. By default, scikit-learn uses just one of your machine's cores. Many computers have 4 or more cores. You can use them all for parallelizable tasks by passing the argument `n_jobs=-1` when doing cross validation with GridSearchCV and many other classes.
1. Save pandas DataFrames in feather or pickle formats for faster reading and writing. Hat tip to Martin Skarzynski, who links to evidence and code [here](https://mobile.twitter.com/marskar/status/1296833212568735751).
1. Use [pd.eval](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html#pandas.eval) to speed up pandas operations. Pass the function your usual code in a string. It does the operation much faster. Here's a chart from tests with 100 column DataFrame.
![img]('') image from this good article on the topic [this article](https://towardsdatascience.com/speed-up-your-numpy-and-pandas-with-numexpr-package-25bd1ab0836b). [df.query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) is the same deal, but a DataFrame method instead of a top-level pandas function. 

See the docs because there are some gotchas. ⚠️

Pandas is using [numexpr](https://numexpr.readthedocs.io/projects/NumExpr3/en/latest/intro.html) under the hood. Numexpr also works with NumPy.  Hat tip to Chris Conlan in his book [Fast Python](https://chrisconlan.com/fast-python/). It's an excellent read for learning how to speed up your Python code.
   


## To do with really big data (roughly tens of millions of rows and up):
1. Use [numba](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#using-numba). Makes things super fast if doing mathematical calcs. To use: install numba and import it. Then use the `@numba.jit` decorator function when you need to loop over NumPy arrays and can't use vectorized methods. It only works  with only NumPy arrays so use `.to_numpy()` with pandas DataFrames and Series. 
1. Use sparse arrays when it makes sense. Scikit-learn ouptputs spars arrays automatically with some transformers, such as CountVectorizer. When your dataset is mostly 0s or missing values, you can convert it to sparse dtypes in pandas. Read more [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html). 
1. Read data into pandas chunks and write it out in chunks using [`dask`](https://dask.org/). Install and import. It has a subset of the pandas API. Dask also has sister packages for scikit-learn and other data science libraries. It plays nicely with PyTorch, too.
1. GPUs aren't just for deep learning. Use PyTorch or TensforFlow with a GPU - as I showed in this article on [sorting]( ), you can get big speedups.


## To keep an eye on/experiment with if all else fails

You really don't want to use any of these if you don't need to. The APIs aren't complete/don't match and there are likely to be configuration issues. They can be great, but problably not if you're just working locally on a cpu.

1. Do you have access to lots of cpu cores? And currently (mid-2020), more than 32 columns? Maybe use Modin - it uses Apache Arrow (via Ray) or Dask under the hood. Dask is experimental. Some things aren't fast - reading in data from NumPy arrays is really slow right now. Memory managment was an issue in my tests. 
1. Can use [jax](https://github.com/google/jax) in place of NumPy. Jax is an opensource google product that's bleading edge. Uses autograd, XLA, JIT, vectorizer, parallelizer. It looks like it will be simpler than using PyTorch or TF. It's good for deeplearning. It uses NumPy. Good on GPU, TPU, or CPU. Good stuff. No pandas version yet, but could convert DF to tensorflow and then use Jax, or just convert to NumPy and then use jax. Intro [here](https://iaml.it/blog/jax-intro-english). PyTorch is best known as a deep learning library. It can do most of what jax can to boost speed. TensorFlow should also be able to do much of what Jax can, but I haven't dug in.
1. NVIDIA's open source Rapids cuDF. Likely using Thrust library under the hood. Rapids maybe using Arrow.
1. Do you have really big data for ETL - use Rapdis cuDF-dask. It gives you the best of multiple machines and GPU power. 👍


The pandas docs have sections on [enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html) and [scaling to large datasets](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html). Some of this is adapted from that.



Notes:
- Including avoid sneaky hidden loops like `sum()` with a list. hat tip Fast Python
- make examples of eval/query
- make example of numba
- make example of PyTorch
- make examples of Dask

maybe add Smaller chunks when reading in data - Dask does this automatically - https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#iterating-through-files-chunk-by-chunk

Other considerations:



GitHub can only handle X size. 

There's a solution for big GitHub data. TK

SQL is fast and lots can be done to make faster.

Data versioning is good too!