# Profiling
## Lecture objectives

1. Demonstrate how to profile the performance of particular functions

In the previous lecture, we briefly mentioned profiling as a strategy to handle big data. There are numerous specialist software packages that can help with this, but the basics are straightforward to consider.

Profiling means identifying the slowest parts of your code, and thinking about how to speed it up. Sometimes, that will be obvious. But in other cases, you need to profile your code. 

For example, imagine you have a long text document and you want to identify stopwords. You might want to loop over each row of the `DataFrame` using `iterrows()`. Let's try this using a word list kindly hosted by Prof. Eric Price at MIT. [Here are the details.](https://stackoverflow.com/questions/18834636/random-word-generator-python)

In [None]:
import pandas as pd
import requests

word_site = "https://www.mit.edu/~ecprice/wordlist.10000"
response = requests.get(word_site)
wordDf = pd.DataFrame(response.content.splitlines(), columns=['word'])

print(len(wordDf))
wordDf.head()

Let's add a column to indicate whether each word is a stopword.

Then, we can loop over each row of the dataframe, and apply a function to that row. Our function will set our `is_stopword` column to `True` if the word is a stopword. (This is a really inefficient way of doing things!)

Then, we can use the `%timeit` magic function to see how long it takes to run that line. (A magic function is preceded by `%` - it basically helps you run or analyze your code.)

In [None]:
from nltk.corpus import stopwords

# initialize the column
wordDf['is_stopword'] = None

# write the function
def exclude_stopwords(wdf):
    for idx, row in wdf.iterrows():
        if row['word'] in stopwords.words('english'):
            wdf.loc[idx, 'is_stopword'] = True
        else:
            wdf.loc[idx, 'is_stopword'] = False
    return wdf
           
# to use %timeit, just put it in front of any function
%timeit newdf = exclude_stopwords(wordDf)

That took more than 1 second per loop on my computer...not a big deal, but it might matter if you are dealing with tens of thousands of long documents.

What about using `apply` with a `lambda` function? For me, it takes just over half the time.

In [None]:
%timeit wordDf['is_stopword'] = wordDf.word.apply(lambda x: x in stopwords.words('english'))

What else could we do to speed this up? Perhaps accessing `stopword.words()` each time incurs some overhead. What if we access this once and store it in a separate variable?

Here, we use `%%timeit` to time the whole cell rather than a single line.

In [None]:
%%timeit
swords = stopwords.words('english')
wordDf['is_stopword'] = wordDf.word.apply(lambda x: x in swords)

Much faster!

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>The %timeit magic function is a simple way to assess how long a piece of code takes to run.</li>
</ul>
</div>