# Tips for Writing Efficient Python

Picture this - you just spent an hour putting together some code to clean a data set with a million records only to find out your code takes 3 hours to run. You're just exited it works, so you switch to other tasks while your code runs. Sometimes a quick and dirty solution is all you need for ad-hoc tasks; however, refactoring is often needed when moving to production environments. This notebook covers tips for speeding up your Python code. We'll also walk through a few examples of code that orginally took hours/days to run and werereduced to minutes/seconds after refactoring

## Downloading PyCharm

We will use PyCharm as the primary integrated development environment IDE for this tutorial, but feel free to use your preferred IDE. To install PyCharm, select the Community Edition from [this link](https://www.jetbrains.com/pycharm/download/#section=mac) (it's free!).

## TLDR; Don't Optimize Before You Need To!

Your code has to run before you know it's slow :). But even if it's "slow", that might not necessarily mean you need to jump into optimization. However,  there are some good practices you should always follow such as removing values that don't change from a for loop.

1. Write code that runs
2. Make sure you get the output you expect
3. Check to see if current speed is acceptable
4. Optimize if needed
5. Repeat 2-4

For more Python performance tips, check out [Python's documentation](https://wiki.python.org/moin/PythonSpeed/PerformanceTips). There are also tools/libraries such as Hadoop and vaex that also help to handle big data.

## When Should I Optimize?

Below are a few examples of when you might want to optimize your code.

- Working with big data
- Creating a reusable component
- Deploying to the cloud where compute resources aren't free
- Improving user experience
- Quickly iterating for a proof of concept
- Ethical concerns such as carbon footprints of data centers

## About the Data

We will use the [AG News Data set](https://www.kaggle.com/amananandrai/ag-news-classification-dataset) for the following examples. This data set contains titles, descriptions, and categories for over one million news articles. We will only be using a sample of this data.

In [1]:
import time
import pandas as pd

# Load data and sample 5,000 records
df = pd.read_csv("data/train.csv")
df = df.sample(n=5000, random_state=50)
df.head()

Unnamed: 0,Class Index,Title,Description
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep..."
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...


## Looping over a DataFrame

One of the great things about Python, or any programming language really, is that there is no "right" way to solve a problem - two people can come up with entirely different solutions to solve the same problem. Let's walk through an example of calculating the number of characters in the Title of each news articleusing the following methods to see how performance impacts each method.

1. for loop
2. iterrows()
3. itertuples()
4. list comprehension
5. apply()
6. vectorization with Pandas series

## For Loop

For loops are used across many languages and are one of the basic ways to iterate over data

In [2]:
start_time = time.time()

length_list = []
for i in range(0, len(df)):
    length = len(df.iloc[i]["Title"])
    length_list.append(length)
    
df["length"] = length_list

elapsed_time = time.time() - start_time
print(elapsed_time)

df.head()

0.23601508140563965


Unnamed: 0,Class Index,Title,Description,length
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...,56
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...,32
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...,41
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep...",54
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...,15


## Iterrows()

This function is specific to Pandas DataFrames and allows you to iterate over DataFrame rows as (Index, Series) pairs.

In [3]:
start_time = time.time()

length_list = []
for index, row in df.iterrows():
    length = len(row["Title"])
    length_list.append(length)
    
df["length"] = length_list

elapsed_time = time.time() - start_time
print(elapsed_time)

df.head()

0.1403789520263672


Unnamed: 0,Class Index,Title,Description,length
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...,56
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...,32
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...,41
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep...",54
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...,15


## Itertuples()

This function is specific to Pandas DataFrames and allows you to iterate over DataFrame rows as namedtuples.

In [4]:
start_time = time.time()

length_list = []
for row in df.itertuples():
    length = len(row.Title)
    length_list.append(length)
    
df["length"] = length_list

elapsed_time = time.time() - start_time
print(elapsed_time)

df.head()

0.008417129516601562


Unnamed: 0,Class Index,Title,Description,length
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...,56
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...,32
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...,41
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep...",54
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...,15


## List Comprehension

List Comprehension offers a shorter syntax for creating a new list based on values in an existing list.

In [5]:
start_time = time.time()

length_list = [len(x) for x in df["Title"]]
    
df["length"] = length_list

elapsed_time = time.time() - start_time
print(elapsed_time)

df.head()

0.004917144775390625


Unnamed: 0,Class Index,Title,Description,length
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...,56
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...,32
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...,41
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep...",54
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...,15


## Apply()

This function is specific to Pandas DataFrames and allows you to apply a function along an axis of the DataFrame. Apply is often used with lambdas to create inline functions.

In [6]:
start_time = time.time()
    
df["length"] = df["Title"].apply(lambda x: len(x))

elapsed_time = time.time() - start_time
print(elapsed_time)

df.head()

0.0039517879486083984


Unnamed: 0,Class Index,Title,Description,length
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...,56
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...,32
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...,41
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep...",54
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...,15


## Vectorization with Pandas Series

Vectorization allows you to execute operations on entire arrays instead of each individual item. This reduces the amount of iteration.

In [7]:
start_time = time.time()
    
df["length"] = df["Title"].str.len()

elapsed_time = time.time() - start_time
print(elapsed_time)

df.head()

0.0029397010803222656


Unnamed: 0,Class Index,Title,Description,length
110913,4,HOW IT WORKS Giving Gamers Another Window on T...,THE newest piece of hardware in the video game...,56
75136,3,UK report says Linux is 'viable',A UK government study finds the open-source Li...,32
48815,4,Sony Corp abandons copy-control music CDs,Sony Corp #39;s music unit is abandoning its C...,41
17853,4,Botswana Donates 500 Elephants to Mozambique (...,"Reuters - Botswana, which has the\largest elep...",54
61101,3,Business digest,NEW YORK - Investors sent stocks falling sharp...,15


Now, let's dive into a couple of examples to put it all together.

## Example 1 - Original Code

For this example, we want to clean the Title of our news article by:

1. Lowercasing the text
2. Replacing certain tokens from a dictionary
3. Removing numbers from the text
4. Removing punctuation from the text

In [8]:
dictionary_df = pd.read_csv("data/dictionary.csv")
dictionary_df.head()

Unnamed: 0,from,to
0,uk,united kingdom
1,it,information technology
2,nba,national basketball association
3,ibm,international business machines
4,us,united states


In [9]:
import re
import string
import time
import warnings
import pandas as pd

warnings.filterwarnings("ignore")

def dict_replace(title):
    dictionary_df = pd.read_csv("data/dictionary.csv")
    clean_tokens = []
    tokens = title.split()
    for old_token in tokens:
        new_token = dictionary_df[dictionary_df["from"] == old_token]
        if new_token.empty:
            clean_tokens.append(old_token)
        else:
            clean_tokens.append(new_token["to"].tolist()[0])
            
    title = " ".join(clean_tokens)
    return title

def preprocess_title(df):
    # Lowercase
    df["title_clean"] = df["Title"].str.lower()
    
    # Replace digits with a space
    df["title_clean"] = df["title_clean"].str.replace("\d", " ")
    
    # Remove punctuation
    df["title_clean"] = df["title_clean"].str.replace(".", " ")
    df["title_clean"] = df["title_clean"].str.replace("#", " ")
    df["title_clean"] = df["title_clean"].str.replace(";", " ")
    df["title_clean"] = df["title_clean"].str.replace("'", " ")
    df["title_clean"] = df["title_clean"].str.replace("-", " ")
    df["title_clean"] = df["title_clean"].str.replace("/", " ")
    df["title_clean"] = df["title_clean"].str.replace("$", " ")
    df["title_clean"] = df["title_clean"].str.replace("(", " ")
    df["title_clean"] = df["title_clean"].str.replace(")", " ")
    df["title_clean"] = df["title_clean"].str.replace("\\", " ")
    df["title_clean"] = df["title_clean"].str.replace("?", " ")
    df["title_clean"] = df["title_clean"].str.replace(":", " ")
    df["title_clean"] = df["title_clean"].str.replace(",", " ")
    df["title_clean"] = df["title_clean"].str.replace("\s+", " ")
    
    df["title_clean"] = df["title_clean"].apply(lambda x: dict_replace(x))

In [10]:
start_time = time.time()

preprocess_title(df)

elapsed_time = time.time() - start_time
print(elapsed_time)

df[["Title", "title_clean"]].head()

10.512485980987549


Unnamed: 0,Title,title_clean
110913,HOW IT WORKS Giving Gamers Another Window on T...,how information technology works giving gamers...
75136,UK report says Linux is 'viable',united kingdom report says linux is viable
48815,Sony Corp abandons copy-control music CDs,sony corp abandons copy control music cds
17853,Botswana Donates 500 Elephants to Mozambique (...,botswana donates elephants to mozambique reuters
61101,Business digest,business digest


## Example 1 - Refactor

In [11]:
def dict_replace_v2(dictionary, title):
    return " ".join([dictionary.get(word, word) for word in title.split()])

def preprocess_title_v2(title):
    # Lowercase
    clean_title = title.lower()
    
    # Replace digits with space
    clean_title = re.sub("\d", " ", clean_title)
    
    # Remove all punctuation
    clean_title = clean_title.translate(
        str.maketrans(string.punctuation, " " * len(string.punctuation)))
    
    return " ".join(clean_title.split())

In [12]:
start_time = time.time()

dictionary_df = pd.read_csv("data/dictionary.csv")
lookup_dict = pd.Series(dictionary_df["to"].values, index=dictionary_df["from"]).to_dict()

df["title_clean_v2"] = df["Title"].apply(lambda x: preprocess_title_v2(x))
df["title_clean_v2"] = df["title_clean_v2"].apply(lambda x: dict_replace_v2(lookup_dict, x))

elapsed_time = time.time() -start_time
print(elapsed_time)

df[["Title", "title_clean", "title_clean_v2"]].head()

0.032651662826538086


Unnamed: 0,Title,title_clean,title_clean_v2
110913,HOW IT WORKS Giving Gamers Another Window on T...,how information technology works giving gamers...,how information technology works giving gamers...
75136,UK report says Linux is 'viable',united kingdom report says linux is viable,united kingdom report says linux is viable
48815,Sony Corp abandons copy-control music CDs,sony corp abandons copy control music cds,sony corp abandons copy control music cds
17853,Botswana Donates 500 Elephants to Mozambique (...,botswana donates elephants to mozambique reuters,botswana donates elephants to mozambique reuters
61101,Business digest,business digest,business digest


## Example 2 - Original Code

In this example we want to count the number of tokens in the Title of the news article. In order to be considered a token, the following must be true.

1. Lemmatized token > 1 character
2. Lemmatized token not a stop word
3. Lemmatized token is in the NLTK vocabulary

Check out my course [Natural Language Processing with Python](https://www.skillshare.com/classes/Natural-Language-Processing-with-Python) to learn more about these common text preprocessing steps.

In [13]:
import time

import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('words')

def count_tokens_in_nltk_vocab(string):
    wnl = WordNetLemmatizer()
    
    counter = 0
    for token in string.split():
        token = wnl.lemmatize(token)
        if len(token) > 1 and token not in set(stopwords.words("english")):
            if token in set(nltk.corpus.words.words()):
                counter +=1
    return counter

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\31628\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\31628\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\31628\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.


In [14]:
# Delete dataframe from earlier and sample 50 records this time
del(df)

df = pd.read_csv("data/train.csv")
df = df.sample(n=50, random_state=50)

In [15]:
# Apply preprocessing from first example
df["title_clean"] = df["Title"].apply(lambda x: preprocess_title_v2(x))
df["title_clean"] = df["title_clean"].apply(lambda x: dict_replace_v2(lookup_dict, x))

start_time = time.time()

df["token_count"] = df["title_clean"].apply(lambda x: count_tokens_in_nltk_vocab(x))

elapsed_time = time.time() - start_time
print(elapsed_time)
df[["title_clean", "token_count"]].head()

27.24153232574463


Unnamed: 0,title_clean,token_count
110913,how information technology works giving gamers...,7
75136,united kingdom report says linux is viable,5
48815,sony corp abandons copy control music cds,5
17853,botswana donates elephants to mozambique reuters,2
61101,business digest,2


## Example 2 - Refactor

In [16]:
# Apple preprocessing from first example
df["title_clean"] = df["Title"].apply(lambda x: preprocess_title_v2(x))
df["title_clean"] = df["title_clean"].apply(lambda x: dict_replace_v2(lookup_dict, x))

start_time = time.time()

wnl = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
nltk_vocab = set(nltk.corpus.words.words())

df["token_count_v2"] = df["title_clean"].apply(
    lambda x: len(
        [
            token
            for token in x.split()
            if (len(wnl.lemmatize(token)) > 1)
               and wnl.lemmatize(token) not in stop_words
               and wnl.lemmatize(token) in nltk_vocab
        ]
    )
)

elapsed_time = time.time() - start_time
print(elapsed_time)
df[["title_clean", "token_count", "token_count_v2"]].head()

0.09496784210205078


Unnamed: 0,title_clean,token_count,token_count_v2
110913,how information technology works giving gamers...,7,7
75136,united kingdom report says linux is viable,5,5
48815,sony corp abandons copy control music cds,5,5
17853,botswana donates elephants to mozambique reuters,2,2
61101,business digest,2,2
