# How to iterate over rows in a DataFrame

This notebook demonstrates how to iterate over the rows in a DataFrame, taking into account different sized data sets and leveraging vectorization to realize massive speed gains on 
large DataFrames.

### Goal
The goal is to compute the average number _attempts_ to solve the popular [Worlde puzzle game](https://www.nytimes.com/games/wordle/index.html) by examining Wordle scores posted to Twitter.

The first data set we work with is small, and has aprox. 17K rows.

In [1]:
import pandas as pd

df_tweets = pd.read_csv('tweetsSmall.csv', delimiter=',', header='infer')

The "tweet_text" column in the DataFrame contains the text posted to Twitter. We need to pull out the number of attempts out of six by parsing the text in each row.

![](images/wordle_score.png)

## Iterating with the `iterrows` method

The most straightforward way to iterate over a DataFrame is with the `iterrows` method. This iterates over one row at a time.

In [2]:
for index, row in df_tweets.iterrows():
    slashIndex = row['tweet_text'].index('/')
    attempts = float(row['tweet_text'][slashIndex-1])
    df_tweets.loc[index, 'attempts'] = attempts

print(df_tweets["attempts"].mean())


3.9035848841191916


This works fine for smaller data sets, and can be used when appropriate. However, larger data sets pose a performance problem for `iterrows`, and it should be avoided.

Now we'll load 175k rows into the DataFrame and run the `iterrows` method again to loop over the data.

In [3]:
df_tweets = pd.read_csv('tweetsBig.csv', delimiter=',', header='infer')

In [4]:
for index, row in df_tweets.iterrows():
    slashIndex = row['tweet_text'].index('/')
    attempts = float(row['tweet_text'][slashIndex-1])
    df_tweets.loc[index, 'attempts'] = attempts

print(df_tweets["attempts"].mean())


4.067405564351956


This time the execution takes over 40 seconds. For this reason, `iterrows` should be avoided on large DataSets. Instead, use vectorized methods.

## About Vectorization

Vectorization refers to featuers in Pandas and Python that are optimized to run operations on multiple rows at one time taking full advantage of your CPU. This makes them orders of magnitude faster and should almost always be preferred over `iterrows`.

### Series Apply

The "Series Apply" is a vectorized method in Pandas that will apply a function to a series (column) of data and will return a series (column) of data. In the following example, we use this to create a function which will be "applied" to the "tweet_text" column in the DataFrame and return a new column that contains the attempts.

In [5]:
def count_attempts_apply(tweet_text_series: str) -> float:
    slashIndex = tweet_text_series.index('/')
    attempts = float(tweet_text_series[slashIndex-1])
    return attempts

df_tweets['attempts'] = df_tweets['tweet_text'].apply(count_attempts_apply)

print(df_tweets["attempts"].mean())


4.067405564351956


This method runs sub second compared to the 45+ seconds with `iterrows`.