# Exploring Texts with `pandas` and `nltk`

This notebook introduces some methods to explore text using `pandas` and the Natural Language Toolkit, [`nltk`](https://www.nltk.org/). We'll keep builing on our AITA DataFrame, discussing some simple ways to explore data. 

This notebook is designed to help you:
1. Do some more data operations in Pandas;
2. Use NLTK's `Text()` object to perform some basic distant reading operations on a subreddit;

## Retrieving the Dataset

Let's get the data. Make sure you're in the `data` directory when importing by running the magic command `%pwd`.
If you're not in the right directory, use `os.chdir` to navigate there.

In [None]:
%pwd

In [None]:
import os
# We include two ../ because we want to go two levels up in the file structure
os.chdir('../../data')

Let's now import the processed data from the last notebook.

**Note:** You can replace this line with whichever processed data you have saved.

In [None]:
import pandas as pd
df = pd.read_csv('aita_sub_top_sm_lemmas.csv')

## Diving Deeper into `pandas`

Let's get started by looking at some more useful `pandas` operations. First, let's use `head()` to remind ourselves what the data look like. 

In [None]:
df.head(3)

Using the `.sort_values()` method we can sort the df by particular columms. We use two parameters: the `by` parameter indicates by which column we want to sort, the `ascending` parameter indicated whether our sortation is in ascending or descending order.

In [None]:
# Sort dataframe by highest scores
df.sort_values(by=['score'], ascending=False)[:3]

One thing we often do when we’re exploring a dataset is filtering based on a given condition. For example, we might need to find all the rows in our dataset where the score is over 500. We can use the `.loc[]` method to do so.

`.loc[]` is a powerful method that can be used for all kinds of research purposes, if you want to filter or prune your dataset based on some condition. For more info, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

The basic structure of `.loc[]` is `pd.loc[rows, columns]`, where `rows` is the rows to select (usually a conditional as in the cell below), and `columns` is a list of names of columns to select. Either value can be replaced by `:` to mean 'select all columns/rows'

For instance, if we only want rows with a score higher than 500:

In [None]:
df_top = df.loc[df['score'] >= 500, :]
len(df_top)

There are a lot of different ways to subset data, but `.loc[]` is one of the most versatile and powerful ways to select different subsets of the data.

## Value Counts
We can also look at unique value counts for a column by running `value_counts()`. Value counts indicate the number of samples with each unique value for a column.

Let's have a look at two particularly interesting columns: `"flair_text"` and `"flair_css_class"`. A flair, in Reddit, allows users to tag posts or usernames in certain subreddits to add context or humor. Here, the flair attached to the posts are created by moderators after the community votes on whether an OP was or wasn't the asshole. There are also other flairs such as "No assholes here" or "Everyone sucks". See [here](https://www.reddit.com/r/AmItheAsshole/wiki/faq) for more information.

This community-driven data segmentation is quite helpful for us!

In [None]:
df.flair_text.value_counts()

## Type-token ratio
Next, let's figure out the **type-token ratio** (TTR) for our posts. Type-token ratio is a crude algorithm to gauge language complexity. First, we'll create a function that computes the TTR. From just reading the code, what does a higher TTR correspond to? 

In [None]:
def typeTokenRatio(tokens):
    """Calculates type-token ratio on tokens."""
    numTokens = len(tokens)
    numTypes = len(set(tokens))
    return numTypes / numTokens

Finally, we loop over the first 10 lemmatized submissions in our dataframe. Do you think that the TTR is capturing a varation in language complexity?

The benefit of the TTR is a single number telling us some characteristics of the text, while the drawbacks are just that: a single number will not capture al of the complexity of a text. However, in concert with other techniques, this can be a helpful tool.

In [None]:
df['lemmas'][:10]

In [None]:
for text in df['lemmas'][:10]:
    tokens = text.split()
    print('Text:\n', text)
    print('TTR:', typeTokenRatio(tokens), '\n')

## Processing and Analyzing Language with `Text()`

Let's have another look at our data. Another powerful tool for initially exploring texts, `nltk` provides a `Text()` class that can process text and supports opreations such as **counting**, **concordancing**, and **collocation discovery**. 

Let's use our "lemmas" data we created in the last notebook. All we need to do to get the total tokens is run `split()` on it.

In [None]:
# Run if you do not have nltk installed
!pip install nltk

In [None]:
tokens = []
for idx, row in enumerate(df['lemmas']):
    # Notice that we put all tokens in the same list
    tokens.extend(row.split(' '))

In [None]:
import nltk
nltk.download('stopwords')
from nltk.text import Text

aita_tokens = Text(tokens)

Let's print out the "docstring" of the [`Text()`](https://www.nltk.org/api/nltk.text.html) object, as well as all the things you can do with this object. Have a read through this to see what it allows you to do! We will talk about five simple operations here, but there's many others that can be helpful - take a look at the docstring below to read a bit more about them!

In [None]:
help(Text)

### Concordances 
One of the most basic, but quite helpful, ways to quickly get an overview of the contexts in which a word appears is through a **concordance** view. `Text.concordance()` takes as input a word and a window width, and shows all cases where that word appears in a text and a window of context around it.

In [None]:
aita_tokens.concordance('mistake', width=50)

### Collocations
A **collocation** is a sequence of words that often appear together. The `.collocation_list()` method will help identify words that often appear together in our texts. 


In [None]:
aita_tokens.collocation_list()

In [None]:
# Change input arguments
aita_tokens.collocation_list(num=30, window_size=3)

### Word Plotting

Using the `dispersion_plot()` method we can easily visualize how often some word appears throughout the text. We input a list of words to plot, and identify how often they occur in the data over time.

This is useful when you are interested in seeing how words behave in the text over time, e.g., when novel words begin to appear in text. Note: the dataframe will need to be sorted by date for this to work.

In [None]:
aita_tokens.dispersion_plot(["unhappy", "marriage"])

### Similar Words

We might also want to look at similar words in the dataset (as defined as appearing in the same context). This is **distributional similarity**. 

In [None]:
aita_tokens.similar('partner')

### Common Context

The `.common_contexts()` method allows us to study the **common context** of two or more words: when they occur in the same immediate context. We must enclose these words in square brackets and round brackets, separated by commas. 

In [None]:
aita_tokens.common_contexts(['mom', 'dad'])

## Incorporating Time

One of the most valuable metadata variables working with text is time. If we know when a text was created, then we can explore all sorts of properties in the text, and their dynamics.

In this dataset, the time is contained in the `created` column. It is a Unix timestamp: the number of seconds that have elapsed since the Unix epoch, minus leap seconds; the Unix epoch is 00:00:00 UTC on 1 January 1970. While this works great for computers, it's not a particularly helpful format for humans! So we use `pd.datetime()` to convert this number to a more readable format.

In [None]:
pd.to_datetime(1207632114, unit='s')

Creating a new column in `pandas` is as easy as using the bracket notation to write a new column name, then assigning it. In this case, we just use the `.to_datetime()` method again to point to the entire "created" column.

In [None]:
df.insert(loc=3, column='created_datetime', value=pd.to_datetime(df['created'], unit='s'))
df.head(3)

We can select particular components of the timestamps by calling `.year`, `.month`,`.day`, etc. on the datetime column.

Thinking back to the flair column for this dataset, let's see if we can find out whether people are considered assholes more frequently in particular months. 
We'll first create a new df with just the submissions from 2021.

In [None]:
df_2021 = df.loc[pd.DatetimeIndex(df['created_datetime']).year == 2021, :]
len(df_2021)

Using the `month_name()` method of `DateTimeIndex`, we can see which month each post was written in:


In [None]:
months_array = pd.DatetimeIndex(df_2021['created_datetime']).month_name()
months_array

Now we have all we need to visualize the data. We will use the seaborn library to plot `df_2021`, using the `Datetimeindex` array we just created to separate counts on the x-axis. Do you see any pattnerns in the flair assignment across months?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

sns.set(rc={'figure.figsize': (7, 5)})

p = sns.histplot(
    data=df_2021, 
    x=months_array,
    hue="flair_css_class",
    multiple="stack")

plt.xticks(rotation=70)
plt.tight_layout()

Let's save our data with the datetime objects!

In [None]:
df.to_csv('aita_sub_top_sm_lemmas.csv', index=False)

Today we covered some techniques for summarizing text:

- Using `pandas` to sort, filter and summarize text.
- Using metrics like the token-type-ratio.
- Using the `nltk` package to quickly process and visualize text.

These techniques are particularly helpful for getting to know your dataset, and exploring text in an efficient way. In the next notebook we will talk about converting text to numeric data that can be used for many useful machine learning techniques.