# Data Science for Social Justice Workshop: Module 2

## Distant reading 

This notebook introduces some methods to engage in a simple distant reading using NLTK. We'll keep builing on our AITA DataFrame, discussing some simple ways to explore data. 

**After completing this notebook, you will be able to:**
1. Do some more data operations in Pandas;
2. Use NLTK's `Text()` object to perform some basic distant reading operations on a subreddit;

There are several basic programming exercises scattered throughout for those who need it.


## Retrieving the dataset
Let's get the data. Make sure you're in the "Data" directory when importing by running the magic command `%pwd`.
If you're not in the right directory, use `os.chdir` to navigate there.

In [None]:
%pwd

In [None]:
import os
# We include two ../ because we want to go two levels up in the file structure
os.chdir('../../Data')

Importing 

In [None]:
import pandas as pd

df = pd.read_csv('aita_sub_top_sm_lemmas.csv')

# 1. A couple more Pandas operations

Let's have a look at the data

In [None]:
df.head(3)

### Sorting a DF
Using the `.sort_values()` method we can sort the df by particular columms. We use two parameters: the `by` parameter indicates by which column we want to sort, the `ascending` parameter indicated whether our sortation is in ascending or descending order.

In [None]:
df.sort_values(by=['score'], ascending=False)[:3]

One thing we often do when weâ€™re exploring a dataset is filtering the data based on a given condition. For example, we might need to find all the rows in our dataset where the score is over 500. We can use the `.loc[]` method to do so.

`.loc[]` is a powerful method that can be used for all kinds of research purposes, if you want to filter or prune your dataset based on some condition. For more info, see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html).

For instance, if we only want rows with a score higher than 500:

In [None]:
df_top = df.loc[df.score >= 500]
len(df_top)

Note that we could also just access this data using `df[df.score >= 500]` (without the `.loc`). This is a bit shorter but has some drawbacks. See [this post](https://stackoverflow.com/questions/38886080/python-pandas-series-why-use-loc) for more info.

### Converting to datetime
Did you ever wonder which format the "created" column is in? It is a Unix timestamp: the number of seconds that have elapsed since the Unix epoch, minus leap seconds; the Unix epoch is 00:00:00 UTC on 1 January 1970.

In [None]:
pd.to_datetime(1207632114,unit='s')

Pandas allows us to create a new column evaluating the Unix timestamp to more readable datetimes using the `.to_datetime` method. 

Creating a new column in Pandas is as easy as using the bracket notation to write a new column name, then assigning it. In this case, we just use the `.to_datetime` method again to point to the entire "created" column.

In [None]:
df.insert(loc=3, column='created_datetime', value=pd.to_datetime(df['created'],unit='s'))
df.head(3)

Let's save this new DF again.

In [None]:
df.to_csv('aita_sub_top_sm_lemmas.csv', index=False)

Our new "created_datetime" column is in a datetime format that Pandas can work with. We do so by calling the `DateTimeIndex` method: when we access the data in this column (also called a Series), the `DateTimeIndex` method turns it into a so-called Time Series. This is data type that allows for specific functionalities. For instance, we can check which years our Time Series data contains. 

In [None]:
from collections import Counter

pd.DatetimeIndex(df['created_datetime']).year.value_counts()

Looks like most of our data was written in 2021. 

## Checking value counts
We can also look at unique value counts for a column by running `value_counts()`, or easily visualize those counts using `hist()`.

Let's have a look at two particularly interesting columns: "flair_text" and "flair_css_class*. A flair, in Reddit, allows users to tag posts or usernames in certain subreddits to add context or humor. Here, the flair attached to the posts are created by moderators after the community votes on whether an OP was or wasn't the asshole. There are also other flairs such as "No assholes here" or "Everyone sucks". See [here](https://www.reddit.com/r/AmItheAsshole/wiki/faq) for more information.

This community-driven data segmentation is quite helpful for us!

In [None]:
df.flair_text.value_counts()

In [None]:
df.flair_css_class.hist()

Let's see if we can find out whether people are considered assholes more frequently in particular months. 
We'll first create a new DF with just the submissions from 2021.

In [None]:
df_2021 = df[pd.DatetimeIndex(df['created_datetime']).year == 2021]
len(df_2021)

Using the `month_name()` method of `DateTimeIndex`, we can see which month each post was written in:



In [None]:
months_array = pd.DatetimeIndex(df_2021['created_datetime']).month_name()
months_array

Now we have all we need to visualize the data. We will use the seaborn library to plot `df_2021`, using the `Datetimeindex` array we just created to separate counts on the x-axis.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

sns.set(rc={'figure.figsize':(7,5)})

p = sns.histplot(
    data=df_2021, 
    x=months_array,
    hue="flair_css_class",
    multiple="stack"
)

plt.xticks(rotation=70)
plt.tight_layout()

## Calculating Type-token ratio
Next, let's figure out the type-token ratio for our posts. Type-token ratio is a crude algorithm to gauge language complexity. First, we'll create a function that computes the TTR.

In [None]:
def typeTokenRatio(tokens): 
    numTokens = len(tokens)
    numTypes = len(set(tokens))
    return numTypes/numTokens

Finally, we loop over the first 10 lemmatized submissions in our df.

In [None]:
for x in df_2021['selftext'][:10]:
    t = x.split()
    print(typeTokenRatio(t))

### Distant reading with NLTK `Text()`
Let's have another look at our data. NLTK provides a `Text()` class, which is a "wrapper" that allows for inital exploration of texts. It supports counting, concordancing, collocation discovery, etc. 

Let's use our "lemmas" data we created in the last notebook. All we need to do is run `split()` on it to get our tokens.

In [None]:
total = []
for row in df_2021['lemmas']:
    total.extend(row.split(' '))

In [None]:
len(total)

In [None]:
import nltk
from nltk.text import Text
nltk.download('stopwords')

aita_t = Text(total)

Let's print out the "docstring" of NLTK's `Text()` object, as well as all the things you can do with this object. Have a read through this to see what it allows you to do!

In [None]:
help(Text)

### Concordances 
One of the most basic, but quite helpful, ways to quickly get an overview of the contexts in which a word appears is through a concordance view. 

In [None]:
aita_t.concordance('mistake', width=115)

### Collocations
A collocation is a sequence of words that often appear together. The .collocations() method can find these in our data.


In [None]:
aita_t.collocations()

### Word plotting
Using the `dispersion_plot()` method we can easily visualize how often some word appears throughout the text. We have to feed it a list with several words.

If our df is sorted by date we can see "through time" to see whether particular words start (dis)appearing in our data.

In [None]:
aita_t.dispersion_plot(["forgive"])

### Similar words
Using the `.similar()` method we can look at "distributional similarity": finding other words which appear in the same contexts as the specified word.
 

In [None]:
aita_t.similar('girlfriend')

### Common context
The `.common_contexts()` method allows us to study the common context of two or more words. We must enclose these words in square brackets and round brackets, separated by commas. 

In [None]:
aita_t.common_contexts(['mom', 'dad'])