# Remark<div class='tocSkip'/>

The code in this notebook differs slightly from the printed book. For example we frequently use pretty print (`pp.pprint`) instead of `print` and `tqdm`'s `progress_apply` instead of Pandas' `apply`. 

Moreover, several layout and formatting commands, like `figsize` to control figure size or subplot commands are removed in the book.

You may also find some lines marked with three hashes ###. Those are not in the book as well as they don't contribute to the concept.

All of this is done to simplify the code in the book and put the focus on the important parts instead of formatting.

# Setup<div class='tocSkip'/>

## Determine Environment<div class='tocSkip'/>

In [None]:
import sys
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    BASE_DIR = "/content"
    print("You are working on Google Colab.")
    print(f'Files will be downloaded to "{BASE_DIR}".')
    # adjust release
    GIT_ROOT = "https://github.com/blueprints-for-text-analytics-python/early-release/raw/master"
else:
    BASE_DIR = "../"
    print("You are working on a local system.")
    print(f'Files will be searched relative to "{BASE_DIR}".')

## Download files on Google Colab<div class='tocSkip'/>

If you are on Colab, copy the following statements into the code cell below and execute them.

```bash
!wget -P $BASE_DIR $GIT_ROOT/settings.py

!mkdir -p $BASE_DIR/data/un-general-debates

!wget -P $BASE_DIR/data/un-general-debates $GIT_ROOT/data/un-general-debates/un-general-debates-blueprint.csv.gz 
```

## Install required libraries<div class='tocSkip'/>

Still todo: setup pip requirements.txt

If you are on Colab, copy the following statements into the code cell below and execute them.

```bash
!pip install textacy
```

In [None]:
import nltk
# make sure stop words are available
nltk.download('stopwords')

## Load Python Settings<div class="tocSkip"/>

Common imports, defaults for formatting in Matplotlib, Pandas etc.

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'

%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2

# Gaining Early Insights from Textual Data
## What you will learn and what we will build


# Exploratory Data Analysis


# Introducing the Dataset


In [None]:
file = f"{BASE_DIR}/data/un-general-debates/un-general-debates-blueprint.csv.gz"
df = pd.read_csv(file)
df.sample(2, random_state=53)


# Blueprint: Getting an Overview of the Data with Pandas


## Calculating Summary Statistics for Columns


In [None]:
df['length'] = df['text'].str.len()

df.describe().T

In [None]:
df[['country', 'speaker']].describe(include='O').T

## Checking for Missing Data


In [None]:
df.isna().sum()

In [None]:
df['speaker'].fillna('unkown', inplace=True)

In [None]:
df[df['speaker'].str.contains('Bush')]['speaker'].value_counts()

## Plotting Value Distributions


In [None]:
df['length'].plot(kind='box', vert=False, figsize=(8, 1))


In [None]:
df['length'].plot(kind='hist', bins=30, figsize=(8,2))


In [None]:
# Not in book: seaborn plot with gaussian kernel density estimate
import seaborn as sns

plt.figure(figsize=(8, 2))
sns.distplot(df['length'], bins=30, kde=True);

## Comparing Value Distributions across Categories


In [None]:

where = df['country'].isin(['USA', 'FRA', 'GBR', 'CHN', 'RUS'])
sns.catplot(data=df[where], x="country", y="length", kind='box', ax=axes[0])
sns.catplot(data=df[where], x="country", y="length", kind='violin', ax=axes[1])



## Visualizing Developments over Time


In [None]:
df.groupby('year').size().plot(title="Number of Countries", figsize=(6,2))


In [None]:
df.groupby('year').agg({'length': 'mean'}) \
  .plot(title="Avg. Speech Length", ylim=(0,30000), figsize=(6,2))


In [None]:

df.groupby('year').size().plot(title="Number of Countries", ax=axes[0])
df.groupby('year').agg({'length': 'mean'}).plot(title="Avg. Speech Length", ax=axes[1], ylim=(0,30000))



# Blueprint: Building a Simple Text Preprocessing Pipeline


## Tokenization with Regular Expressions


In [None]:
import regex as re

def tokenize(text):
    return re.findall(r'[\w-]*\p{L}[\w-]*', text)

In [None]:
text = "Let's defeat SARS-CoV-2 together in 2020!"
tokens = tokenize(text)
print("|".join(tokens))

## Treating Stop Words


In [None]:
import nltk

stopwords = set(nltk.corpus.stopwords.words('english'))

def remove_stop(tokens):
    return [t for t in tokens if t.lower() not in stopwords]

In [None]:
include_stopwords = {'dear', 'regards', 'must', 'would', 'also'}
exclude_stopwords = {'against'}

stopwords |= include_stopwords
stopwords -= exclude_stopwords

## Processing a Pipeline with one Line of Code


In [None]:
pipeline = [str.lower, tokenize, remove_stop]

def prepare(text, pipeline):
    tokens = text
    for transform in pipeline:
        tokens = transform(tokens)
    return tokens

In [None]:
df['tokens'] = df['text'].progress_apply(prepare, pipeline=pipeline)

In [None]:
df['no_tokens'] = df['tokens'].progress_map(len)

# Analyzing Word Frequencies


## Blueprint: Counting Words with a Counter


In [None]:
from collections import Counter

tokens = tokenize("She likes my cats and my cats like my sofa.")

counter = Counter(tokens)
print(counter)

In [None]:
more_tokens = tokenize("She likes dogs and cats.")
counter.update(more_tokens)

print(counter)

In [None]:
counter = Counter()

_ = df['tokens'].map(counter.update)

In [None]:
pp.pprint(counter.most_common(5))

In [None]:
def count_words(df, column='tokens', preprocess=None, min_freq=2):

    # process tokens and update counter
    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(tokens)

    # create counter and run through all data
    counter = Counter()
    df[column].progress_map(update)

    # transform counter into data frame
    freq_df = pd.DataFrame.from_dict(counter, orient='index', columns=['freq'])
    freq_df = freq_df.query('freq >= @min_freq')
    freq_df.index.name = 'token'
    
    return freq_df.sort_values('freq', ascending=False)

In [None]:
freq_df = count_words(df)
freq_df.head(5)

## Blueprint: Creating a Frequency Diagram


In [None]:
ax = freq_df.head(15).plot(kind='barh', width=0.95, figsize=(8,3))
ax.invert_yaxis()
ax.set(xlabel='Frequency', ylabel='Token', title='Top Words')


## Blueprint: Creating Word Clouds


In [None]:
from wordcloud import WordCloud
from matplotlib import pyplot as plt

text = df.query("year==2015 and country=='USA'")['text'].values[0]

wc = WordCloud(max_words=100, stopwords=stopwords)
wc.generate(text)
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")


In [None]:

def wordcloud(word_freq, title=None, max_words=200, stopwords=None):

    wc = WordCloud(width=800, height=400, 
                   background_color= "black", colormap="Paired", 
                   max_font_size=150, max_words=max_words)
    
    # convert data frame into dict
    if type(word_freq) == pd.Series:
        counter = Counter(word_freq.fillna(0).to_dict())
    else:
        counter = word_freq

    # filter stop words in frequency counter
    if stopwords is not None:
        counter = {token:freq for (token, freq) in counter.items() 
                              if token not in stopwords}
    wc.generate_from_frequencies(counter)
 
    plt.title(title) 

    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")

In [None]:
freq_2015_df = count_words(df[df['year']==2015])
plt.figure(figsize=(12,4))
wordcloud(freq_2015_df['freq'], max_words=100)
wordcloud(freq_2015_df['freq'], max_words=100, stopwords=freq_df.head(50).index)


## Blueprint: Ranking with TF-IDF


In [None]:
def idf(df, column='tokens', preprocess=None, min_df=2):

    def update(doc):
        tokens = doc if preprocess is None else preprocess(doc)
        counter.update(set(tokens))

    # count tokens
    counter = Counter()
    df[column].progress_map(update)

    # create data frame and compute idf
    idf_df = pd.DataFrame.from_dict(counter, orient='index', columns=['df'])
    idf_df = idf_df[idf_df['df'] >= min_df]
    idf_df['idf'] = np.log(len(df)/idf_df['df'])+0.1
    idf_df.index.name = 'token'
    return idf_df

In [None]:
idf_df = idf(df)

In [None]:
freq_df = freq_df.join(idf_df)
freq_df['tfidf'] = freq_df['freq'] * freq_df['idf']

In [None]:
freq_1970 = count_words(df[df['year'] == 1970])
freq_2015 = count_words(df[df['year'] == 2015])

freq_1970['tfidf'] = freq_1970['freq'] * idf_df['idf']
freq_2015['tfidf'] = freq_2015['freq'] * idf_df['idf']

#wordcloud(freq_df['freq'], title='All years', subplot=(1,3,1))
wordcloud(freq_1970['freq'], title='1970 - TF', 
          stopwords=['twenty-fifth', 'twenty-five'])
wordcloud(freq_2015['freq'], title='2015 - TF', 
          stopwords=['seventieth'])
wordcloud(freq_1970['tfidf'], title='1970 - TF-IDF', 
          stopwords=['twenty-fifth', 'twenty-five', 'twenty', 'fifth'])
wordcloud(freq_2015['tfidf'], title='2015 - TF-IDF', 
          stopwords=['seventieth'])


# Blueprint: Finding a Keyword in Context (KWIC)


In [None]:
from textacy.text_utils import KWIC

def kwic(doc_series, keyword, window=35, print_samples=None):

    def add_kwic(text):
        kwic_list.extend(KWIC(text, keyword, ignore_case=True, 
                              window_width=window, print_only=False))

    kwic_list = []
    doc_series.progress_map(add_kwic)

    if print_samples is None or print_samples==0:
        return kwic_list
    else:
        k = min(print_samples, len(kwic_list))
        print(f"{k} random samples out of {len(kwic_list)} " + \
              f"contexts for '{keyword}':")
        for sample in random.sample(list(kwic_list), k):
            print(re.sub(r'[\n\t]', ' ', sample[0])+'  '+ \
                  sample[1]+'  '+\
                  re.sub(r'[\n\t]', ' ', sample[2]))

In [None]:
kwic(df[df['year'] == 2015]['text'], 'sdgs', window=35, print_samples=5)

In [None]:
from textacy.text_utils import KWIC

def kwic(doc_series, keyword, window=35, print_samples=5):

    def add_kwic(text):
        kwic_list.extend(KWIC(text, keyword, ignore_case=True, 
                              window_width=window, print_only=False))

    kwic_list = []
    doc_series.progress_map(add_kwic)

    if print_samples is None or print_samples==0:
        return kwic_list
    else:
        k = min(print_samples, len(kwic_list))
        print(f"{k} random samples out of {len(kwic_list)} " + \
              f"contexts for '{keyword}':")
        for sample in random.sample(list(kwic_list), k):
            print(re.sub(r'[\n\t]', ' ', sample[0])+'  '+ \
                  sample[1]+'  '+\
                  re.sub(r'[\n\t]', ' ', sample[2]))

In [None]:
kwic(df[df['year'] == 2015]['text'], 'sdgs', print_samples=5)

# Blueprint: Analyzing N-Grams


In [None]:
text = "the visible manifestation of the global climate change"
tokens = tokenize(text)

def ngrams(tokens, n=2, sep=' '):
    return [sep.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])]

print("|".join(ngrams(tokens, 2)))

In [None]:
def ngrams(tokens, n=2, sep=' ', stopwords=set()):
    return [sep.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])
            if len([t for t in ngram if t in stopwords])==0]

tokens = prepare(text, [str.lower, tokenize]) # keep full list of tokens

print("Bigrams:", "|".join(ngrams(tokens, 2, stopwords=stopwords)))
print("Trigrams:", "|".join(ngrams(tokens, 3, stopwords=stopwords)))

In [None]:
df['bigrams'] = df['text'].progress_apply(prepare, pipeline=[str.lower, tokenize]) \
                          .progress_apply(ngrams, n=2, stopwords=stopwords)

count_words(df, 'bigrams').head(5)

In [None]:
# concatenate existing IDF data frame with bigram IDFs
idf_df = pd.concat([idf_df, idf(df, 'bigrams', min_df=10)])

freq_df = count_words(df[df['year'] == 2015], 'bigrams')
freq_df['tfidf'] = freq_df['freq'] * idf_df['idf']

In [None]:
wordcloud(freq_df['tfidf'], title='all bigrams', max_words=50)

where = freq_df.index.str.contains('climate')
wordcloud(freq_df[where]['freq'], title='"climate" bigrams', max_words=50)


# Blueprint: Comparing Frequencies across Time-Intervals and Categories


## Creating Frequency Timelines


In [None]:
def count_keywords(tokens, keywords): 
    tokens = [t for t in tokens if t in keywords]
    counter = Counter(tokens)
    return [counter.get(k, 0) for k in keywords]

In [None]:
keywords = ['nuclear', 'terrorism', 'climate', 'freedom']
tokens = ['nuclear', 'climate', 'climate', 'freedom', 'climate', 'freedom']

print(count_keywords(tokens, keywords))

In [None]:
def count_keywords_by(df, by, column='tokens', keywords=keywords):
    
    freq_matrix = df['tokens'].progress_apply(count_keywords, keywords=keywords)
    freq_df = pd.DataFrame.from_records(freq_matrix, columns=keywords)
    freq_df[by] = df[by] # copy the grouping column(s)
    
    return freq_df.groupby(by=by).sum().sort_values(by)

In [None]:
freq_df = count_keywords_by(df, by='year', keywords=keywords)

In [None]:
pd.options.display.max_rows = 4

In [None]:
freq_df

In [None]:
pd.options.display.max_rows = 60

In [None]:
freq_df.plot(kind='line', figsize=(8, 3))


In [None]:
# analyzing mentions of 'climate' before 1980
kwic(df.query('year < 1980')['text'], 'climate', window=35, print_samples=5)

## Creating Frequency Heat Maps


In [None]:
keywords = ['terrorism', 'terrorist', 'nuclear', 'war', 'oil',
            'syria', 'syrian', 'refugees', 'migration', 'peacekeeping', 
            'humanitarian', 'climate', 'change', 'sustainable', 'sdgs']  

freq_df = count_keywords_by(df, by='year', keywords=keywords)

# compute relative frequencies based on total number of tokens per year
freq_df = freq_df.div(df.groupby('year')['no_tokens'].sum(), axis=0)
# apply square root as sublinear filter for better contrast
freq_df = freq_df.apply(np.sqrt)

sns.heatmap(data=freq_df.T, 
            xticklabels=True, yticklabels=True, cbar=False, cmap="Reds")


In [None]:
df.info(memory_usage='deep')

# Closing Remarks
