# HW01: Intro to Text Data

In this assignment, we will explore how to load a text classification dataset (AG's news, originally posted [here](http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)), how we can preprocess the data and extract useful information from a real-world dataset. First, we have to download the data; we only download a subset of the data with four classes.

In [None]:
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv

## Inspect Data

In [None]:
import pandas as pd
df = pd.read_csv("train.csv", header=None)
df.info()
df.head()

Let's make the data more human readable by adding a header and replacing labels

In [None]:
df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 

In [None]:
df.head()

In [None]:
# TODO implement a new column text which contains the lowercased title and lead
# 1. merge them
# 2. lowercase the text

df["text"] = (df['title'] + ' ' + df['lead']).str.lower()
df.head()

In [None]:
# TODO print the number of documents for each label
df['label'].value_counts()

## Document Length

In [None]:
# TODO create a new column with the number of words for each text
# TODO plot the average number of words per label 

# remove special characters
def curate(string):
    for char in "()-.,\\'?!":
        string = string.replace(char, ' ')
    return string

df['text_clean'] = df['text'].apply(lambda x: curate(x))

# create column with number of words
df['word_count'] = df['text_clean'].apply(lambda x: len(x.split()))

import matplotlib.pyplot as plt

# avg number of words/label
ax = df.groupby('label')['word_count'].mean().plot.bar()
ax.set_ylabel('Average word count')

## Word Frequency 

Let's implement a keyword search (similar to the baker-bloom economic uncertainty) and compute how often some given keywords ("play", "tax", "blackberry", "israel") appear in the different classes in our data

In [None]:
import re
keywords = ["play", "tax", "blackberry", "israel"]
for keyword in keywords:
    #TODO implement a regex pattern
    x = "\\b" + keyword + "\\b"
    pattern = re.compile(x)
    def count_keyword_frequencies(x):
        #TODO implement a function which counts how often a pattern appears in a text
        num_occurrences = len(pattern.findall(x))
        return num_occurrences
    # Now, we can print how often a keyword appears in the data
    print(df["text"].apply(count_keyword_frequencies).sum())
    # and we want to find out how often the keyword appears within each class
    for label in df["label"].unique():
        print ("label:", label,", keyword:", keyword)
        #TODO print how often the keyword appears in this class
        print(df[df.label == label]['text'].apply(count_keyword_frequencies).sum())
    print ("*" * 100)

As a last exercise, we re-use the fuzzy keyword search implemented above and plot the total number of occurrences of "tax" (and it's variations, e.g. taxation, taxes etc.) for each class in the dataset. Hint: have a look at the [pandas bar plot with group by](https://queirozf.com/entries/pandas-dataframe-plot-examples-with-matplotlib-pyplot)

In [None]:
import matplotlib.pyplot as plt

keyword = "tax"
pattern = re.compile(r"(\b)tax[a-z]*")

def count_keyword_frequencies(x):
    #TODO implement a function which counts how often different version of "tax" appears in text
    num_occurences = len(pattern.findall(x))
    return num_occurences

df["counts"] = df["text"].apply(count_keyword_frequencies)
#TODO create a bar plot for the wordcounts of "tax" for the different classes in the dataset
df['tax_count'] = df['text'].apply(count_keyword_frequencies)
ax = df.groupby('label')['tax_count'].sum().plot(kind='bar')
ax.set_ylabel('Tax frequency')