# Monthly Keyword Analysis 

This notebook ingests the preprocessed data from `../interim/text` downloaded by `download_datasets.ipynb` and uses a TF-IDF method to identify the top 10 keywords for each month. This is done by implementing the following procedure: For each month, we train and fit a separate TF-IDF model, then collect the top 10 scoring words for each email and sum their occurrences to identify the top 10 most frequently occurring keywords for each month.     

Finally, the data is saved as a single csv file and pushed to remote storage for visualization with Superset. 

In [2]:
import pandas as pd
import numpy as np
import os
import datetime
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
import gc
import sys
from pathlib import Path
from dotenv import load_dotenv

load_dotenv("../../.env")
sys.path.append("../..")
from src import utils  # noqa

In [3]:
BASE_PATH = os.getenv("LOCAL_DATA_PATH", "../../data")

LAST_MONTH_DATE = datetime.datetime.now().replace(day=1) - datetime.timedelta(
    days=1
)
year = LAST_MONTH_DATE.year
month = LAST_MONTH_DATE.month

In [4]:
if os.getenv("RUN_IN_AUTOMATION"):
    df = pd.read_csv(
        f"{BASE_PATH}/interim/text/fedora-devel-{year}-{month}.mbox.csv"
    )
    df.head()

else:
    df = utils.load_dataset(f"{BASE_PATH}/interim/text/")
    df.head()

In [5]:
df.shape

(3962, 3)

<!-- ## Text Preprocessing

Due to the casual nature of email writing, along with some known useless artifacts present in our textual dataset, we need to clean our data a bit before performing our analysis.  -->

In [8]:
df["Date"] = df["Date"].apply(lambda x: pd.to_datetime(x))
df["Chunk"] = df["Date"].apply(lambda x: datetime.date(x.year, x.month, 1))
df = df.sort_values(by="Date")
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,Message-ID,Date,Body,Chunk
0,<e0c38165-50de-dfb8-77ce-596489e2ecbf@compton.nu>,2020-12-01 00:06:04+00:00,Do we really have to go through this all again...,2020-12-01
1,<CAB-QmhR9rhdwBfW+cGvOW0wdsJMbGhuk9_9Mj43A5_GY...,2020-12-01 01:12:20+01:00,"On Tue, Dec 1, 2020 at 1:06 AM Tom Hughes via ...",2020-12-01
2,<1f51986b-c5b6-b073-d908-a8c713e9c9ad@redhat.com>,2020-12-01 01:58:12+01:00,ntirely. That was exactly my reasoning. Miro H...,2020-12-01
3,<d59cdf62-19a9-62d8-dab0-a2b73f0489e8@redhat.com>,2020-11-30 17:05:38-08:00,"We never came to a conclusion on this, because...",2020-11-01
4,<32b36717-bec7-1747-d7ea-55627c1779dc@redhat.com>,2020-11-30 17:37:05-08:00,e: False positive because it matches the regex...,2020-11-01


In [9]:
df.tail()

Unnamed: 0,Message-ID,Date,Body,Chunk
3957,<346ef226-3317-c310-d80c-283e4cc7dc2d@redhat.com>,2021-02-27 20:30:45+01:00,"Hi Benjamin, Ray, I noticed this problem while...",2021-02-01
3958,<CAA_UwzK-njEiGSvq6FfGWteCz93Cm-Uk-KGLdC4f=Bq1...,2021-02-27 14:56:02-05:00,"ah i think we need to pull in Ray y. l, .org",2021-02-01
3959,<8dee2ff2-e118-bdb2-5d77-20ca82759727@gmail.com>,2021-02-27 20:59:59+01:00,"Hi, I am trying to test some Renoir s2idle pat...",2021-02-01
3960,<CAA_Uwz+nM0n85OyaAd6=55_ANw4yefwAqJ3k40e91Yui...,2021-02-27 15:16:43-05:00,"Hi, seems like this is already in updates. you...",2021-02-01
3961,<4199adc3-49c8-4d3d-d768-84327df177fa@gmail.com>,2021-02-27 18:56:52-05:00,The assimp license field for version 5.0.1 has...,2021-02-01


## Single month example 

Here we will prototype our method for identifying top N key words for a single month, to ensure it works properly before applying it to the entire dataset. 

In [10]:
if not os.getenv("RUN_IN_AUTOMATION"):
    corpus = df[df.Chunk == datetime.date(2020, 11, 1)].Body
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(X.shape)

(4, 71)


In [11]:
if not os.getenv("RUN_IN_AUTOMATION"):
    feature_array = np.array(vectorizer.get_feature_names())

    Document = []
    for i, j in enumerate(X[0].toarray()[0]):
        if j > 0:
            Document.append((i, j))

    top_10 = sorted(Document, key=lambda x: x[1], reverse=True)[0:10]
    top_10_keys = [x[0] for x in top_10]
    print([feature_array[i] for i in top_10_keys], end="\n\n")
    print(corpus[corpus[0:1].index[0]])
    del feature_array
    gc.collect()

['to', 'was', 'be', 'came', 'cmake', 'conclusion', 'going', 'never', 'on', 'or']

We never came to a conclusion on this, because it was unclear whether make was going to be a weak or strong dependency of cmake. Tom


Looks out key words are reasonable given the email in question above.   

### Run full analysis on entire dataset 

Now that we are confident our approach works, we will break it up into manageable functions and apply it to each months dataset. 

In [12]:
def train_monthly_tfidf(corpus):
    vectorizer = TfidfVectorizer(stop_words="english")
    x = vectorizer.fit_transform(corpus)
    return x, vectorizer


def top_words_per_email(email_vector, feature_array, top_words=10):
    document = []
    for i, j in enumerate(email_vector.toarray()[0]):
        if j > 0:
            document.append((i, j))
    top_n = sorted(document, key=lambda x: x[1], reverse=True)[0:top_words]
    top_n_keys = [x[0] for x in top_n]
    top_n_words = [feature_array[i] for i in top_n_keys]
    return top_n_words


def get_monthly_keywords(corpus, chunk):
    x, vectorizer = train_monthly_tfidf(corpus)
    feature_array = np.array(vectorizer.get_feature_names())
    keywords = []
    for i in range(x.shape[0]):
        keywords.extend(top_words_per_email(x[i], feature_array))

    keywords = Counter(keywords).most_common(10)
    keywords = pd.DataFrame(keywords, columns=["word", "count"])
    keywords["month"] = chunk
    del feature_array
    gc.collect()

    return keywords

In [13]:
dataset_base_path = Path(f"{BASE_PATH}/processed/keywords/")
dataset_base_path.mkdir(parents=True, exist_ok=True)

In [14]:
# For each document collect the top 10 words, then sum the top 10 for each month.

months = df.Chunk.unique()
new_files = []

for month in months:
    corpus = df[df.Chunk == month].Body
    monthly_keywords = get_monthly_keywords(corpus, month)
    monthly_keywords = monthly_keywords.reset_index().set_index("word")
    monthly_keywords = monthly_keywords.drop("index", axis=1)
    monthly_keywords.to_csv(
        f"{BASE_PATH}/processed/keywords/keywords-{month}.csv", header=False
    )
    new_files.append(f"{BASE_PATH}/processed/keywords/keywords-{month}.csv")
    print(month)

2020-12-01
2020-11-01
2021-01-01
2021-02-01


In [14]:
monthly_keywords

Unnamed: 0_level_0,count,month
word,Unnamed: 1_level_1,Unnamed: 2_level_1
aarch64,116,2020-11-01
x8664,115,2020-11-01
ttest,94,2020-11-01
tests,77,2020-11-01
failed,64,2020-11-01
cloudbaseqcow2qcow2,56,2020-11-01
soft,55,2020-11-01
test,39,2020-11-01
uefiurl,36,2020-11-01
package,31,2020-11-01


## Upload results to S3

In [15]:
if os.getenv("RUN_IN_AUTOMATION"):
    utils.upload_files(
        (f, f"processed/keywords/{Path(f).stem}.csv") for f in new_files
    )