# A simple example to demonstrate Pandas data frames

[Pandas](https://pandas.pydata.org/) is a data analysis library for Python. The aim of this notebook is to read the summary page of the [Living Review of Machine Learning for Particle Physics](https://iml-wg.github.io/HEPML-LivingReview/) into a pandas data frame and perform some basic statistical analysis. We will read in the text and process with a language processing toolkit 'nltk'.

We will try to answer the following questions

* How many articles are included in the reivew?
* What are the most common words in the article titles?
* How do the trends in common terms change with arxiv date?


In [None]:
import re
import string
# natural language toolkit
import nltk

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

STOP_WORDS = stopwords.words()

In [None]:
%matplotlib inline

In [None]:
# if using google colab
!git clone https://github.com/enocera/FIS0204.git

In [None]:
import sys
sys.path.append('/content/FIS0204/Lectures/Lecture_08/PandasExample/')
%cd /content/FIS0204/Lectures/Lecture_08/PandasExample/

* First we defining a function which will clean spurious/useless text from each line

In [None]:
def cleaning(text):
    """
    Convert to lowercase.
    remove URL links, special characters and punctuation.
    Tokenize and remove stop words.
    """

    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('[’“”…]', '', text)

    # removing the stop-words
    text_tokens = word_tokenize(text)
    tokens_without_sw = [
        word for word in text_tokens if not word in STOP_WORDS]
    filtered_sentence = (" ").join(tokens_without_sw)
    text = filtered_sentence

    return text

In [None]:
# README.md taken from https://github.com/iml-wg/HEPML-LivingReview
with open('README.md', 'r') as file:
        data = file.read()

* Now read the whole file into a Pandas data frame which is split by line breaks

In [None]:
df = pd.DataFrame([x.split(';') for x in data.split('\n')])

* Let's see what we've got with df.info()

In [None]:
df.info()

From this we can see the number of lines and look at the their contents using df[<data #>]. From the line-by-line split there is only 1 data entry with two columns: line number and the text from that line

In [None]:
df[0]

The Length is indicative of the number of articles but not exactly since there are alse section headings and other descriptions included.

Q: Can you filter the text to find the number of articles?

## Counting occurances of keywords

In [None]:
# Now put everything in lower case
df['text'] = df[0].str.lower()

In [None]:
# and filter the words according to the rules definied above
dt = df['text'].apply(cleaning)

In [None]:
# now we can split into words and count the frequency
word_count = Counter(" ".join(dt).split()).most_common(30)
word_frequency = pd.DataFrame(word_count, columns = ['Word', 'Frequency'])
print(word_frequency)

# Looking at the appearance of keywords over time

For this analysis, a new file was prepared by passing regular expression pattern matching on the original README.md using vim.

Exercise: Achieve the same result processing only in Python

In [None]:
# process README.md in vim using
# :g!/\* \[/d
# :%s/* \[\([^]]*\)\]([^0-9]*.\([0-9][0-9]\)\([0-9][0-9]\).[0-9]*).*$/"\1", \2, \3/
# :%s/^\s*//
# :g/\*/d
# add top line - title, year, month
# save as READMEtest.md
df2 = pd.read_csv('READMEtest.md')

In [None]:
# thanks to the better formatting read_csv finds the column structure and
# assigns the names accorinding to the headings provided
df2.info()

In [None]:
# watch out for the space in ' year' and ' month' - bad preprocessing. Can you fix it?
df2['title-clean'] = df2['title'].str.lower().apply(cleaning)
df2['date'] = df2[' year'] + (df2[' month']-1)/12

In [None]:
# filter by year using df.drop top look at the last year only
df2filter_test  = df2.drop(df2[df2[' year']<23].index)[['title-clean','date']]
df2filter_test.info()

In [None]:
# this is the list of words from before:
word_frequency['Word']

In [None]:
# now we can filter by some common title keywords
df2filter1 = df2.drop(df2[df2['title-clean'].str.contains( word_frequency['Word'][0] )==False].index)
df2filter2 = df2.drop(df2[df2['title-clean'].str.contains( word_frequency['Word'][6] )==False].index)
df2filter3 = df2.drop(df2[df2['title-clean'].str.contains( word_frequency['Word'][16] )==False].index)
df2filter4 = df2.drop(df2[df2['title-clean'].str.contains( word_frequency['Word'][21] )==False].index)
df2filter5 = df2.drop(df2[df2['title-clean'].str.contains( word_frequency['Word'][24] )==False].index)

In [None]:
# we can plot the appearnces of the most common word from 2010 to 2023
mybins = np.histogram_bin_edges(df2filter1['date'], bins=50, range=(10,23))

plt.hist(df2filter1['date'], density=False, bins=mybins, alpha=0.5, label=word_frequency['Word'][0])
plt.xlim([10,23])
plt.xlabel('date')
plt.legend(loc='upper left')

# Exercise

Make plot of the keywords appearing 6th, 16th, 21st and 28th most frequent of the last 10 years