<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/master/parquet_text_analyis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Archives Unleashed Full Text Parquet Derivatives

In this notebook, we'll setup an enviroment, then download a dataset of web archive collection derivatives that were produced with the [Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut/). These derivatives are in the [Apache Parquet](https://parquet.apache.org/) format, which is a [columnar storage](http://en.wikipedia.org/wiki/Column-oriented_DBMS) format. These derivatives and generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames as demostrated below.

**Web Pages**

`.webpages()` 

Produces a DataFrame with the following columns:
  - `crawl_date`
  - `url`
  - `mime_type_web_server`
  - `mime_type_tika`
  - `content`


# Dataset

We will need a web archive dataset to work with.

The one we'll use in this example notebook comes from [Bibliothèque et Archives nationales du Québec](https://www.banq.qc.ca/accueil/). It is a web archive collection of the Ministry of Environment of Québec (2011-2014), that has been  processed by the [Archives Unleashed Toolkit](https://github.com/archivesunleashed/aut/). Merci beaucoup banq!

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3598450.svg)](https://doi.org/10.5281/zenodo.3598450)

Curious about the size the derivative Parquet output compared to the size of the web archive collection?

The total size of all 12 Parquet deriatives is 1.9G, with `webpages` being the largest (1.8G) since it has a column with full text (`content`).

```
2.5M	./videos
344K	./domains
1.7M	./word-processor-files
24K	./presentation-program-files
1.7M	./spreadsheets
880K	./audio
4.4M	./images
1.8G	./webpages
1.7M	./text-files
3.9M	./pdfs
29M	./webgraph
22M	./imagelinks
1.9G	.
```

The total size of the web archive collection is 165G.

In [0]:
%%capture

!curl -L "https://zenodo.org/record/3598450/files/environnement-qc.tar.gz?download=1" > environment-qc-parquet.tar.gz
!tar -xzf environment-qc-parquet.tar.gz

In [0]:
!ls -1 parquet

audio
domains
imagelinks
images
pdfs
presentation-program-files
spreadsheets
text-files
videos
webgraph
webpages
word-processor-files


# Environment

Next, we'll setup our environment so we can work with the Parquet output with Pandas.

In [0]:
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import matplotlib.pyplot as plt

# Loading our Archives Unleashed Datasets as DataFrames

Next, we'll load up our dataset to work with, and show a preview.

## Pages

In [0]:
pages_parquet = pq.read_table('parquet/webpages')
pages = pages_parquet.to_pandas()
pages

Unnamed: 0,crawl_date,url,mime_type_web_server,mime_type_tika,content
0,20121218,http://www.mddefp.gouv.qc.ca/infuseur/communiq...,text/html,text/html,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
1,20121218,http://www.vehiculeselectriques.gouv.qc.ca/eng...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
2,20121218,http://www.mddep.gouv.qc.ca/index_en.asp,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
3,20121218,http://www.vehiculeselectriques.gouv.qc.ca/eng...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
4,20121218,http://www.vehiculeselectriques.gouv.qc.ca/eng...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
5,20121218,http://www.mddep.gouv.qc.ca/climat/surveillanc...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
6,20121218,http://www.vehiculeselectriques.gouv.qc.ca/eng...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
7,20121218,http://www.registres.mddep.gouv.qc.ca/message_...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nContent-Length: 1364\r\nCon...
8,20121218,http://www.registres.mddep.gouv.qc.ca/index_SE...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nConnection: close\r\nDate: ...
9,20121218,http://www.registres.mddep.gouv.qc.ca/includes...,text/html,application/xhtml+xml,HTTP/1.1 200 OK\r\nContent-Length: 4417\r\nCon...


# Data Analysis

Now that we have all of our dataset loaded up, we can begin to work with it!

## Text Analysis

With the `pages` derivative, we get a `text` column. This is the text of the web page, we the HTTP header information stripped out. Since we have this column, it opens up the whole world of text analysis to us!

Note of caution here, working with raw text can be very memory intensive, and we're limited on memory in Colab. You'll need to set up an enviroment with more memory if you want to do analysis on larger collection. In these examples, you might notice us using `.head()` on some examples. This is because we can't load all of the text for some of the examples.

Let's start the text analysis section by installing and importing [`textblob`](https://textblob.readthedocs.io/en/dev/), a really robust text processing Python library, and download some helpful items from [`nltk`](https://www.nltk.org/).

In [0]:
%%capture

!pip install textblob
!python -m nltk.downloader all

import nltk
from textblob import TextBlob

### Tokenization

Let's add a new column to our `pages` DataFrame that is the tokenized output of the `text` column.

**THIS TAKES A LONG TIME**

In [0]:
from nltk.tokenize import sent_tokenize, word_tokenize

pages['tokenized_text'] = pages.apply(lambda row: word_tokenize(row.content), axis=1)
pages

KeyboardInterrupt: ignored

#### Let's add one more column, a count of tokenized words.

In [0]:
pages['tokenized_text_count'] = pages.apply(lambda row: len(row.tokenized_text), axis=1)
pages

### Basic word count statistics

In [0]:
pages['tokenized_text_count'].mean()

In [0]:
pages['tokenized_text_count'].std()

In [0]:
pages['tokenized_text_count'].max()

In [0]:
pages['tokenized_text_count'].min()

### Pages with the most words

Let's create a bar chart, that shows the pages with the most words.

First, let's show the query to get the data for our chart.




In [0]:
word_count = pages[['url', 'tokenized_text_count']].sort_values(by='tokenized_text_count', ascending=False).head(25)
word_count

Next, let's create a bar chart of this.

In [0]:
word_count_chart = word_count.plot(kind='bar', x='url', figsize=(20,10))
word_count_chart.set_title('Pages with most words\n', fontsize=22)
word_count_chart.set_ylabel('Count', fontsize=20)
word_count_chart.set_xlabel('Page', fontsize=20)


### Sentiment Analysis

Let's start with sentiment analysis.


In [0]:
text_blob = TextBlob("".join(pages['text'].head(10000).values))
text_blob.sentiment

### n-grams

In [0]:
from nltk.util import ngrams

ngrams(text_blob, 5)

### Word Cloud

Word clouds are always fun, right?!

Let's setup some dependcies here, and install the [`word_cloud`](https://github.com/amueller/word_cloud) library, and setup some stop words via `nltk`.


In [0]:
%%capture

!pip install wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from nltk.corpus import stopwords

french_stop_words = set(stopwords.words('french'))

In [0]:
wordcloud = WordCloud(stopwords=french_stop_words, width=2000, height=1500, scale=10, max_font_size=250, max_words=100, background_color="white").generate("".join(pages['text'].values))
plt.figure(figsize=[35,10])
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()