# Cleaning the NOW corpus
This notebook shows the cleaning process that will be used for the ADA project. Here, only a sample of the data is used (from [here](https://www.corpusdata.org/now_corpus.asp)), but the methods should be the same once scaled to the full database.

The NOW database is composed of billions of words from online newspapers and magazines from 20 different countries. The data we downloaded comes in different files which can be used together or independently. These files are:

1. **now-samples-lexicon.txt**: this is the full dictionnary of the english language, a lexicon. It contains four clolumns, `wID` which is the word id, `word` the actual word, `lemma` which is family of the word (ie: if word is walked, lemma is walk) and `PoS` which is the part of speech.
2. **now-samples_sources.txt**: this is the source of every text, in order it contains the text id, the number of words, the date, the country, the website, the url and title of the article.
3. **text.txt**: this file has the complete texts of the articles, the first column is the `textID` in the format @@textID, the second column is the full text, complete with html paragraphs and headers. It is important to note that to prevent plagiarism, every 200 words, 10 words are replaced by the string "@ @ @ @ @ @ @ @ @ @". Combined words are also split, example "can't" is written as "ca n't" and punctuation is surrounded by spaces.
4. **wordLemPoS.txt**: finally, this file contains the `word`, `lemma` and `PoS` for each word in the texts, one by one, so one could read the texts by reading doewn the columns. Along with that is the `textID` from where the word is and an `ID (seq)` which we don't know what it does.

In [1]:
import findspark
findspark.init()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

## text.txt

In [2]:
text_rdd = sc.textFile('sample_data/text.txt') \
            .filter(lambda r: len(r)>20)

In [3]:
text_raw_schema = text_rdd.map(lambda r: Row(text=r)) 
text_raw = spark.createDataFrame(text_raw_schema)

In [7]:
text_raw = text_raw.withColumn('textID', regexp_extract('text','(\d+)',1))
text = text_raw.rdd.map(lambda r: (re.sub('@@\d+ ','',r[0]),r[1])).map(lambda r: Row(text=r[0],textID=r[1])).toDF()
text.show()

+--------------------+------+
|                text|textID|
+--------------------+------+
|<p> Sol Yurick , ...| 11241|
|<h> That 's What ...| 11242|
|<h> A sublime cro...| 11243|
|<h> Reflecting on...| 11244|
|<h> Ask Ars : Doe...| 21242|
|<p> NEW YORK -- A...| 21243|
|<p> IRELAND 'S Ol...| 31240|
|<h> Shakira launc...| 31241|
|<p> ENTREPRENEUR ...| 31242|
|<p> Syrian women ...| 41240|
|<h> Published byS...| 41241|
|<h> The Bay Bridg...| 41244|
|<h> MPAA Lobbies ...| 51243|
|<h> Mum 's fight ...| 61240|
|<h> IPPC to inves...| 61242|
|<p> North America...| 71240|
|<h> James Ferguss...| 71241|
|<h> From Richard ...| 71242|
|<h> ' Incompatibl...| 71243|
|<h> Mary Leakey ,...| 71244|
+--------------------+------+
only showing top 20 rows

