# Cleaning the NOW corpus
This notebook shows the cleaning process that will be used for the ADA project. Here, only a sample of the data is used (from [here](https://www.corpusdata.org/now_corpus.asp)), but the methods should be the same once scaled to the full database.

The NOW database is composed of billions of words from online newspapers and magazines from 20 different countries. The data we downloaded comes in different files which can be used together or independently. These files are:

1. **now-samples-lexicon.txt**: this is the full dictionnary of the english language, a lexicon. It contains four clolumns, `wID` which is the word id, `word` the actual word, `lemma` which is family of the word (ie: if word is "walked", lemma is "walk") and `PoS` which is the part of speech.
2. **now-samples_sources.txt**: this is the source of every text, in order it contains the text id, the number of words, the date, the country, the website, the url and title of the article.
3. **text.txt**: this file has the complete texts of the articles, the first column is the `textID` in the format @@textID, the second column is the full text, complete with html paragraphs and headers. It is important to note that to prevent plagiarism, every 200 words, 10 words are replaced by the string "@ @ @ @ @ @ @ @ @ @". Combined words are also split, example "can't" is written as "ca n't" and punctuation is surrounded by spaces.
4. **wordLemPoS.txt**: finally, this file contains the `word`, `lemma` and `PoS` for each word in the texts, one by one, so one could read the texts by reading down the columns. Along with that is the `textID` from where the word is and an `ID (seq)` which we don't know what it does.

In [30]:
import findspark
findspark.init()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [52]:
# pip install stop-words
from stop_words import get_stop_words
stop_words = get_stop_words('en')
len(stop_words)

174

## Word-Lemma-PoS processing


The goal of this part is to extract the useful data from wlp text files. Since they contain all the words of all the articles and the "lemmas" to replace them with.
This code serves as an example to show how to treat such a file on the cluster.

In [31]:
#first read the text file
wlp_rdd = sc.textFile('sample_data/wordLem_poS.txt')

In [32]:
#the first 3 lines are useless headlines
header = wlp_rdd.take(3)

In [33]:
#so let's remove those headlines
noheaders = wlp_rdd.filter(lambda l: l != header[0])\
.filter(lambda l: l != header[1])\
.filter(lambda l: l != header[2])

In [34]:
#we split the elements separated by tabs
lines = noheaders.map(lambda l: l.split("\t"))
#identify the columns
frame = lines.map(lambda p: Row(textid=int(p[0]),idseq=int(p[1]),word=p[2],lemma=p[3],pos=p[4]))
df = spark.createDataFrame(frame)

In [35]:
df.show(5)

+----------+------+----+------+-------+
|     idseq| lemma| pos|textid|   word|
+----------+------+----+------+-------+
|1095362496|      |  fo| 11241|@@11241|
|1095362497|      |null| 11241|    <p>|
|1095362498|   sol| np1| 11241|    Sol|
|1095362499|yurick| np1| 11241| Yurick|
|1095362500|      |   ,| 11241|      ,|
+----------+------+----+------+-------+
only showing top 5 rows



In [36]:
#keep only useful information
df = df.drop("idseq", "pos", "word").filter(df["lemma"]!='')

In [37]:
df.show(5)

+------+------+
| lemma|textid|
+------+------+
|   sol| 11241|
|yurick| 11241|
|   the| 11241|
|writer| 11241|
| whose| 11241|
+------+------+
only showing top 5 rows



In [59]:
df3 = df.filter(df["lemma"].isin(stop_words) == False)

In [60]:
df3.groupBy("lemma").count().sort("count", ascending=False).show(200)

+-------------+-----+
|        lemma|count|
+-------------+-----+
|           's| 9878|
|          say| 9221|
|         will| 6407|
|            '| 5976|
|         year| 4272|
|          one| 3897|
|          can| 3860|
|          n't| 3841|
|         make| 3448|
|         also| 3361|
|         time| 3169|
|          new| 3122|
|          get| 3031|
|       people| 2913|
|           go| 2859|
|         take| 2667|
|         like| 2356|
|          use| 2244|
|        first| 2155|
|         work| 2137|
|           us| 2134|
|         come| 2116|
|          see| 2115|
|         just| 2082|
|          two| 2055|
|          now| 1943|
|          day| 1819|
|         last| 1769|
|        state| 1713|
|      company| 1698|
|      comment| 1667|
|         need| 1654|
|         know| 1652|
|          may| 1611|
|         want| 1579|
|         look| 1564|
|        world| 1553|
|   government| 1551|
|            -| 1524|
|         show| 1480|
|         give| 1480|
|         many| 1468|
|      cou

In [23]:
df2=df.groupBy("textid").agg(collect_list("lemma"))\
    .sort("textid")\
    .withColumnRenamed("collect_list(lemma)","lemma")

In [24]:
df2.show(5)

+------+--------------------+
|textid|               lemma|
+------+--------------------+
| 11241|[, , sol, yurick,...|
| 11242|[, , that, be, wh...|
| 11243|[, , a, sublime, ...|
| 11244|[, , reflect, on,...|
| 21242|[, , ask, ars, , ...|
+------+--------------------+
only showing top 5 rows



## text.txt

In [12]:
text_rdd = sc.textFile('sample_data/text.txt') \
            .filter(lambda r: len(r)>20)

In [13]:
text_raw_schema = text_rdd.map(lambda r: Row(text=r)) 
text_raw = spark.createDataFrame(text_raw_schema)

In [14]:
text_raw = text_raw.withColumn('textID', regexp_extract('text','(\d+)',1))
text = text_raw.rdd.map(lambda r: (re.sub('@@\d+ ','',r[0]),r[1])).map(lambda r: Row(text=r[0],textID=r[1])).toDF()
text.show()

+--------------------+------+
|                text|textID|
+--------------------+------+
|<p> Sol Yurick , ...| 11241|
|<h> That 's What ...| 11242|
|<h> A sublime cro...| 11243|
|<h> Reflecting on...| 11244|
|<h> Ask Ars : Doe...| 21242|
|<p> NEW YORK -- A...| 21243|
|<p> IRELAND 'S Ol...| 31240|
|<h> Shakira launc...| 31241|
|<p> ENTREPRENEUR ...| 31242|
|<p> Syrian women ...| 41240|
|<h> Published byS...| 41241|
|<h> The Bay Bridg...| 41244|
|<h> MPAA Lobbies ...| 51243|
|<h> Mum 's fight ...| 61240|
|<h> IPPC to inves...| 61242|
|<p> North America...| 71240|
|<h> James Ferguss...| 71241|
|<h> From Richard ...| 71242|
|<h> ' Incompatibl...| 71243|
|<h> Mary Leakey ,...| 71244|
+--------------------+------+
only showing top 20 rows

