# Cleaning the NOW corpus
This notebook shows the cleaning process that will be used for the ADA project. Here, only a sample of the data is used (from [here](https://www.corpusdata.org/now_corpus.asp)), but the methods should be the same once scaled to the full database available on the cluster.

The NOW database is composed of billions of words from online newspapers and magazines from 20 different countries. The data we downloaded comes in different files which can be used together or independently. These files are:

1. **now-samples-lexicon.txt**: this is the full dictionnary of the english language, a lexicon. It contains four clolumns, `wID` which is the word id, `word` the actual word, `lemma` which is family of the word (ie: if word is "walked", lemma is "walk") and `PoS` which is the part of speech.
2. **now-samples_sources.txt**: this is the source of every text, in order it contains the text id, the number of words, the date, the country, the website, the url and title of the article.
3. **text.txt**: this file has the complete texts of the articles, the first column is the `textID` in the format @@textID, the second column is the full text, complete with html paragraphs and headers. It is important to note that to prevent plagiarism, every 200 words, 10 words are replaced by the string "@ @ @ @ @ @ @ @ @ @". Combined words are also split, example "can't" is written as "ca n't" and punctuation is surrounded by spaces.
4. **wordLemPoS.txt**: finally, this file contains the `word`, `lemma` and `PoS` for each word in the texts, one by one, so one could read the texts by reading down the columns. Along with that is the `textID` from where the word is and an `ID (seq)` which we don't know what it does.

In [1]:
import findspark
findspark.init()

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql.types import DateType

#from pyspark.sql import SparkSession
#from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [2]:
# pip install stop-words
from stop_words import get_stop_words
stop_words = get_stop_words('en')
len(stop_words)

174

In [3]:
not_these_words = pd.read_csv('stop-words.txt',delimiter=', ',header=None, engine='python').T

In [4]:
stopwords = not_these_words[0].values.tolist()

## Word-Lemma-PoS processing


The goal of this part is to extract the useful data from wlp text files. Since they contain all the words of all the articles and the "lemmas" to replace them with.
This code serves as an example to show how to treat such a file on the cluster.

In [5]:
#first read the text file
wlp_rdd = sc.textFile('sample_data/wordLem_poS.txt')

In [6]:
#the first 3 lines are useless headlines
header = wlp_rdd.take(3)

#so let's remove those headlines
noheaders = wlp_rdd.filter(lambda r: r != header[0])\
            .filter(lambda r: r != header[1])\
            .filter(lambda r: r != header[2])

In [7]:
#we split the elements separated by tabs
lines = noheaders.map(lambda r: r.split('\t'))
#identify the columns
wlp_schema = lines.map(lambda r: Row(textID=int(r[0]),idseq=int(r[1]),word=r[2],lemma=r[3],pos=r[4]))
wlp = spark.createDataFrame(wlp_schema)

In [8]:
wlp.show(5)

+----------+------+----+------+-------+
|     idseq| lemma| pos|textID|   word|
+----------+------+----+------+-------+
|1095362496|      |  fo| 11241|@@11241|
|1095362497|      |null| 11241|    <p>|
|1095362498|   sol| np1| 11241|    Sol|
|1095362499|yurick| np1| 11241| Yurick|
|1095362500|      |   ,| 11241|      ,|
+----------+------+----+------+-------+
only showing top 5 rows



In [9]:
#keep only useful information
wlp = wlp.drop('idseq', 'pos', 'word').filter(wlp['lemma']!='')
wlp.show(5)

+------+------+
| lemma|textID|
+------+------+
|   sol| 11241|
|yurick| 11241|
|   the| 11241|
|writer| 11241|
| whose| 11241|
+------+------+
only showing top 5 rows



In [22]:
wlp_nostop = wlp.filter(wlp['lemma'].isin(stopwords) == False)
wlp_nostop.groupBy('lemma').count().sort('count', ascending=False).show()

+----------+-----+
|     lemma|count|
+----------+-----+
|        's| 9878|
|         '| 5976|
|      year| 4272|
|       n't| 3841|
|      time| 3169|
|    people| 2913|
|      take| 2667|
|       use| 2244|
|      work| 2137|
|       day| 1819|
|     state| 1713|
|   company| 1698|
|   comment| 1667|
|      need| 1654|
|      want| 1579|
|      look| 1564|
|     world| 1553|
|government| 1551|
|         -| 1524|
|      show| 1480|
+----------+-----+
only showing top 20 rows



In [11]:
wlp_bytext=wlp.groupBy('textID').agg(collect_list('lemma'))\
            .sort('textID')\
            .withColumnRenamed('collect_list(lemma)','lemma')
wlp_bytext.show(5)

+------+--------------------+
|textID|               lemma|
+------+--------------------+
| 11241|[sol, yurick, the...|
| 11242|[that, be, what, ...|
| 11243|[a, sublime, croi...|
| 11244|[reflect, on, a, ...|
| 21242|[ask, ars, do, fa...|
+------+--------------------+
only showing top 5 rows



## List all lemma to extract most and least common
This is by Luca, to be able to remove them from the rest of our word lists. This is an alternative to using stop words and using sql (because why not).

In [10]:
all_lemma = wlp.drop('textID')
all_lemma.show(5)

+------+
| lemma|
+------+
|   sol|
|yurick|
|   the|
|writer|
| whose|
+------+
only showing top 5 rows



In [18]:
lemma_freq = all_lemma.groupby('lemma').count().sort('count', ascending=False)
lemma_tokeep = lemma_freq.where('count<7000')

In [17]:
lemma_tokeep.where('lemma=="sol"').show()

+-----+-----+
|lemma|count|
+-----+-----+
|  sol|    4|
+-----+-----+



In [12]:
wlp.registerTempTable('wlp')
lemma_tokeep.registerTempTable('lemma_tokeep')

In [27]:
query = """
SELECT wlp.lemma, wlp.textID
FROM wlp
INNER JOIN lemma_tokeep ON wlp.lemma = lemma_tokeep.lemma
order by textID
"""

wlp_kept = spark.sql(query)
wlp_kept_bytext = wlp_kept.groupBy('textID').agg(collect_list('lemma'))\
                    .sort('textID')\
                    .withColumnRenamed('collect_list(lemma)','lemma')
wlp_kept_bytext.show(5)

+------+--------------------+
|textID|               lemma|
+------+--------------------+
| 11241|[1970s, some, dec...|
| 11242|[online, some, so...|
| 11243|[art, some, some,...|
| 11244|[..., ..., art, a...|
| 21242|[..., online, som...|
+------+--------------------+
only showing top 5 rows



## text.txt

In [50]:
text_rdd = sc.textFile('sample_data/text.txt') \
            .filter(lambda r: len(r)>20)

In [51]:
text_raw_schema = text_rdd.map(lambda r: Row(text=r)) 
text_raw = spark.createDataFrame(text_raw_schema)

In [52]:
text_raw = text_raw.withColumn('textID', regexp_extract('text','(\d+)',1))
text = text_raw.rdd.map(lambda r: (re.sub('@@\d+ ','',r[0]),r[1])).map(lambda r: Row(text=r[0],textID=int(r[1]))).toDF()
text.show(5)

+--------------------+------+
|                text|textID|
+--------------------+------+
|<p> Sol Yurick , ...| 11241|
|<h> That 's What ...| 11242|
|<h> A sublime cro...| 11243|
|<h> Reflecting on...| 11244|
|<h> Ask Ars : Doe...| 21242|
+--------------------+------+
only showing top 5 rows



## now-samples-sources.txt

In [53]:
sources_rdd = sc.textFile('sample_data/now-samples-sources.txt')\
                .map(lambda r: r.split('\t'))

header = sources_rdd.take(3)
sources_rdd = sources_rdd.filter(lambda l: l != header[0])\
                .filter(lambda l: l != header[1])\
                .filter(lambda l: l != header[2])

In [54]:
sources_schema = sources_rdd.map(lambda r: Row(textID=int(r[0]),nwords=int(r[1]),date=r[2],country=r[3],website=r[4],url=r[5],title=r[6],)) 
sources = spark.createDataFrame(sources_schema)
sources = sources.withColumn('date',to_date(sources.date, 'yy-MM-dd'))

In [55]:
sources.printSchema()

root
 |-- country: string (nullable = true)
 |-- date: date (nullable = true)
 |-- nwords: long (nullable = true)
 |-- textID: long (nullable = true)
 |-- title: string (nullable = true)
 |-- url: string (nullable = true)
 |-- website: string (nullable = true)



In [56]:
sources.show(5)

+-------+----------+------+------+--------------------+--------------------+-------------------+
|country|      date|nwords|textID|               title|                 url|            website|
+-------+----------+------+------+--------------------+--------------------+-------------------+
|     US|2013-01-06|   397| 11241|Author of The War...|http://kotaku.com...|             Kotaku|
|     US|2013-01-06|   757| 11242|That's What They ...|http://michiganra...|     Michigan Radio|
|     US|2013-01-06|   755| 11243|Best of New York:...|http://www.nydail...|New York Daily News|
|     US|2013-01-06|  1677| 11244|Reflecting on a q...|http://www.oregon...|     OregonLive.com|
|     US|2013-01-11|   794| 21242|Ask Ars: Does Fac...|http://arstechnic...|       Ars Technica|
+-------+----------+------+------+--------------------+--------------------+-------------------+
only showing top 5 rows

