# Exercise 2. Preprocessing
### Text, Web and Social Media Analytics Lab 

In this exercise, we will be using the 20-Newsgroups dataset. This version of the dataset contains about 11k newsgroups posts from 20 different topics. We will do following steps:
- A) Import and examine data
- B) Remove initial text metadata
- C) Remove numbers, punctuation, tabs and convert to lower case with gensim
- D) Remove stop words and short words
- E) Stemming and Lematization

We first import all the necessary libraries.

In [1]:
import pandas as pd
import re
from gensim.parsing.preprocessing import STOPWORDS, strip_tags, strip_numeric, strip_punctuation, strip_multiple_whitespaces, remove_stopwords, strip_short, stem_text
import pickle
import en_core_web_sm
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Part A: Import and examine data

We read the json file from the URL and we then show the head of the dataframe to understand how our data looks like.

In [2]:
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
df.head()

Unnamed: 0,content,target,target_names
0,From: lerxst@wam.umd.edu (where's my thing)\nS...,7,rec.autos
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...,4,comp.sys.mac.hardware
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,4,comp.sys.mac.hardware
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...,1,comp.graphics
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...,14,sci.space


We now show more detailed information about our dataset, including the number of entries, as well as the column names, the count of non-nulls and the data type of each column.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11314 entries, 0 to 11313
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   content       11314 non-null  object
 1   target        11314 non-null  int64 
 2   target_names  11314 non-null  object
dtypes: int64(1), object(2)
memory usage: 353.6+ KB


We now show all the target names available in our dataset as well as the number of records that belong to that target.

In [4]:
df['target_names'].value_counts()

rec.sport.hockey            600
soc.religion.christian      599
rec.motorcycles             598
rec.sport.baseball          597
sci.crypt                   595
rec.autos                   594
sci.med                     594
sci.space                   593
comp.windows.x              593
comp.os.ms-windows.misc     591
sci.electronics             591
comp.sys.ibm.pc.hardware    590
misc.forsale                585
comp.graphics               584
comp.sys.mac.hardware       578
talk.politics.mideast       564
talk.politics.guns          546
alt.atheism                 480
talk.politics.misc          465
talk.religion.misc          377
Name: target_names, dtype: int64

We now print the content of our first record in order to understand how it really looks like and how it is structured.

In [5]:
print(df['content'][0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







After reading the previous text, we print the corresponding target name to corroborate if it makes sense.

In [6]:
print(df['target_names'][0])

rec.autos


## Part B: Remove initial text metadata

In order to remove the text metadata, we first create two lists. The first list contains words that start a specific line of metadata, which we want to remove. The second list contains specific words that do not bring any value to the meaning of the text and that we also want to remove.

In [7]:
starting_words = ['From:', 'Article-I.D.:', 'Organization:', 'Lines:', 'NNTP-Posting-Host:', 'Distribution:', 'Reply-To:', 'X-Newsreader:', 'Expires:', ' -']
single_words = ['Subject:', 'Summary:', 'Keywords:']

We now create a copy of our dataframe and then remove every line that starts with any word from the first list and all the words from the second list that might be found in the content of our dataset. By doing this, we make sure only the text that actually brings value and explains the meaning of the message remains. 

In [8]:
data = df.copy()

for word in starting_words:
  data['content'] = data['content'].apply(lambda content: re.sub(word + '.*\n', '', content, flags=re.IGNORECASE))

for word in single_words:
  data['content'] = data['content'].apply(lambda content: re.sub(word, '', content, flags=re.IGNORECASE))

We print the content of the first record to see how the text looks after removing the metadata.

In [9]:
print(data.loc[0, 'content'])

 WHAT car is this!?

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
  






## Part C: Remove numbers, punctuation, tabs and convert to lower case with gensim

We now perform a few more transformations, like removing all the numeric characters, all the punctuation symbols, all extra whitespaces between words, and we then finally convert the whole text to lowercase. This way only individual words remain, each separated by a single space and in the same format. 

In [10]:
data['content'] = (data['content'].apply(lambda content: strip_numeric(content))
                                  .apply(lambda content: strip_punctuation(content))
                                  .apply(lambda content: strip_multiple_whitespaces(content))
                                  .apply(lambda content: content.lower()))

We now print the content of the first record again to see how the text looks after the previous transformations.

In [11]:
print(data['content'][0])

 what car is this i was wondering if anyone out there could enlighten me on this car i saw the other day it was a door sports car looked to be from the late s early s it was called a bricklin the doors were really small in addition the front bumper was separate from the rest of the body this is all i know if anyone can tellme a model name engine specs years of production where this car is made history or whatever info you have on this funky looking car please e mail thanks il 


## Part D: Remove stop words and short words

At the beginning, we imported stopwords from two different libraries. 

Here we show the stopwords from 'gensim', but we first sort these words and then print them to see what they look like. 

In [12]:
print(sorted(STOPWORDS))

['a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'bill', 'both', 'bottom', 'but', 'by', 'call', 'can', 'cannot', 'cant', 'co', 'computer', 'con', 'could', 'couldnt', 'cry', 'de', 'describe', 'detail', 'did', 'didn', 'do', 'does', 'doesn', 'doing', 'don', 'done', 'down', 'due', 'during', 'each', 'eg', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found

Here we print the stopwords from 'nltk', where we can see that it has a lot more words, including stopwords from different languages. 

In [13]:
print(sorted(stopwords.words()))

['a', 'a', 'a', 'a', 'a', 'a', 'a', 'aan', 'abban', 'abbia', 'abbiamo', 'abbiano', 'abbiate', 'aber', 'abia', 'about', 'above', 'acea', 'aceasta', 'această', 'aceea', 'aceeasi', 'acei', 'aceia', 'acel', 'acela', 'acelasi', 'acele', 'acelea', 'acest', 'acesta', 'aceste', 'acestea', 'acestei', 'acestia', 'acestui', 'aceşti', 'aceştia', 'ad', 'ad', 'ada', 'adalah', 'adanya', 'adapun', 'adica', 'af', 'after', 'again', 'against', 'agak', 'agaknya', 'agar', 'agl', 'agli', 'ahhoz', 'ahogy', 'ahol', 'ai', 'ai', 'ai', 'aia', 'aibă', 'aici', 'aie', 'aient', 'aies', 'ain', 'ait', 'akan', 'akankah', 'akhir', 'akhiri', 'akhirnya', 'aki', 'akik', 'akkor', 'aku', 'akulah', 'al', 'al', 'al', 'al', 'ala', 'alatt', 'ale', 'alea', 'algo', 'algunas', 'algunos', 'ali', 'all', 'all', 'alla', 'alla', 'alle', 'alle', 'alle', 'alle', 'allem', 'allen', 'aller', 'alles', 'alles', 'allo', 'allt', 'als', 'als', 'also', 'alt', 'alt', 'alta', 'altceva', 'altcineva', 'alte', 'altfel', 'alti', 'altii', 'altijd', 'altm

We perform two more transformations to the content of our dataset. We first remove the stopwords using those from 'gensim' and then we also remove all words shorter than three characters. This removes letters that might have remained from the previous transformations and that do not bring any value.

In [14]:
data['content'] = data['content'].apply(lambda content: remove_stopwords(content)).apply(lambda content: strip_short(content))

We now print the content of the first record as we have done before to see the changes in the text.

In [15]:
print(data['content'][0])

car wondering enlighten car saw day door sports car looked late early called bricklin doors small addition bumper separate rest body know tellme model engine specs years production car history info funky looking car mail thanks


## Part E: Stemming and Lematization

We create a copy of our data, and then apply a stemming transformation to our data, which removes all possible inflections in the words, like prefixes and suffixes, leaving only the stem of each word, as we can later see when we print the content of the first record.

In [16]:
stemmed_data = data.copy()
stemmed_data['content'] = stemmed_data['content'].apply(lambda content: stem_text(content))

print(stemmed_data['content'][0])

car wonder enlighten car saw dai door sport car look late earli call bricklin door small addit bumper separ rest bodi know tellm model engin spec year product car histori info funki look car mail thank


We first load Spacy's english model, then create another copy of our data, and then apply lemmatization to the content of our dataset. With lemmatization, we are also removing all inflections from words, like prefixes and suffixes, but leaving the lemma of the word, which compared to the stem from the previous step, a lemma is an actual word. We can see this when we print the content of the first record again.

In [17]:
nlp = en_core_web_sm.load()

lemmatized_data = data.copy()
lemmatized_data['content'] = lemmatized_data['content'].apply(lambda content: ' '.join([token.lemma_ for token in nlp(content)]))

print(lemmatized_data['content'][0])

car wonder enlighten car see day door sport car look late early call bricklin door small addition bumper separate rest body know tellme model engine specs year production car history info funky looking car mail thank


Here we are just mounting the Google drive so we can access its folders and files.

In [18]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We finally save our transformed datasets in Google drive by using 'pickle'. By saving our datasets in a file, we don't have to run all the transformations again every time. We can just load the files back and continue working on them.

In [19]:
pickle.dump(stemmed_data, open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/stemmed_data.p', 'wb'))
pickle.dump(lemmatized_data, open('/content/drive/MyDrive/Colab Notebooks/TWSM Analytics Lab/storage/lemmatized_data.p', 'wb'))