# An NLP workshop - Categorizing tweets into relevant or non-relevant
#### adapted from https://github.com/hundredblocks/concrete_NLP_tutorial.git

## 2. Preprocessing

Let's clean up the data based on what we observed during our EDA

In [1]:
import pandas as pd
import nltk
import re
import itertools

In [2]:
%matplotlib inline

### Let's load the data again

In [3]:
questions = pd.read_csv("socialmedia_relevant_cols.csv", encoding='ISO-8859-1')
questions.columns=['text', 'choose_one', 'class_label']
questions.head()

Unnamed: 0,text,choose_one,class_label
0,Just happened a terrible car crash,Relevant,1
1,Our Deeds are the Reason of this #earthquake M...,Relevant,1
2,"Heard about #earthquake is different cities, s...",Relevant,1
3,"there is a forest fire at spot pond, geese are...",Relevant,1
4,Forest fire near La Ronge Sask. Canada,Relevant,1


## Data Cleansing

Let's create a function to clean up our data, and save it back to disk for future use.

For now all we are going to do is to convert everything to lower case and remove URLs, but after examining the data you might want to add code to remove punctuation marks or other irrelevant characters or words.

In [4]:
def standardize_text(df, text_field):
    df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r"http\S+", "", elem))  # get rid of URLs
    # Add additional clean up coded here
    return df

In [5]:
questions = standardize_text(questions, "text")

Let's take a look at the effects

In [6]:
pd.set_option('display.max_colwidth', 100)

In [8]:
questions.text.sample(10)

3787                                                                            self destruction mode! ???? 
6574                            enter the world of extreme diving   9 stories up and into the volga river 
9678     guys he can run so fast he creates a tornado without breaking a sweat. he makes superman look li...
1508                       womens satchel lattice chain studded cross body multi colour shoulder bags blue  
9653                                         9:35 pm. thunderstorm. no rain. 90 degrees. this weather weird.
6681     being stuck on a sleeper train for 24 hours after de-railing due to a landslide was most definit...
443      short reading\r\n\r\napocalypse 21:1023 \r\n\r\nin the spirit the angel took me to the top of an...
2390     'it hasn't collapsed because the greek people are still being played for as fools by tsipras he ...
3316     i got my wisdom teeth out yesterday and i just demolished a whole bowl of chicken alfredo like i...
Name: text, dtype: 

In [10]:
questions.text.loc[6880]

'@flgovscott we allow farrakhan to  to challenge 10000 males to rise up &amp; commit mass murder as he just did in miami? '

Go back to `standardize_text` and any other cleanup you think would be useful 

Let's check the impact on the vocabulary

In [11]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
questions["tokens"] = questions["text"].apply(tokenizer.tokenize)

In [12]:
all_words = {word for tokens in questions.tokens for word in tokens}
print(f"Total number of unique tokens {len(all_words)}")

Total number of unique tokens 21338


In [None]:
from pprint import pprint
pprint(all_words)

Once we're happy we've cleaned as much as we want to, let's write the clean data back to disk

In [13]:
questions.to_csv("clean_data.csv")