## DATA 255 -LAB 2-PART 2 NLP

In [1]:
import pandas as pd
import numpy as np
import re
from collections import Counter

In [3]:
labels = ['toxicity', 'severe_toxicity', 'obscene', 'threat', 
          'insult', 'identity_attack', 'sexual_explicit']

train_data = pd.read_csv("train.csv")
train_data.head()

Unnamed: 0,id,text,toxicity,severe_toxicity,obscene,threat,insult,identity_attack,sexual_explicit
0,0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,haha you guys are a bunch of losers.,0.893617,0.021277,0.0,0.0,0.87234,0.021277,0.0


**Filling the NA text with an empty strings**

In [4]:
train_data['text'] = train_data['text'].fillna('')

In [5]:
train_data.head()

Unnamed: 0,id,text,toxicity,severe_toxicity,obscene,threat,insult,identity_attack,sexual_explicit
0,0,"This is so cool. It's like, 'would you want yo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,Thank you!! This would make my life a lot less...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,This is such an urgent design problem; kudos t...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,Is this something I'll be able to install on m...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,haha you guys are a bunch of losers.,0.893617,0.021277,0.0,0.0,0.87234,0.021277,0.0


In [6]:
train_data.isna().sum()

id                 0
text               0
toxicity           0
severe_toxicity    0
obscene            0
threat             0
insult             0
identity_attack    0
sexual_explicit    0
dtype: int64

In [7]:
import pandas as pd
import re
import nltk
from nltk.stem import PorterStemmer
from tqdm import tqdm
from joblib import Parallel, delayed
import numpy as np

nltk.download('punkt')

stemmer = PorterStemmer()

[nltk_data] Downloading package punkt to /Users/shitgupt/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Function to clean and stem the text

In [9]:

def clean_text(text):
    
    text = text.lower()
    
    text = text.strip()
    
    text = ' '.join([word for word in text.split() if len(word) > 2])
    
    text = ''.join(char for char in text if char.isprintable())
    
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    words = text.split()
    
    text = ' '.join([stemmer.stem(word) for word in words])
    
    return text

def parallelize_dataframe(df, func, num_workers=8):
    result = Parallel(n_jobs=num_workers)(delayed(func)(text) for text in tqdm(df))
    return result

In [10]:
train_data['text'] = parallelize_dataframe(train_data['text'], clean_text)

100%|███████████████████████████████| 1804874/1804874 [07:15<00:00, 4141.79it/s]


In [11]:
train_data.head()

Unnamed: 0,id,text,toxicity,severe_toxicity,obscene,threat,insult,identity_attack,sexual_explicit
0,0,thi cool it like would you want your mother re...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,thank you thi would make life lot less anxiety...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,thi such urgent design problem kudo you for ta...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,thi someth ill abl instal site when will you r...,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,haha you guy are bunch loser,0.893617,0.021277,0.0,0.0,0.87234,0.021277,0.0


**Saved the cleaned file as pickle**

In [12]:
train_data.to_pickle('cleaned_data_stemmed.pkl')

**Loading the test set and performing the same cleaning steps**

In [13]:
test_data = pd.read_csv("test.csv")
test_data.head()

Unnamed: 0,id,text
0,0,[ Integrity means that you pay your debts.]\n\...
1,1,This is malfeasance by the Administrator and t...
2,2,@Rmiller101 - Spoken like a true elitist. But ...
3,3,"Paul: Thank you for your kind words. I do, in..."
4,4,Sorry you missed high school. Eisenhower sent ...


In [14]:
test_data.isna().sum()

id      0
text    0
dtype: int64

In [15]:
test_data['text'] = parallelize_dataframe(test_data['text'], clean_text)

100%|███████████████████████████████████| 97320/97320 [00:18<00:00, 5152.55it/s]


In [16]:
test_data.head()

Unnamed: 0,id,text
0,0,integr mean that you pay your debt doe thi app...
1,1,thi malfeas the administr and the board they a...
2,2,rmiller spoken like true elitist but look out ...
3,3,paul thank you for your kind word do inde have...
4,4,sorri you miss high school eisenhow sent troop...


In [17]:
test_data.to_pickle('cleaned_testdata_stemmed.pkl')

## Thank you - Modeling in seperate NB