# 1 Data Wrangling

The main goal of this notebook is performing the data wrangling on the Wikipedia comments database. Mainly, I import data, inspect the data format, any missing values and organize data for the next EDA step.

In [1]:
import pandas as pd
import numpy as np

For this project, I want to use Toxic Comment Classification by Jigsaw/Conversation AI from Kaggel. The comments were labeled binary by human raters for toxic behavior under six different categories. The dataset includes two seperate files of train and test.

The data was stored in a csv format in Kaggel. I downloaded the data and stored on the local machine. In the next step, I import the raw data (train dataset) to perform the analysis.

In [2]:
train_data = pd.read_csv('Library/train.csv')

First, let's see how many data points and columns we have in this file

In [3]:
train_data.shape

(159571, 8)

Next, I want to see the first few rows of the dataset.

In [4]:
train_data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


I inspect the data type of each column.

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB


The data type of toxic, severe_toxic, obscene, threat, insult and identity_hate columns is int64 and is suitable for the purpose of analysis. Also, the data type of comment_text column is object and it is suitable for analysis.

Let's see if there is any missing value in the dataset.

In [6]:
# Check if there is any missing value. 
train_data.isnull().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

There is no either empty or null cell. 

I perform the same steps for the test dataset and then combine it with the train dataset. There are two files for the test dataset. The first and second files contain the comments text and the toxicity labels, respectively. 
In the first step, I compare two files and merge them to have comments and their labels in a single dataframe.

In [7]:
test_comment = pd.read_csv('Library/test.csv')

In [8]:
test_label = pd.read_csv('Library/test_labels.csv')

In [9]:
test_comment.shape

(153164, 2)

In [10]:
test_comment.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [11]:
test_label.shape

(153164, 7)

In [12]:
test_label.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


Let's comapre the id column for the two datasets.

In [13]:
test_comment['id'].equals(test_label['id'])

True

Therefore, the id column for the two datasets are identical. Now, I can merge those.

In [14]:
test_data = pd.merge(test_comment, test_label, how='outer', on='id')

In [15]:
test_data.shape

(153164, 8)

In [16]:
test_data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,":If you have a look back at the source, the in...",-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,I don't anonymously edit articles at all.,-1,-1,-1,-1,-1,-1


The rows labeled -1 are not used for scoring (based on the Kaggel dataset), therefore I remove those from the dataset.

In [17]:
test_data = test_data[test_data['toxic'] != -1]

In [18]:
test_data.shape

(63978, 8)

In [19]:
test_data.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
5,0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0
7,000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0
11,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0,0,0,0,0,0
13,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0,0,0,0,0,0
14,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0,0,0,0,0,0
16,000663aff0fffc80,this other one from 1897,0,0,0,0,0,0
17,000689dd34e20979,== Reason for banning throwing == \n\n This ar...,0,0,0,0,0,0
19,000844b52dee5f3f,|blocked]] from editing Wikipedia. |,0,0,0,0,0,0
21,00091c35fa9d0465,"== Arabs are committing genocide in Iraq, but ...",1,0,0,0,0,0
22,000968ce11f5ee34,Please stop. If you continue to vandalize Wiki...,0,0,0,0,0,0


In [20]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 63978 entries, 5 to 153156
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             63978 non-null  object
 1   comment_text   63978 non-null  object
 2   toxic          63978 non-null  int64 
 3   severe_toxic   63978 non-null  int64 
 4   obscene        63978 non-null  int64 
 5   threat         63978 non-null  int64 
 6   insult         63978 non-null  int64 
 7   identity_hate  63978 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 4.4+ MB


In [21]:
test_data.isnull().sum()

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64

I save the input dataframe as a csv file for the next step.

In [22]:
path = '../Notebooks/Library/train_df.csv'
train_data.to_csv(path, index=False)

In [23]:
path = '../Notebooks/Library/test_df.csv'
test_data.to_csv(path, index=False)