## 1. Data Acquisition

This project utilizes the Disaster Identification Tweet dataset which consists of the following:


1.  train.csv
2.  test.csv
3.  sample_submission.csv

As part of this step, we read the training and testing data.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
cd /content/drive/MyDrive/NLP-Project

/content/drive/MyDrive/NLP-Project


In [3]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
train_dataset = pd.read_csv("clean_train.csv")
print("Training Dataset Features:")
print(train_dataset.keys())
print("Unique Values in id =", train_dataset['id'].nunique())
print("Unique Values in keyword =", train_dataset['keyword'].nunique())
print("Unique Values in location =", train_dataset['location'].nunique())
print("Unique Values in text =", train_dataset['text'].nunique())
print("Unique Values in target =", train_dataset['target'].nunique())

Training Dataset Features:
Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')
Unique Values in id = 7613
Unique Values in keyword = 221
Unique Values in location = 3291
Unique Values in text = 6949
Unique Values in target = 2


## 3. Pre-processing

The pre-processing step involves the following:


1.   Remove hyperlink 
2.   Tokenize
3.   Stop word removal
4.   Removing numbers
5.   Removing special characters
6.   Conversion to Lower Case
7.   Remove Empty Spaces


In [15]:
def removehyperlink(text):
  temptext = []
  for i in text:
    temp = i.split()
    temp = [j.lower() for j in temp if not j.startswith('http://')]
    temp = [j for j in temp if not j.startswith('https://')]
    temptext.append(temp)
  return temptext

In [5]:
def TokenizeText(text):
  temptext = []
  for i in text:
    temptext.append(i.split())
  return temptext

In [6]:
def removespecialcharacter(text):
  temptext = []
  for i in text:
    temp = []
    for j in i:
      temp.append(re.sub(r"[^a-zA-Z0-9]","",j))
    temptext.append(temp)
  return temptext

In [7]:
def removeemptyspace(text):
  temptext = []
  for i in text:
    temp = []
    for j in i:
      if j != '':
        temp.append(j)
    temptext.append(temp)
  return temptext

In [8]:
def removenumbers(text):
  temptext = []
  for i in text:
    temp = []
    for j in i:
      if j != '':
        temp.append(re.sub(r'[~^0-9]', '',j))
    temptext.append(temp)
  return temptext

In [9]:
def lowercase(text):
  temptext = []
  for i in text:
    temp = []
    for j in i:
      temp.append(j.lower())
    temptext.append(temp)
  return temptext

In [10]:
def removestopwords(text):
  temptext = []
  stop_words = set(stopwords.words('english'))
  for i in text:
    temp = []
    for j in i:
      if j not in stop_words:
        temp.append(j)
    temptext.append(temp)
  return temptext

In [11]:
clean = TokenizeText(train_dataset['text'])
clean = removespecialcharacter(clean)
clean = removeemptyspace(clean)
clean = removenumbers(clean)
clean = removeemptyspace(clean)
clean = lowercase(clean)
clean = removestopwords(clean)

In [12]:
vocab_sentence = []
for i in clean:
  temp = " ".join([j for j in i])
  vocab_sentence.append(temp)

print(vocab_sentence)

