# Kaggle Tweet Analytics on Natural Disaster 
This is a Kaggle competition on applying natural language processing techniques on Twitter tweets in classifying whether the tweet is discussing natural disaster. Here I will use the original dataset from Kaggle.com, and perform serveral machine learning and neural network techniques in classifying if the tweets contains information about natural disasters. 

The links to competiton: https://www.kaggle.com/c/nlp-getting-started/data

During working on this project, I was inspired by a few websites on how to tackle tweets and text data.

- https://towardsdatascience.com/sentiment-analysis-of-a-tweet-with-naive-bayes-ff9bdb2949c7
    
- https://stackabuse.com/removing-stop-words-from-strings-in-python/

This project is created by Kelvin Kong
On Sept 30 2021

Last Modified on Nov 27 2021


Project Update Log:


Nov 27 2021

- Successfully installed scikit-learn on M1 Version MacBook Air.

Oct 9 2021

- Updated Explanation on Each Code Block

Remarks: All initial coding work are worked on the 2020 M1 Macbook Air. All validation work is worked on my Ubuntu 20.04 LTS machine. Due to the new Apple Silicon platform, some python packages may not work on the new M1 system. I am trying to reduce the possibility of having non-functioning packages on M1 Macbook Air.

Remarks 2: Known Issues: gensim is not yet supported on M1 system. Not able to install gensim through pip (Oct 9 2021).

## Step 0: Import required modules and data

In [1]:
import pandas as pd
import numpy as np

In [2]:
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

## Step 1: Understanding data structure

In [3]:
#Understanding Data Structure
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


Short Introduction to data: 

Our goal is to classify whether the tweet is talking about natural disaster. So below are the descriptions of all the columns.

    id : It is the tweet ID

    keyword: 

    location: This contains where the tweets are being sent out. According to the description on Kaggle.com, some of the tweets contains location data. We will double check later.

    text: This is where the main tweet text are located, in original format. No text preprocessing performed on the original training and testing data.
    
    target: Whether the tweet is talking about disaster or not. 1 means it is about natural disaster and 0 means it is not about natural disaster. This is our goal to classify in the testing data. Hence we will discover that this columns only appear in training data. We will use the result in training data set for model building, training and improvements.

## Step 1a: Checking data's properties

In [5]:
# Check to see if there is any columns with all NaN values
train.isnull().all(axis=0)

id          False
keyword     False
location    False
text        False
target      False
dtype: bool

In [6]:
test.isnull().all(axis=0)

id          False
keyword     False
location    False
text        False
dtype: bool

In [7]:
train.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

In [8]:
test.dtypes

id           int64
keyword     object
location    object
text        object
dtype: object

## Step 2: Preprocessing data

### Step 2a: Clean tweets

In this step we will perform operations which clean the tweets so that it is machine readable and ready for comparison using different machine learning algorithms. 

The first step we will make all characters become lowercase. This is to make all the text become uniform. After that, we are going to remove special characters, including hashtags, @ sign, fullstops and comma etc. Those special characters are meaningless in analyzing the tweet content. Also it may affect the prediction.

The last step we are going to remove stopwords which is an important step since stopwords are generally considered as netural and doesn't carry any sentiment and can be removed for analysis. Removing stopswords can increase the accruacy of understanding the true meaning in natural language processing. 

In [9]:
# preprocessing tweets
# turn all characters to lowercase.
train["text"] = train["text"].str.lower()
test["text"] = test["text"].str.lower()

In [10]:
# Remove special Characters
import re

def remove_special(text_str):
    new_text = re.sub(r'[^A-Za-z0-9]', ' ', text_str)

    return new_text

train["text"] = train["text"].apply(remove_special)
test["text"] = test["text"].apply(remove_special)

In [11]:
# Checking the cleaned data
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,our deeds are the reason of this earthquake m...,1
1,4,,,forest fire near la ronge sask canada,1
2,5,,,all residents asked to shelter in place are ...,1
3,6,,,13 000 people receive wildfires evacuation or...,1
4,7,,,just got sent this photo from ruby alaska as ...,1


In [12]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,just happened a terrible car crash
1,2,,,heard about earthquake is different cities s...
2,3,,,there is a forest fire at spot pond geese are...
3,9,,,apocalypse lighting spokane wildfires
4,11,,,typhoon soudelor kills 28 in china and taiwan


In [13]:
# Remove stopwords
from nltk.corpus import stopwords
stopwords.words('english') #Showing stopwords in English

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

### Tokenize Text Data

In [14]:
# Tokenize text
from nltk.tokenize import word_tokenize

def remove_stopword(text_str):
    new_text = word_tokenize(text_str)

    tokens_without_sw = [word for word in new_text if not word in stopwords.words()]

    return tokens_without_sw

train['text'] = train['text'].apply(remove_stopword)

test['text'] = test['text'].apply(remove_stopword)

In [15]:
# Check the modified dataframe
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,"[deeds, reason, earthquake, may, allah, forgiv...",1
1,4,,,"[forest, fire, near, ronge, sask, canada]",1
2,5,,,"[residents, asked, shelter, place, notified, o...",1
3,6,,,"[13, 000, people, receive, wildfires, evacuati...",1
4,7,,,"[got, sent, photo, ruby, alaska, smoke, wildfi...",1


In [16]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,"[happened, terrible, car, crash]"
1,2,,,"[heard, earthquake, different, cities, stay, s..."
2,3,,,"[forest, fire, spot, pond, geese, fleeing, acr..."
3,9,,,"[apocalypse, lighting, spokane, wildfires]"
4,11,,,"[typhoon, soudelor, kills, 28, china, taiwan]"


## Step 2 Part II: TF-IDT

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df = 2, max_df = 0.5, ngram_range = (1,2))
train_text = tfidf.fit_transform(train['text'])
test_text = tfidf.transform(test['text'])

## Step 3: Import algoritms for training models.



Random Forest Classifier

In [12]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

model_1 = RandomForestClassifier(random_state=42)
model_1.fit(train_text, train['target'])
predicted_1 = model_1.predict(test_text)

## Step 4: Evaluating the model accuracy

## Step 5: Improve prediction model accuracy

## Step 6: Generate submission file for Kaggle Competition (Optional)

In [13]:
result = pd.DataFrame({'Id':test['id'],
                       'Target': predicted_1})

result.to_csv('submission.csv', index = False)

In [14]:
# View Result
result.head(5)

Unnamed: 0,Id,Target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,0
