# Disaster Tweets: Project Intro
### For this project, we will be using this data set of twitter post text, found on kaggle. The goal is to come up with an algorithm that most accuractely classifies a tweet as indicative of a real disaster or not a real disaster. 

In [1]:
%matplotlib inline

# General libraries.
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import *
from sklearn import metrics

# SK-learn Decomp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# NLP processors
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/benjamin.mok/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/benjamin.mok/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### First, we will read in the "train" dataset, which contains the correct labels. We will split the data 50/50 between train and dev (or test). For ease of analysis and text processing, the data will be further split into "pos" (label = 1) and "neg" (label = 0) dataframes and text only. 

In [407]:
#read in data
# df = pd.read_csv(r'C:\Users\lwu31\OneDrive - JNJ\Documents\train.csv')
df = pd.read_csv('data/nlp-getting-started/train.csv')
# sample the data, acts as shuffling the data on row

#50/50 split between train and dev
# allocate more for traiing if we do it this way, i'll run some
# analysis to see if my cluster bootstrap can imrpove the models we run.
numtest = int(len(df)/3.5)
df_train = df[numtest:].reset_index(drop=True)
df_test = df[:int(numtest/2)].reset_index(drop=True)
df_dev = df[int(numtest/2):numtest].reset_index(drop=True)

train_data, train_label = df_train.text, df_train.target
dev_data, dev_label = df_dev.text, df_dev.target
test_data, test_label = df_test.text, df_test.target

#split into disaster and non disaster data
df_neg = df_train.loc[df_train.target == 0]
df_pos = df_train.loc[df_train.target == 1]

#split into disaster and nondisaster tweets only
neg_text = df_neg.text
pos_text = df_pos.text

print("Some data metrics\n")
print("Shape of train data:", df_train.shape)
print("\nMissing data in each column:\n" + str(df.isnull().sum()))
print("\nNumber of disaster tweets:\n"+ str(train_label.value_counts()))

Some data metrics

Shape of train data: (5438, 5)

Missing data in each column:
id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

Number of disaster tweets:
0    2996
1    2442
Name: target, dtype: int64


### Because the available tags, keyword and location, are sparse and method of construction are unclear to us, we wanted to create new tags for the text that we may be able to train later on. 

In [84]:
df_train['hashtag'] = df_train['text'].apply(lambda s: re.findall(r'#(\w+)', s))
df_train['mentions'] = df_train['text'].apply(lambda x: re.findall(r"@(\w+)", x))
df_train['links'] = df_train['text'].apply(lambda x: re.findall(r"http:\/\/(\w+)", x))
df_train['retweet'] = df_train['text'].apply(lambda x: "rt" in x.lower().split())

df_train['mentions_ind'] = df_train.mentions.apply(lambda y: 0 if len(y)==0 else 1)
df_train['hashtag_ind'] = df_train.hashtag.apply(lambda y: 0 if len(y)==0 else 1)
df_train['links_ind'] = df_train.links.apply(lambda y: 0 if len(y)==0 else 1)
df_train['retweet_ind'] = df_train.retweet.apply(lambda y: 0 if y == False else 1)

print(df_train.head(10))

     id             keyword        location  \
0  8367                ruin         Belfast   
1  4164               drown             NaN   
2  9232   suicide%20bombing             NaN   
3  7587            outbreak  Fukuoka, Japan   
4  8816              sirens       Hollywood   
5  9673             tornado         Midwest   
6  2538           collision             NaN   
7   798              battle             NaN   
8  7108            military          Alaska   
9  7228  natural%20disaster             NaN   

                                                text  target  \
0  And then I go a ruin it all with something awf...       0   
1        @GraysonDolan only if u let me drown you ??       0   
2  meek mill should join isis since he loves suic...       0   
3  Families to sue over Legionnaires: More than 4...       1   
4  @TravelElixir Any idea what's going on? I hear...       1   
6  my favorite lady came to our volunteer meeting...       1   
7  Dragon Ball Z: Battle Of Gods (

In text classification problems, text pre-processing is a crucial part to prepping our data for analysis. This can be found in our text_clean function. Some pre-processing considerations we have made include:
* removing numbers, symbols, and punctuation
* standardizing to lowercase text
* remove stop words
* word stemming
* trailing spaces

In [3]:
def preprocess(text, method=None, tokenizer=None):
    #remove line breaks
    text = re.sub(r"\n","",text)

    #convert to lowercase 
    text = text.lower()

    #remove digits and currencies 
    text = re.sub(r"\d+","",text) 
    text = re.sub(r'[\$\d+\d+\$]', "", text)      

    #remove dates 
    text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)
    text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)
    text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)

    #remove non-ascii
    text = re.sub(r'[^\x00-\x7f]',r' ',text) 

    #remove punctuation
    text = re.sub(r'[^\w\s]','',text) 

    #remove hyperlinks
    #text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE)
    
    # remove trailing spaces
    text = re.sub(r'[ \t]+$','', text)
    
    if method.lower() == 'l':
        lemmer = WordNetLemmatizer()
        lemm_tokens = [lemmer.lemmatize(word) for word in tokenizer(text)]
        return " ".join(lemm_tokens)
    
    elif method.lower() == 's':
        porter = PorterStemmer()
        stem_tokens = [porter.stem(word) for word in tokenizer(text)]
        return " ".join(stem_tokens)
    
    return text

    # remove stop words, yea don't remove stop words.
    # filtered_tokens = [word for word in word_tokenize(text) if not word in stop_words]
    # text = ("").join(filtered_tokens)

#### Clean the data and strip

### After the data has been cleaned and text pre-processed, we can begin exploring different algorithms. The three machine learning algorithms we will focus on are:
* Naive Bayes
* Logistic Regression
* SVM
