# Disaster Tweets: Project Intro
### For this project, we will be using this data set of twitter post text, found on kaggle. The goal is to come up with an algorithm that most accuractely classifies a tweet as indicative of a real disaster or not a real disaster. 

In [1]:
%matplotlib inline

# General libraries.
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk

### First, we will read in the "train" dataset, which contains the correct labels. We will split the data 50/50 between train and dev (or test). For ease of analysis and text processing, the data will be further split into "pos" (label = 1) and "neg" (label = 0) dataframes and text only. 

In [26]:
#read in data
df = pd.read_csv(r'C:\Users\lwu31\OneDrive - JNJ\Documents\train.csv')

#50/50 split between train and dev
numtest = int(len(df)/2)
df_train = df[:numtest]
df_dev = df[numtest:]
train_data, train_label = df_train.text, df_train.target
dev_data, dev_label = df_dev.text, df_dev.target

#split into disaster and non disaster data
df_neg = df_train.loc[df_train.target == 0]
df_pos = df_train.loc[df_train.target == 1]

#split into disaster and nondisaster tweets only
neg_text = df_neg.text
pos_text = df_pos.text

print("Some data metrics\n")
print("Shape of train data:", df_train.shape)
print("\nMissing data in each column:\n" + str(df.isnull().sum()))
print("\nNumber of disaster tweets:\n"+ str(train_labels.value_counts()))

Some data metrics

Shape of train data: (3806, 5)

Missing data in each column:
id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

Number of disaster tweets:
0    2252
1    1554
Name: target, dtype: int64


### Because the available tags, keyword and location, are sparse and method of construction are unclear to us, we wanted to create new tags for the text that we may be able to train later on. 

In [24]:
df_train['hashtag'] = df_train['text'].apply(lambda s: re.findall(r'#(\w+)', s))
df_train['mentions'] = df_train['text'].apply(lambda x: re.findall(r"@(\w+)", x))
df_train['links'] = df_train['text'].apply(lambda x: re.findall(r"http:\/\/(\w+)", x))
df_train['retweet'] = df_train['text'].apply(lambda x: "rt" in x.lower().split())

df_train['mentions_ind'] = df_train.mentions.apply(lambda y: 0 if len(y)==0 else 1)
df_train['hashtag_ind'] = df_train.hashtag.apply(lambda y: 0 if len(y)==0 else 1)
df_train['links_ind'] = df_train.links.apply(lambda y: 0 if len(y)==0 else 1)
df_train['retweet_ind'] = df_train.retweet.apply(lambda y: 0 if y == False else 1)

print(df_train.head(10))


   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   
5   8     NaN      NaN  #RockyFire Update => California Hwy. 20 closed...   
6  10     NaN      NaN  #flood #disaster Heavy rain causes flash flood...   
7  13     NaN      NaN  I'm on top of the hill and I can see a fire in...   
8  14     NaN      NaN  There's an emergency evacuation happening now ...   
9  15     NaN      NaN  I'm afraid that the tornado is coming to our a...   

   target                         hashtag rt link mentions links  retweet  \
0       1                    [earthquake]               []    []    False  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['hashtag'] = df_train['text'].apply(lambda s: re.findall(r'#(\w+)', s))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['mentions'] = df_train['text'].apply(lambda x: re.findall(r"@(\w+)", x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['links'] = df_train['text'].apply

In text classification problems, text pre-processing is a crucial part to prepping our data for analysis. This can be found in our text_clean function. Some pre-processing considerations we have made include:
* removing numbers, symbols, and punctuation
* standardizing to lowercase text
* remove stop words
* word stemming
* more...

In [4]:
def text_clean(text):
  #remove line breaks
  text = re.sub(r"\n","",text)

  #convert to lowercase 
  text = text.lower()

  #remove digits and currencies 
  text = re.sub(r"\d+","",text) 
  text = re.sub(r'[\$\d+\d+\$]', "", text)      

  #remove dates 
  text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)
  text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)
  text = re.sub(r'\d+[\.\/-]\d+[\.\/-]\d+', '', text)

  #remove non-ascii
  text = re.sub(r'[^\x00-\x7f]',r' ',text) 

  #remove punctuation
  text = re.sub(r'[^\w\s]','',text) 

  #remove hyperlinks
  text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)

  #replace extra whitespaces with a single one 
  #text = re.sub(re.sub(' +', ' ', text)
  return text

### After the data has been cleaned and text pre-processed, we can begin exploring different algorithms. The three machine learning algorithms we will focus on are:
* Naive Bayes
* Logistic Regression
* SVM
