<a href="https://colab.research.google.com/github/chupati/disasterdetection/blob/master/Disaster_Detection_From_Social_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Detecting disaster events from twitter

This program uses Natural Laguage Processing (NLP) and Machine Learning to train two classification algorithms, Random Forest and SVM.  Upon completing training, the two algorithms will be able to predict if a twitter post is a disaster related event. 

**Import Libraries**

In [0]:
import pandas as pd
from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import train_test_split
from google.colab import files
import textwrap
import numpy as np

**Import Data**

Upload Comma Delimited File (.CSV). 

Download the file from the link to your computer and open back up using the "Choose Files" button.

[Social Network Disaster Data (CSV)](https://raw.githubusercontent.com/chupati/disasterdetection/master/socialmedia-disaster-tweets-DFE-utf8.csv)


In [2]:


uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving socialmedia-disaster-tweets-DFE-utf8.csv to socialmedia-disaster-tweets-DFE-utf8.csv
User uploaded file "socialmedia-disaster-tweets-DFE-utf8.csv" with length 2213376 bytes


**Data Pre-processing**

The following code checks that each line in the CSV file is in a utf8 format. Note that the Pandas library will fail to read non utf8 lines. 

In [3]:
bad_lines = 0
total_lines = 0
f = open('socialmedia-disaster-tweets-DFE-utf8.csv', encoding='utf-8')
for line in f:
    try:
        line.encode('utf-8')
    except UnicodeDecodeError:
        bad_lines += 1
    total_lines += 1
f.close()
print("Total Lines read:", total_lines, '\n', "Total Non UTF8 lines read:", bad_lines)  #check for non utf8 formatted twitter messages (bad lines)

Total Lines read: 12267 
 Total Non UTF8 lines read: 0


**Data Processing**

Using the Pandas library, the following code loads the CSV data into a dataframe and prints out dataframe statistics. A pandas dataframe is a datatypes that holds data in cells and columns. 


NOTE: The columns listed here are not used as features (direct inputs to the ML model); However, "choose_one:confidence" is used to create a new feature for the ML model. 

In [4]:
data = pd.read_csv('socialmedia-disaster-tweets-DFE-utf8.csv', delimiter=',')
data.describe()

Unnamed: 0,_unit_id,_trusted_judgments,choose_one:confidence,tweetid,userid
count,10876.0,10876.0,10876.0,10876.0,10789.0
mean,778250300.0,6.022527,0.842037,6.240055e+17,1231422000.0
std,3200.574,10.463834,0.168086,5.603918e+16,1167599000.0
min,778243800.0,3.0,0.3342,1.0,3840.0
25%,778247500.0,5.0,0.7149,6.29059e+17,187002700.0
50%,778250300.0,5.0,0.8049,6.29092e+17,634217300.0
75%,778253000.0,5.0,1.0,6.292342e+17,2416228000.0
max,778261100.0,157.0,1.0,6.29365e+17,3404474000.0


**Model Setup - Defining Labels for train and test phases**

Label = 0    Non-Disaster

Label = 1    Disaster

The following code adds **target** column to Pandas datagram.  A target is the value we like the ML model to predict.  The target is defined with a 95% threshold confidence. 

In [5]:
data['target'] = 0
data.loc[(data['choose_one'] == 'Relevant') & (data['choose_one:confidence'] > 0.95), ['target']] = 1
data.describe()

Unnamed: 0,_unit_id,_trusted_judgments,choose_one:confidence,tweetid,userid,target
count,10876.0,10876.0,10876.0,10876.0,10789.0,10876.0
mean,778250300.0,6.022527,0.842037,6.240055e+17,1231422000.0,0.208533
std,3200.574,10.463834,0.168086,5.603918e+16,1167599000.0,0.406278
min,778243800.0,3.0,0.3342,1.0,3840.0,0.0
25%,778247500.0,5.0,0.7149,6.29059e+17,187002700.0,0.0
50%,778250300.0,5.0,0.8049,6.29092e+17,634217300.0,0.0
75%,778253000.0,5.0,1.0,6.292342e+17,2416228000.0,0.0
max,778261100.0,157.0,1.0,6.29365e+17,3404474000.0,1.0


**Data Exploration**

Get a count of the tweets labeled as disasters in the data.  Notice that the low count suggest tha disaster labeled tweets are a minority class. 

In [6]:
disastercount = data['target'].sum() 
print(disastercount, ' out of ', data.shape[0] )

2268  out of  10876


**Sample Twitter messages** (TODO: NEED TO UPDATE CODE)

The following code prints sample tweets that are labeled as disasters or not labeled at all (to be considered non-disasters). 


In [7]:
y = data['target']  #Labels
x = data['text']    
for line in x:
    print(line)
    break

Just happened a terrible car crash


**Tokenizing (Breaking up twitter messages into words)**

The Tweet Tokenizer libray is used here to create list of tokens (i.e. hastags, words, or mentions). 


In [9]:
document_frequencies = dict()
term_index = dict()
token_counts = dict()
token_disasters = dict()
tokenizer = TweetTokenizer()
token_count = 0
index = 0
for index, row in data.iterrows():
    tokens = tokenizer.tokenize(row['text'])
    target = row['target']
    document_terms = set()
    for token in tokens:
        if token.lower() in token_counts:
            token_counts[token.lower()] += 1
            if not(token.lower() in document_terms):
              document_frequencies[token.lower()] += 1
              document_terms.add(token.lower())
        else:
            token_counts[token.lower()] = 1
            document_frequencies[token.lower()] = 1
            document_terms.add(token.lower())
        if token.lower() in token_disasters:
            token_disasters[token.lower()] += target
        else:
            token_disasters[token.lower()] = target
        token_count += 1

print('Token Count: ', token_count)

Token Count:  183921


**Splitting the dataset **

The following code splits the dataset into an 80/20 split. 80% of the data is used for training and 20% is used for testing. 

In [10]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

print('Text Observations (X) and Disaster Labels (y) for TRAIN set')
print('X_train: ', X_train.shape)
print('y_train', y_train.shape, '\n')

print('Text Observations (X) and Disaster Labels (y) for TEST set')
print('X_test:', X_test.shape)
print('y_test', y_test.shape)

Text Observations (X) and Disaster Labels (y) for TRAIN set
X_train:  (8700,)
y_train (8700,) 

Text Observations (X) and Disaster Labels (y) for TEST set
X_test: (2176,)
y_test (2176,)


**Generate Document Frequencies**

A document is an NLP term for one unit of text. In our case, one tweet is represented as one document.

In [19]:
token_counts = dict()
x_train_disaster_probabilities = list()
N = x.shape[0]
print('N: ', N)
p_y = np.sum(y)/N
print('p(y) =', p_y)
min_pmi = np.log2(1e-10)  # Because np.log2(0) == -inf
x_train_features = np.zeros((X_train.shape[0], 7))
for row in X_train:
    #print(row)
    dedented_text = textwrap.dedent(row).strip()
    #print(dedented_text)
    for width in [ 80]:
        #print ('%d Columns:\n' % width)
        print ('Twitter message:', '\"%s\"' % textwrap.fill(dedented_text, width=width))

    tokens = tokenizer.tokenize(row)
    disaster_probabilities = list()
    disaster_probabilities_round = list()
    pmi_values = list()
    for token in tokens:
        p_x = document_frequencies[token.lower()]/N
        p_x_y = token_disasters[token.lower()]/N
        if p_x_y == 0:
          pmi = min_pmi
        else:
          pmi = np.log2(p_x_y/(p_x * p_y))
        disaster_probabilities.append(pmi)
        disaster_probabilities_round.append(round(pmi, 2))
    
    print('\n','\n', 'Disaster Text Probabilities')
    print( disaster_probabilities_round)
    break
    

N:  10876
p(y) = 0.20853254873115115
Twitter message: "i dont even remember slsp happening i just remember being like wtf and then the
lights turned off and everyone screamed for the encore"

 
 Disaster Text Probabilities
[-1.86, -1.2, -1.05, -0.1, -33.22, -1.91, -1.86, -1.93, -0.1, -0.35, -1.79, -1.74, -0.44, -1.18, 0.21, -33.22, -1.74, -0.24, -0.44, -0.68, -33.22, 0.16, 0.21, -33.22]
