<div class="alert alert-info alertinfo" style="margin-top: 0px">
<h1> Natural Language Processing with Disaster Tweets </h1>
</div>

<div class="alert-success" style="margin-top: 0px">
<h1> Imports </h1>
</div> 

In [1]:
# standard
import pandas as pd
import numpy as np

# visualization
import plotly.express as px

<div class="alert-success" style="margin-top: 0px">
<h1> Data exploration </h1>
</div> 

In [2]:
# definitions
def summary_table(df):
    '''
    Creates a summary info tableabout given data frame
    
    Args:
        df: data frame
        
    Returns:
        summary: info data frame
    '''
    print('There are {} rows in the original data'.format(df.shape[0]))
    summary = pd.DataFrame(df.dtypes, columns = ['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name', 'dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['Uniques'] = df.nunique().values
    return summary

def show_target_distribution(df, target_column):
    '''
    Creates a plot of target distribution
    
    Args:
        df: data frame
        target_column: target column name
        
    Returns:
        target_distribution: data frame
    '''
    target_distribution = df[target_column].value_counts().reset_index()
    fig = px.bar(target_distribution, x=target_column, y='count', title='Distribution of Target Variable', width=600, height=400)
    fig.show()
    return target_distribution

def show_distribution_of_text_leghts(df, text_column):
    '''
    Creates a plot of text length distribution
        Args:
        df: data frame
        text_column: text column name
        
    Returns:
        None
    '''
    df['text_length'] = df[text_column].apply(len)
    fig = px.histogram(df, x='text_length', nbins=30, title='Distribution of Text Lengths', labels={'text_length': 'Text Length', 'count': 'Frequency'})
    fig.show()
    long_border = 0.8 * df['text_length'].values.max()
    sample_long = df[df['text_length'] >= long_border][text_column].head()
    short_border = 1.2 * df['text_length'].values.min()
    sample_short = df[df['text_length'] <= short_border][text_column].head()
    sample_long = sample_long.values
    print('Some long tweets:')
    print('-----------------')
    for x in sample_long:
        print(x)
    print('Some short tweets:')
    print('------------------')
    for x in sample_short:
        print(x)
    # shortest texts classified as target
    print('shortest text lenght classified as target:', df[df['target'] == 1]['text_length'].min())
    print('------------------')
    df_target = df[df['target'] == 1].sort_values(by='text_length', ascending=True)
    df_target = df_target[[text_column]].head(5)
    for x in df_target[text_column].values:
        print(x)

### 1. read data set

In [3]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


### 2. data overview

In [4]:
summary_table(df)

There are 7613 rows in the original data


Unnamed: 0,Name,dtypes,Missing,Uniques
0,id,int64,0,7613
1,keyword,object,61,221
2,location,object,2533,3341
3,text,object,0,7503
4,target,int64,0,2


### 3. Target variable

In [5]:
# the classes are not perfectly balanced, but the imbalance is not extreme
target_distribution = show_target_distribution(df, 'target')
target_distribution

Unnamed: 0,target,count
0,0,4342
1,1,3271


### 4 . Explore Text Data

In [6]:
# distribution of text lengths
show_distribution_of_text_leghts(df, 'text')

Some long tweets:
-----------------
All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
Haha South Tampa is getting flooded hah- WAIT A SECOND I LIVE IN SOUTH TAMPA WHAT AM I GONNA DO WHAT AM I GONNA DO FVCK #flooding
Barbados #Bridgetown JAMAICA ÛÒ Two cars set ablaze: SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintende...  http://t.co/wDUEaj8Q4J
First night with retainers in. It's quite weird. Better get used to it; I have to wear them every single night for the next year at least.
SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintendent Lanford Salmon has r ... - http://t.co/vplR5Hka2u http://t.co/SxHW2TNNLf
Some short tweets:
------------------
LOOOOOOL
The end!
Crushed
fatality
Bad day
shortest text lenght classified as target: 14
------------------
Omg earthquake
hurricane?? sick!
I see a massacre!!
My hand is burning
Earthquake drill ??


### other feature exploration

In [7]:
# keywords
keywords = set(df['keyword'].values)
list(keywords)[0:20]

['drown',
 'quarantine',
 'typhoon',
 'annihilated',
 'upheaval',
 'suicide%20bomb',
 'attack',
 'detonate',
 'inundation',
 'crushed',
 'crush',
 'obliterated',
 'casualties',
 'wreckage',
 'outbreak',
 'sinkhole',
 'seismic',
 'survived',
 'whirlwind',
 'exploded']

In [8]:
# locations
locations = set(df['location'].values)
list(locations)[0:20]

['Suva, Fiji Islands.',
 'Yeezy Taught Me , NV',
 'Spokane, Washington',
 'West Virginia, USA',
 'Ottawa, Ontario',
 'TonyJ@Centralizedhockey.com',
 'Ashland, Oregon',
 'SoDak',
 'Neverland',
 '36 & 38',
 'Halton Region',
 'Gotham City,USA',
 'Uganda',
 'In Hell',
 'Miami Beach, Fl',
 'port matilda pa',
 'BodÌü, Norge',
 'Enfield, UK',
 'Lindenhurst',
 'Canada BC']

##### this is the end of the preliminary data exploration. Deeper text exploration will be done in part 2. data cleaning 