# CS4447 Final Project - Predicting Real Disasters from Tweets
## Hafez Gharbiah, Tyler Christeson
## data: https://www.kaggle.com/vbmokin/nlp-with-disaster-tweets-cleaning-data

## Rubric + Guidelines
1. Proper tagging of Github repository for final report as per deadlines (0.5 = 0.25 + 0.25 points)
1. Dataset and motivation slide (1 points)
    - How/why the dataset was collected and a description of the metadata of your dataset.
1. Actual task definition/research question (2 points)
    - What real-world problem are you trying to solve? What are the input and output of your analysis?
1. Literature review (2 points)
    - What other work has been done in this area, and how is your work novel compared to others?
1. Quality of cleaning (6 points, 2 points each) 
    - Data cleaning and type conversion activity. Please share anything unusual you faced during this activity.
    - What did you do about missing values and why? Handling missing values properly is very important.
    - New feature/attribute creation and data summary statistics and interpretation.
1. Visualization (8 points, 2 points each)
    - Data visualization activity (box plot, bar plot, violin plot, and pairplot to see relationships and distribution, etc.).
    - Describe anything you find in the data after each visualization.
    - What data visualization helped you understand about data distribution.
    - What you did about possible outlier as per data distribution visualization. (Did you confirm with your client whether it is actually an outlier or put a disclosure statement in your notebook if you decided to remove it?)

- The problem we're trying to solve is predicting whether a tweet is about a real disaster or not, which can be used to determine if emergency services need to be sent.

- We have a collection of 10,000 tweets. The attributes of the dataset are a unique identifier for each tweet, text of the tweet, where the tweet was sent from, keywords that could be used to identify disasters, and whether or not it is about a real disaster (only on some of them).

- Examples of records:
    - "Heard about # earthquake is different cities, stay safe everyone ." 
    - "Please like and share our new page for our Indoor Trampoline Park Aftershock opening this fall !" 
    - " nowplaying Alfons - Ablaze 2015 on Puls Radio pulsradio" 
    - "Coincidence Or # Curse ? Still # Unresolved Secrets From Past # accident"

- This is a noisy data set because the tweets are not all about disasters, and certain disaster keywords are used in contexts that are not disasters . For example, while "ablaze" is used in several real disaster tweets about ongoing fires, in the above example it is used as a song title. The same is true for many keywords, like "accident" and "aftershock" above.

- Feature engineering can be used in this dataset to:
    - extract years to see if we're tweeting about events that aren't current
    - extract news network name to determine if the accident is being reported on or not
    - topic modeling to extract relevant topics as features

In [1]:
import pandas as pd

In [14]:
df = pd.read_csv('train_data_cleaning.csv',index_col=0)

4342

In [18]:
df.target[df.location.isna() == False].count() / df.target.count()

0.6672796532247471

In [19]:
#ratio of tweets about disasters w/ locations to all tweets about disasters
df.target[df.location.isna() == False].sum()/df.target.sum()

0.671354325894222

In [20]:
#ratio of tweets about disasters w/ locations to all tweets with locations
df.target[df.location.isna() == False].sum()/df.target[df.location.isna() == False].count()

0.4322834645669291

In [21]:
df.target.sum()/df.target.count()

0.4296597924602653