# Disaster Tweets

## Objectives

**In this project I predict which tweet is real which tweet is fake disaster tweet**

What should I expect the data format to be?
Each sample in the train and test set has the following information:

The text of a tweet
A keyword from that tweet (although this may be blank!)
The location the tweet was sent from (may also be blank)
What am I predicting?
You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

Files
train.csv - the training set
test.csv - the test set
sample_submission.csv - a sample submission file in the correct format
Columns
id - a unique identifier for each tweet
text - the text of the tweet
location - the location the tweet was sent from (may be blank)
keyword - a particular keyword from the tweet (may be blank)
target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

## Data Setup

In [32]:
import pandas as pd

In [33]:
df = pd.read_csv('train.csv')

In [34]:
df.sample(20)

Unnamed: 0,id,keyword,location,text,target
4547,6464,injured,Mumbai,Udhampur terror attack: Militants attack polic...,1
354,509,army,"Memphis, TN",Salvation Army hosts rally to reconnect father...,1
2808,4039,disaster,"Alexandria, VA",Four Technologies That Could Let Humans Surviv...,0
2759,3962,devastation,contactSimpleNews@gmail.com,70 Years After Atomic Bombs Japan Still Strugg...,1
4952,7058,meltdown,,@nashhmu have a meltdown he noticed you,0
4492,6388,hurricane,"#1 Vacation Destination,HAWAII",HURRICANE GUILLERMO LIVE NOAA TRACKING / LOOPI...,1
208,294,ambulance,"Davidson, NC",People who try to j-walk while an ambulance is...,0
2245,3212,deluged,"Karachi, Pakistan",#Glimpses: Hyderabad deluged by heavy rainfall...,1
7347,10519,wildfire,,Solitude Fire Update August 6 2015 (Solitude W...,1
2,5,,,All residents asked to 'shelter in place' are ...,1


In [35]:
df['location'].unique()

array([nan, 'Birmingham', 'Est. September 2012 - Bristol', ...,
       'Vancouver, Canada', 'London ', 'Lincoln'], dtype=object)

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7613 entries, 0 to 7612
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        7613 non-null   int64 
 1   keyword   7552 non-null   object
 2   location  5080 non-null   object
 3   text      7613 non-null   object
 4   target    7613 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 297.5+ KB


In [37]:
df.fillna(method='ffill')

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,wrecked,Lincoln,Two giant cranes holding a bridge collapse int...,1
7609,10870,wrecked,Lincoln,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,wrecked,Lincoln,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,wrecked,Lincoln,Police investigating after an e-bike collided ...,1


In [39]:
df['location'].value_counts()

USA                             104
New York                         71
United States                    50
London                           45
Canada                           29
                               ... 
cereal aisle #17:i4               1
iPhone: 33.104393,-96.628624      1
??t?a                             1
#Bummerville otw                  1
Mountains                         1
Name: location, Length: 3341, dtype: int64

In [40]:
df['location'] = df['location'].fillna('Unknown')

In [41]:
df['merged'] = df['location'] + " " + df['text']

In [43]:
df['merged'].sample(4)

4335    Unknown Criminals Who Hijack Lorries And Buses...
786     Daruka (near Tamworth) NSW Get a load of this ...
5322    Durham, NC Element of Freedom: The Biggest Par...
5925    Unknown I just got screamed at for asking my d...
Name: merged, dtype: object

## NLP Proccess

In [60]:
def clearing(text):
    
    text=text.lower()                    # Buradan sonraki 4 satırd ise NLP methodlarını uygulayabilmek adına
                                         # bütün veriyi küçük harflere çevirdik ve içlerinden numerik 
                                         # verileri ve de sembolleri attık
    text=text.replace("[^\w\s]","") 
    text=text.replace("\d+","") 
    text=text.replace("\n"," ").replace("\r","") 
    text=text.replace("\&\#[0-9]+\;",'')
    
    return text

In [61]:
df['merged'] = df['merged'].apply(clearing)

In [58]:
df['merged'].replace({'r"\&\#[0-9]+\;"':''},regex=True,inplace=True)

In [None]:
df['merged'].replace('r"\&\#[0-9]+\;)

In [63]:
df['merged'].replace({'r"\&\#[0-9]+\;"':''},regex=True,inplace=True)

In [64]:
df['merged']

0       unknown our deeds are the reason of this #eart...
1          unknown forest fire near la ronge sask. canada
2       unknown all residents asked to 'shelter in pla...
3       unknown 13,000 people receive #wildfires evacu...
4       unknown just got sent this photo from ruby #al...
                              ...                        
7608    unknown two giant cranes holding a bridge coll...
7609    unknown @aria_ahrary @thetawniest the out of c...
7610    unknown m1.94 [01:04 utc]?5km s of volcano haw...
7611    unknown police investigating after an e-bike c...
7612    unknown the latest: more homes razed by northe...
Name: merged, Length: 7613, dtype: object

## Modelling

In [65]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from textblob import TextBlob
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english') 

def split_into_lemmas(text):
    
    text = str(text).lower() 
    
    words = TextBlob(text).words
    
    return [stemmer.stem(word) for word in words]

In [66]:
x,y=df['merged'],df['target']

In [68]:
y

0       1
1       1
2       1
3       1
4       1
       ..
7608    1
7609    1
7610    1
7611    1
7612    1
Name: target, Length: 7613, dtype: int64

In [78]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=50)

## Vectorizing 

In [79]:
vect=CountVectorizer(lowercase=True, stop_words='english', ngram_range=(1,2), analyzer=split_into_lemmas)
x_train_dtm=vect.fit_transform(x_train)
x_test_dtm=vect.transform(x_test)

In [80]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [81]:
b=MultinomialNB()
model=b.fit(x_train_dtm,y_train)
b_predict=b.predict(x_test_dtm)

In [82]:
accuracy_score(y_test,b_predict)

0.7935924369747899

## Test Dataset

In [88]:
test['text'] = test['text'].apply(clearing)

In [89]:
test['location'] = test['location'].fillna('Unknown')

In [90]:
test['merged'] = test['location'] + test['text']

In [91]:
test['merged']

0               Unknownjust happened a terrible car crash
1       Unknownheard about #earthquake is different ci...
2       Unknownthere is a forest fire at spot pond, ge...
3         Unknownapocalypse lighting. #spokane #wildfires
4       Unknowntyphoon soudelor kills 28 in china and ...
                              ...                        
3258    Unknownearthquake safety los angeles ûò safet...
3259    Unknownstorm in ri worse than last hurricane. ...
3260    Unknowngreen line derailment in chicago http:/...
3261    Unknownmeg issues hazardous weather outlook (h...
3262    Unknown#cityofcalgary has activated its munici...
Name: merged, Length: 3263, dtype: object

In [93]:
def vectorizing(text):
    
    return vect.transform([text])

In [94]:
vect = test['merged'].apply(vectorizing)

In [95]:
list = []

In [96]:
for v in vect:
    list.append(model.predict(v))
list

[array([1], dtype=int64),
 array([1], dtype=int64),
 array([1], dtype=int64),
 array([1], dtype=int64),
 array([1], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([1], dtype=int64),
 array([0], dtype=int64),
 array([0], 

In [100]:
df = pd.DataFrame(list)

In [101]:
df

Unnamed: 0,0
0,1
1,1
2,1
3,1
4,1
...,...
3258,1
3259,1
3260,1
3261,1


In [102]:
df.to_csv('submission1')