<h1>Disaster Tweets Classification</h1>

<h4 style='text-align: justify;'>
This notebook is mainly to preprocess the dataset. At the beginning, I tested the accuracy that can be reached by using trivial methods to predict the output. Afterwards, the data is procesed to be use by BERT. The machine learning models are notshown in this notebook because I focused on the models on another notebook. Neural networks with BERT was the main interest, yet other algorithm were used too.
<br/>
For the machine learning models, check the link below:
    <a href=''>Machine Learning Notebook</a>
</h4>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
import tensorflow_hub as hub
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import ModelCheckpoint
from collections import OrderedDict,Counter
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.manifold import TSNE

In [None]:
n_splits=7
random_state=27

<h2><b>Functions</b></h2>

<h4>
Here some useful functions are created
</h4>

In [None]:
def get_keyword_mean(train_df,val_df,column='keyword',target='target'):
    aggregate_column=column+'_mean'
    train_df[aggregate_column]=train_df.groupby(column)[target].transform('mean')
    
    val_df = val_df.merge(
                train_df[[column, aggregate_column]].drop_duplicates(),
                on=column,
                how="left",
            )
    return train_df,val_df

In [None]:
def cv_datasets(df,model,regex='^BERT',target_column='target',use_keyword=False,model_function=None,compile_dic={},**kwargs):
    '''
    df: Dataframe to divide the data.
    model: Model used to fit and evaluate the data.
    regex: Regex used to pick a column. By default, it searches for the columns starting with BERT.
    target_column: Name of the column containing the output.
    compile_dic: It is only used for keras models
    '''
    kfold=KFold(n_splits=n_splits,shuffle=True,random_state=random_state)
    results={'Total':0}
    i=0
    
    #iterate through the folds
    for train_index,val_index in kfold.split(df):
        
        if 'keras' in str(type(model)):
            #keras
            model_copy=model_function()
            model_copy.compile(**compile_dic)
    
            model_copy.summary()
        else:
            #sklearn
            model_copy= sklearn.base.clone(model)
            
        train_df=df.iloc[train_index]
        val_df=df.iloc[val_index]
        
        if use_keyword:
            train_df,val_df=get_keyword_mean(train_df,val_df)
            regex+='|^keyword_'
        
        X_train=train_df.filter(regex=regex,axis=1)
        y_train=train_df[target_column]
        
        X_val=val_df.filter(regex=regex,axis=1)
        y_val=val_df[target_column]
        
        #print(X_train)
        X_train=X_train.copy().values
        y_train=y_train.copy().values
        
        X_val=X_val.copy().values
        y_val=y_val.copy().values
        
        'The problem with keras mmodels is the copy itself'
        model_copy.fit(X_train,y_train,**kwargs)
        
        predicted=model_copy.predict(X_val)
        
        metric=confusion_matrix(y_val,predicted)
        
        tn, fp, fn, tp = metric.ravel()
        
        results_matrix=metric
        results['Batch '+str(i)]=results_matrix 
        results['Total']+=np.array(results_matrix)
        i+=1
        
        print(f'{i}th run:\naccuracy: {(tp+tn)/(tn+fp+fn+tp)}\nprecision:{tp/(tp+fp)}\nrecall:{tp/(tp+fn)}\n')
    
    
    tn, fp, fn, tp = results['Total'].ravel()
    
    print(f'Total:\naccuracy: {(tp+tn)/(tn+fp+fn+tp)}\nprecision:{tp/(tp+fp)}\nrecall:{tp/(tp+fn)}\n')
    
    return results
        

In [None]:
def fit_and_predict(model,train_df,test_df,use_keyword=False,regex='^BERT',target_column='target',**kwargs):
    if use_keyword:
        train_df,test_df=get_keyword_mean(train_df,test_df)
        regex+='|^keyword_'

    X_train=train_df.filter(regex=regex,axis=1)
    y_train=train_df[target_column]

    X_test=test_df.filter(regex=regex,axis=1)
    
    model.fit(X_train,y_train,**kwargs)
    
    test_df['predictions']=(model.predict(X_test)>0.5).astype(int)
    
    return test_df

In [None]:
def generate_submission(test_df,predictions_column='predictions',name='submission'):
#     new_submission=pd.DataFrame()
#     new_submission['id']=test_df['id']
#     new_submission['target']=predictions
#     new_submission=new_submission.set_index('id')
    
    new_submission=test_df[['id',predictions_column]].set_index('id').rename(columns={predictions_column:'target'})
    print(f'saving the results in {name}.csv')
    new_submission.to_csv(name+'.csv')
    print('finished saving')
    
    return new_submission

In [None]:
def generate_results(df,model,name='submission',regex='^BERT',target_column='target',use_keyword=False,compile_dic={},model_function=None,**kwargs):
    df['keyword']=df['keyword'].fillna('unk')
    df['location']=df['location'].fillna('unk_location')
    
    train_df=df[df['trainable']==1]
    test_df=df[~(df['trainable']==1)]
    
    results=cv_datasets(train_df,model,regex=regex,target_column=target_column,model_function=model_function,use_keyword=use_keyword,**kwargs)
    
    test_df=fit_and_predict(model,train_df,test_df,use_keyword=use_keyword,regex=regex,target_column=target_column,**kwargs)
    
    test_df['predictions']=test_df['predictions'].values>0.5
    test_df['predictions']=test_df['predictions'].astype(int)
    
    submission=generate_submission(test_df,predictions_column='predictions',name=name)
    
    return results,test_df,submission

In [None]:
df=pd.read_csv('/kaggle/input/bert-features/BERT_processed.csv')
results,test_df,submission=generate_results(df,model=LogisticRegression(solver='sag'),use_keyword=True)

In [None]:
df=pd.read_csv('/kaggle/input/bert-features/BERT_processed.csv')

rf_parameters={
    'n_estimators':300,
    'max_depth':20,
    'min_samples_split':80,
    'min_samples_leaf':30,
    'min_weight_fraction_leaf':0.0,
    'max_features':'auto',
    'max_leaf_nodes':40,
    'min_impurity_decrease':0.0,
    'min_impurity_split':None,
}
results,test_df,submission=generate_results(df,model=RandomForestClassifier(**rf_parameters),use_keyword=True)

In [None]:
#Best till now
df=pd.read_csv('/kaggle/input/bert-features/BERT_processed.csv')
results,test_df,submission=generate_results(df,model=SVC(C=5),use_keyword=True)

In [None]:
df=pd.read_csv('/kaggle/input/bert-features/BERT_processed.csv')
results,test_df,submission=generate_results(df,model=KNeighborsClassifier(200,weights='distance',p=2),use_keyword=True)

In [None]:
results,test_df,submission
results['Total']

In [None]:
tn, fp, fn, tp = results['Total'].ravel()
(tn+tp)/(tn+fp+fn+tp)

<h2><b>Get The Data</b></h2>

In [None]:
train_df=pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_df=pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
sample_submission=pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

In [None]:
sample_submission

In [None]:
train_df[train_df['keyword'].notnull()]

In [None]:
test_df['keyword']=test_df['keyword'].fillna('unk')
test_df

In [None]:
train_df.describe()

In [None]:
pd.isna(train_df).sum()/len(train_df)

In [None]:
positive_class=train_df['target'].sum()/len(train_df)
print(f'the percentage of ones in the dataset is {np.round(positive_class,2)}')

<h2><b>Simple Methods</b></h2>

<h4 style='text-align: justify;'>
First I did submit the sample results. It achieved a 57% accuracy score. Afterwards, I used generated output files from each of the keyword and location columns. Hence, we can sense what score can be reached using easy methods and whatmore advanced ones can do.
</h4>

In [None]:
keyword_df=pd.DataFrame()
i=0
for keyword,group_df in train_df[['keyword','target']].fillna('unk').groupby('keyword'):
    positive_Kkeyword=group_df['target'].sum()/len(group_df)
    keyword_df.loc[i,['keyword']]=keyword
    keyword_df.loc[i,['count']]=len(group_df)
    keyword_df.loc[i,['% positive']]=positive_Kkeyword
    i+=1
#keyword_df.to_csv('keyword_df.csv')

In [None]:
location_df=pd.DataFrame()
i=0
for keyword,group_df in train_df[['location','target']].fillna('unk').groupby('location'):
    positive_Kkeyword=group_df['target'].sum()/len(group_df)
    location_df.loc[i,['location']]=keyword
    location_df.loc[i,['count']]=len(group_df)
    location_df.loc[i,['% positive']]=positive_Kkeyword
    i+=1
#location_df.to_csv('location_df.csv')

<h4>The idea is to group each row with the same column value (for keyword_df the keyword column while for the location_df the location column). Afterwards, some useful information is added, the most important one (the one that is used in the model) is the % positive which is basically the mean of the target for each group.</h4>

In [None]:
keyword_df.head()

In [None]:
print(keyword_df[keyword_df['keyword']==test_df.iloc[0]['keyword']]['% positive'].values[0])
new_submission=pd.DataFrame()
new_submission['id']=test_df['id']
new_submission['target']=test_df.apply(lambda row: int(keyword_df[keyword_df.keyword==row['keyword']]['% positive'].values[0]>=0.5),axis=1)

In [None]:
#new_submission.rename({0:'target'},axis=1)
new_submission.to_csv('to_submit.csv')

In [None]:
new_submission

<h4>
The keyword generated output file got 72% score while the location file got a lower score. The location output file generated a lower score than the keyword. This is expected because the location column contains a lot of null values. <br/>
Notice the difference between the sample submission (57%) and the result only using the keyword (72%). This shows the value of the keyword column for this task. It is expected because the keyword is a summary of the tweet using one word.
</h4>

<h2><b>Visualizations</b></h2>

In [None]:
train_ls=list(train_df['text'].str.split(' ').to_numpy())

In [None]:
Tokenizer=keras.preprocessing.text.Tokenizer(oov_token=0)
Tokenizer.fit_on_texts(train_ls)

In [None]:
i=0
n_words=len(Tokenizer.word_counts)
for word,count in Counter(Tokenizer.word_counts).most_common():
    print(f'{word}: {count}')
    i+=1
    if count<100:
        break

print(f'{i} word are the most common from {n_words} which is {i/n_words*100}%')

In [None]:
sns.countplot(x='target',data=train_df)

In [None]:
max_characters=np.max(train_df['text'].apply(lambda x: len(x)))
print(f'the max number of characters in a tweet is {max_characters}\nThus, the number of word < {max_characters}')

In [None]:
type(Counter(Tokenizer.word_counts).most_common())

In [None]:
top_100=Counter(Tokenizer.word_counts).most_common()[:100]

x_top_100=[x for x,y in top_100]
y_top_100=[y for x,y in top_100]
ax=sns.barplot(x=y_top_100,y=x_top_100)
ax.figure.set_size_inches(15,20)

<h4 style='text-align: justify;'>
Looking at the top 100 words makes us realized the most used words like 'a' do not give a lot of information if a disaster did occur or not. Actually, only emergency and disaster are word that indicate the possibility of a disaster occuring. Hence, most of the words are stop words that do not give much information about the tweet itself. These words will be removed.
</h4>

In [None]:
bottom_100=Counter(Tokenizer.word_counts).most_common()[-100:]

x_bottom_100=[x for x,y in bottom_100]
y_bottom_100=[y for x,y in bottom_100]

#Horizontal barplot
ax=sns.barplot(x=y_bottom_100,y=x_bottom_100)
ax.figure.set_size_inches(15,20)

<h4>
The bottom 100 words are all links. I do not think that the presence of a link means a disater happened. I do not expect that a disaster happened if someone tweeted 'check this link http:...'. If there was a link or a radio source that only give information about disasters. Hence, the links will be removed too.
</h4>

<h2><b>Process Tweets</b></h2>

<h4>
The tweets are processed using the nltk library.
</h4>

In [None]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
import string
import re

In [None]:
stop=stopwords.words('english')

# for i in stop:
#     print(i)

In [None]:
total_df=pd.DataFrame()

train_df['trainable']=1
test_df['trainable']=0

total_df=pd.concat((train_df,test_df))
total_df

In [None]:
train_ls_tmp=[]
stemmer= PorterStemmer()

In [None]:
def get_stem(word:str,stemmer):
    '''
    Get the stem of the word
    '''
    return stemmer.stem(word)

In [None]:
def preprocess_sentence(text,stemmer):
    new_sentence=[]
    
    text = text.replace(r"won't ", "will not ")
    text = text.replace(r"can't ", "can not ")
    text = text.replace(r"ain't ", "am not ")
    
    text = text.replace(r"n't ", " not ")
    text = text.replace(r"'re ", " are ")
    text = text.replace(r"'s ", " is ")
    text = text.replace(r"'d ", " would ")
    text = text.replace(r"'ll ", " will ")
    text = text.replace(r"'t ", " not ")
    text = text.replace(r"'ve ", " have ")
    text = text.replace(r"'m ", " am ")
    
    text = text.translate(str.maketrans('','',string.punctuation))
    
    text=re.sub(r'[^a-zA-Z0-9 ]', '', text)
    #print(text)
    sentence=text.split(' ')
    
    sentence=list(filter(lambda a: a != '', sentence))
    
    for word in sentence:
        #word=stemmer.stem(word)
        word_lower=word.lower()
        if (word_lower not in stop) and ('http' not in word_lower):# and ():
            new_sentence.append(word_lower)
    
    return ' '.join(new_sentence)

preprocess_sentence("Hey I'm Yann, ° Ñ  how're you and how's it going ? That's interesting: I'd love to hear more about it. http:idk.com",stemmer)

In [None]:
total_df['preprocessed_text']=total_df['text'].apply(lambda text: preprocess_sentence(text,stemmer))
total_df.head()

In [None]:
total_df.to_csv('preprocessed_text.csv')

<h1>Conclusion</h1>

<h4 style='text-align: justify;'>
This notebook started by generating outputs using simple methods from the non-tweet features. The keyword output got 72% accuracy. Afterwards, some plots for visualization were shown. The plot showed the effect of stop words and links. Finally, the text was preprocessed to be used with BERT.
</h4>

<h4>
    Checkout my BERT notebook <a href=''>Here</a>
</h4>