
<h5 style="text-align: center; color: #BD6C49;"> <i> Ecole Polytechnique de Thiès <br>  Département Génie Informatique et Télécommunication </i> </h5>
<h3 style="text-align: center; color: orange"> Disaster Tweets Detection 🥇 Exploratory Data Analysis </h3>
<h5 style="text-align: center; color: green"> By Kikia DIA 🤝🏾 Mouhamadou Naby DIA 🤝🏾 Ndeye Awa SALANE </h5>

<a id="0"></a> <br>
### Overview
#### [Introduction](#1)
1. [Exercice 1 La bibliothèque PIL](#2)
1. [Exercice 2 Numpy, MatplotLib](#3)
1. [Exercice 3 ScikitLearn](#4)
1. [Exercice 4 Scipy](#8)
#### [Conclusion](#5)
* <i>[References](#6)</i>
* <i>[Authors](#7)</i>

<a id="1"></a> 
#### Introduction [⏮️]()[👆🏽](#0)[⏭️](#2)

<div style="display: flex;">
     <div style="flex: 1;">
         <img src="https://storage.googleapis.com/kaggle-media/competitions/tweet_screenshot.png" alt="Descriptive Image" style="height:90%;">
     </div>
     <div style="flex: 4; padding-top: 10px;">
         <p>
             ♻️ Twitter est devenu un important canal de communication en cas d’urgence.
             <br><br>
             ♻️ L’omniprésence des smartphones permet aux gens d’annoncer une urgence qu’ils observent en temps réel. Pour cette raison, de plus en plus d’organismes s’intéressent à la surveillance programmatique de Twitter (c.-à-d. les organisations de secours aux sinistrés et les agences de presse).
             <br><br>
             ♻️ Mais on ne sait pas toujours si les paroles d’une personne annoncent réellement un désastre (comme en témoigne l'image ci-contre).
             <br><br>
             ♻️ L’auteur utilise explicitement le mot « ABLAZE » (qui veut dire "En Feu") mais le dit métaphoriquement, ce qui est clair pour un homme dès le départ, surtout avec l’aide visuelle. Mais c’est moins clair pour une machine.
             <br><br>
             ♻️ C’est pourquoi nous avons choisis d’utiliser un modèle de langage qui prédit quels Tweets sont sur des catastrophes réelles et lesquels ne le sont pas. Nous allons utiliser à un ensemble de données de 10000 tweets qui ont été classifiés. 
         </p>
     </div>
</div>


In [2]:
# Ajouter le répertoire parent pour les imports de module
import sys
sys.path.append('..')

In [3]:
# Les logs
from src.logging.main import LoggerManager

log = LoggerManager('disaster_tweets_logging.ipynb')

In [20]:
# Importations
import pandas as pd
import numpy as np
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.util import ngrams
from collections import defaultdict
from collections import  Counter
from nltk.tokenize import word_tokenize
from tqdm import tqdm
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from wordcloud import WordCloud
from spellchecker import SpellChecker

plt.style.use('ggplot')

In [5]:
train = pd.read_csv('../data/raw/train.csv')
test = pd.read_csv('../data/raw/test.csv')

♻️ Removing URLS

In [10]:
example="New competition launched: https://www.kaggle.com/c/nlp-getting-started"

In [11]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

remove_URL(example)

'New competition launched: '

In [12]:
train['text']= train['text'].apply(lambda x : remove_URL(x))

♻️ Removing HTML tags¶

In [13]:
example = """<div>
<h1>Real or Fake</h1>
<p>Kaggle </p>
<a href="https://www.kaggle.com/c/nlp-getting-started">getting started</a>
</div>"""

In [14]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)
print(remove_html(example))


Real or Fake
Kaggle 
getting started



In [15]:
train['text']= train['text'].apply(lambda x : remove_html(x))

♻️ Removing Emojis

In [16]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake 😔😔")

'Omg another Earthquake '

In [17]:
train['text']= train['text'].apply(lambda x: remove_emoji(x))

♻️ Removing punctuations

In [18]:
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

example="I am a #king"
print(remove_punct(example))

I am a king


In [19]:
train['text']= train['text'].apply(lambda x : remove_punct(x))

♻️ Spelling Correction

In [21]:
spell = SpellChecker()
def correct_spellings(text):
    if not isinstance(text, str):
        return text
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        corrected_word = spell.correction(word) if word in misspelled_words else word
        corrected_text.append(corrected_word if corrected_word is not None else "")
    return " ".join(corrected_text)
        
text = "corect me plese"
correct_spellings(text)

'correct me please'

In [29]:
spell = SpellChecker()
def correct_spellings(text):
    if not isinstance(text, str):
        return text
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        corrected_word = spell.correction(word) if word in misspelled_words else word
        corrected_text.append(corrected_word if corrected_word is not None else "")
    return " ".join(corrected_text)

# Apply the function to the 'text' column
train['text'] = train['text'].apply(lambda x: correct_spellings(x))

In [None]:
# Apply the function to the 'keyword' column
train['keyword'] = train['keyword'].apply(lambda x: correct_spellings(x))

In [None]:
# Apply the function to the 'keyword' column
train['location'] = train['location'].apply(lambda x: correct_spellings(x))

<a id="2"></a> 
#### 1. Exercice 1 La bibliothèque PIL [⏮️](#1)[👆🏽](#0)[⏭️](#3)

<a id="3"></a> 
#### 2. Exercice 2 Numpy, MatplotLib [⏮️](#2)[👆🏽](#0)[⏭️](#4)

<a id="4"></a> 
#### 3. Exercice 3 ScikitLearn [⏮️](#3)[👆🏽](#0)[⏭️](#5)

<a id="8"></a> 
#### 4. Exercice 4 Scipy [⏮️](#3)[👆🏽](#0)[⏭️](#5)

<a id="5"></a> 
#### Conclusion [⏮️](#4)[👆🏽](#0)[⏭️](#6)

<a id="6"></a> 
#### <i>References</i> [⏮️](#5)[👆🏽](#0)[⏭️](#7)

Here is some text with a reference to the [Python documentation](https://docs.python.org/).

...

Here are some references for more information on the libraries used:

- [Pandas documentation](https://pandas.pydata.org/docs/)
- [NumPy documentation](https://numpy.org/doc/stable/)

<a id="7"></a> 
#### <i>Authors</i> [⏮️](#6)[👆🏽](#0)[⏭️]()

🍀 Auteurs
- 🧑🏾‍💻 Kikia DIA
- 🧑🏾‍💻 Mouhamadou Naby DIA
- 🧑🏾‍💻 Ndeye Awa SALANE

🍀 Affiliations
- 🎓 Ecole Polytechnique de THIES

🍀 Département 
- 💻 Genie Informatique et Telecoms

🍀 Niveau
- 📚 DIC2