## Data Understanding

### Extracting
The main goal of this notebook is to provide insights about existing data and eventually prototype anything that is connected with ETL Pipeline.

In [2]:
import pandas as pd

Let's read data

In [3]:
messages = pd.read_csv('data/csv/disaster_messages.csv', index_col=False)
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [4]:
messages.shape

(26248, 4)

So it looks like we have around 26248 messages about disaster with english translation and original language. Now let's see what kind of genre's are available.

In [5]:
messages['genre'].unique()

array(['direct', 'social', 'news'], dtype=object)

So there are three genres direct, social and news that are available.

In [6]:
categories = pd.read_csv('data/csv/disaster_categories.csv', index_col=False)
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


In [7]:
categories.shape

(26248, 2)

Now let's perform the concatenation of both dataframes

In [8]:
df = pd.concat([messages, categories], axis=1)
df.head()

Unnamed: 0,id,message,original,genre,id.1,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,12,related-1;request-0;offer-0;aid_related-0;medi...


### Cleaning

Now as the data is extracted, now let's go with cleaning the data.

As first step let's remove these 'id' columns.

In [9]:
df = df.drop(columns=['id'])
df.head()

Unnamed: 0,message,original,genre,categories
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


Let's see which columns contain missing values.

In [13]:
df.isna().sum()

message           0
original      16064
genre             0
categories        0
dtype: int64

It looks like the 'original' column contains missing values. However the 'message' column does not have any missing item. So let's see how messages looks like when 'original' is missing.

In [15]:
df[(df['original']).isna()]['message']

7433              NOTES: It mark as not enough information
9902     My thoughts and prayers go out to all the live...
9903     I m sorry for the poor people in Haiti tonight...
9904     RT selenagomez UNICEF has just announced an em...
9905     lilithia yes 5.2 magnitude earthquake hit mani...
                               ...                        
26243    The training demonstrated how to enhance micro...
26244    A suitable candidate has been selected and OCH...
26245    Proshika, operating in Cox's Bazar municipalit...
26246    Some 2,000 women protesting against the conduc...
26247    A radical shift in thinking came about as a re...
Name: message, Length: 16064, dtype: object

So the data is messages is in english - so then is the message is already in english there is no need to fill 'original' column. So let's focus only on original column then (we don't need to drop any row due to NaN value). We could drop 'original' column, but let's leave it, as we don't need to safe space dramatically.

Remove duplicates.

In [18]:
df['message'].duplicated().sum()

71

In [22]:
df = df.drop_duplicates(['message'])
df['message'].duplicated().sum()

0

In [23]:
df.head()

Unnamed: 0,message,original,genre,categories
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


Now let's clean the 'categories' column

In [24]:
categories = df['categories'].str.split(pat=';', expand=True)
row = categories.iloc[0]
category_colnames = [r.split('-')[0] for r in row]
categories.columns = category_colnames

for column in categories:
    # set each value to be the last character of the string
    categories[column] = categories[column].apply(lambda val: val.split('-')[1])

    # convert column from string to numeric
    categories[column] = pd.to_numeric(categories[column], downcast='integer')

df = df.drop(columns=['categories'])
df = pd.concat([df, categories], axis=1)
df.head()

Unnamed: 0,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
