# ETL Pipeline Preparation
Follow the instructions below to help you create your ETL pipeline.
### 1. Import libraries and load datasets.
- Import Python libraries
- Load `messages.csv` into a dataframe and inspect the first few lines.
- Load `categories.csv` into a dataframe and inspect the first few lines.

In [1]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine

In [2]:
# load messages dataset
messages = pd.read_csv("messages.csv")
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [3]:
messages.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26248 entries, 0 to 26247
Data columns (total 4 columns):
id          26248 non-null int64
message     26248 non-null object
original    10184 non-null object
genre       26248 non-null object
dtypes: int64(1), object(3)
memory usage: 820.3+ KB


In [4]:
messages.describe(include="all")

Unnamed: 0,id,message,original,genre
count,26248.0,26248,10184,26248
unique,,26177,9630,3
top,,#NAME?,Nap fe ou konnen ke apati de jodi a sevis SMS ...,news
freq,,4,20,13068
mean,15224.078368,,,
std,8826.069156,,,
min,2.0,,,
25%,7445.75,,,
50%,15660.5,,,
75%,22923.25,,,


In [5]:
messages.id.value_counts()[:10]

24779    3
7747     2
14246    2
25512    2
17553    2
13914    2
29119    2
14135    2
14592    2
17919    2
Name: id, dtype: int64

In [6]:
messages[messages.id.isin(messages.id.value_counts()[:10].index)]

Unnamed: 0,id,message,original,genre
6843,7747,where we can paticipate in the law reorganizat...,ki kote nou ka patisipe nan dwa reyamenajman yo?,direct
6844,7747,where we can paticipate in the law reorganizat...,ki kote nou ka patisipe nan dwa reyamenajman yo?,direct
12051,13914,Falta tan tan poco.. ¬¨¬¥Simply Red - Holding ...,,social
12052,13914,Falta tan tan poco.. ¬¨¬¥Simply Red - Holding ...,,social
12162,14135,in my village after collapse of whole infrastr...,mera gaon pura selab se tehas -nehas ho chuka ...,direct
12163,14135,in my village after collapse of whole infrastr...,mera gaon pura selab se tehas -nehas ho chuka ...,direct
12201,14246,Mera ghar selab ki waja sy gir giya hy. aur do...,a. a. a. ...,direct
12202,14246,Mera ghar selab ki waja sy gir giya hy. aur do...,a. a. a. ...,direct
12316,14592,MAOZA U REP ',MAOZA MUZAFAR u REP ',direct
12317,14592,MAOZA U REP ',MAOZA MUZAFAR u REP ',direct


In [7]:
# load categories dataset
categories = pd.read_csv("categories.csv")
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


In [8]:
categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26248 entries, 0 to 26247
Data columns (total 2 columns):
id            26248 non-null int64
categories    26248 non-null object
dtypes: int64(1), object(1)
memory usage: 410.2+ KB


In [9]:
categories.describe(include="all")

Unnamed: 0,id,categories
count,26248.0,26248
unique,,4003
top,,related-0;request-0;offer-0;aid_related-0;medi...
freq,,6125
mean,15224.078368,
std,8826.069156,
min,2.0,
25%,7445.75,
50%,15660.5,
75%,22923.25,


In [10]:
categories.id.value_counts()[:10]

24779    3
7747     2
14246    2
25512    2
17553    2
13914    2
29119    2
14135    2
14592    2
17919    2
Name: id, dtype: int64

In [11]:
categories[categories.id.isin(categories.id.value_counts()[:10].index)]

Unnamed: 0,id,categories
6843,7747,related-1;request-0;offer-0;aid_related-0;medi...
6844,7747,related-0;request-0;offer-0;aid_related-0;medi...
12051,13914,related-2;request-0;offer-0;aid_related-0;medi...
12052,13914,related-1;request-0;offer-0;aid_related-0;medi...
12162,14135,related-1;request-1;offer-0;aid_related-1;medi...
12163,14135,related-1;request-1;offer-0;aid_related-1;medi...
12201,14246,related-2;request-0;offer-0;aid_related-0;medi...
12202,14246,related-2;request-0;offer-0;aid_related-0;medi...
12316,14592,related-2;request-0;offer-0;aid_related-0;medi...
12317,14592,related-2;request-0;offer-0;aid_related-0;medi...


### 2. Merge datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned in the following steps

In [12]:
# merge datasets
df = pd.merge(messages, categories, on="id", how="inner")
df.head()

Unnamed: 0,id,message,original,genre,categories
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...


### 3. Split `categories` into separate category columns.
- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

In [13]:
# create a dataframe of the 36 individual category columns
categories = df.categories.str.split(";", expand=True)
categories.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [14]:
# select the first row of the categories dataframe
row = categories.head(1)

# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
category_colnames = row.apply(lambda x: x.str.split("-")[0][0]).values
print(category_colnames)

['related' 'request' 'offer' 'aid_related' 'medical_help'
 'medical_products' 'search_and_rescue' 'security' 'military' 'child_alone'
 'water' 'food' 'shelter' 'clothing' 'money' 'missing_people' 'refugees'
 'death' 'other_aid' 'infrastructure_related' 'transport' 'buildings'
 'electricity' 'tools' 'hospitals' 'shops' 'aid_centers'
 'other_infrastructure' 'weather_related' 'floods' 'storm' 'fire'
 'earthquake' 'cold' 'other_weather' 'direct_report']


In [15]:
# rename the columns of `categories`
categories.columns = category_colnames
categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [16]:
categories.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26386 entries, 0 to 26385
Data columns (total 36 columns):
related                   26386 non-null object
request                   26386 non-null object
offer                     26386 non-null object
aid_related               26386 non-null object
medical_help              26386 non-null object
medical_products          26386 non-null object
search_and_rescue         26386 non-null object
security                  26386 non-null object
military                  26386 non-null object
child_alone               26386 non-null object
water                     26386 non-null object
food                      26386 non-null object
shelter                   26386 non-null object
clothing                  26386 non-null object
money                     26386 non-null object
missing_people            26386 non-null object
refugees                  26386 non-null object
death                     26386 non-null object
other_aid                 2

In [17]:
categories.describe()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26386,26386,26386,26386,26386,26386,26386,26386,26386,26386,...,26386,26386,26386,26386,26386,26386,26386,26386,26386,26386
unique,3,2,2,2,2,2,2,2,2,1,...,2,2,2,2,2,2,2,2,2,2
top,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
freq,20042,21873,26265,15432,24287,25067,25661,25915,25523,26386,...,26077,25231,19043,24209,23922,26104,23925,25853,25007,21273


In [18]:
cat_desc = categories.describe().T
cat_desc[cat_desc.unique != 2]

Unnamed: 0,count,unique,top,freq
related,26386,3,related-1,20042
child_alone,26386,1,child_alone-0,26386


In [19]:
len(categories[(categories.related == "related-2")])

204

In [20]:
categories[(categories.related == "related-2")].duplicated(keep=False).value_counts()

True    204
dtype: int64

### 4. Convert category values to just numbers 0 or 1.
- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.
- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`.

In [21]:
# categories.apply(lambda x: x.str.split("-", expand=True)[1]).head(5)

In [22]:
for column in categories:
    # set each value to be the last character of the string
    categories[column] = categories[column].str[-1:]
    
    # convert column from string to numeric
    categories[column] = pd.to_numeric(categories[column])
    
# replace any integers > 1 with 1
categories.replace([2, 3, 4, 5, 6, 7, 8, 9], 1, inplace=True)

categories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
# cat_desc2 = 
categories.describe().T
# cat_desc2[
# (cat_desc2.min != 0.0) | 
# cat_desc2.max != 1.0
# ]

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
related,26386.0,0.767301,0.42256,0.0,1.0,1.0,1.0,1.0
request,26386.0,0.171038,0.376549,0.0,0.0,0.0,0.0,1.0
offer,26386.0,0.004586,0.067564,0.0,0.0,0.0,0.0,1.0
aid_related,26386.0,0.415144,0.492756,0.0,0.0,0.0,1.0,1.0
medical_help,26386.0,0.07955,0.2706,0.0,0.0,0.0,0.0,1.0
medical_products,26386.0,0.049989,0.217926,0.0,0.0,0.0,0.0,1.0
search_and_rescue,26386.0,0.027477,0.163471,0.0,0.0,0.0,0.0,1.0
security,26386.0,0.01785,0.13241,0.0,0.0,0.0,0.0,1.0
military,26386.0,0.032707,0.177871,0.0,0.0,0.0,0.0,1.0
child_alone,26386.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [24]:
# drop the original categories column from `df`
df.drop(columns="categories", inplace=True)

df.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [25]:
# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df, categories], axis=1)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26386 entries, 0 to 26385
Data columns (total 40 columns):
id                        26386 non-null int64
message                   26386 non-null object
original                  10246 non-null object
genre                     26386 non-null object
related                   26386 non-null int64
request                   26386 non-null int64
offer                     26386 non-null int64
aid_related               26386 non-null int64
medical_help              26386 non-null int64
medical_products          26386 non-null int64
search_and_rescue         26386 non-null int64
security                  26386 non-null int64
military                  26386 non-null int64
child_alone               26386 non-null int64
water                     26386 non-null int64
food                      26386 non-null int64
shelter                   26386 non-null int64
clothing                  26386 non-null int64
money                     26386 non-null i

### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed.

In [27]:
# check number of duplicates
df.duplicated(keep=False).value_counts()

False    26113
True       273
dtype: int64

In [28]:
df.duplicated(keep="first").value_counts()

False    26215
True       171
dtype: int64

In [29]:
# drop duplicates
df = df[~df.duplicated(keep="first")]
df.shape

(26215, 40)

In [30]:
# check number of duplicates
df.duplicated(keep=False).value_counts()

False    26215
dtype: int64

In [31]:
# check number of duplicates
df.message.duplicated(keep=False).value_counts()

False    26141
True        74
Name: message, dtype: int64

In [32]:
df.message.duplicated(keep="last").value_counts()

False    26177
True        38
Name: message, dtype: int64

In [33]:
# drop duplicates
df = df[~df.message.duplicated(keep="last")]
df.shape

(26177, 40)

In [34]:
# check number of duplicates
df.message.duplicated(keep=False).value_counts()

False    26177
Name: message, dtype: int64

In [35]:
df[df.message == "#NAME?"]

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
22863,26309,#NAME?,,news,1,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,1


In [36]:
# looks like spreadsheet error rather than meaningful message
df = df[~(df.message == "#NAME?")]
df.shape

(26176, 40)

In [37]:
df.describe(include="all")

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26176.0,26176,10153,26176,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,...,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0,26176.0
unique,,26176,9630,3,,,,,,,...,,,,,,,,,,
top,,ioletGjok haha at least you have a home gym in...,Nap fe ou konnen ke apati de jodi a sevis SMS ...,news,,,,,,,...,,,,,,,,,,
freq,,1,20,13035,,,,,,,...,,,,,,,,,,
mean,15226.183985,,,,0.766313,0.1705,0.00447,0.41412,0.079462,0.050084,...,0.011805,0.043819,0.278232,0.08206,0.093215,0.010773,0.093674,0.020209,0.052453,0.193498
std,8827.169602,,,,0.423184,0.376078,0.066708,0.492579,0.270464,0.218123,...,0.108008,0.204696,0.448137,0.274461,0.290739,0.103236,0.29138,0.140718,0.222942,0.395047
min,2.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7448.75,,,,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15663.5,,,,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22925.25,,,,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [38]:
df.describe(include="all").related

count     26176.000000
unique             NaN
top                NaN
freq               NaN
mean          0.766313
std           0.423184
min           0.000000
25%           1.000000
50%           1.000000
75%           1.000000
max           1.000000
Name: related, dtype: float64

In [39]:
df.describe(include="all").child_alone

count     26176.0
unique        NaN
top           NaN
freq          NaN
mean          0.0
std           0.0
min           0.0
25%           0.0
50%           0.0
75%           0.0
max           0.0
Name: child_alone, dtype: float64

In [40]:
df[df.message.duplicated(keep=False)]

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report


### 7. Save the clean dataset into an sqlite database.
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below.

In [41]:
engine = create_engine('sqlite:///DisasterResponse_AD.db')
df.to_sql('Message', engine, if_exists="replace", index=False)

### 8. Use this notebook to complete `etl_pipeline.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later.

In [42]:
genre_counts = df.groupby('genre').count()['message']
genre_counts

genre
direct    10747
news      13035
social     2394
Name: message, dtype: int64

In [43]:
genre_names = list(genre_counts.index)
genre_names

['direct', 'news', 'social']

In [44]:
cat_names = ['related', 'request', 'offer', 'aid_related', 'medical_help',
    'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone',
    'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees',
    'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings',
    'electricity', 'tools', 'hospitals', 'shops', 'aid_centers',
    'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire',
    'earthquake', 'cold', 'other_weather', 'direct_report']
cat_names

['related',
 'request',
 'offer',
 'aid_related',
 'medical_help',
 'medical_products',
 'search_and_rescue',
 'security',
 'military',
 'child_alone',
 'water',
 'food',
 'shelter',
 'clothing',
 'money',
 'missing_people',
 'refugees',
 'death',
 'other_aid',
 'infrastructure_related',
 'transport',
 'buildings',
 'electricity',
 'tools',
 'hospitals',
 'shops',
 'aid_centers',
 'other_infrastructure',
 'weather_related',
 'floods',
 'storm',
 'fire',
 'earthquake',
 'cold',
 'other_weather',
 'direct_report']

In [45]:
cat_counts = df[cat_names].sum().sort_values(ascending=False)
cat_counts

related                   20059
aid_related               10840
weather_related            7283
direct_report              5065
request                    4463
other_aid                  3439
food                       2917
earthquake                 2452
storm                      2440
shelter                    2309
floods                     2148
medical_help               2080
infrastructure_related     1701
water                      1669
other_weather              1373
buildings                  1329
medical_products           1311
transport                  1198
death                      1192
other_infrastructure       1147
refugees                    874
military                    858
search_and_rescue           723
money                       603
electricity                 532
cold                        529
security                    471
clothing                    404
aid_centers                 309
missing_people              298
hospitals                   283
fire    

In [46]:
cat_counts.index

Index(['related', 'aid_related', 'weather_related', 'direct_report', 'request',
       'other_aid', 'food', 'earthquake', 'storm', 'shelter', 'floods',
       'medical_help', 'infrastructure_related', 'water', 'other_weather',
       'buildings', 'medical_products', 'transport', 'death',
       'other_infrastructure', 'refugees', 'military', 'search_and_rescue',
       'money', 'electricity', 'cold', 'security', 'clothing', 'aid_centers',
       'missing_people', 'hospitals', 'fire', 'tools', 'shops', 'offer',
       'child_alone'],
      dtype='object')