# ETL Pipeline Preparation
The notebook follows instructions below to create the ETL pipeline.
### 1. Import libraries and load datasets.
- Import Python libraries
- Load `messages.csv` into a dataframe and inspect the first few lines.
- Load `categories.csv` into a dataframe and inspect the first few lines.

In [5]:
# import libraries
import sqlite3
import pandas as pd
from sqlalchemy import create_engine
import os

In [6]:
# load messages dataset
messages = pd.read_csv('disaster_messages.csv', encoding='latin-1')
messages.head()

Unnamed: 0,id,message,original,genre
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct


In [7]:
messages.describe(include ='all')

Unnamed: 0,id,message,original,genre
count,26248.0,26248,10184,26248
unique,,26177,9630,3
top,,#NAME?,Nap fe ou konnen ke apati de jodi a sevis SMS ...,news
freq,,4,20,13068
mean,15224.078368,,,
std,8826.069156,,,
min,2.0,,,
25%,7445.75,,,
50%,15660.5,,,
75%,22923.25,,,


In [8]:
# load categories dataset
categories = pd.read_csv('disaster_categories.csv',encoding='latin-1')
categories.head()

Unnamed: 0,id,categories
0,2,related-1;request-0;offer-0;aid_related-0;medi...
1,7,related-1;request-0;offer-0;aid_related-1;medi...
2,8,related-1;request-0;offer-0;aid_related-0;medi...
3,9,related-1;request-1;offer-0;aid_related-1;medi...
4,12,related-1;request-0;offer-0;aid_related-0;medi...


In [9]:
categories.describe(include='all')

Unnamed: 0,id,categories
count,26248.0,26248
unique,,4003
top,,related-0;request-0;offer-0;aid_related-0;medi...
freq,,6125
mean,15224.078368,
std,8826.069156,
min,2.0,
25%,7445.75,
50%,15660.5,
75%,22923.25,


### 3. Split `categories` into separate category columns.
- Split the values in the `categories` column on the `;` character so that each value becomes a separate column. You'll find [this method](https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.Series.str.split.html) very helpful! Make sure to set `expand=True`.
- Use the first row of categories dataframe to create column names for the categories data.
- Rename columns of `categories` with new column names.

In [10]:
# create a dataframe of the 36 individual category columns
newcategories = categories["categories"].str.split(";", expand=True)
newcategories.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,26,27,28,29,30,31,32,33,34,35
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


In [11]:
# select the first row of the categories dataframe
row = newcategories.iloc[0]
print("Raw categories\n", row)
# use this row to extract a list of new column names for categories.
# one way is to apply a lambda function that takes everything 
# up to the second to last character of each string with slicing
newcategory_colnames = row.apply(lambda x: x[0:-2])
print("Cleaned Category names\n", newcategory_colnames)

Raw categories
 0                    related-1
1                    request-0
2                      offer-0
3                aid_related-0
4               medical_help-0
5           medical_products-0
6          search_and_rescue-0
7                   security-0
8                   military-0
9                child_alone-0
10                     water-0
11                      food-0
12                   shelter-0
13                  clothing-0
14                     money-0
15            missing_people-0
16                  refugees-0
17                     death-0
18                 other_aid-0
19    infrastructure_related-0
20                 transport-0
21                 buildings-0
22               electricity-0
23                     tools-0
24                 hospitals-0
25                     shops-0
26               aid_centers-0
27      other_infrastructure-0
28           weather_related-0
29                    floods-0
30                     storm-0
31                     

In [12]:
# rename the columns of `categories`
newcategories.columns = newcategory_colnames
newcategories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
1,related-1,request-0,offer-0,aid_related-1,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-1,floods-0,storm-1,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
2,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
3,related-1,request-1,offer-0,aid_related-1,medical_help-0,medical_products-1,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0
4,related-1,request-0,offer-0,aid_related-0,medical_help-0,medical_products-0,search_and_rescue-0,security-0,military-0,child_alone-0,...,aid_centers-0,other_infrastructure-0,weather_related-0,floods-0,storm-0,fire-0,earthquake-0,cold-0,other_weather-0,direct_report-0


### 4. Convert category values to just numbers 0 or 1.
- Iterate through the category columns in df to keep only the last character of each string (the 1 or 0). For example, `related-0` becomes `0`, `related-1` becomes `1`. Convert the string to a numeric value.
- You can perform [normal string actions on Pandas Series](https://pandas.pydata.org/pandas-docs/stable/text.html#indexing-with-str), like indexing, by including `.str` after the Series. You may need to first convert the Series to be of type string, which you can do with `astype(str)`.

In [13]:
for column in newcategories:
    # set each value to be the last character of the string
    newcategories[column] = newcategories[column].astype("str").apply(lambda x: x[-1])
    # convert column from string to numeric
    newcategories[column] = newcategories[column].astype("int32")
    
newcategories.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
newcategories.describe(include='all')

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,...,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0,26248.0
mean,0.774002,0.17068,0.004534,0.414432,0.079511,0.050061,0.027583,0.017944,0.032764,0.0,...,0.011772,0.043851,0.278269,0.082216,0.093264,0.010744,0.093531,0.020192,0.052423,0.193577
std,0.435472,0.376236,0.067181,0.492633,0.27054,0.218075,0.163778,0.132751,0.178023,0.0,...,0.107862,0.204767,0.448155,0.274698,0.290808,0.103095,0.291181,0.140659,0.222883,0.395108
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [15]:
categories.describe(include='all')


Unnamed: 0,id,categories
count,26248.0,26248
unique,,4003
top,,related-0;request-0;offer-0;aid_related-0;medi...
freq,,6125
mean,15224.078368,
std,8826.069156,
min,2.0,
25%,7445.75,
50%,15660.5,
75%,22923.25,


In [16]:
Newcategories2 = pd.concat([categories, newcategories], axis=1)

In [17]:
# Newcategories2.describe(include='all')
Newcategories2.groupby(by=["related"]).count()

#categories.groupby(by=["categories"]).count()

Unnamed: 0_level_0,id,categories,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
related,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,6125,6125,6125,6125,6125,6125,6125,6125,6125,6125,...,6125,6125,6125,6125,6125,6125,6125,6125,6125,6125
1,19930,19930,19930,19930,19930,19930,19930,19930,19930,19930,...,19930,19930,19930,19930,19930,19930,19930,19930,19930,19930
2,193,193,193,193,193,193,193,193,193,193,...,193,193,193,193,193,193,193,193,193,193


### 2. Merge datasets.
- Merge the messages and categories datasets using the common id
- Assign this combined dataset to `df`, which will be cleaned in the following steps

In [18]:
# merge datasets
df = messages.merge(Newcategories2, how='inner', on=["id"])
df.head()

Unnamed: 0,id,message,original,genre,categories,related,request,offer,aid_related,medical_help,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,related-1;request-0;offer-0;aid_related-1;medi...,1,0,0,1,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,related-1;request-1;offer-0;aid_related-1;medi...,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,related-1;request-0;offer-0;aid_related-0;medi...,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
df.describe(include='all')

Unnamed: 0,id,message,original,genre,categories,related,request,offer,aid_related,medical_help,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26386.0,26386,10246,26386,26386,26386.0,26386.0,26386.0,26386.0,26386.0,...,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0
unique,,26177,9630,3,4003,,,,,,...,,,,,,,,,,
top,,Shelter materials (thick polyesters) are being...,Nap fe ou konnen ke apati de jodi a sevis SMS ...,news,related-0;request-0;offer-0;aid_related-0;medi...,,,,,,...,,,,,,,,,,
freq,,9,20,13128,6140,,,,,,...,,,,,,,,,,
mean,15217.885886,,,,,0.775032,0.171038,0.004586,0.415144,0.07955,...,0.011711,0.043773,0.278292,0.082506,0.093383,0.010687,0.093269,0.0202,0.052263,0.193777
std,8823.741128,,,,,0.435692,0.376549,0.067564,0.492756,0.2706,...,0.107583,0.204594,0.448166,0.275139,0.290974,0.102828,0.290815,0.140687,0.22256,0.395264
min,2.0,,,,,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7438.25,,,,,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15650.5,,,,,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22916.75,,,,,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## So the row count=26386 in the merged df is more than messages and categories, 26248

### 5. Replace `categories` column in `df` with new category columns.
- Drop the categories column from the df dataframe since it is no longer needed.
- Concatenate df and categories data frames.

In [20]:
# drop the original categories column from `df`
df.drop(["categories"], axis=1, inplace=True)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# let's check new df summary stats 
df.describe(include='all')

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26386.0,26386,10246,26386,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,...,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0,26386.0
unique,,26177,9630,3,,,,,,,...,,,,,,,,,,
top,,Shelter materials (thick polyesters) are being...,Nap fe ou konnen ke apati de jodi a sevis SMS ...,news,,,,,,,...,,,,,,,,,,
freq,,9,20,13128,,,,,,,...,,,,,,,,,,
mean,15217.885886,,,,0.775032,0.171038,0.004586,0.415144,0.07955,0.049989,...,0.011711,0.043773,0.278292,0.082506,0.093383,0.010687,0.093269,0.0202,0.052263,0.193777
std,8823.741128,,,,0.435692,0.376549,0.067564,0.492756,0.2706,0.217926,...,0.107583,0.204594,0.448166,0.275139,0.290974,0.102828,0.290815,0.140687,0.22256,0.395264
min,2.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7438.25,,,,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15650.5,,,,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22916.75,,,,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 6. Remove duplicates.
- Check how many duplicates are in this dataset.
- Drop the duplicates.
- Confirm duplicates were removed.

In [22]:
# check number of duplicates
print("Total rows=", df.count()) 
print("% duplicate rows=",df.duplicated(subset=None, keep='first').mean())

Total rows= id                        26386
message                   26386
original                  10246
genre                     26386
related                   26386
request                   26386
offer                     26386
aid_related               26386
medical_help              26386
medical_products          26386
search_and_rescue         26386
security                  26386
military                  26386
child_alone               26386
water                     26386
food                      26386
shelter                   26386
clothing                  26386
money                     26386
missing_people            26386
refugees                  26386
death                     26386
other_aid                 26386
infrastructure_related    26386
transport                 26386
buildings                 26386
electricity               26386
tools                     26386
hospitals                 26386
shops                     26386
aid_centers               26

In [23]:
def remove_duplicates_from_df(df):
    """
    This function takes a dataframe as input and removes any duplicates from the dataframe.
    Returns the dataframe with duplicates removed, and confirmation message indicating no. of duplicates removed.
    
    Parameters:
    df (dataframe): The dataframe from which duplicates should be removed.
    
    Returns:
    dataframe, str: The dataframe with duplicates removed, and message indicating no. of duplicates removed
    """
    original_row_count = df.shape[0]
    df.drop_duplicates(inplace=True)
    new_row_count = df.shape[0]
    duplicates_removed = original_row_count - new_row_count
    if duplicates_removed == 0:
        return df, f"No duplicates found in the dataframe."
    else:
        return df, f"{duplicates_removed} duplicates removed from the dataframe."


In [24]:
# drop duplicate rows
df, dupknt = remove_duplicates_from_df(df)
print(dupknt)

170 duplicates removed from the dataframe.


In [25]:
# check number of duplicate rows after
print("Total rows=\n", df.count()) 
print("% duplicate rows=",df.duplicated(subset=None, keep='first').mean())

Total rows=
 id                        26216
message                   26216
original                  10170
genre                     26216
related                   26216
request                   26216
offer                     26216
aid_related               26216
medical_help              26216
medical_products          26216
search_and_rescue         26216
security                  26216
military                  26216
child_alone               26216
water                     26216
food                      26216
shelter                   26216
clothing                  26216
money                     26216
missing_people            26216
refugees                  26216
death                     26216
other_aid                 26216
infrastructure_related    26216
transport                 26216
buildings                 26216
electricity               26216
tools                     26216
hospitals                 26216
shops                     26216
aid_centers               2

### 7. Save the clean dataset into an sqlite database.
You can do this with pandas [`to_sql` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html) combined with the SQLAlchemy library. Remember to import SQLAlchemy's `create_engine` in the first cell of this notebook to use it below.

In [26]:
# First check if the DB file exists, and if it does, delete it before creating the SQLite engine. 
if os.path.exists("DisasterResponse.db"):
    os.remove("DisasterResponse.db")
    
# Create the SQLite engine
engine = create_engine('sqlite:///DisasterResponse.db')

# copy the df dataframe to SQLITE db
# df.to_sql('DisasterResponse', engine, index=False)
# 
database_filename = 'DisasterResponse.db'
tablename = database_filename.split('.')[0]
print(tablename)
df.to_sql(tablename, engine, index=False)


DisasterResponse


26216

In [27]:
# validate the records are wrtten to the database
print("dateframe has rows:  " ,df.shape[0])
numrows = pd.read_sql("SELECT COUNT(*) as rows FROM DisasterResponse", engine) 
print("Db table has ", numrows.iloc[0])

dateframe has rows:   26216
Db table has  rows    26216
Name: 0, dtype: int64


In [31]:
## validate the contents of the database table
import sqlite3
con = sqlite3.connect("DisasterResponse.db")
cursor = con.cursor()
## cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")

cursor.execute("SELECT related, count(*) FROM DisasterResponse group by related;")
res= cursor.fetchall()

for tb in res:
    print("Before related=", tb)

# Remove rows with related =2
cursor.execute("delete FROM DisasterResponse where related =2;")
con.commit()

cursor.execute("SELECT related, count(*) FROM DisasterResponse group by related;")
res= cursor.fetchall()

for tb in res:
    print("\n After related=", tb)

con.close()

Before related= (0, 6122)
Before related= (1, 19906)

 After related= (0, 6122)

 After related= (1, 19906)


### 8. Use this notebook to complete `etl_pipeline.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database based on new datasets specified by the user. Alternatively, you can complete `etl_pipeline.py` in the classroom on the `Project Workspace IDE` coming later.