**First, I'll import the necessary libraries into the project.**

In [1]:
# This cell contains the necessary imports for the project to work
import numpy as np
import pandas as pd
import re
from sqlalchemy import create_engine
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
import nltk
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
import joblib
import warnings

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

warnings.simplefilter('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Daniel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## **Getting the data**

**In this section I will retrieve the data to be used to train and evaluate the model.**

In [3]:
# Creating a connection and reading the data
engine = create_engine('sqlite:///../data/DisasterResponse.db')
df_disaster_categories = pd.read_sql_table('disaster_categories', engine)

## **Building the model**

**Before building the model, I will split the data into predictor and target variables.**

In [7]:
# Splitting the data
X = df_disaster_categories.message
y = df_disaster_categories.iloc[:, 2:]

**Some adjustments to the data are required as the predictor columns are in text format. Also, I need to check if there is any column without variability, that is, with all values ​​equal.**

**To do this, two steps will be performed:** 
1. **Tokenization on the messages column.** 
2. **Search and remove columns without variability**

#### **Tokenization**

**Now, a function will be created to clear the text data. The steps are:**
1. **remove stopwords**
2. **reduce words to their root form**
3. **normalize case**

In [None]:
def tokenize(text):
    stopwords_list = stopwords.words('english')
    
    # Remove punctuation and urls
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    new_text = re.sub(url_regex, ' ', text)
    new_text = re.sub('\W', ' ', new_text)

    # Removing stopwords, bringing words to their root form and normalize them
    tokens = word_tokenize(new_text)
    lemmatizer = WordNetLemmatizer()
    new_tokens = [lemmatizer.lemmatize(word).lower().strip() for word in tokens if word not in stopwords_list]
    
    
    return new_tokens

#### **Searching and removing columns**

In [None]:
# Searching columns without variability
y.columns[y.sum() == 0]

In [None]:
y = y.drop(columns=y.columns[y.sum() == 0])

#### **Building and training a Pipeline**

**To assemble the pipeline, I will use the ``CountVectorizer`` to transform the tokenized texts with the function created into vectors, the ``TfidfTransformer`` to weight the variables, finally, I will train a ``decision tree`` with this data**

In [None]:
# Splitting into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y,
    test_size=.30,
    random_state=41
)

In [None]:
# Building the pipeline
pipe = Pipeline([('vectorizer', CountVectorizer(tokenizer=tokenize)),
                 ('tfidf', TfidfTransformer()),
                 ('clf', DecisionTreeClassifier())])

In [None]:
# Training the pipeline
pipe.fit(X_train, y_train)

In [None]:
# Checking the metrics
print(classification_report(y_test, pipe.predict(X_test), target_names=y_test.columns))

**The metrics weren't good, let's try to tune the hyperparameters**

#### **Tunning hyperparameters**

In [None]:
# Setting the parameters
params = {'clf__max_depth':[6, 7, 15],
          'clf__min_samples_split':[20, 30, 40],
          'clf__criterion':['entropy', 'gini']}

In [None]:
# Looking for the best combination
grid_clf = GridSearchCV(pipe, params, cv = 5) 
grid_clf.fit(X_train, y_train)

In [None]:
# Checking the metrics
print(classification_report(y_test, grid_clf.predict(X_test), target_names=y_test.columns))

**In the end, the score remained low, making it necessary to add more data, or change the approach when processing the data.**

#### **Saving the model**

In [None]:
joblib.dump(grid_clf, 'model.pkl')