# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [2]:
# import libraries
from sqlalchemy import create_engine
import pandas as pd

In [3]:
def load_data(database_name, table_name):
    engine = create_engine(f'sqlite:///{database_name}.db')

    df = pd.read_sql_table(table_name, con=engine)
   
    print(df.columns)
    print(df.head(2))

    X = df['message']
    Y = df.iloc[:, 4:]

    return X, Y

X, Y = load_data('InsertDatabaseName', 'InsertTableName1')

Index(['id', 'message', 'original', 'genre', 'related', 'request', 'offer',
       'aid_related', 'medical_help', 'medical_products', 'search_and_rescue',
       'security', 'military', 'child_alone', 'water', 'food', 'shelter',
       'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')
   id                                            message  \
0   2  Weather update - a cold front from Cuba that c...   
1   7            Is the Hurricane over or is it not over   

                                            original   genre  related  \
0  Un front froid se retrouve sur Cuba ce matin. ...  direct        1   
1                 Cyclone nan fini osinon li pa fini  direct        1   



### 2. Write a tokenization function to process your text data

In [6]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def tokenize(text):
    """
    Tokenize and clean text by breaking it into words, lemmatizing, 
    converting to lower case and removing leading/trailing white space.

    Parameters:
    text (str): Text to be tokenized.

    Returns:
    list: List of clean, lemmatized tokens.
    """
    # Initialize WordNet lemmatizer
    lemmatizer = WordNetLemmatizer()

    # Tokenize text into words
    tokens = word_tokenize(text)

    # Initialize an empty list to hold the cleaned tokens
    clean_tokens = []

    # Iterate over each token
    for token in tokens:
        # Lemmatize, convert to lower case and remove leading/trailing white space
        clean_token = lemmatizer.lemmatize(token).lower().strip()

        # Add the clean token to the list
        clean_tokens.append(clean_token)

    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
import nltk

# Create a machine learning pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [13]:
from sklearn.model_selection import train_test_split

def train_classifier(X, Y, pipeline, test_size=0.2, random_state=42):
    """
    Train a classifier using a training pipeline.

    Parameters:
    X: Features dataset.
    Y: Labels dataset.
    pipeline: The machine learning pipeline that includes the preprocessing and the classifier.
    test_size: The proportion of the dataset to include in the test split (default is 0.2).
    random_state: The seed used by the random number generator (default is 42).

    Returns:
    The trained pipeline.
    """
    # Split data into training and test sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_state)

    # Train classifier
    pipeline.fit(X_train, Y_train)

    # Return the trained pipeline and the test set for further evaluation
    return pipeline, X_train, X_test, Y_train, Y_test

trained_pipeline, X_train, X_test, Y_train, Y_test = train_classifier(X, Y, pipeline)

In [9]:
# Split data into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 42)

# Train classifier
pipeline.fit(X_train, Y_train)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [11]:
import pandas as pd
from sklearn.metrics import classification_report

def evaluate_model(pipeline, X_test, Y_test):
    """
    Predict on test data and print classification report for each label feature.

    Parameters:
    pipeline: The trained machine learning pipeline.
    X_test: Test dataset features.
    Y_test: Test dataset labels.

    Returns:
    Y_pred_df: DataFrame containing the predictions.
    """
    # Predict on test data using the pipeline
    Y_pred = pipeline.predict(X_test)

    # Get a list of the column names of Y for iteration
    target_names = Y_test.columns.tolist()

    # Convert Y_test and Y_pred into DataFrames for easier manipulation
    Y_test_df = pd.DataFrame(Y_test, columns=target_names)
    Y_pred_df = pd.DataFrame(Y_pred, columns=target_names)

    # Print classification report for each feature
    for column in target_names:
        print('------------------------------------------------------\n')
        print(f'FEATURE: {column}\n')
        print(classification_report(Y_test_df[column], Y_pred_df[column]))

    return Y_pred_df

Y_pred_df = evaluate_model(trained_pipeline, X_test, Y_test)

------------------------------------------------------

FEATURE: related

              precision    recall  f1-score   support

           0       0.74      0.27      0.39      1266
           1       0.80      0.97      0.88      3938
           2       0.83      0.12      0.22        40

    accuracy                           0.79      5244
   macro avg       0.79      0.45      0.50      5244
weighted avg       0.78      0.79      0.75      5244

------------------------------------------------------

FEATURE: request

              precision    recall  f1-score   support

           0       0.89      0.99      0.94      4349
           1       0.90      0.41      0.56       895

    accuracy                           0.89      5244
   macro avg       0.90      0.70      0.75      5244
weighted avg       0.89      0.89      0.87      5244

------------------------------------------------------

FEATURE: offer

              precision    recall  f1-score   support

           0     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5244

    accuracy                           1.00      5244
   macro avg       1.00      1.00      1.00      5244
weighted avg       1.00      1.00      1.00      5244

------------------------------------------------------

FEATURE: water

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      4905
           1       0.89      0.26      0.40       339

    accuracy                           0.95      5244
   macro avg       0.92      0.63      0.69      5244
weighted avg       0.95      0.95      0.94      5244

------------------------------------------------------

FEATURE: food

              precision    recall  f1-score   support

           0       0.93      0.99      0.96      4649
           1       0.91      0.39      0.54       595

    accuracy                           0.93      5244
   macro avg       0.92      0.69      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.96      1.00      0.98      5007
           1       0.89      0.14      0.23       237

    accuracy                           0.96      5244
   macro avg       0.92      0.57      0.61      5244
weighted avg       0.96      0.96      0.95      5244

------------------------------------------------------

FEATURE: other_aid

              precision    recall  f1-score   support

           0       0.87      1.00      0.93      4549
           1       0.69      0.01      0.03       695

    accuracy                           0.87      5244
   macro avg       0.78      0.51      0.48      5244
weighted avg       0.85      0.87      0.81      5244

------------------------------------------------------

FEATURE: infrastructure_related

              precision    recall  f1-score   support

           0       0.94      1.00      0.97      4916
           1       0.50      0.00      0.01       328

    accuracy     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5177
           1       0.00      0.00      0.00        67

    accuracy                           0.99      5244
   macro avg       0.49      0.50      0.50      5244
weighted avg       0.97      0.99      0.98      5244

------------------------------------------------------

FEATURE: other_infrastructure

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      5021
           1       0.00      0.00      0.00       223

    accuracy                           0.96      5244
   macro avg       0.48      0.50      0.49      5244
weighted avg       0.92      0.96      0.94      5244

------------------------------------------------------

FEATURE: weather_related

              precision    recall  f1-score   support

           0       0.87      0.96      0.91      3806
           1       0.86      0.62      0.72      1438

    accuracy 

### 6. Improve your model
Use grid search to find better parameters. 

In [15]:
from sklearn.model_selection import GridSearchCV

def perform_grid_search(pipeline, X_train, Y_train):
    """
    Perform grid search to find the best parameters for the pipeline.

    Parameters:
    pipeline: The machine learning pipeline on which to perform grid search.
    X_train: Training dataset features.
    Y_train: Training dataset labels.

    Returns:
    cv: The fitted GridSearchCV object.
    """
    # Define the parameter grid to search
    parameters = {
        'clf__estimator__n_estimators': [50, 100],
        'clf__estimator__min_samples_split': [2, 5],
        'clf__estimator__min_samples_leaf': [2, 4],
    }

    # Create GridSearchCV object with the pipeline, parameter grid, and verbose output
    cv = GridSearchCV(pipeline, param_grid=parameters, verbose=3)

    # Fit GridSearchCV
    cv.fit(X_train, Y_train)

    # Return the GridSearchCV object to access the results
    return cv


cv = perform_grid_search(pipeline, X_train, Y_train)
print(cv.best_params_)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
[CV 1/5] END clf__estimator__min_samples_leaf=2, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50;, score=0.200 total time= 6.2min
[CV 2/5] END clf__estimator__min_samples_leaf=2, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50;, score=0.195 total time= 2.7min
[CV 3/5] END clf__estimator__min_samples_leaf=2, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50;, score=0.214 total time= 4.2min
[CV 4/5] END clf__estimator__min_samples_leaf=2, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50;, score=0.190 total time= 3.4min
[CV 5/5] END clf__estimator__min_samples_leaf=2, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=50;, score=0.214 total time= 2.0min
[CV 1/5] END clf__estimator__min_samples_leaf=2, clf__estimator__min_samples_split=2, clf__estimator__n_estimators=100;, score=0.202 total time= 3.7min
[CV 2/5] END clf__estimator__min_

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [16]:
import pandas as pd
from sklearn.metrics import classification_report

def predict_and_evaluate(pipeline, X_test, Y_test):
    """
    Predict on test data, print classification report for each label, and calculate accuracy.

    Parameters:
    pipeline: The trained machine learning pipeline.
    X_test: Test dataset features.
    Y_test: Test dataset labels.

    Returns:
    accuracy: The mean accuracy across all label predictions.
    """
    # Predict on test data using the pipeline
    Y_pred = pipeline.predict(X_test)

    # Get a list of the column names of Y for iteration
    target_names = Y_test.columns.tolist()

    # Convert Y_test and Y_pred into DataFrames for easier manipulation
    Y_test_df = pd.DataFrame(Y_test, columns=target_names)
    Y_pred_df = pd.DataFrame(Y_pred, columns=target_names)

    # Print classification report for each feature
    for column in target_names:
        print('------------------------------------------------------\n')
        print('FEATURE: {}\n'.format(column))
        print(classification_report(Y_test_df[column], Y_pred_df[column]))

    # Calculate and return the mean accuracy across all label predictions
    accuracy = (Y_pred_df == Y_test_df).mean()
    return accuracy


accuracy = predict_and_evaluate(pipeline, X_test, Y_test)
print('Mean Accuracy:', accuracy)

------------------------------------------------------

FEATURE: related

              precision    recall  f1-score   support

           0       0.73      0.26      0.38      1266
           1       0.80      0.97      0.87      3938
           2       0.75      0.07      0.14        40

    accuracy                           0.79      5244
   macro avg       0.76      0.43      0.46      5244
weighted avg       0.78      0.79      0.75      5244

------------------------------------------------------

FEATURE: request

              precision    recall  f1-score   support

           0       0.89      0.99      0.94      4349
           1       0.90      0.41      0.56       895

    accuracy                           0.89      5244
   macro avg       0.89      0.70      0.75      5244
weighted avg       0.89      0.89      0.87      5244

------------------------------------------------------

FEATURE: offer

              precision    recall  f1-score   support

           0     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5244

    accuracy                           1.00      5244
   macro avg       1.00      1.00      1.00      5244
weighted avg       1.00      1.00      1.00      5244

------------------------------------------------------

FEATURE: water

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      4905
           1       0.87      0.24      0.38       339

    accuracy                           0.95      5244
   macro avg       0.91      0.62      0.67      5244
weighted avg       0.94      0.95      0.93      5244

------------------------------------------------------

FEATURE: food

              precision    recall  f1-score   support

           0       0.93      0.99      0.96      4649
           1       0.91      0.43      0.59       595

    accuracy                           0.93      5244
   macro avg       0.92      0.71      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5191
           1       0.00      0.00      0.00        53

    accuracy                           0.99      5244
   macro avg       0.49      0.50      0.50      5244
weighted avg       0.98      0.99      0.98      5244

------------------------------------------------------

FEATURE: earthquake

              precision    recall  f1-score   support

           0       0.98      0.99      0.98      4766
           1       0.88      0.75      0.81       478

    accuracy                           0.97      5244
   macro avg       0.93      0.87      0.90      5244
weighted avg       0.97      0.97      0.97      5244

------------------------------------------------------

FEATURE: cold

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5127
           1       0.77      0.09      0.15       117

    accuracy                      

ValueError: Can only compare identically-labeled DataFrame objects

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [28]:
import pickle

with open('model.pkl', 'wb') as file:
    pickle.dump(cv, file)
    

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.