# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import re
from sqlalchemy import create_engine

In [2]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('response_table',con=engine)

In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('response_table',con=engine)
X = df.loc[:,'message'].values
Y = df.iloc[:,4:]

### 2. Write a tokenization function to process your text data

Lookup a complicated text to work on
- create a dictionary keeping all urls
- look up a complicated text
- test our code on this text

In [None]:
# regex pattern to identify an url
url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

# a dictionaly comprehension keeping all indexes and urls in case we wishesd to furtehr process later
urls_dict = \
{df[df.message==text].index[0]:re.findall(url_regex, text) for text in df.message if len(re.findall(url_regex, text))>1}

In [None]:
# lookup check out a complicated text to work on
text = df.message[12409]
text

In [None]:
# we use the above mentioned text to get the index od the data frame, which will later help us identify urls 
# and where these are located in the dataframe - their index
df[df.message == 'Wind 16.0 mph NNE. Barometer 982.09 mb, gust 31.0, Temp 56.9 &amp;deg;F. Rain 0.00 in. Humidity 95% hurricane cam http://t.co/Sq2ekENuWind 16.0 mph NNE. Barometer 981.68 mb, gust 26.0, Temp 57.1 &amp;deg;F. Rain 0.00 in. Humidity 95% hurricane cam http://t.co/Sq2ekENu'].index

In [None]:
# Index in pandas is basic object storing axis labels for all pandas objects - we need the index value
type(df[df.message == 'Wind 16.0 mph NNE. Barometer 982.09 mb, gust 31.0, Temp 56.9 &amp;deg;F. Rain 0.00 in. Humidity 95% hurricane cam http://t.co/Sq2ekENuWind 16.0 mph NNE. Barometer 981.68 mb, gust 26.0, Temp 57.1 &amp;deg;F. Rain 0.00 in. Humidity 95% hurricane cam http://t.co/Sq2ekENu'].index[0])

In [None]:
# replace all url found in each text with the string "url"
text = re.sub(url_regex, 'url', text)
text

In [None]:
# normalize with lower case
text = text.lower()
text

In [None]:
# remove punctuation
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
text

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
nltk.download(['punkt', 'words', 'stopwords', 'averaged_perceptron_tagger'])

In [None]:
from nltk.corpus import stopwords
print(stopwords.words("english"))

In [None]:
# remove stopwords
words = [w for w in word_tokenize(text) if w not in stopwords.words("english")]
words

In [None]:
# import "Parts of Speech": POS and Name Entity Recognition NER
from nltk import pos_tag, ne_chunk
words = pos_tag(words)
words

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w[0], pos = 'v') for w in words]
lemmed

In [None]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in lemmed]
stemmed

In [16]:
# credit to https://github.com/iris-theof/Disaster_response_pipeline
def tokenize(text):
    '''
    Function that splits text into words and return the root form of the words
    after removing the stop words
    
    Input: text(str): the message
    Output: lemm(list of str): a list of the root form of the message words
    '''
    #Regex to find urls
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    # Finds all urls from the provided text
    detected_urls = re.findall(url_regex, text)
    
    #Replaces all urls found with the "urlplaceholder"
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
        
    # Normalize text
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())    
        
    # Extracts the word tokens from the provided text    
    tokens = word_tokenize(text)
      
    # Remove stop words
    stop = stopwords.words("english")
    words = [t for t in tokens if t not in stop]
    
    #Lemmanitizer to remove inflectional and derivationally related forms of a word
    lemmatizer = WordNetLemmatizer()

    # Makes a list of clean tokens
    clean_tokens = []
    for tok in words:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


In [4]:
import re
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download(['punkt', 'words', 'stopwords', 'averaged_perceptron_tagger', 'wordnet'])
def tokenize(text):
    """
    Input: 
    a text string found in each reccord (str)
    Output:
    a list of stems 
    
    Desscription:
    Function that cterates stems - word tokens
    1. replaces urls with the 'url' string
    2. replaces punctuation marks with white spaces
    3. creates lists of words out of the initial text
    4. assigns Parts of speech to every word
    5. reduces words to their root form by specifying verb parts of speech
    6. reduces words to their stems - not necessary words to be understood by humans
    
    
    """
    # regex pattern to identify an url
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    # replace urls with a 'url' string
    text = re.sub(url_regex, 'url', text)
    # text normalization - remove punctuation and lower case
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    # tokenize text to words
    words = [w for w in word_tokenize(text) if w not in stopwords.words("english")]
    # assign "Parts of Speech": POS to every word - words output is a tupple
    words = pos_tag(words)
    # Reduce words to their root form by specifying Part of Speech: verb
    lemmed = [WordNetLemmatizer().lemmatize(w[0], pos = 'v') for w in words]
    # Reduce words to their stems - that is their root form not exactly a word to be understood 
    stemmed = [PorterStemmer().stem(w) for w in lemmed]
    
    return stemmed

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [10]:
text = df.message[12409]
text

'Wind 16.0 mph NNE. Barometer 982.09 mb, gust 31.0, Temp 56.9 &amp;deg;F. Rain 0.00 in. Humidity 95% hurricane cam http://t.co/Sq2ekENuWind 16.0 mph NNE. Barometer 981.68 mb, gust 26.0, Temp 57.1 &amp;deg;F. Rain 0.00 in. Humidity 95% hurricane cam http://t.co/Sq2ekENu'

In [11]:
text = tokenize(text)
text

['wind',
 '16',
 '0',
 'mph',
 'nne',
 'baromet',
 '982',
 '09',
 'mb',
 'gust',
 '31',
 '0',
 'temp',
 '56',
 '9',
 'amp',
 'deg',
 'f',
 'rain',
 '0',
 '00',
 'humid',
 '95',
 'hurrican',
 'cam',
 'url',
 '16',
 '0',
 'mph',
 'nne',
 'baromet',
 '981',
 '68',
 'mb',
 'gust',
 '26',
 '0',
 'temp',
 '57',
 '1',
 'amp',
 'deg',
 'f',
 'rain',
 '0',
 '00',
 'humid',
 '95',
 'hurrican',
 'cam',
 'url']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [12]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

In [13]:
pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer = tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(GradientBoostingClassifier(max_depth=6)))
    ])

In [25]:
# choose parameters
parameters = {'clf__estimator__n_estimators': [100, 140]}

    # create grid search object
model = GridSearchCV(pipeline, param_grid=parameters, scoring='recall_micro', cv=4)

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [14]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

 `CountVectorizer` (Bag of Words)

In [15]:
# initialize count vectorizer object
vect = CountVectorizer(tokenizer=tokenize)

In [21]:
# get counts of each token (word) in text data
X_count = vect.fit_transform(text)

In [22]:
# convert sparse matrix to numpy array to view
X_count.toarray()

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0]], dtype=int64)

`TfidfTransformer`

In [18]:
# initialize tf-idf transformer object
transformer = TfidfTransformer(smooth_idf=False)

In [23]:
# use counts from count vectorizer results to compute tf-idf values
tfidf = transformer.fit_transform(X_count)

In [24]:
# convert sparse matrix to numpy array to view
tfidf.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 1.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.]])

`TfidfVectorizer` = `CountVectorizer` + `TfidfTransformer`

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
# initialize tf-idf vectorizer object
vectorizer = TfidfVectorizer()

In [26]:
# compute bag of word counts and tf-idf values
X_vect = vectorizer.fit_transform(X_train)

In [27]:
# convert sparse matrix to numpy array to view
X_vect.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [28]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_vect, Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [29]:
x_vect = vectorizer.fit_transform(X_test)
clf.predict(x_vect)

ValueError: Number of features of the model must match the input. Model n_features is 30475 and input n_features is 14342 

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.