## Model Considerations

In [2]:
import pandas as pd

In [3]:
import os
os.path.abspath('data/Sentiment-Analysis-Dataset.zip')

'/home/jovyan/datacourse/extra_notebooks/data/Sentiment-Analysis-Dataset.zip'

In [4]:
df = pd.read_csv('/home/jovyan/datacourse/extra_notebooks/data/Sentiment-Analysis-Dataset.zip',
                 compression='zip', error_bad_lines=False)
# only 2 rows were bad!

b'Skipping line 8836: expected 4 fields, saw 5\n'
b'Skipping line 535882: expected 4 fields, saw 7\n'


In [5]:
df['SentimentText'].head(50)
# a lot of slang! HTML entities '&lt;--', handles '@ginaa' etc we should remove... Tricky...

0                          is so sad for my APL frie...
1                        I missed the New Moon trail...
2                               omg its already 7:30 :O
3               .. Omgaga. Im sooo  im gunna CRy. I'...
4              i think mi bf is cheating on me!!!   ...
5                     or i just worry too much?        
6                    Juuuuuuuuuuuuuuuuussssst Chillin!!
7            Sunny Again        Work Tomorrow  :-|  ...
8           handed in my uniform today . i miss you ...
9              hmmmm.... i wonder how she my number @-)
10                        I must think about positive..
11          thanks to all the haters up in my face a...
12                       this weekend has sucked so far
13               jb isnt showing in australia any more!
14                                 ok thats it you win.
15        &lt;-------- This is the way i feel right ...
16        awhhe man.... I'm completely useless rt no...
17        Feeling strangely fine. Now I'm gonna 

## Model Development

In [6]:
from sklearn.feature_extraction.text import HashingVectorizer

#### Could use TweetTokenizer from nltk module, but need to install.

In [7]:
pip install --user -U nltk


The following command must be run outside of the IPython shell:

    $ pip install --user -U nltk

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/


In [8]:
from nltk import TweetTokenizer 
# lol just watch vids

ModuleNotFoundError: No module named 'nltk'

In [None]:
tweet_tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True)

In [None]:
tweet_tokenizer.tokenize(df.iloc[0, -1])

#### Or use own preprocessor instead of TweetTokenizer:

In [9]:

from html import unescape

def preprocessor(doc):
    return unescape(doc).lower()

In [10]:
unescape('&lt')

'<'

In [11]:
import spacy
from spacy.lang.en import STOP_WORDS

In [12]:
nlp = spacy.load('en', disable=['ner', 'parser', 'tagger'])

In [13]:
def lemmatizer(doc):
    return [word.lemma_ for word in nlp(doc)]

In [14]:
STOP_WORDS_lemma = [word.lemma_ for word in nlp(' '.join(list(STOP_WORDS)))]
STOP_WORDS_lemma = set(STOP_WORDS_lemma).union({',', '.', ';'})

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

In [16]:
def construct_model():
    vectorizer = HashingVectorizer(preprocessor=preprocessor,
                            # tokenizer=lemmatizer,
                            alternate_sign=False,
                            # ngram_range=(1, 2),
                            stop_words=STOP_WORDS_lemma) # kernel dies even without removing:(
    clf = MultinomialNB()
    pipe = Pipeline([
        ('vectorizer', vectorizer), ('classifier', clf)
    ])
    
    return pipe

In [17]:
X = df['SentimentText']
y = df['Sentiment']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [18]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 HashingVectorizer(alternate_sign=False, analyzer='word',
                                   binary=False, decode_error='strict',
                                   dtype=<class 'numpy.float64'>,
                                   encoding='utf-8', input='content',
                                   lowercase=True, n_features=1048576,
                                   ngram_range=(1, 1), norm='l2',
                                   preprocessor=<function preprocessor at 0x7f2f35ca0400>,
                                   stop_words={',', '.', '3', ';', 'a', 'about',
                                               'a...',
                                               'afterwards', 'again', 'against',
                                               'all', 'almost', 'alone',
                                               'along', 'already', 'also',
                                               'although', 'always', 'am

In [19]:
pipe.score(X_train, y_train) #0.81

0.8059869077963305

In [20]:
pipe.score(X_test, y_test) # 0.77!

0.769585997852548

### Persisting Model
Want to access model without retraining - pickle/dill file, dump to disc to upload later. Can do `partial_fit` with new data!

In [24]:
import gzip
import dill

In [25]:
def serialize_model():
    model = construct_model()
    
    with gzip.open('sentiment_model.dill.gz', 'wb') as f: # write mode
        dill.dump(pipe, f, recurse=True)

In [27]:
!ls - alh sentiment_model.dill.gz

ls: cannot access '-': No such file or directory
ls: cannot access 'alh': No such file or directory
sentiment_model.dill.gz


In [28]:
with gzip.open('sentiment_model.dill.gz', 'rb') as f: # read mode
    model = dill.load(f)

In [29]:
model.score(X_test, y_test) # 0.77 as before!

0.769585997852548

**Note:** Could construct functions `ConstructModel` and `SerializeModel` to encapsualte the above process (Done).

**Next step:** Done ML part. Now want to load dill file to use model in web application! (Python Flask application)

## Flask App Local Development Summary

### GET Requests

### Using Model with Twitter API

### POST Requests

### Deployment to Web via Heroku
Cloud platform to host web apps

In [31]:
pip install Flask


The following command must be run outside of the IPython shell:

    $ pip install Flask

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

See the Python documentation for more information on how to install packages:

    https://docs.python.org/3/installing/
