## What is supervised learning?
- Form of machine learning
- Problem has predefined training data
- This data has a label (or outcome) you want the model to learn
- Classification problem
Goal: Make good hypotheses about the species based on geometric features


## Supervised learning with NLP

- Need to use **language** instead of geometric features
- scikit-learn: Powerful open-source library
- How to create supervised learning data from text?
- Use bag-of-words models or tf-idf as features


## IMDB Movie Dataset


![](https://i.imgur.com/tdlKkz1.png)

- Goal: Predict movie genre based on plot summary
- Categorical features generated using preprocessing


## Supervised learning steps
- Collect and preprocess our data
- Determine a label (Example: Movie genre)
- Split data into training and test sets
- Extract features from the text to help predict the label
    - Bag-of-words vector built into scikit-learn
- Evaluate trained model using the test set

## Let's Practice

# Building word count vectors with scikitlearn


## Predicting movie genre
- Dataset consisting of movie plots and corresponding genre
- Goal: Create bag-of-word vectors for the movie plots
    - Can we predict genre based on the words used in the plot summary?
    
    
## Count Vectorize w Python

```python
# import req libraries
In [1]: import pandas as pd
In [2]: from sklearn.model_selection import train_test_split
In [3]: from sklearn.feature_extraction.text import CountVectorizer
    
# load dataset as df
In [4]: df = ... # Load data into DataFrame

# create y, labels or outcome we want the model to learn 
In [5]: y = df['Sci-Fi']  # 1 if scifi 0 if action

# split df into training and testing
# it splits the features and labels into train & test
# test_size is the perc of split ex 33% of data as test data
# random_state is seed to replicate split vals

In [6]: X_train, X_test, y_train, y_test = train_test_split(
df['plot'], y,
test_size=0.33,
random_state=53)
    
# turns text into bag of words vectors similar to gensim corpus, also remove english stop words
# from the movie plot summary as a preprocessing step 
# each token will act as a feature for ML classification problem like the flower mesurmens of iris dataset
In [7]: count_vectorizer = CountVectorizer(stop_words='english')
    
# call fit_transform
In [8]: count_train = count_vectorizer.fit_transform(X_train.values)
In [9]: count_test = count_vectorizer.transform(X_test.values)
    ```

## Let's Practice

# Training and testing a classification model with scikit-learn

## Naive Bayes classifier
- Naive Bayes Model
    - Commonly used for testing NLP classification problems
    - Basis in probability
- Given a particular piece of data, how likely is a particular outcome?
- Examples:
    - If the plot has a spaceship, how likely is it to be sci-fi?
    - Given a spaceship and an alien, how likely now is it sci-fi?
- Each word from CountVectorizer acts as a feature
- Naive Bayes: Simple and effective

## Naive Bayes w scikit-learn

```python
In [10]: from sklearn.naive_bayes import MultinomialNB
In [11]: from sklearn import metrics
In [12]: nb_classifier = MultinomialNB()
In [13]: nb_classifier.fit(count_train, y_train)
In [14]: pred = nb_classifier.predict(count_test)
In [15]: metrics.accuracy_score(
```

## Confustion Matrix

```python
In [16]: metrics.confusion_matrix(y_test, pred, labels=[0,1])
Out [16]:
array([[6410, 563],
[ 864, 2242]])

```


![](https://i.imgur.com/yJowLkX.png)



## Let's Practice

# Simple NLP, Complex Problems

## Translation
![](https://i.imgur.com/PfopGKE.png)

(source: https://twitter.com/Lupintweets/status/865533182455685121)

## Sentiment Analysis

![](https://i.imgur.com/3NDgJQg.png)

(source: https://nlp.stanford.edu/projects/socialsent/)

## Language Biases

![](https://i.imgur.com/sCT6oSV.png)

(related talk: https://www.youtube.com/watch?v=j7FwpZB1hWc)


## Let's Practice