1\. Classifying fake news using supervised learning with NLP
------------------------------------------------------------

00:00 - 00:13

In this video, we'll be learning about supervised machine learning with NLP. Throughout this chapter you will be using the skills and ideas applied to classifying fake news.

2\. What is supervised learning?
--------------------------------

00:13 - 00:59

Supervised learning is a form of machine learning where you are given or create training data. This data has a label or outcome which you want the model or algorithm to learn. One common problem used as a good example of introductory machine learning is the Fischer's iris data; we have a few example rows of it here. The data has several features: Sepal Length and width and Petal length and width. The label we want to learn and predict is the species. This is a classification problem, so you want to be able to classify or categorize some data based on what you already know or have learned. Our goal is to use the dataset to make intelligent hypotheses about the species based on the geometric features.

3\. Supervised learning with NLP
--------------------------------

00:59 - 01:20

But instead of using geometric features like the Iris dataset, we need to use language. To help create features and train a model, we will use Scikit learn, a powerful open-source library. One of the ways you can create supervised learning data from text is by using bag of words models or TFIDF as features.

4\. IMDB Movie Dataset
----------------------

01:20 - 02:09

Let's say I have a dataset full of movie plots and genres from the IMDB database, as shown in this chart. I've separated the action and sci-fi movies, removing any movies labeled both action and scifi. I want to predict whether a movie is action or sci-fi based on the plot summary. The dataset we've extracted has categorical features generated using some preprocessing. We can see the plot summary, and the sci-fi and action columns. You can also see the Sci-Fi column, which is 1 for movies that are scifi and 0 for movies that are action. The Action column is the inverse of the Sci-Fi column.

5\. Supervised learning steps
-----------------------------

02:09 - 03:08

In the next video, we'll use scikit-learn to predict a movie's genre from its plot. But first, let's review the supervised learning process as a whole. To begin, we collect and preprocess our data. Then, we determine a label - this is what we want the model to learn, in our case, the genre of the movie. We can split our data into training and testing datasets, keeping them separate so we can build our model using only the training data. The test data remains unseen so we can test how well our model performs after it is trained. This is an essential part of Supervised Learning! We also need to extract features from the text to predict the label. We will use a bagof words vectorizer built into scikit-learn to do so. After the model is trained, we can then test it using the test dataset. There are also other methods to evaluate model performance, such as k-fold cross validation and you can check out DataCamp's Machine Learning curriculum to learn that and more!

6\. Let's practice!
-------------------

03:08 - 03:16

Let's review some of the supervised learning steps, like splitting testing and training data before applying it to our movie plot data.

Which possible features?
========================

Which of the following are possible features for a text classification problem?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Number of words in a document.

-   Specific named entities.

-   Language.

-   All of the above.

Training and testing
====================

What datasets are needed for supervised learning?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Training data.

-   Testing data.

-   Both training and testing data.

-   A label or outcome.

1\. Building word count vectors with scikit-learn
-------------------------------------------------

00:00 - 00:07

In this video, we'll build our first scikit learn vectors from the movie plot and genre dataset.

2\. Predicting movie genre
--------------------------

00:07 - 00:22

We have a dataset full of movie plots and what genre the movie is -- either action or sci-fi. We want to create bag of words vectors for these movie plots to see if we can predict the genre based on the words used in the plot summary.

3\. Count Vectorizer with Python
--------------------------------

00:22 - 03:28

To do so, we first need to import some necessary tools from Sci-kit learn. Once the data is loaded, we can create y which traditionally refers to the labels or outcome you want the model to learn. We can use the Sci-Fi column which has 1 if the movie is Sci-Fi and 0 if it is Action. Then, scikit learn's train_test_split function can be used to split the dataframe into training and testing data. This method will split the features which is the plot summary or column PLOT and the labels (y) based on a given test_size such as 0.33, representing 33 percent. I have also set random state so we have a repeatable result, it operates similar to setting a random seed and ensures I get the same results when I run the code again. The function will take 33% of rows to be marked as test data, and remove them from the training data. The test data is later used to see what my model has learned. The resulting data from train_test_split are training data (as X_train) and training labels (as y_train) and testing data as X_test and testing labels as y_test. Next, we create a countvectorizer which turns my text into bag of words vectors similar to a Gensim corpus, it will also remove English stop words from the movie plot summaries as a preprocessing step. Each token now acts as a feature for the machine learning classification problem, just like the flower measurements in the iris data set. We can then call fit_transform on the training data to create the bag-of-words vectors. Fit_transform is a handy shortcut which will call the model's fit and then transform methods; which here generates a mapping of words with IDs and vectors representing how many times each word appears in the plot. Fit_transform operates differently for each model, but generally fit will find parameters or norms in the data and transform will apply the model's underlying algorithm or approximation -- similar to preprocessing but with a specific use case in mind. For the CountVectorizer class, fit_transform will create the bagofwords dictionary and vectors for each document using the training data. After calling fit_transform on the training data, we call transform on the test data to create bag of words vectors using the same dictionary. The training and test vectors need to use a consistent set of words, so the trained model can understand the test input. If we don't have much data, there can be an issue with words in the test set which don't appear in the training data. This will throw an error, so you will need to either add more training data or remove the unknown words from the test dataset. In only a few lines of Python, we have transformed text into bagofwords vectors and generated test and training datasets. Scikitlearn is a great aid in helping make NLP machine learning simple and accessible.

4\. Let's practice!
-------------------

03:28 - 03:34

Now it's your turn to create some scikitlearn vectors!

CountVectorizer for text classification
=======================================

It's time to begin building your text classifier! The [data](https://s3.amazonaws.com/assets.datacamp.com/production/course_3629/fake_or_real_news.csv) has been loaded into a DataFrame called `df`. Explore it in the IPython Shell to investigate what columns you can use. The `.head()` method is particularly informative.

In this exercise, you'll use `pandas` alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a `CountVectorizer` and investigate some of its features.

Instructions
------------

-   Import `CountVectorizer` from `sklearn.feature_extraction.text` and `train_test_split` from `sklearn.model_selection`.
-   Create a Series `y` to use for the labels by assigning the `.label` attribute of `df` to `y`.
-   Using `df["text"]` (features) and `y`(labels), create training and test sets using `train_test_split()`. Use a `test_size` of `0.33` and a `random_state` of `53`.
-   Create a `CountVectorizer` object called `count_vectorizer`. Ensure you specify the keyword argument `stop_words="english"`so that stop words are removed.
-   Fit and transform the training data `X_train` using the `.fit_transform()`method of your `CountVectorizer` object. Do the same with the test data `X_test`, except using the `.transform()` method.
-   Print the first 10 features of the `count_vectorizer` using its `.get_feature_names()` method.

In [None]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

TfidfVectorizer for text classification
=======================================

Similar to the sparse `CountVectorizer`created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a `TfidfVectorizer` and investigate some of its features.

In this exercise, you'll use `pandas` and `sklearn` along with the same `X_train`, `y_train` and `X_test`, `y_test` DataFrames and Series you created in the last exercise.

Instructions
------------

-   Import `TfidfVectorizer` from `sklearn.feature_extraction.text`.
-   Create a `TfidfVectorizer` object called `tfidf_vectorizer`. When doing so, specify the keyword arguments `stop_words="english"` and `max_df=0.7`.
-   Fit and transform the training data. 
-   Transform the test data.
-   Print the first 10 features of `tfidf_vectorizer`.
-   Print the first 5 vectors of the tfidf training data using slicing on the `.A` (or array) ***attribute*** of `tfidf_train`.

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

Inspecting the vectors
======================

To get a better idea of how the vectors work, you'll investigate them by converting them into `pandas` DataFrames.

Here, you'll use the same data structures you created in the previous two exercises (`count_train`, `count_vectorizer`, `tfidf_train`, `tfidf_vectorizer`) as well as `pandas`, which is imported as `pd`.

Instructions
------------

-   Create the DataFrames `count_df` and `tfidf_df` by using `pd.DataFrame()` and specifying the values as the first argument and the columns (or features) as the second argument.
    -   The values can be accessed by using the `.A` attribute of, respectively, `count_train` and `tfidf_train`.
    -   The columns can be accessed using the `.get_feature_names()` methods of `count_vectorizer` and `tfidf_vectorizer`.
-   Print the head of each DataFrame to investigate their structure. *This has been done for you.*
-   Test if the column names are the same for each DataFrame by creating a new object called `difference` to see the difference between the columns that `count_df` has from `tfidf_df`. Columns can be accessed using the `.columns` attribute of a DataFrame. Subtract the set of `tfidf_df.columns` from the set of `count_df.columns`.
-   Test if the two DataFrames are equivalent by using the `.equals()` method on `count_df` with `tfidf_df` as the argument.

In [None]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))