1\. Classifying fake news using supervised learning with NLP
------------------------------------------------------------

00:00 - 00:13

In this video, we'll be learning about supervised machine learning with NLP. Throughout this chapter you will be using the skills and ideas applied to classifying fake news.

2\. What is supervised learning?
--------------------------------

00:13 - 00:59

Supervised learning is a form of machine learning where you are given or create training data. This data has a label or outcome which you want the model or algorithm to learn. One common problem used as a good example of introductory machine learning is the Fischer's iris data; we have a few example rows of it here. The data has several features: Sepal Length and width and Petal length and width. The label we want to learn and predict is the species. This is a classification problem, so you want to be able to classify or categorize some data based on what you already know or have learned. Our goal is to use the dataset to make intelligent hypotheses about the species based on the geometric features.

* Form of machine learning
  * Problem has predefined training data
  * This data has a label (or outcome) you want the model to learn
  * Classification problem  
  * Goal: Make good hypotheses about the species based on geometric features

| Sepal length | Sepal width | Petal length | Petal width | Species |
|--------------|-------------|--------------|-------------|----------|
| 5.1 | 3.5 | 1.4 | 0.2 | I. setosa |
| 7.0 | 3.2 | 4.77 | 1.4 | I.versicolor |  
| 6.3 | 3.3 | 6.0 | 2.5 | I.virginica |

3\. Supervised learning with NLP
--------------------------------

00:59 - 01:20

But instead of using geometric features like the Iris dataset, we need to use language. To help create features and train a model, we will use Scikit learn, a powerful open-source library. One of the ways you can create supervised learning data from text is by using bag of words models or TFIDF as features.

* Need to use language instead of geometric features
* `scikit-learn`: Powerful open-source library 
* How to create supervised learning data from text?
  * Use bag-of-words models or tf-idf as features

4\. IMDB Movie Dataset
----------------------

01:20 - 02:09

Let's say I have a dataset full of movie plots and genres from the IMDB database, as shown in this chart. I've separated the action and sci-fi movies, removing any movies labeled both action and scifi. I want to predict whether a movie is action or sci-fi based on the plot summary. The dataset we've extracted has categorical features generated using some preprocessing. We can see the plot summary, and the sci-fi and action columns. You can also see the Sci-Fi column, which is 1 for movies that are scifi and 0 for movies that are action. The Action column is the inverse of the Sci-Fi column.

| Plot | Sci-Fi | Action |
|------|--------|---------|
| In a post-apocalyptic world in human decay, a ... | 1 | 0 |
| Mohei is a wandering swordsman. He arrives in ... | 0 | 1 |
| #137 is a SCI/FI thriller about a girl, Marla,... | 1 | 0 |

* Goal: Predict movie genre based on plot summary
* Categorical features generated using preprocessing

5\. Supervised learning steps
-----------------------------

02:09 - 03:08

In the next video, we'll use scikit-learn to predict a movie's genre from its plot. But first, let's review the supervised learning process as a whole. To begin, we collect and preprocess our data. Then, we determine a label - this is what we want the model to learn, in our case, the genre of the movie. We can split our data into training and testing datasets, keeping them separate so we can build our model using only the training data. The test data remains unseen so we can test how well our model performs after it is trained. This is an essential part of Supervised Learning! We also need to extract features from the text to predict the label. We will use a bagof words vectorizer built into scikit-learn to do so. After the model is trained, we can then test it using the test dataset. There are also other methods to evaluate model performance, such as k-fold cross validation and you can check out DataCamp's Machine Learning curriculum to learn that and more!

* Collect and preprocess our data
* Determine a label (Example: Movie genre)
* Split data into training and test sets 
* Extract features from the text to help predict the label
  * Bag-of-words vector built into `scikit-learn`
* Evaluate trained model using the test set

6\. Let's practice!
-------------------

03:08 - 03:16

Let's review some of the supervised learning steps, like splitting testing and training data before applying it to our movie plot data.

Which possible features?
========================

Which of the following are possible features for a text classification problem?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Number of words in a document.

-   Specific named entities.

-   Language.

-   All of the above.

Training and testing
====================

What datasets are needed for supervised learning?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Training data.

-   Testing data.

-   Both training and testing data.

-   A label or outcome.

1\. Building word count vectors with scikit-learn
-------------------------------------------------

00:00 - 00:07

In this video, we'll build our first scikit learn vectors from the movie plot and genre dataset.

2\. Predicting movie genre
--------------------------

00:07 - 00:22

We have a dataset full of movie plots and what genre the movie is -- either action or sci-fi. We want to create bag of words vectors for these movie plots to see if we can predict the genre based on the words used in the plot summary.

* Dataset consisting of movie plots and corresponding genre
* Goal: Create bag-of-word vectors for the movie plots
  * Can we predict genre based on the words used in the plot summary?

3\. Count Vectorizer with Python
--------------------------------

00:22 - 03:28

To do so, we first need to import some necessary tools from Sci-kit learn. Once the data is loaded, we can create y which traditionally refers to the labels or outcome you want the model to learn. We can use the Sci-Fi column which has 1 if the movie is Sci-Fi and 0 if it is Action. Then, scikit learn's train_test_split function can be used to split the dataframe into training and testing data. This method will split the features which is the plot summary or column PLOT and the labels (y) based on a given test_size such as 0.33, representing 33 percent. I have also set random state so we have a repeatable result, it operates similar to setting a random seed and ensures I get the same results when I run the code again. The function will take 33% of rows to be marked as test data, and remove them from the training data. The test data is later used to see what my model has learned. The resulting data from train_test_split are training data (as X_train) and training labels (as y_train) and testing data as X_test and testing labels as y_test. Next, we create a countvectorizer which turns my text into bag of words vectors similar to a Gensim corpus, it will also remove English stop words from the movie plot summaries as a preprocessing step. Each token now acts as a feature for the machine learning classification problem, just like the flower measurements in the iris data set. We can then call fit_transform on the training data to create the bag-of-words vectors. Fit_transform is a handy shortcut which will call the model's fit and then transform methods; which here generates a mapping of words with IDs and vectors representing how many times each word appears in the plot. Fit_transform operates differently for each model, but generally fit will find parameters or norms in the data and transform will apply the model's underlying algorithm or approximation -- similar to preprocessing but with a specific use case in mind. For the CountVectorizer class, fit_transform will create the bagofwords dictionary and vectors for each document using the training data. After calling fit_transform on the training data, we call transform on the test data to create bag of words vectors using the same dictionary. The training and test vectors need to use a consistent set of words, so the trained model can understand the test input. If we don't have much data, there can be an issue with words in the test set which don't appear in the training data. This will throw an error, so you will need to either add more training data or remove the unknown words from the test dataset. In only a few lines of Python, we have transformed text into bagofwords vectors and generated test and training datasets. Scikitlearn is a great aid in helping make NLP machine learning simple and accessible.

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

df = ... # Load data into DataFrame
y = df['Sci-Fi']
X_train, X_test, y_train, y_test = train_test_split(
                    df['plot'], y,
                    test_size=0.33,
                    random_state=53)

count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values) 
```

4\. Let's practice!
-------------------

03:28 - 03:34

Now it's your turn to create some scikitlearn vectors!

CountVectorizer for text classification
=======================================

It's time to begin building your text classifier! The [data](https://s3.amazonaws.com/assets.datacamp.com/production/course_3629/fake_or_real_news.csv) has been loaded into a DataFrame called `df`. Explore it in the IPython Shell to investigate what columns you can use. The `.head()` method is particularly informative.

In this exercise, you'll use `pandas` alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a `CountVectorizer` and investigate some of its features.

Instructions
------------

-   Import `CountVectorizer` from `sklearn.feature_extraction.text` and `train_test_split` from `sklearn.model_selection`.
-   Create a Series `y` to use for the labels by assigning the `.label` attribute of `df` to `y`.
-   Using `df["text"]` (features) and `y`(labels), create training and test sets using `train_test_split()`. Use a `test_size` of `0.33` and a `random_state` of `53`.
-   Create a `CountVectorizer` object called `count_vectorizer`. Ensure you specify the keyword argument `stop_words="english"`so that stop words are removed.
-   Fit and transform the training data `X_train` using the `.fit_transform()`method of your `CountVectorizer` object. Do the same with the test data `X_test`, except using the `.transform()` method.
-   Print the first 10 features of the `count_vectorizer` using its `.get_feature_names()` method.

In [None]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

TfidfVectorizer for text classification
=======================================

Similar to the sparse `CountVectorizer`created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a `TfidfVectorizer` and investigate some of its features.

In this exercise, you'll use `pandas` and `sklearn` along with the same `X_train`, `y_train` and `X_test`, `y_test` DataFrames and Series you created in the last exercise.

Instructions
------------

-   Import `TfidfVectorizer` from `sklearn.feature_extraction.text`.
-   Create a `TfidfVectorizer` object called `tfidf_vectorizer`. When doing so, specify the keyword arguments `stop_words="english"` and `max_df=0.7`.
-   Fit and transform the training data. 
-   Transform the test data.
-   Print the first 10 features of `tfidf_vectorizer`.
-   Print the first 5 vectors of the tfidf training data using slicing on the `.A` (or array) ***attribute*** of `tfidf_train`.

In [None]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

Inspecting the vectors
======================

To get a better idea of how the vectors work, you'll investigate them by converting them into `pandas` DataFrames.

Here, you'll use the same data structures you created in the previous two exercises (`count_train`, `count_vectorizer`, `tfidf_train`, `tfidf_vectorizer`) as well as `pandas`, which is imported as `pd`.

Instructions
------------

-   Create the DataFrames `count_df` and `tfidf_df` by using `pd.DataFrame()` and specifying the values as the first argument and the columns (or features) as the second argument.
    -   The values can be accessed by using the `.A` attribute of, respectively, `count_train` and `tfidf_train`.
    -   The columns can be accessed using the `.get_feature_names()` methods of `count_vectorizer` and `tfidf_vectorizer`.
-   Print the head of each DataFrame to investigate their structure. *This has been done for you.*
-   Test if the column names are the same for each DataFrame by creating a new object called `difference` to see the difference between the columns that `count_df` has from `tfidf_df`. Columns can be accessed using the `.columns` attribute of a DataFrame. Subtract the set of `tfidf_df.columns` from the set of `count_df.columns`.
-   Test if the two DataFrames are equivalent by using the `.equals()` method on `count_df` with `tfidf_df` as the argument.

In [None]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

1\. Training and testing a classification model with scikit-learn
-----------------------------------------------------------------

00:00 - 00:07

In this video, we'll use the features we have extracted to train and test a supervised classification model.

2\. Naive Bayes classifier
--------------------------

00:07 - 01:12

A Naive Bayes model is commonly used for testing NLP classification problems because of its basis in probability. Naive bayes algorithm uses probability, attempting to answer the question if given a particular piece of data, how likely is a particular outcome? For example, thinking back to our movie genres dataset -- If the plot has a spaceship, how likely is it that the movie is Sci-Fi? And given a Spaceship and an alien how likely NOW is it a sci-fi movie? Each word acts as a feature from our CountVectorizer helping classify our text using probability. Naive bayes has been used for text classification problems since the 1960s and continues to be used today despite the growth of many other models, algorithms and neural network architectures. That said, it is not always the best tool for the job, but it is a simple and effective one you will use to build a fake news classifier.

* Naive Bayes Model
  * Commonly used for testing NLP classification problems
  * Basis in probability
* Given a particular piece of data, how likely is a particular outcome?
* Examples:
  * If the plot has a spaceship, how likely is it to be sci-fi?
  * Given a spaceship and an alien, how likely now is it sci-fi?
* Each word from `CountVectorizer` acts as a feature
* Naive Bayes: Simple and effective

3\. Naive Bayes with scikit-learn
---------------------------------

01:12 - 03:04

We'll use scikit learn's naive bayes to take a look at our scifi versus action plot classification problem. Recall the data we're using is simply IMDB plot summaries, and whether the movie is science fiction or action. First, we import the naive bayes model class, multinomial naive bayes, which works well with count vectorizers as it expects integer inputs. MultinomialNB is also used for multiple label classification. This model may not work as well with floats, such as tfidf weighted inputs. Instead, use support vector machines or even linear models; although I recommend trying Naive Bayes first to determine if it can also work well. We use the metrics module to evaluate model performance. We initialize our class and call fit with the training data. If you recall from the previous video, this will determine the internal parameters based on the dataset. We pass the training count vectorizer first and the training labels second. After fitting the model, we call predict with the count vectorizer test data. Predict will use the trained model to predict the label based on the test data vectors. We save the predicted labels in variable pred to test the accuracy. Finally, we test accuracy using accuracy_score from the metrics module and passing the predicted and test labels. Accuracy for our model means the percentage of correct genre guesses out of total guesses. Our model has about 86% accuracy -- which is pretty good for a first try! You'll be applying the Multinomial Naive Bayes classifier to the fake news dataset in the following exercises.

```python
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

nb_classifier = MultinomialNB()

nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test)
metrics.accuracy_score(y_test, pred)

0.8584184938982042
```

4\. Confusion matrix
--------------------

03:04 - 04:16

To further evaluate our model, we can also check the confusion matrix which shows correct and incorrect labels. The confusion_matrix function from the metrics module takes the test labels, the predictions and a list of labels. If the label list is not passed, scikit learn will order them using Python ordering. The confusion matrix is a bit easier to read when we transform it into a table. The first value and last value of the matrix (or the main diagonal of the matrix) show true scores, meaning, true classification of both action and scifi films based on the plot bag of words vectors. In a confusion matrix, the predicted labels are shown across the top and the true labels are shown down the side. This confusion matrix shows 864 Sci-Fi movies incorrectly labeled as Action and 563 Action movies incorrectly labeled as Sci-Fi. We can see from the distribution of true positives and negatives that our dataset is a bit skewed, we have many more action films than sci-fi. This could be one reason that our action movies are predicted more accurately.

```python
metrics.confusion_matrix(y_test, pred, labels=[0,1])

array([[6410,  563],
       [ 864, 2242]])

# Table format:
#           Action  Sci-Fi
# Action    6410    563
# Sci-Fi    864     2242
```

5\. Let's practice!
-------------------

04:16 - 04:22

Now it's your turn to train and test a naive bayes model for the fake news problem!

Text classification models
==========================

Which of the below is the most reasonable model to use when training a new supervised model using text vector data?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Random Forests

-   Naive Bayes

-   Linear Regression

-   Deep Learning

Training and testing the "fake news" model with CountVectorizer
===============================================================

Now it's your turn to train the "fake news" model using the features you identified and extracted. In this first exercise you'll train and test a Naive Bayes model using the `CountVectorizer` data.

The training and test sets have been created, and `count_vectorizer`, `count_train`, and `count_test` have been computed.

Instructions
------------

-   Import the `metrics` module from `sklearn`and `MultinomialNB` from `sklearn.naive_bayes`.
-   Instantiate a `MultinomialNB` classifier called `nb_classifier`.
-   Fit the classifier to the training data.
-   Compute the predicted tags for the test data.
-   Calculate and print the accuracy score of the classifier.
-   Compute the confusion matrix. To make it easier to read, specify the keyword argument `labels=['FAKE', 'REAL']`.

In [None]:
# Import the necessary modules
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

Training and testing the "fake news" model with TfidfVectorizer
===============================================================

Now that you have evaluated the model using the `CountVectorizer`, you'll do the same using the `TfidfVectorizer` with a Naive Bayes model.

The training and test sets have been created, and `tfidf_vectorizer`, `tfidf_train`, and `tfidf_test` have been computed. Additionally, `MultinomialNB` and `metrics`have been imported from, respectively, `sklearn.naive_bayes` and `sklearn`.

Instructions
------------

-   Instantiate a `MultinomialNB` classifier called `nb_classifier`.
-   Fit the classifier to the training data.
-   Compute the predicted tags for the test data.
-   Calculate and print the accuracy score of the classifier.
-   Compute the confusion matrix. As in the previous exercise, specify the keyword argument `labels=['FAKE', 'REAL']` so that the resulting confusion matrix is easier to read.

In [None]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

1\. Simple NLP, complex problems
--------------------------------

00:00 - 00:27

You've learned so much about Natural Language Processing fundamentals in this course, congratulations! In this video we'll talk more about how complex these problems can be and how to use the skills you have learned to start a longer exploration of working with language in Python. In the exercises, you will apply some extra investigation into your fake news classification model to see if it has really learned what you wanted.

2\. Translation
---------------

00:27 - 01:12

Translation, although it might work well for some languages, still has a long way to go. This tweet by Lupin attempting to translate some legal or bureaucratic text about economics and industry is a pretty funny and also sadly accurate example of when using word vectors between two languages can end poorly. The German text uses words like Nationalökonomie and Wirtschaftswissenschaft but the English text simply says the economics of economics (including economics, economics and so forth) is part of economics. The German text has many different words related to economics and they are all simply closest to the english vector for economics, leading to a hilarious but woefully inaccurate translation.

> **Lupin** @Lupintweets · Follow
> 
> god bless the german language
> 
> [*Translation interface showing:*]
>
> **German (detected):**  
> Die Volkswirtschaftslehre (auch Nationalökonomie, Wirtschaftliche Staatswissenschaften oder Sozialökonomie, kurz VWL), ist ein Teilgebiet der Wirtschaftswissenschaft
>
> **English translation:**  
> The economics of economics (including economics, economics, economics, economics, economics, economics) is a part of economics
>
> 🔄 9,595 Retweets   ❤️ 16,327 Likes

[Source](https://twitter.com/Lupintweets/status/865533182455685121)

The image shows a humorous translation example highlighting how German compounds words in unique ways. In the image, a German text about economics ("Volkswirtschaftslehre") is translated to English, where "economics" gets repeated multiple times, making it comically redundant: "The economics of economics (including economics, economics, economics, economics, economics, economics) is a part of economics."

The tweet caption "god bless the german language" humorously acknowledges German's tendency to create precise, often lengthy compound words that can translate awkwardly into English. The tweet received significant engagement with 9,595 retweets and 16,327 likes.

The humor comes from how the German language can express complex concepts in single compound words, while the English translation becomes a repetitive string of the same word "economics" to try to capture the same meaning.

3\. Sentiment analysis
----------------------

01:12 - 02:04

Sentiment analysis is far from a solved problem. Complex issues like snark or sarcasm, and difficult problems with negation (for example: I liked it BUT it could have been better) make it an open field of research. There is also active research regarding how separate communities use the same words differently. This is a graphic from a project called Social Sent created by a group of researchers at Stanford. The project compares sentiment changes in words over time and from different communities. Here, the authors compare sentiment in word usage between two different reddit communities, 2X which is a woman-centered reddit and sports. The graphic illustrates the SAME word can be used with very different sentiments depending on the communal understanding of the word.

```markdown
"big men are very _soft_"
"freakin raging _animal_"
"went from the _ladies_ tees"
"two _dogs_ fighting"
"being able to _hit_"
"insanely _difficult_ saves"
"amazing _shot_"
"he is still _crazy_ good"
"his stats are _insane_"

Ex. contexts in r/sports

[Bar chart showing sentiment scores ranging from -10 to +10]

"some _soft_ pajamas"
"stuffed _animal_"
"lovely _ladies_"
"hiking with the _dogs_"
"it didn't really _hit_ me"
"a _difficult_ time"
"totally _shot_ me down"
"overreacting _crazy_ woman"
"people are just _insane_"

Ex. contexts in r/TwoX

(source: https://nlp.stanford.edu/projects/socialsent/)
```

```markdown
| "big men are very _soft_"             |               soft ■■■■■■■■■─         |     "some _soft_ pajamas"           |
| "freakin raging _animal_"             |             animal ■■■■■■■■──         |     "stuffed _animal_"              |
| "went from the _ladies_ tees"         |             ladies ■■■■■───           |     "lovely _ladies_"               |
| "two _dogs_ fighting"                 |               dogs ■■■■────           |     "hiking with the _dogs_"        |
| "being able to _hit_"                 |                hit ■─────             |     "it didn't really _hit_ me"     |
| "insanely _difficult_ saves"          |           difficult █████              |     "a _difficult_ time"            |
| "amazing _shot_"                      |               shot ██████             |     "totally _shot_ me down"        |
| "his stats are _insane_"             |             insane ████████           |     "people are just _insane_"      |
| "he is still _crazy_ good"           |              crazy ███████            |     "overreacting _crazy_ woman"    |

**Ex. contexts in r/sports**                                                    **Ex. contexts in r/TwoX**

                               ←────────────────|────────────────→
                               -10              0               +10
                     more positive in r/sports,     more positive in r/TwoX,
                     more negative in r/TwoX        more negative in r/sports

_(source: https://nlp.stanford.edu/projects/socialsent/)_
```

Note: I used ■ for red/brown bars (positive sentiment) and █ for blue bars (negative sentiment) to approximate the visualization. The actual bars are colored differently in the original image.

4\. Language biases
-------------------

02:04 - 02:46

Finally, we must remember language can contain its own prejudices and unfair treatment towards groups. When we then train word vectors on these prejudiced texts, our word vectors will likely reflect those problems. Here, we take a gendered language, like english and translate it to Turkish, a language with no gendered pronouns. When we click to translate it back, we see the genders have switched. This phenomena was studied in a recent article by a Princeton researcher Aylin Caliskan alongside several ethical machine learning researchers. She also gave a talk at the 33rd annual Chaos Computer Club conference in Hamburg, which I can definitely recommend viewing.

```markdown
## Google Übersetzer
[Englisch] [Rumänisch] [Türkisch] [Sprache erkennen ▼] [🔄] [Türkisch] [Englisch] [Deutsch ▼] [Übersetzen]

She's a professor. He's a babysitter.
↔
O bir profesör. O bir bebek bakıcısı.

---

## Google Übersetzer  
[Englisch] [Rumänisch] [Türkisch] [Sprache erkennen ▼] [🔄] [Türkisch] [Englisch] [Deutsch ▼] [Übersetzen]

O bir profesör. O bir bebek bakıcısı.
↔ 
He's a professor. She's a babysitter.

_(related talk: https://www.youtube.com/watch?v=j7FwpZB1hWc)_
```

5\. Let's practice!
-------------------

02:46 - 03:07

As we have seen in just a few examples, the field of natural language processing still has plenty of unsolved problems. In fact, our fake news detector is likely one of them! Let's do some investigation into what it has learned to determine if the model will be widely applicable or if the problem is likely more complex than simple word counts.

Improving the model
===================

What are possible next steps you could take to improve the model?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Tweaking alpha levels.

-   Trying a new classification model.

-   Training on a larger dataset.

-   Improving text preprocessing.

-   All of the above.

Improving your model
====================

Your job in this exercise is to test a few different alpha levels using the `Tfidf` vectors to determine if there is a better performing combination.

The training and test sets have been created, and `tfidf_vectorizer`, `tfidf_train`, and `tfidf_test` have been computed.

Instructions
------------

-   Create a list of alphas to try using `np.arange()`. Values should range from `0`to `1` with steps of `0.1`.
-   Create a function `train_and_predict()`that takes in one argument: `alpha`. The function should:
    -   Instantiate a `MultinomialNB` classifier with `alpha=alpha`.
    -   Fit it to the training data.
    -   Compute predictions on the test data.
    -   Compute and return the accuracy score.
-   Using a `for` loop, print the `alpha`, `score`and a newline in between. Use your `train_and_predict()` function to compute the `score`. Does the score change along with the alpha? What is the best alpha?

In [None]:
alphas = np.arange(0, 1, .1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()


Inspecting your model
=====================

Now that you have built a "fake news" classifier, you'll investigate what it has learned. You can map the important vector weights back to actual words using some simple inspection techniques.

You have your well performing tfidf Naive Bayes classifier available as `nb_classifier`, and the vectors as `tfidf_vectorizer`.

Instructions
------------

-   Save the class labels as `class_labels` by accessing the `.classes_` attribute of `nb_classifier`.
-   Extract the features using the `.get_feature_names()` method of `tfidf_vectorizer`.
-   Create a zipped array of the classifier coefficients with the feature names and sort them by the coefficients. To do this, first use `zip()` with the arguments `nb_classifier.coef_[0]` and `feature_names`. Then, use `sorted()` on this.
-   Print the *top* 20 weighted features for the first label of `class_labels` and print the bottom 20 weighted features for the second label of `class_labels`. *This has been done for you.*

In [None]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])