### CountVectorizer for text classification
It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. Explore it in the IPython Shell to investigate what columns you can use. The .head() method is particularly informative.

In this exercise, you'll use pandas alongside *scikit-learn* to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer and investigate some of its features.

####  Instructions
- Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection.
- Create a Series y to use for the labels by assigning the .label attribute of df to y.
- Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53.
- Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed.
- Fit and transform the training data X_train using the .fit_transform() method of your CountVectorizer object. Do the same with the test data X_test, except using the .transform() method.
- Print the first 10 features of the count_vectorizer using its .get_feature_names() method.

In [2]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import pandas as pd


df = pd.read_csv('fake_or_real_news.csv')
# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df['label']

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size = 0.33,random_state = 53 )

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words="english")

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

   Unnamed: 0                                              title  \
0        8476                       You Can Smell Hillary’s Fear   
1       10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2        3608        Kerry to go to Paris in gesture of sympathy   
3       10142  Bernie supporters on Twitter erupt in anger ag...   
4         875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  
['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


### TfidfVectorizer for text classification
Similar to the sparse CountVectorizer created in the previous exercise, you'll work on creating tf-idf vectors for your documents. You'll set up a TfidfVectorizer and investigate some of its features.

In this exercise, you'll use pandas and sklearn along with the same X_train, y_train and X_test, y_test DataFrames and Series you created in the last exercise.

### Instructions

- Import TfidfVectorizer from sklearn.feature_extraction.text.
- Create a TfidfVectorizer object called tfidf_vectorizer. When doing so, specify the keyword arguments stop_words="english" and max_df=0.7.
- Fit and transform the training data.
- Transform the test data.
- Print the first 10 features of tfidf_vectorizer.
- Print the first 5 vectors of the tfidf training data using slicing on the .A (or array) attribute of tfidf_train.

In [3]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english" , max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
len(X_train)
# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


### Inspecting the vectors
To get a better idea of how the vectors work, you'll investigate them by converting them into pandas DataFrames.

Here, you'll use the same data structures you created in the previous two exercises (count_train, count_vectorizer, tfidf_train, tfidf_vectorizer) as well as pandas, which is imported as pd.

### Instructions

- Create the DataFrames count_df and tfidf_df by using pd.DataFrame() and specifying the values as the first argument and the columns (or features) as the second argument.
+ The values can be accessed by using the .A attribute of, respectively, count_train and tfidf_train.
+ The columns can be accessed using the .get_feature_names() methods of count_vectorizer and tfidf_vectorizer.
- Print the head of each DataFrame to investigate their structure. This has been done for you.
- Test if the column names are the same for each DataFrame by creating a new object called difference to see the difference between the columns that count_df has from tfidf_df. Columns can be accessed using the .columns attribute of a DataFrame. Subtract the set of tfidf_df.columns from the set of count_df.columns.
- Test if the two DataFrames are equivalent by using the .equals() method on count_df with tfidf_df as the argument.


In [4]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())
count_df.info()
# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())
tfidf_df.info()
# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4244 entries, 0 to 4243
Columns: 56922 entries, 00 to ยงade
dtypes: int64(56922)
memory usage: 1.8 GB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4244 entries, 0 to 4243
Columns: 56922 entries, 00 to ยงade
dtypes: float64(56922)
memory usage: 1.8 GB
   00  000  0000  00000031  000035  00006  0001  0001pt  000ft  000km  ...  \
0   0    0     0         0       0      0     0       0      0      0  ...   
1   0    0     0         0       0      0     0       0      0      0  ...   
2   0    0     0         0       0      0     0       0      0      0  ...   
3   0    0     0         0       0      0     0       0      0      0  ...   
4   0    0     0         0       0      0     0       0      0      0  ...   

   حلب  عربي  عن  لم  ما  محاولات  من  هذا  والمرضى  ยงade  
0    0     0   0   0   0        0   0    0        0      0  
1    0     0   0   0   0        0   0    0        0      0  
2    0     0   0   0   0        0   0    0

### Training and testing a classification model with scikit-learn

Training and testing the "fake news" model with CountVectorizer
Now it's your turn to train the "fake news" model using the features you identified and extracted. In this first exercise you'll train and test a Naive Bayes model using the CountVectorizer data.

The training and test sets have been created, and count_vectorizer, count_train, and count_test have been computed.

Instructions

- Import the metrics module from sklearn and MultinomialNB from sklearn.naive_bayes.
- Instantiate a MultinomialNB classifier called nb_classifier.
- Fit the classifier to the training data.
- Compute the predicted tags for the test data.
- Calculate and print the accuracy score of the classifier.
- Compute the confusion matrix. To make it easier to read, specify the keyword argument labels=['FAKE', 'REAL'].



In [5]:
# Import the necessary modules
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels = ['FAKE', 'REAL'])
print(cm)


0.893352462936394
[[ 865  143]
 [  80 1003]]


### Training and testing the "fake news" model with TfidfVectorizer
Now that you have evaluated the model using the CountVectorizer, you'll do the same using the **TfidfVectorizer**  with a Naive Bayes model.

The training and test sets have been created, and tfidf_vectorizer, tfidf_train, and tfidf_test have been computed. Additionally, MultinomialNB and metrics have been imported from, respectively, sklearn.naive_bayes and sklearn.

#### Instructions

- Instantiate a MultinomialNB classifier called nb_classifier.
- Fit the classifier to the training data.
- Compute the predicted tags for the test data.
- Calculate and print the accuracy score of the classifier.
- Compute the confusion matrix. As in the previous exercise, specify the keyword argument labels=['FAKE', 'REAL'] so that the resulting confusion matrix is easier to read.

In [6]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

0.8565279770444764
[[ 739  269]
 [  31 1052]]


### Simple NLP, complex problems

Improving your model
Your job in this exercise is to test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination.

The training and test sets have been created, and **tfidf_vectorizer, tfidf_train, and tfidf_test** have been computed.


#### Instructions

- Create a list of alphas to try using np.arange(). Values should range from 0 to 1 with steps of 0.1.
- Create a function train_and_predict() that takes in one argument: alpha. The function should:
- Instantiate a MultinomialNB classifier with alpha=alpha.
- Fit it to the training data.
- Compute predictions on the test data.
- Compute and return the accuracy score.
- Using a for loop, print the alpha, score and a newline in between. Use your train_and_predict() function to compute the score. Does the score change along with the alpha? What is the best alpha?



In [9]:
# Create the list of alphas: alphas
import numpy as np
alphas = np.arange(0,1, 0.1)
max_score = 0.0
# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha = alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    score = train_and_predict(alpha)
    print('Score: ', score)
    if score > max_score:
        max_score = score
        print('Max Score is ',max_score, ' at Alpha = ', alpha)



Alpha:  0.0
Score:  0.8813964610234337
Max Score is  0.8813964610234337  at Alpha =  0.0
Alpha:  0.1
Score:  0.8976566236250598
Max Score is  0.8976566236250598  at Alpha =  0.1
Alpha:  0.2
Score:  0.8938307030129125
Alpha:  0.30000000000000004
Score:  0.8900047824007652
Alpha:  0.4
Score:  0.8857006217120995
Alpha:  0.5
Score:  0.8842659014825442
Alpha:  0.6000000000000001
Score:  0.874701099952176
Alpha:  0.7000000000000001
Score:  0.8703969392635102
Alpha:  0.8
Score:  0.8660927785748446
Alpha:  0.9
Score:  0.8589191774270684


### Inspecting your model
Now that you have built a "fake news" classifier, you'll investigate what it has learned. You can map the important vector weights back to actual words using some simple inspection techniques.

You have your well performing tfidf Naive Bayes classifier available as nb_classifier, and the vectors as tfidf_vectorizer.

#### Instructions

- Save the class labels as class_labels by accessing the .classes_ attribute of nb_classifier.
- Extract the features using the .get_feature_names() method of tfidf_vectorizer.
- Create a zipped array of the classifier coefficients with the feature names and sort them by the coefficients. To do this, first use zip() with the arguments nb_classifier.coef_[0] and feature_names. Then, use sorted() on this.
- Print the top 20 weighted features for the first label of class_labels and print the bottom 20 weighted features for the second label of class_labels. This has been done for you.

In [11]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names), reverse=True)

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])


FAKE [(-6.328486029596207, 'trump'), (-6.5653954389926845, 'said'), (-6.724771332488041, 'clinton'), (-7.108263114902301, 'obama'), (-7.2166878130917755, 'sanders'), (-7.2854057032685775, 'president'), (-7.33649923948987, 'campaign'), (-7.35748985914622, 'republican'), (-7.410910239085596, 'state'), (-7.411148410203476, 'cruz'), (-7.448027933291952, 'party'), (-7.516447881078008, 'new'), (-7.545789237823644, 'people'), (-7.616556675728881, 'percent'), (-7.6246184189367, 'bush'), (-7.649398936153309, 'republicans'), (-7.654774992495461, 'house'), (-7.703583809227384, 'voters'), (-7.717550034444668, 'rubio'), (-7.742481952533027, 'states')]
REAL [(-11.316312804238807, '023'), (-11.316312804238807, '020'), (-11.316312804238807, '01am'), (-11.316312804238807, '018'), (-11.316312804238807, '015'), (-11.316312804238807, '014'), (-11.316312804238807, '00pm'), (-11.316312804238807, '00p'), (-11.316312804238807, '00am'), (-11.316312804238807, '0099'), (-11.316312804238807, '008s'), (-11.3163128