# Natural Language Processing

In [1]:
my_string = "Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

In [5]:
import re # importing regular expression

# Practicing regular expressions: re.split() and re.findall()
Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n" in Python is used to indicate a new line, but if you use the r prefix, it will be interpreted as the raw string "\n" - that is, the character "\" followed by the character "n" - and not as a new line.

The regular expression module re has already been imported for you.

Remember from the video that the syntax for the regex library is to always to pass the pattern first, and then the string second.

In [8]:
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))


["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']
['Let', 'RegEx', 'Won', 'Can', 'Or']
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']


In [10]:
# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))

['Let', 'RegEx', 'Won', 'Can', 'Or']


In [9]:
# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))

["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']


In [7]:
# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))

['4', '19']


# CountVectorizer for text classification
It's time to begin building your text classifier! The data has been loaded into a DataFrame called df. Explore it in the IPython Shell to investigate what columns you can use. The .head() method is particularly informative.

In this exercise, you'll use pandas alongside scikit-learn to create a sparse text vectorizer you can use to train and test a simple supervised model. To begin, you'll set up a CountVectorizer and investigate some of its features.

In [32]:
import pandas as pd
import numpy as np

df = pd.read_csv("fake_or_real_news.csv", dtype=str)
df = df[['Unnamed: 0','title', 'text','label']]

df['title'] = df['title'].astype(str)
df['text'] = df['text'].astype(str)
df['label'] = df['label'].astype(str)

In [33]:
df.dtypes

Unnamed: 0    object
title         object
text          object
label         object
dtype: object

In [34]:
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33, random_state=53)


  Unnamed: 0                                              title  \
0       8476                       You Can Smell Hillary’s Fear   
1      10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2       3608        Kerry to go to Paris in gesture of sympathy   
3      10142  Bernie supporters on Twitter erupt in anger ag...   
4        875   The Battle of New York: Why This Primary Matters   

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  It's primary day in New York and front-runners...  REAL  


Import CountVectorizer from sklearn.feature_extraction.text and train_test_split from sklearn.model_selection. <br>
Create a Series y to use for the labels by assigning the .label attribute of df to y.<br>
Using df["text"] (features) and y (labels), create training and test sets using train_test_split(). Use a test_size of 0.33 and a random_state of 53.<br>
Create a CountVectorizer object called count_vectorizer. Ensure you specify the keyword argument stop_words="english" so that stop words are removed.<br>
Fit and transform the training data X_train using the .fit_transform() method of your CountVectorizer object. Do the same with the test data X_test, except using the .transform() method.<br>
Print the first 10 features of the count_vectorizer using its .get_feature_names() method.

In [35]:
# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])


['00', '000', '0000', '000000031', '00000031', '0001', '0002', '000ft', '000x', '001']


# TfidfVectorizer for text classification
Import TfidfVectorizer from sklearn.feature_extraction.text. <br>
Create a TfidfVectorizer object called tfidf_vectorizer. When doing so, specify the keyword arguments stop_words="english" and max_df=0.7.<br>
Fit and transform the training data.<br>
Transform the test data.<br>
Print the first 10 features of tfidf_vectorizer.<br>
Print the first 5 vectors of the tfidf training data using slicing on the. <br>
A (or array) attribute of tfidf_train.


In [41]:
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])


['00', '000', '0000', '000000031', '00000031', '0001', '0002', '000ft', '000x', '001']
[[0.         0.06262927 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


# Inspecting the vectors
To get a better idea of how the vectors work, you'll investigate them by converting them into pandas DataFrames. <br>

Create the DataFrames count_df and tfidf_df by using pd.DataFrame() and specifying the values as the first argument and the columns (or features) as the second argument. <br>
The values can be accessed by using the .A attribute of, respectively, count_train and tfidf_train. <br>
The columns can be accessed using the .get_feature_names() methods of count_vectorizer and tfidf_vectorizer. <br>
Print the head of each DataFrame to investigate their structure. This has been done for you. <br>
Test if the column names are the same for each DataFrame by creating a new object called difference to see the difference between the columns that count_df has from tfidf_df. Columns can be accessed using the .columns attribute of a DataFrame. Subtract the set of tfidf_df.columns from the set of count_df.columns. <br>
Test if the two DataFrames are equivalent by using the .equals() method on count_df with tfidf_df as the argument. <br>

In [43]:
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())



   00  000  0000  000000031  00000031  0001  0002  000ft  000x  001  ...  שתי  \
0   0    7     0          0         0     0     0      0     0    0  ...    0   
1   0    0     0          0         0     0     0      0     0    0  ...    0   
2   0    0     0          0         0     0     0      0     0    0  ...    0   
3   0    0     0          0         0     0     0      0     0    0  ...    0   
4   0    0     0          0         0     0     0      0     0    0  ...    0   

   תאמצנה  תוצאה  תחל  תיירות  תנותק  תעודת  תתרכז  عربي  ยงade  
0       0      0    0       0      0      0      0     0      0  
1       0      0    0       0      0      0      0     0      0  
2       0      0    0       0      0      0      0     0      0  
3       0      0    0       0      0      0      0     0      0  
4       0      0    0       0      0      0      0     0      0  

[5 rows x 56478 columns]


In [44]:
# Print the head of tfidf_df
print(tfidf_df.head())


    00       000  0000  000000031  00000031  0001  0002  000ft  000x  001  \
0  0.0  0.062629   0.0        0.0       0.0   0.0   0.0    0.0   0.0  0.0   
1  0.0  0.000000   0.0        0.0       0.0   0.0   0.0    0.0   0.0  0.0   
2  0.0  0.000000   0.0        0.0       0.0   0.0   0.0    0.0   0.0  0.0   
3  0.0  0.000000   0.0        0.0       0.0   0.0   0.0    0.0   0.0  0.0   
4  0.0  0.000000   0.0        0.0       0.0   0.0   0.0    0.0   0.0  0.0   

   ...  שתי  תאמצנה  תוצאה  תחל  תיירות  תנותק  תעודת  תתרכז  عربي  ยงade  
0  ...  0.0     0.0    0.0  0.0     0.0    0.0    0.0    0.0   0.0    0.0  
1  ...  0.0     0.0    0.0  0.0     0.0    0.0    0.0    0.0   0.0    0.0  
2  ...  0.0     0.0    0.0  0.0     0.0    0.0    0.0    0.0   0.0    0.0  
3  ...  0.0     0.0    0.0  0.0     0.0    0.0    0.0    0.0   0.0    0.0  
4  ...  0.0     0.0    0.0  0.0     0.0    0.0    0.0    0.0   0.0    0.0  

[5 rows x 56478 columns]


In [46]:
# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))

set()
False


# Training and testing the "fake news" model with CountVectorizer
Import the metrics module from sklearn and MultinomialNB from sklearn.naive_bayes. <br>
Instantiate a MultinomialNB classifier called nb_classifier.<br>
Fit the classifier to the training data.<br>
Compute the predicted tags for the test data.<br>
Calculate and print the accuracy score of the classifier.<br>
Compute the confusion matrix. To make it easier to read, specify the keyword argument labels=['FAKE', 'REAL'].

In [47]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# Multiclass classification naive bayes classifier
nb_classifier = MultinomialNB()

# fit classifier to the count train and y train
nb_classifier.fit(count_train, y_train) # count train is vectorized version of X

# use classifier to predict count train
pred = nb_classifier.predict(count_test)

# Get accuracy
metrics.accuracy_score(y_test, pred)

0.8453167508744656

In [54]:
# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

print("True Positive (TP) : The genere observation is true, and is predicted to be true. It predecited that right 905 times")
print("False Negative (FN) : Observation is positive, but is predicted negative. this happened 146 times")
print("True Negative (TN) : Observation is negative, and is predicted to be negative. this happened 956 times")
print("False Positive (FP) : Observation is negative, but is predicted positive. this happened 79 times")

[[905 146]
 [ 79 956]]
True Positive (TP) : The genere observation is true, and is predicted to be true. It predecited that right 905 times
False Negative (FN) : Observation is positive, but is predicted negative. this happened 146 times
True Negative (TN) : Observation is negative, and is predicted to be negative. this happened 956 times
False Positive (FP) : Observation is negative, but is predicted positive. this happened 79 times


# Training and testing the "fake news" model with TfidfVectorizer
Now that you have evaluated the model using the CountVectorizer, you'll do the same using the TfidfVectorizer with a Naive Bayes model.

In [55]:
# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)


0.792071511853867
[[ 713  339]
 [  21 1017]]


# Improving your model
Your job in this exercise is to test a few different alpha levels using the Tfidf vectors to determine if there is a better performing combination. <br>

Create a list of alphas to try using np.arange(). Values should range from 0 to 1 with steps of 0.1.<br>
Create a function train_and_predict() that takes in one argument: alpha. The function should:<br>
Instantiate a MultinomialNB classifier with alpha=alpha.<br>
Fit it to the training data.<br>
Compute predictions on the test data.<br>
Compute and return the accuracy score.<br>
Using a for loop, print the alpha, score and a newline in between.<br>
Use your train_and_predict() function to compute the score. Does the score change along with the alpha? What is the best alpha?

In [56]:
# Create the list of alphas: alphas
alphas = np.arange(0, 1, .1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()
    

Alpha:  0.0


  'setting alpha = %.1e' % _ALPHA_MIN)


Score:  0.8449280994947532

Alpha:  0.1
Score:  0.8530897784687136

Alpha:  0.2
Score:  0.8422075398367664

Alpha:  0.30000000000000004
Score:  0.8359891177613681

Alpha:  0.4
Score:  0.8286047415468325

Alpha:  0.5
Score:  0.8208317139525846

Alpha:  0.6000000000000001
Score:  0.8150019432568986

Alpha:  0.7000000000000001
Score:  0.8080062184220754

Alpha:  0.8
Score:  0.8006218422075398

Alpha:  0.9
Score:  0.7982899339292654



# Best model is with an alpha of .3 above

# Inspecting your model
Now that you have built a "fake news" classifier, you'll investigate what it has learned. You can map the important vector weights back to actual words using some simple inspection techniques.

In [60]:
# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])


  [(-10.941624166627946, '00'), (-10.941624166627946, '000'), (-10.941624166627946, '0000'), (-10.941624166627946, '000000031'), (-10.941624166627946, '00000031'), (-10.941624166627946, '0001'), (-10.941624166627946, '0002'), (-10.941624166627946, '000ft'), (-10.941624166627946, '000x'), (-10.941624166627946, '001'), (-10.941624166627946, '003'), (-10.941624166627946, '004'), (-10.941624166627946, '005'), (-10.941624166627946, '00684'), (-10.941624166627946, '006s'), (-10.941624166627946, '007'), (-10.941624166627946, '007s'), (-10.941624166627946, '008'), (-10.941624166627946, '008s'), (-10.941624166627946, '009')]


In [61]:
# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])


 1948. [57]  [(-10.941624166627946, 'שינוי'), (-10.941624166627946, 'שיתעקש'), (-10.941624166627946, 'שכל'), (-10.941624166627946, 'שכמוני'), (-10.941624166627946, 'של'), (-10.941624166627946, 'שלו'), (-10.941624166627946, 'שנדרש'), (-10.941624166627946, 'שני'), (-10.941624166627946, 'שעת'), (-10.941624166627946, 'שתי'), (-10.941624166627946, 'תאמצנה'), (-10.941624166627946, 'תוצאה'), (-10.941624166627946, 'תחל'), (-10.941624166627946, 'תיירות'), (-10.941624166627946, 'תנותק'), (-10.941624166627946, 'תעודת'), (-10.941624166627946, 'תתרכז'), (-10.941624166627946, 'عربي'), (-10.941624166627946, 'ยงade'), (-10.248476986068, 'nan')]
