<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Classifying-Fake-News-using-Supervised-Learning-with-NLP" data-toc-modified-id="Classifying-Fake-News-using-Supervised-Learning-with-NLP-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Classifying Fake News using Supervised Learning with NLP</a></span><ul class="toc-item"><li><span><a href="#Supervised-learning-steps" data-toc-modified-id="Supervised-learning-steps-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Supervised learning steps</a></span></li></ul></li><li><span><a href="#Building-word-count-vectors-with-scikit-learn" data-toc-modified-id="Building-word-count-vectors-with-scikit-learn-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Building word count vectors with scikit-learn</a></span></li><li><span><a href="#Training-and-Testing-a-Classification-Model-with-scikit-learn" data-toc-modified-id="Training-and-Testing-a-Classification-Model-with-scikit-learn-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Training and Testing a Classification Model with scikit-learn</a></span></li><li><span><a href="#Simple-NLP,-Complex-Problems" data-toc-modified-id="Simple-NLP,-Complex-Problems-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Simple NLP, Complex Problems</a></span></li><li><span><a href="#Extra-Links" data-toc-modified-id="Extra-Links-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Extra Links</a></span></li></ul></div>

## Classifying Fake News using Supervised Learning with NLP

- How to create supervised learning data from text?
 - Use bag-of-words models or tf-idf as features

### Supervised learning steps

- Collect and preprocess our data
- Determine a label (Example: Movie genre)
- Split data into training and test sets
- Extract features from the text to help predict the label
 - Bag-of-words vector built into scikit-learn
- Evaluate trained model using the test set

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

```
df = ... # Load data into DataFrame
y traditionally refers to the labels or outcome you want the model to earn
y = df['Sci-Fi']

X_train, X_test, y_train, y_test = train_test_split(df['plot'],
                                                    y, 
                                                    test_size=0.33,
                                                    random_state=53
                                                    )
training data: X_train
training labels: y_train
    
count_vectorizer = CountVectorizer(stop_words='english')
count_train = count_vectorizer.fit_transform(X_train.values)
count_test = count_vectorizer.transform(X_test.values)
```

## Building word count vectors with scikit-learn

```
# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df["text"], y, test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words="english")

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train.values)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test.values)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])
```

```
# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words="english", max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train.values)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test.values)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])
```

```
# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns=tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(count_df.columns) - set(tfidf_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))
```

## Training and Testing a Classification Model with scikit-learn

Naive Bayes classifier

Examples:
- If the plot has a spaceship, how likely is it to be sci-fi?
- Given a spaceship and an alien, how likely now is it sci-fi?

Each word from `CountVectorizer` acts as a feature

In [2]:
from sklearn.naive_bayes import MultinomialNB
# evaluate model perormance
from sklearn import metrics

```
nb_classifier = MultinomialNB()

# pass training count vectorizer & training labels
nb_classifier.fit(count_train, y_train)

# call predict with the count vectorizer test data
pred = nb_classifier.predict(count_test)
```

**evaluation**

```
# test accuracy
metrics.accuracy_score(y_test, pred)

# check the confusion matrix
metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
```

```
# Create the list of alphas: alphas
alphas = np.arange(0,1,0.1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()
```

## Simple NLP, Complex Problems

- Translation https://twitter.com/Lupintweets/status/865533182455685121

- Sentiment Analysis https://nlp.stanford.edu/projects/socialsent/
![wordsentiment](img/wordsentiment.png)

- Language Biases https://www.youtube.com/watch?v=j7FwpZB1hWc

## Extra Links

[学习规划|机器学习和NLP入门规划](https://www.pkudodo.com/2019/03/20/1-10/)

[面试体会|微软、头条、滴滴、爱奇艺NLP面试感想](https://www.pkudodo.com/2019/03/10/1-9/)

[为什么相比于计算机视觉(cv)，自然语言处理(nlp)领域的发展要缓慢？](https://www.zhihu.com/question/295962495)

[自然语言处理有哪些方向适合独立研究？](https://www.zhihu.com/question/335289475)

[调用jupyter notebook文件内的函数一种简单方法](https://blog.csdn.net/wangjian1204/article/details/67633614/)