# Introduction to Textmining with NLTK

A short introduction in data processing for textual data and some basic applications for sentiment analysis and

# Basic Setup


Install nltk library for text processing and download some extensions that are required. Also, we install the wordcloud library for plotting our results as wordcloud.

In [None]:
!pip install nltk
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

!pip install textblob

In [None]:
# we import a series of specific functions from the nltk package for processing the texts.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk import FreqDist

# we import pandas for reading in files
import pandas as pd

#textblob for performing more advanced Sentiment Analysis.
from textblob import TextBlob

## Read in the data

In [None]:
corpus = pd.read_csv("https://github.com/casbdai/notebooks2023/raw/main/Module2/Textmining/fake_news.csv")
corpus.head()

We extract the second document and save it as an object text.

In [None]:
______
print(______)

## Pre-Processing Textual Data

### Convert text to lower case:

In [None]:
lower_text = ______.______
print (lower_text)

### Tokenize text

Break down text into tokens, i.e, breaking the sentences into single words for analysis.

In [None]:
word_tokens = ______.______
print (word_tokens)

We need a better tokenizer also "punctuation" and "numbers" are retained as tokens. Also, very short words are translated into tokens.


In [None]:
better_tokenizer = RegexpTokenizer(r'[a-zA-Z]{3,}')

# [a-zA-Z] means that only letters are retained as tokens
# {3,} means that only tokens with at least three characters are retained

In [None]:
word_tokens = ______.______
print(word_tokens)

## Remove stop words

Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

In [None]:
stopword = ______.______
stopword

For getting rid of stopwords, we must compare each token against the words in the stop words list. With can be easily done in a list comprehension. List comprehension are a common extension of "for-loops".

A for loop that prints out every token:

In [None]:
_________

Reformulating the for loop as a list comprehension. List comprehensions are considered to be very understandable and are thus used very frequently by pythonistas.

In [None]:
_________

Extending our list comprehension such that only tokens are retained that are NOT on the stoplist.

In [None]:
clean_tokens = [word for word in word_tokens ______]
print (clean_tokens)

## Stemming


Often we want to map the different forms of the same word to the same root word, e.g. "walks", "walking", "walked" should all be the same as "walk".


In [None]:
 ______

## Get word frequency

Counting the most frequently used words in a textdocument

In [None]:
freq = FreqDist(stemmed_tokens)
print (freq.most_common(5))

# Very Basic Sentiment Analysis

Using a dictionairy of positive and negative words, we can now perform a very basic sentiment analysis

In [None]:
______

In [None]:
neg_sent = []

[neg_sent.append(1) for word in stemmed_tokens if word in ["virus", "infect","gun"] ]

sum(neg_sent)

In [None]:
______

## Better Sentiment Analysis with TextBlob

Let's have a look at the original text we started with:



In [None]:
text

In [None]:
______.______

TextBlob returns polarity and subjectivity of a sentence.

- Polarity lies between -1 and 1; -1 defines a negative sentiment and 1 defines a positive sentiment.
- Subjectivity lies between 0 and 1. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

Apply Textblob Sentiment Analyis to the 4th File in the data set:

In [None]:
text = corpus["text"][3]
TextBlob(text).sentiment

## Apply Textblob to entire column in pandas dataframe

Pandas DataFrames cannot directly ingested into Textblob because it expects a different format of the data. But we can create two small helper functions that allow us to adapt textblob to a Pandas DataFrame.

In [None]:
# function that only returns the polarity score of TextBlob
def polarity(text):
    try:
        return TextBlob(text).sentiment.polarity
    except:
        return None

# function that only returns the subjectivity score of TextBlob
def subjectivity(text):
    try:
        return TextBlob(text).sentiment.subjectivity
    except:
        return None

In [None]:
______

In [None]:
corpus[______]= corpus["text"].______(______)
corpus.head()

## Comparing Fake and Real News

In [None]:
______

In [None]:
corpus.groupby("label")[["polarity", "subjectivity"]].mean().______