<a href="https://colab.research.google.com/github/python-for-data-analytic/data-science-in-economics/blob/master/003_text_mining_preprocessing_and_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining - Preprocessing and Sentiment Analysis

Text mining is a process of exploring sizeable textual data and find patterns. Text Mining process the text itself. Finding frequency counts of words, length of the sentence, presence/absence of specific words is known as text mining.

Natural language processing is one of the components of text mining. NLP helps identified sentiment, finding entities in the sentence, and category of blog/article. Text mining is preprocessed data for text analytics. In Text Analytics, statistical and machine learning algorithm used to classify information.

In this section, we will use demonetization tweets data as our text mining case study.

The demonetization of ₹500 and ₹1000 banknotes was a step taken by the Government of India on 8 November 2016, ceasing the usage of all ₹500 and ₹1000 banknotes of the Mahatma Gandhi Series as a form of legal tender in India from 9 November 2016.

The announcement was made by the Prime Minister of India Narendra Modi in an unscheduled live televised address to the nation at 20:15 Indian Standard Time (IST) the same day. In the announcement, Modi declared circulation of all ₹500 and ₹1000 banknotes of the Mahatma Gandhi Series as invalid and announced the issuance of new ₹500 and ₹2000 banknotes of the Mahatma Gandhi New Series in exchange for the old banknotes. 

The data contains 6000 most recent tweets on #demonetization. There are 6000 rows(one for each tweet) and 14 columns.


1.   Text (Tweets)
2.   favorited
3.   favoriteCount
4.   replyToSN
5.   created
6.   truncated
7.   replyToSID
8.   id
9.   replyToUID
10.   statusSource
11.   screenName
12.   retweetCount
13.   isRetweet
14.   retweeted

Source: https://www.kaggle.com/arathee2/demonetization-in-india-twitter-data

***Import Library***

We need to import some libraries first. Here are the libraries we need to import.

In [None]:
# Import library for Text Analytics
import nltk
nltk.download('vader_lexicon')

In [None]:
# Import Libraries for Data Manipulation
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

***Import Data***

Then, import our demonetization tweets dataset into this notebook using Pandas library. 

In [None]:
# Import Data
tweets=pd.read_csv('https://raw.githubusercontent.com/dhitology/temporary/master/tweet.csv',encoding='ISO-8859-1')
tweets.head()

## **Text Preprocessing**

Text preprocessing is traditionally an important step for natural language processing (NLP) tasks. It transforms text into a more digestible form so that machine learning algorithms can perform better.

***Select Data***

We will use only text(tweets) data in this text mining modeling.

In [None]:
# Select Only Text Column
text = tweets[['text']]
text.head()

***Clean the Dataset***

In [None]:
# Create Text Cleaning Function using Regex
import re

def  clean_text(df, text_field):
    df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", elem))  
    # remove numbers
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r"\d+", "", elem))
    return df

# Apply to the data
text_clean = clean_text(text, 'text')
text_clean.head()

In [None]:
# Import Stopword
import nltk.corpus
nltk.download('stopwords')

# Apply Stopword to the dataframe
from nltk.corpus import stopwords
stop = stopwords.words('english')

text_clean['nostopword'] = text_clean['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
text_clean.head()

In [None]:
# Import Punkt
import nltk 
nltk.download('punkt')

# Tokenize
from nltk.tokenize import sent_tokenize, word_tokenize
text_clean['tokenize'] = text_clean['nostopword'].apply(lambda x: word_tokenize(x))
text_clean.head()

In [None]:
# Import Stemmer
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize

# Create Stemmer Function
def word_stemmer(text):
    stem_text = [PorterStemmer().stem(i) for i in text]
    return stem_text

# Apply to the dataframe
text_clean['stemming'] = text_clean['tokenize'].apply(lambda x: word_stemmer(x))
text_clean.head()

In [None]:
# Import Wordnet Library
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# Create Lematization Funtion
def word_lemmatizer(text):
    lem_text = [WordNetLemmatizer().lemmatize(i, pos='v') for i in text]
    return lem_text

# Apply to a dataframe
text_clean['lemmatization'] = text_clean['tokenize'].apply(lambda x: word_lemmatizer(x))
text_clean.head()

In [None]:
# Convert to a New Dataframe
text_preprocessed = text_clean['lemmatization'].str.join(",") 
text_preprocessed = text_preprocessed.str.replace(',', ' ', regex=False)
text_preprocessed = pd.DataFrame(text_preprocessed)
text_preprocessed.rename(columns={'lemmatization': 'text'}, inplace = True)
text_preprocessed

In [None]:
# Save as CSV
text_preprocessed.to_csv('text_preprocessed.csv', index=False)

## **Sentiment Analysis**

Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

In [None]:
# Import Module
from nltk.sentiment.vader import SentimentIntensityAnalyzer 

# Sentiment Analysis
sid = SentimentIntensityAnalyzer()
listy = [] 
for index, row in text_preprocessed.iterrows():
  ss = sid.polarity_scores(row['text'])
  listy.append(ss)
  
se = pd.Series(listy)
text_preprocessed['polarity'] = se.values
display(text_preprocessed.head(5))

In [None]:
# Pie Chart Visualization
labels = ['negative', 'neutral', 'positive']
sizes  = [ss['neg'], ss['neu'], ss['pos']]
plt.pie(sizes, labels=labels, autopct='%1.1f%%')
plt.axis('equal') 
plt.show()