# Building feature from text data

## Introduction

Text data is different from structured tabular data and, therefore, building features on it requires a completely different approach. In this guide, you will learn how to extract features from raw text for predictive modeling. You will also learn how to perform text preprocessing steps, and create Tf-Idf and Bag-of-words (BOW) feature matrices. We will begin by exploring the data.


## Data

In this guide, we will be using tweet data about the company 'Apple'. The objective is to create features that can be used for building a sentiment predictor model.

The dataset contains 1181 observations and 3 variables, as described below:

- Tweet: Consists of the twitter comments by the users. The twitter data is publicly available.

- Avg: Average sentiment of the tweets (-2 means extremely negative while +2 means extremely positive). This classification was done using the Amazon Mechanical Turk.

- Sentiment: Consists of the sentiment labels - positive, negative, and neutral.

## Loading the Required Libraries and Modules

In [8]:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import string
import nltk
import warnings 
%matplotlib inline
warnings.filterwarnings("ignore", category=DeprecationWarning)
from nltk.corpus import stopwords
stop = stopwords.words('english')

## Loading the Data and Performing Basic Data Checks

The first line of code below reads in the data as pandas dataframe, while the second line prints the shape - 1,181 observations of 3 variables.

The third line prints the first five observations.

In [12]:
dat = pd.read_csv('datatweets.csv')
print(dat.shape)
dat.head(5)

(5, 3)


Unnamed: 0,Tweet,Avg,Sentiment
0,iphone 5c is ugly as heck what the freak @appl,-2.0,Negative
1,freak YOU @APPLE,-2.0,Negative
2,freak you @apple,-2.0,Negative
3,@APPLE YOU RUINED MY LIFE,-2.0,Negative
4,@apple I hate apple!!!!!,-2.0,Negative


We will start by performing basic analysis of the data. The line of code below prints the number of tweets, as per the 'Sentiment' label. The output shows that the highest number of tweets are for the negative sentiment, while the lowest are for the positive sentiment.

In [13]:
# Get the number of dates / entries in each month
dat.groupby('Sentiment')['Tweet'].count()

Sentiment
Negative    5
Name: Tweet, dtype: int64