# NLP Basics: Exploring the dataset

### Read in text data

In [6]:
# We need header=None b/c the raw dataset doesn't have column names, 
# and we don't want the first row of data to be misconstrued as the headers

import pandas as pd

fullCorpus = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)   
fullCorpus.columns = ['label', 'body_text']    # Here I tell it what to name the columns

fullCorpus.head()

Unnamed: 0,label,body_text
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


### Explore the dataset

In [2]:
# What is the shape of the dataset?

print("Input data has {} rows and {} columns".format(len(fullCorpus), len(fullCorpus.columns)))

Input data has 5568 rows and 2 columns


In [7]:
fullCorpus.shape   # does the same thing

(5568, 2)

In [3]:
# How many spam/ham are there?

print("Out of {} rows, {} are spam, {} are ham".format(len(fullCorpus),
                                                       len(fullCorpus[fullCorpus['label']=='spam']),
                                                       len(fullCorpus[fullCorpus['label']=='ham'])))

Out of 5568 rows, 746 are spam, 4822 are ham


In [9]:
fullCorpus[fullCorpus['label'] == 'spam']    # does the same thing

Unnamed: 0,label,body_text
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
6,spam,WINNER!! As a valued network customer you have...
7,spam,Had your mobile 11 months or more? U R entitle...
9,spam,"SIX chances to win CASH! From 100 to 20,000 po..."
10,spam,URGENT! You have won a 1 week FREE membership ...
...,...,...
5533,spam,Want explicit SEX in 30 secs? Ring 02073162414...
5536,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5543,spam,Had your contract mobile 11 Mnths? Latest Moto...
5562,spam,REMINDER FROM O2: To get 2.50 pounds free call...


In [4]:
# How much missing data is there?

print("Number of null in label: {}".format(fullCorpus['label'].isnull().sum()))
print("Number of null in text: {}".format(fullCorpus['body_text'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


In [12]:
fullCorpus['label'].isnull().sum()    # does the same thing

0