### Processing Text Data


#### Feature extraction of text data

1. Number of words
2. Number of characters
3. length of Sentence
4. Stop of words
5. speical characters
6. uppercase words


### Basic text processing

1. Lower casing
2. special character removal
3. stopwords removal
4. Spelling correction
5. Clean words
6. Frequent words
7. Tokenization
8. Stemming
9. Lemmatization


### Advance Text processing

1. N-grams
2. Frequency
3. TFIDF
4. Word embedding


In [28]:
import os
import pandas as pd
import numpy as np


In [2]:
os.listdir(os.path.join(os.getcwd(), 'Dataset'))

['application_test.csv',
 'application_train.csv',
 'bureau.csv',
 'bureau_balance.csv',
 'cityofla',
 'codebook.csv',
 'codebook.xlsx',
 'credit_card_balance.csv',
 'data-science-for-good-city-of-los-angeles.zip',
 'dataset',
 'data_description.txt',
 'HomeCredit_columns_description.csv',
 'house-prices-advanced-regression-techniques.zip',
 'installments_payments.csv',
 'movielen',
 'POS_CASH_balance.csv',
 'previous_application.csv',
 'sample_submission.csv',
 'sentiment',
 'test.csv',
 'test_tweets_anuFYb8.csv',
 'text',
 'text2',
 'text3',
 'train.csv',
 'train_twitter.csv']

In [29]:
train = pd.read_csv('Dataset/train_twitter.csv')
train.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [30]:
train.shape

(31962, 3)

In [31]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
id       31962 non-null int64
label    31962 non-null int64
tweet    31962 non-null object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [32]:
train['tweet']

0         @user when a father is dysfunctional and is s...
1        @user @user thanks for #lyft credit i can't us...
2                                      bihday your majesty
3        #model   i love u take with u all the time in ...
4                   factsguide: society now    #motivation
                               ...                        
31957    ate @user isz that youuu?ðððððð...
31958      to see nina turner on the airwaves trying to...
31959    listening to sad songs on a monday morning otw...
31960    @user #sikh #temple vandalised in in #calgary,...
31961                     thank you @user for you follow  
Name: tweet, Length: 31962, dtype: object

In [33]:
# number of words
train['word_count'] = train['tweet'].apply(lambda x: len(str(x).split(" ")))
train
                                           

Unnamed: 0,id,label,tweet,word_count
0,1,0,@user when a father is dysfunctional and is s...,21
1,2,0,@user @user thanks for #lyft credit i can't us...,22
2,3,0,bihday your majesty,5
3,4,0,#model i love u take with u all the time in ...,17
4,5,0,factsguide: society now #motivation,8
...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,6
31958,31959,0,to see nina turner on the airwaves trying to...,25
31959,31960,0,listening to sad songs on a monday morning otw...,15
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",13


In [34]:
train['tweet'][0]

' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run'

In [35]:
# Numbers of characters
train['char_count'] = train['tweet'].str.len()
train

Unnamed: 0,id,label,tweet,word_count,char_count
0,1,0,@user when a father is dysfunctional and is s...,21,102
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122
2,3,0,bihday your majesty,5,21
3,4,0,#model i love u take with u all the time in ...,17,86
4,5,0,factsguide: society now #motivation,8,39
...,...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,6,68
31958,31959,0,to see nina turner on the airwaves trying to...,25,131
31959,31960,0,listening to sad songs on a monday morning otw...,15,63
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",13,67


In [36]:
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words) / len(words))

train['avg_word'] = train['tweet'].apply(lambda x: avg_word(x))
train

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789
2,3,0,bihday your majesty,5,21,5.666667
3,4,0,#model i love u take with u all the time in ...,17,86,4.928571
4,5,0,factsguide: society now #motivation,8,39,8.000000
...,...,...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,6,68,12.600000
31958,31959,0,to see nina turner on the airwaves trying to...,25,131,4.652174
31959,31960,0,listening to sad songs on a monday morning otw...,15,63,3.769231
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",13,67,5.500000


In [37]:
#Stopwords
from nltk.corpus import stopwords

stop = stopwords.words('english')


In [38]:
train['stopwords'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x in stop]))


In [39]:
train[['tweet', 'stopwords']]

Unnamed: 0,tweet,stopwords
0,@user when a father is dysfunctional and is s...,10
1,@user @user thanks for #lyft credit i can't us...,5
2,bihday your majesty,1
3,#model i love u take with u all the time in ...,5
4,factsguide: society now #motivation,1
...,...,...
31957,ate @user isz that youuu?ðððððð...,1
31958,to see nina turner on the airwaves trying to...,9
31959,listening to sad songs on a monday morning otw...,5
31960,"@user #sikh #temple vandalised in in #calgary,...",2


In [40]:
# Number of special characters

train['hastags'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))

# Number of numerics

train['numerics'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
train[['hastags', 'numerics', 'tweet']]

Unnamed: 0,hastags,numerics,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,3,0,@user @user thanks for #lyft credit i can't us...
2,0,0,bihday your majesty
3,1,0,#model i love u take with u all the time in ...
4,1,0,factsguide: society now #motivation
...,...,...,...
31957,0,0,ate @user isz that youuu?ðððððð...
31958,2,0,to see nina turner on the airwaves trying to...
31959,0,0,listening to sad songs on a monday morning otw...
31960,4,0,"@user #sikh #temple vandalised in in #calgary,..."


In [41]:
# number of uppercase words

train['upper'] = train['tweet'].apply(lambda x: len([x for x in x.split() if x.isupper()]))
train

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics,upper
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10,1,0,0
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5,3,0,0
2,3,0,bihday your majesty,5,21,5.666667,1,0,0,0
3,4,0,#model i love u take with u all the time in ...,17,86,4.928571,5,1,0,0
4,5,0,factsguide: society now #motivation,8,39,8.000000,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...,6,68,12.600000,1,0,0,0
31958,31959,0,to see nina turner on the airwaves trying to...,25,131,4.652174,9,2,0,0
31959,31960,0,listening to sad songs on a monday morning otw...,15,63,3.769231,5,0,0,0
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,...",13,67,5.500000,2,4,0,0


### 2. Basic preprocessing text data



In [42]:
# 1. Lower case
train['tweet'] = train['tweet'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [43]:
type(train['tweet'])

pandas.core.series.Series

In [44]:
train['tweet'].apply(lambda x: x.split())

0        [@user, when, a, father, is, dysfunctional, an...
1        [@user, @user, thanks, for, #lyft, credit, i, ...
2                                  [bihday, your, majesty]
3        [#model, i, love, u, take, with, u, all, the, ...
4                 [factsguide:, society, now, #motivation]
                               ...                        
31957    [ate, @user, isz, that, youuu?ðððð...
31958    [to, see, nina, turner, on, the, airwaves, try...
31959    [listening, to, sad, songs, on, a, monday, mor...
31960    [@user, #sikh, #temple, vandalised, in, in, #c...
31961                [thank, you, @user, for, you, follow]
Name: tweet, Length: 31962, dtype: object

In [45]:
# 2. remove punctuation
train['tweet'] = train['tweet'].str.replace('[^\w\s]', '')
train

Unnamed: 0,id,label,tweet,word_count,char_count,avg_word,stopwords,hastags,numerics,upper
0,1,0,user when a father is dysfunctional and is so ...,21,102,4.555556,10,1,0,0
1,2,0,user user thanks for lyft credit i cant use ca...,22,122,5.315789,5,3,0,0
2,3,0,bihday your majesty,5,21,5.666667,1,0,0,0
3,4,0,model i love u take with u all the time in urð...,17,86,4.928571,5,1,0,0
4,5,0,factsguide society now motivation,8,39,8.000000,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...
31957,31958,0,ate user isz that youuuðððððððððâï,6,68,12.600000,1,0,0,0
31958,31959,0,to see nina turner on the airwaves trying to w...,25,131,4.652174,9,2,0,0
31959,31960,0,listening to sad songs on a monday morning otw...,15,63,3.769231,5,0,0,0
31960,31961,1,user sikh temple vandalised in in calgary wso ...,13,67,5.500000,2,4,0,0


In [46]:
train['tweet']

0        user when a father is dysfunctional and is so ...
1        user user thanks for lyft credit i cant use ca...
2                                      bihday your majesty
3        model i love u take with u all the time in urð...
4                        factsguide society now motivation
                               ...                        
31957                   ate user isz that youuuðððððððððâï
31958    to see nina turner on the airwaves trying to w...
31959    listening to sad songs on a monday morning otw...
31960    user sikh temple vandalised in in calgary wso ...
31961                        thank you user for you follow
Name: tweet, Length: 31962, dtype: object

In [47]:
train['tweet'] = train['tweet'].apply(lambda x: ' '.join(x for x in x.split() if x not in stop))
train['tweet']

0        user father dysfunctional selfish drags kids d...
1        user user thanks lyft credit cant use cause do...
2                                           bihday majesty
3                    model love u take u time urð ðððð ððð
4                            factsguide society motivation
                               ...                        
31957                        ate user isz youuuðððððððððâï
31958    see nina turner airwaves trying wrap mantle ge...
31959      listening sad songs monday morning otw work sad
31960    user sikh temple vandalised calgary wso condem...
31961                                    thank user follow
Name: tweet, Length: 31962, dtype: object

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['This is the first document',
          'This document is the second document',
          'And this is the third',
          'Is this the first document?']

vectorizer = TfidfVectorizer()

In [4]:
vectorizer.fit(corpus)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [5]:
x = vectorizer.transform(corpus)
x

<4x8 sparse matrix of type '<class 'numpy.float64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [6]:
print(vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'second', 'the', 'third', 'this']
