### Fearure Extraction of text with Scikitlearn

In [79]:
import numpy as np
import pandas as pd

In [80]:
df = pd.read_csv('spam.csv', encoding='latin-1')
df = df.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
df.columns = ["label", "message"]

df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [81]:
df.isnull().sum()

label      0
message    0
dtype: int64

In [82]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [83]:
from sklearn.model_selection import train_test_split

In [84]:
X = df['message']
y = df['label']

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [86]:
X

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: message, Length: 5572, dtype: object

### Using CountVectorizer

In [87]:
from sklearn.feature_extraction.text import CountVectorizer

In [88]:
# initialize the vectorizer

count_vect = CountVectorizer()

# fit the data into the countVectorizer

#transform the original text mesg

count_vect.fit(X_train)
X_train_counts = count_vect.fit_transform(X_train)

X_train_counts



<3733x7057 sparse matrix of type '<class 'numpy.int64'>'
	with 49296 stored elements in Compressed Sparse Row format>

In [89]:
X_train_counts.shape

(3733, 7057)

### Using TfidfTransformer

In [90]:
from sklearn.feature_extraction.text import TfidfTransformer

In [91]:
tfidf_transformer = TfidfTransformer()

In [92]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

X_train_tfidf.shape

(3733, 7057)

In [93]:
# Combined sklearn countvectorizer and tfidftransformer function

from sklearn.feature_extraction.text import TfidfVectorizer

In [94]:
vectorizer = TfidfVectorizer()

X_train_tfidfv = vectorizer.fit_transform(X_train) # directly feeds the X_train data

X_train_tfidfv.shape

(3733, 7057)

In [95]:
# train a Linear Support Vector Classifier

from sklearn.svm import LinearSVC

clf = LinearSVC()


clf.fit(X_train_tfidfv, y_train)

LinearSVC()

**Note:**

Only our train data has been vectorized into a full vocaburay. So if we had to do an analysis on the test data, we would repeat all the vectorization steps we have done so far. This can get TIRESOME!

So:

We create a Pipeline!

In [96]:
from sklearn.pipeline import Pipeline

In [97]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

# call fit() on the pipeline

text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [98]:
preds = text_clf.predict(X_test)

preds

array(['ham', 'ham', 'spam', ..., 'ham', 'ham', 'spam'], dtype=object)

In [99]:
from sklearn.metrics import confusion_matrix, classification_report

In [100]:
print(confusion_matrix(y_test, preds))

[[1581    6]
 [  27  225]]


In [101]:
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1587
        spam       0.97      0.89      0.93       252

    accuracy                           0.98      1839
   macro avg       0.98      0.94      0.96      1839
weighted avg       0.98      0.98      0.98      1839



In [102]:
# make predictions

text_clf.predict(["Congraturations Brian! You have passed the interview"])

array(['ham'], dtype=object)

### Text Classification on IMDB Movie Reviews

In [103]:
movie_df = pd.read_csv('IMDB Dataset.csv')
movie_df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [104]:
movie_df.shape

(50000, 2)

In [105]:
len(movie_df)

50000

In [106]:
movie_df.loc[movie_df['sentiment'] == 'positive']

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
...,...,...
49983,"I loved it, having been a fan of the original ...",positive
49985,Imaginary Heroes is clearly the best film of t...,positive
49989,I got this one a few weeks ago and love it! It...,positive
49992,John Garfield plays a Marine who is blinded by...,positive


In [107]:
print(movie_df['review'][1])

A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.


In [108]:
movie_df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [109]:
# Check for empty strings if available

blanks = []
for iter, review, sentiment in movie_df.itertuples(name="IMDB_reviews"):
  if review.isspace():
    blanks.append(1)

In [110]:
blanks

[]

In [111]:
X = movie_df['review']
y = movie_df['sentiment']

In [112]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [113]:
# Create a Pipeline
text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC())])

In [114]:
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [115]:
preds = text_clf.predict(X_test)

In [116]:
preds

array(['negative', 'positive', 'negative', ..., 'negative', 'positive',
       'positive'], dtype=object)

In [117]:
confusion_matrix(y_test, preds)

array([[6626,  785],
       [ 695, 6894]])

In [118]:
from sklearn.metrics import accuracy_score

print(accuracy_score(y_test, preds))

0.9013333333333333


In [119]:
text_clf.predict(['crazy yes'])

array(['positive'], dtype=object)

## VADER Sentiment Analysis wit Python and NLTK

In [120]:
import nltk

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [121]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create an instance - Takes a string and creates a dictionary of scores
# of 4 scores(Negative, Positive, Neutral, Compound Scores(normalized N,P, Neutral scores))

sid = SentimentIntensityAnalyzer()


In [122]:
sample_text = 'The is super awesome. I would repray it when bored!'

sid.polarity_scores(sample_text) # returns a maz value of 1.0

{'neg': 0.128, 'neu': 0.366, 'pos': 0.506, 'compound': 0.8016}

In [123]:
sample_text = 'This was the BEST movie I have watched!!'
sid.polarity_scores(sample_text)

{'neg': 0.0, 'neu': 0.521, 'pos': 0.479, 'compound': 0.7592}

In [124]:
sample_text = 'Worst movie. Noting makes SENSE!!'
sid.polarity_scores(sample_text)

{'neg': 0.539, 'neu': 0.461, 'pos': 0.0, 'compound': -0.6892}

## Amazon Reviews with VADER and NLTK

In [125]:
amazon_df = pd.read_csv('amazonreviews.tsv', sep='\t')

In [126]:
amazon_df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [127]:
amazon_df['label'].value_counts()

neg    5097
pos    4903
Name: label, dtype: int64

In [128]:
amazon_df.isnull().sum()

label     0
review    0
dtype: int64

In [129]:
sid.polarity_scores(amazon_df.iloc[0]['review'])

{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'compound': 0.9454}

In [130]:
amazon_df['scores'] = amazon_df['review'].apply(lambda review: sid.polarity_scores(review))

In [131]:
amazon_df.head()

Unnamed: 0,label,review,scores
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co..."
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co..."
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com..."
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com..."
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp..."


In [132]:

# Get the compound Scores
amazon_df['compound'] = amazon_df['scores'].apply(lambda cmpd_key: cmpd_key['compound'])

In [133]:
amazon_df.head()

Unnamed: 0,label,review,scores,compound
0,pos,Stuning even for the non-gamer: This sound tra...,"{'neg': 0.088, 'neu': 0.669, 'pos': 0.243, 'co...",0.9454
1,pos,The best soundtrack ever to anything.: I'm rea...,"{'neg': 0.018, 'neu': 0.837, 'pos': 0.145, 'co...",0.8957
2,pos,Amazing!: This soundtrack is my favorite music...,"{'neg': 0.04, 'neu': 0.692, 'pos': 0.268, 'com...",0.9858
3,pos,Excellent Soundtrack: I truly like this soundt...,"{'neg': 0.09, 'neu': 0.615, 'pos': 0.295, 'com...",0.9814
4,pos,"Remember, Pull Your Jaw Off The Floor After He...","{'neg': 0.0, 'neu': 0.746, 'pos': 0.254, 'comp...",0.9781


In [151]:
# Create column to show if score is negative/pos based on compound score
amazon_df['compound_score'] = amazon_df['compound'].apply(lambda score: 'pos' if score >=0.5 else 'neg')

In [152]:
amazon_df.loc[amazon_df['compound_score'] == 'neg'].head()

Unnamed: 0,label,review,scores,compound,compound_score
6,neg,"Buyer beware: This is a self-published book, a...","{'neg': 0.124, 'neu': 0.806, 'pos': 0.069, 'co...",-0.8744,neg
10,neg,The Worst!: A complete waste of time. Typograp...,"{'neg': 0.36, 'neu': 0.586, 'pos': 0.054, 'com...",-0.9274,neg
14,neg,Awful beyond belief!: I feel I have to write t...,"{'neg': 0.171, 'neu': 0.755, 'pos': 0.074, 'co...",-0.9312,neg
15,neg,Don't try to fool us with fake reviews.: It's ...,"{'neg': 0.105, 'neu': 0.832, 'pos': 0.063, 'co...",-0.5414,neg
19,neg,sizes recomended in the size chart are not rea...,"{'neg': 0.0, 'neu': 0.935, 'pos': 0.065, 'comp...",0.4926,neg


Evaluation

In [153]:
print(accuracy_score(amazon_df['label'], amazon_df['compound_score']))

0.7472


In [154]:
print(classification_report(amazon_df['label'], amazon_df['compound_score']))

              precision    recall  f1-score   support

         neg       0.80      0.67      0.73      5097
         pos       0.71      0.82      0.76      4903

    accuracy                           0.75     10000
   macro avg       0.75      0.75      0.75     10000
weighted avg       0.75      0.75      0.75     10000



In [155]:
print(confusion_matrix(amazon_df['label'], amazon_df['compound_score']))

[[3440 1657]
 [ 871 4032]]
