# Text classification using TF-IDF

### 1. Load the dataset from sklearn.datasets

In [1]:
from sklearn.datasets import fetch_20newsgroups

In [2]:
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

### 2. Training data

In [3]:
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

### 3. Test data

In [4]:
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

###  a.  You can access the values for the target variable using .target attribute 

In [5]:
twenty_train.target

array([1, 1, 3, ..., 2, 2, 2], dtype=int64)

###  b. You can access the name of the class in the target variable with .target_names

In [6]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [7]:
twenty_train.data[0:5]

['From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format.  We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance.  Michael.\n-- \nMichael Collier (Programmer)                 The Computer Unit,\nEmail: M.P.Collier@uk.ac.city                The City University,\nTel: 071 477-8000 x3769                      London,\nFax: 071 477-8565                            EC1V 0HB.\n',
 "From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the probl

### 4.  Now with dependent and independent data available for both train and test datasets, using TfidfVectorizer fit and transform the training data and test data and get the tfidf features for both

Hint: Use ".fit_transform" on Train set and ".transform" on Test set

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
#initialize vectorizer
vect = TfidfVectorizer(ngram_range=(1,2),stop_words='english', min_df = 0.01)

In [10]:
#fit and transform the training set
twenty_train_vec = vect.fit_transform(twenty_train.data)

In [11]:
#transform the test set
twenty_test_vec = vect.transform(twenty_test.data)

In [12]:
twenty_train_vec.shape, twenty_test_vec.shape

((2257, 2543), (1502, 2543))

### 5. Use logisticRegression with tfidf features as input and targets as output and train the model and report the train and test accuracy score

In [13]:
from sklearn.linear_model import LogisticRegression

In [14]:
logreg = LogisticRegression(max_iter=500)

In [15]:
logreg.fit(twenty_train_vec, twenty_train.target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [16]:
# make class predictions with train data
pred_train = logreg.predict(twenty_train_vec)

In [17]:
pred_train

array([1, 1, 3, ..., 2, 2, 2], dtype=int64)

In [18]:
# make class predictions with test data
pred_test = logreg.predict(twenty_test_vec)

In [19]:
pred_test

array([2, 2, 2, ..., 2, 2, 1], dtype=int64)

In [20]:
# calculate accuracy of class predictions eith train data
from sklearn import metrics

metrics.accuracy_score(twenty_train.target, pred_train)

0.9898094816127603

In [21]:
#calculate accuracy of class predictions eith test data
metrics.accuracy_score(twenty_test.target, pred_test)

0.8854860186418109

## Sentiment analysis <br> 

The objective of this problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 6. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

Hint: pd.read_csv('./tweets.csv',encoding = "ISO-8859-1").dropna()

In [22]:
import pandas as pd
import numpy as np

In [23]:
data = pd.read_csv('tweets.csv',encoding = "ISO-8859-1").dropna()

In [24]:
data.shape

(3291, 3)

In [25]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


In [26]:
data.columns

Index(['tweet_text', 'emotion_in_tweet_is_directed_at',
       'is_there_an_emotion_directed_at_a_brand_or_product'],
      dtype='object')

### 7. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [27]:
import string, re
from nltk import word_tokenize 
def preprocess(text):
    try:
        # Check characters to see if they are in punctuation
        nopunc = [char for char in text if char not in string.punctuation]
        # Join the characters again to form the string.
        nopunc = ''.join(nopunc)
        # convert text to lower-case
        nopunc = nopunc.lower()
        # remove URLs
        nopunc = re.sub('((www\.[^\s]+)|(https?://[^\s]+)|(http?://[^\s]+))', '', nopunc)
        nopunc = re.sub(r'http\S+', '', nopunc)
        # remove usernames
        nopunc = re.sub('@[^\s]+', '', nopunc)
        # remove the # in #hashtag
        nopunc = re.sub(r'#([^\s]+)', r'\1', nopunc)
        return ''.join(nopunc)
    except Exception as e:
        return ""

In [28]:
data['text'] = [preprocess(text) for text in data.tweet_text]

In [29]:
data.shape

(3291, 4)

In [30]:
data.head(3)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipadiphon...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for ipad 2 also they s...


### 8. Consider only rows having a Positive or Negative emotion and remove other rows from the dataframe.

Hint: Use df = df[(df["col_name"] == "Positive emotion") OR (df["col_name"] == "Negative emotion")]

In [31]:
data = data[(data["is_there_an_emotion_directed_at_a_brand_or_product"]=="Positive emotion") | 
            (data["is_there_an_emotion_directed_at_a_brand_or_product"]=="Negative emotion")]

In [32]:
data.shape

(3191, 4)

In [33]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 i have a 3g iphone after 3 hrs tweeti...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee know about fludapp awesome ipadiphon...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin can not wait for ipad 2 also they s...
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw i hope this years festival isnt as crashy...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff on fri sxsw marissa maye...


In [34]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 9. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

Hint: Perfrom fit (".fit") and transformation(".transform") for whole data, later will do CountVectorizer "fit_transform" and "transform" for train and test separately 

In [35]:
from sklearn.feature_extraction.text import CountVectorizer

In [36]:
vect = CountVectorizer(ngram_range=(1,2),stop_words='english', min_df=0, max_df=1.0)
#vect = CountVectorizer(ngram_range=(1,2), min_df=0, max_df=1.0)

In [37]:
vect.fit(data['text'])
data_vec = vect.transform(data['text'])
data_vec.shape

(3191, 23945)

In [38]:
data_vec_df = pd.DataFrame(data_vec.toarray())
data_vec_df.columns = vect.get_feature_names()

In [39]:
data_vec_df.head()

Unnamed: 0,02,02 symbian,03,03 blackberry,0310,0310 weve,10,10 attendees,10 dangerous,10 hot,...,ûó sxsw,ûó theft,ûójust,ûójust macbooks,ûólewis,ûólewis carroll,ûómention,ûómention connectedbrands,ûóthe,ûóthe right
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 10. Find number of different words in vocabulary

 #### Tip: To see all available functions for an Object use dir and use appropriate function to find number of different words in vocab

In [40]:
dir(vect)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_sort_features',
 '_stop_words_id',
 '_validate_custom_analyzer',
 '_validate_params',
 '_validate_vocabulary',
 '_warn_for_unused_params',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words'

In [41]:
vect.vocabulary_

{'wesley83': 23017,
 '3g': 300,
 'iphone': 10511,
 'hrs': 9272,
 'tweeting': 21951,
 'riseaustin': 17466,
 'dead': 4907,
 'need': 14269,
 'upgrade': 22197,
 'plugin': 15836,
 'stations': 19084,
 'sxsw': 19739,
 'wesley83 3g': 23018,
 '3g iphone': 307,
 'iphone hrs': 10617,
 'hrs tweeting': 9273,
 'tweeting riseaustin': 21955,
 'riseaustin dead': 17467,
 'dead need': 4909,
 'need upgrade': 14309,
 'upgrade plugin': 22201,
 'plugin stations': 15838,
 'stations sxsw': 19085,
 'jessedee': 10931,
 'know': 11313,
 'fludapp': 6884,
 'awesome': 2130,
 'ipadiphone': 10457,
 'app': 1192,
 'youll': 23709,
 'likely': 11871,
 'appreciate': 1641,
 'design': 5083,
 'theyre': 21135,
 'giving': 7602,
 'free': 7073,
 'ts': 21840,
 'jessedee know': 10932,
 'know fludapp': 11317,
 'fludapp awesome': 6885,
 'awesome ipadiphone': 2148,
 'ipadiphone app': 10458,
 'app youll': 1360,
 'youll likely': 23714,
 'likely appreciate': 11872,
 'appreciate design': 1642,
 'design theyre': 5110,
 'theyre giving': 21139

### 11. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [42]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 12. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [43]:
data["is_there_an_emotion_directed_at_a_brand_or_product"] = data["is_there_an_emotion_directed_at_a_brand_or_product"].map({'Positive emotion':1, 'Negative emotion':0})

In [44]:
data["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

1    2672
0     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 13. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets and display shapes

In [45]:
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
x = data['text']
y = data['is_there_an_emotion_directed_at_a_brand_or_product']

In [46]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)

In [47]:
x_train.shape, y_train.shape

((2233,), (2233,))

In [48]:
x_test.shape, y_test.shape

((958,), (958,))

## 14. **Predicting the sentiment:**


### Use (i) Naive Bayes and (ii) Logistic Regression and print their accuracy scores for predicting the sentiment of the given text

In [49]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [50]:
vect = CountVectorizer()

# create document-term matrices
x_train_dtm = vect.fit_transform(x_train)
x_test_dtm = vect.transform(x_test)

In [51]:
nb_clf = MultinomialNB()

In [52]:
nb_clf.fit(x_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [53]:
pred_train = nb_clf.predict(x_train_dtm)

In [54]:
pred_train

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [55]:
metrics.accuracy_score(y_train, pred_train)

0.9493954321540529

In [56]:
pred_test = nb_clf.predict(x_test_dtm)

In [57]:
#pred_test

In [58]:
metrics.accuracy_score(y_test, pred_test)

0.8674321503131524

In [59]:
lr_clf = LogisticRegression(max_iter=500)

In [60]:
lr_clf.fit(x_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [61]:
pred_train = lr_clf.predict(x_train_dtm)

In [62]:
pred_train

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [63]:
metrics.accuracy_score(y_train, pred_train)

0.9816390506045678

In [64]:
pred_test = lr_clf.predict(x_test_dtm)

In [65]:
#pred_test

In [66]:
metrics.accuracy_score(y_test, pred_test)

0.8716075156576201

### 15. Create a function called `tokenize_predict` which can take count vectorizer object as input, create document term matrix out of x_train & x_test, build and train a model using dtm created and print the accuracy 

In [67]:
vect = CountVectorizer()
def tokenize_predict(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### 16. Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

Hint: vect = CountVectorizer(ngram_range=(1, 2))

In [68]:
# include 1-grams and 2-grams
#ngram range = (1,2) and default values for all other parameters

vect = CountVectorizer(ngram_range=(1,2))
tokenize_predict(vect)

Features:  24045
Accuracy:  0.8684759916492694


### 17. Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [69]:
# remove English stop words
#ngram range = (1,2), stop_word='english' and default values for all other parameters

vect = CountVectorizer(ngram_range=(1,2),stop_words='english')
tokenize_predict(vect)

Features:  18629
Accuracy:  0.8747390396659708


### 18. Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [70]:
# remove English stop words and only keep 300 features
#ngram range = (1,2), stop_word='english', max_features=300 and default values for all other parameters

vect = CountVectorizer(ngram_range=(1,2),stop_words='english', max_features=300)
tokenize_predict(vect)

Features:  300
Accuracy:  0.7964509394572025


### 19. Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [71]:
# include 1-grams and 2-grams, and limit the number of features to 15000
#ngram range = (1,2), max_features=15000 and default values for all other parameters

vect = CountVectorizer(ngram_range=(1,2), max_features=15000)
tokenize_predict(vect)

Features:  15000
Accuracy:  0.8663883089770354


### 20. Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [72]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
#ngram range = (1,2), min_df=2 and default values for all other parameters

vect = CountVectorizer(ngram_range=(1,2), min_df=2)
tokenize_predict(vect)

Features:  7149
Accuracy:  0.8705636743215032
