# **Twitter Sentiment Analysis**

# Problem Statement##

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

## Data Description

You are provided with the train and test set. Train set contains the tweets and their corresponding label which indicates whether the tweet is racist or not. You shou ld train the
model using this training set. Here is the description about the files:

1. train.csv - For training the models, we provide a labelled dataset of 31,962 tweets. The dataset is provided in the form of a csv file with each line storing a tweet id, its label and
the tweet.
2. test_tweets.csv - The test data file contains only tweet ids and the tweet text with each tweet in a new line.

Your objective is to predict the labels of the given tweets in the test dataset.

The training set contains three variables:
1. id -> Unique identifier
2. label -Y Binary target variable, where 1 denotes the tweet is racist/sexist and O denotes the tweet is not racist/sexist
3. tweet text of the tweet

The variable label is not present in the test set and you have to predict this.

### Data Download

You can download the dataset from the [Datahack Platform](https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/). Several Machine Learning and Deep
Learning competitions are hosted on this platform. It gives you the following functionalities:

-You can view and compare your score against that of other participants on the leaderboard.
- You can also keep track of your submissions.

## Approach to the Problem
- Text Pre-processing: use word cloud to pre-process your data. This can include removing stop words, non important words, etc.
- Create meta features: count of words, average word length, count of mentions, etc.
- Train your model
- Do the same pre-processing steps on test set as you did for the training set, and create the similar features
- Generate predictions for the test set using the trained model
- Save the predictions in a csv file (to check the format, refer to the sample submission file provided on the problem page)
- Submit your predictions on the problem page and check your rank on the leaderboard.
- To improve the score further, try out the advanced features like text representations: Bag of words (BOW) and tf-idf features
- You can also use the word embedding to improve the performance of your model

# Submission Details
The following 3 files are to be uploaded.

- test_predictions.csv - This should contain the 0/1 label for the tweets in test_tweets.csv, in the same order corresponding to the tweets in test_tweets.csv. Each 0/1 label should be in a new line.
- A .zip file of source code - The code should produce the output file submitted and must be properly commented.
 

Evaluation Metric:
The metric used for evaluating the performance of classification model would be F1-Score.

The metric can be understood as -

**True Positives (TP)** - These are the correctly predicted positive values which means that the value of actual class is yes and the value of predicted class is also yes.

**True Negatives (TN)** - These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.

**False Positives (FP)** – When actual class is no and predicted class is yes.

**False Negatives (FN)** – When actual class is yes but predicted class in no.

**Precision** = TP/TP+FP

**Recall** = TP/TP+FN

**F1 Score** = 2*(Recall * Precision) / (Recall + Precision)

F1 is usually more useful than accuracy, especially if for an uneven class distribution.


# Get Data

Downloaded from hackathon site.  https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/#ProblemStatement

In [1]:
# from google.colab import drive
# drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
import pandas as pd
#data =pd.read_csv('train_E6oV3lV.csv', index_col=0)
train = pd.read_csv('train_E6oV3lV.csv', index_col=0)
test = pd.read_csv('test_tweets_anuFYb8.csv', index_col=0)


In [2]:
print(train.shape)
print(test.shape)

(31962, 2)
(17197, 1)


In [3]:
train.head()

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,@user when a father is dysfunctional and is s...
2,0,@user @user thanks for #lyft credit i can't us...
3,0,bihday your majesty
4,0,#model i love u take with u all the time in ...
5,0,factsguide: society now #motivation


# Preprocessing

Using pre-trained [VADER](https://predictivehacks.com/how-to-run-sentiment-analysis-in-python-using-vader/) which stands for **V**alence **A**ware **D**ictionary and s**E**ntiment **R**easoner and is specifically attuned to sentiments expressed in social media.

In [4]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# import nltk
nltk.download('vader_lexicon')

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\czwea\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [8]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\czwea\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [1]:
# print(stopwords.words('english'))

In [10]:
import re
from nltk.corpus import stopwords
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
stop_words = stopwords.words('english')
def preprocessing(document):
  document = list(document)
  lemm = WordNetLemmatizer()
  for i in range(0, len(document)):
    document[i] = document[i].lower()
    document[i] = document[i].replace('user','')
    #regex to remove all the words with numbers for reference: https://stackoverflow.com/a/18082370/4084039
    document[i] = re.sub("\S*\d\S*", "", document[i]).strip()
    #remove special character: https://stackoverflow.com/a/5843547/4084039
    document[i] = re.sub('[^A-Za-z0-9]+', ' ', document[i])
    document[i] = word_tokenize(document[i])
    document[i] = [lemm.lemmatize(words) for words in document[i]]
    document[i] = ' '.join(e.lower() for e in document[i] if e.lower() not in stop_words)
  return document

[Punkt Sentence Tokenizer](https://findanyanswer.com/what-is-nltk-punkt): This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. 

[Wordnet](https://www.guru99.com/wordnet-nltk.html#:~:text=What%20is%20Wordnet%3F%20Wordnet%20is%20an%20NLTK%20corpus,it%20as%20a%20semantically%20oriented%20dictionary%20of%20English.): Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to find the meaning of words, synonym or antonym. One can define it as a semantically oriented dictionary of English. Stats reveal that there are 155287 words and 117659 synonym sets included with English WordNet.

In [11]:
nltk.download('wordnet')
nltk.download('punkt')
from nltk import word_tokenize
preprocessed_train = preprocessing(train['tweet'])
preprocessed_test = preprocessing(test['tweet'])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\czwea\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\czwea\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [13]:
preprocessed_train[:20]

['father dysfunctional selfish drag kid dysfunction run',
 'thanks lyft credit use cause offer wheelchair van pdx disapointed getthanked',
 'bihday majesty',
 'model love u take u time ur',
 'factsguide society motivation',
 'huge fan fare big talking leave chaos pay dispute get allshowandnogo',
 'camping tomorrow danny',
 'next school year year exam think school exam hate imagine actorslife revolutionschool girl',
 'love land allin cavs champion cleveland clevelandcavaliers',
 'welcome',
 'ireland consumer price index mom climbed previous may blog silver gold forex',
 'selfish orlando standwithorlando pulseshooting orlandoshooting biggerproblems selfish heabreaking value love',
 'get see daddy today gettingfed',
 'cnn call michigan middle school build wall chant tcot',
 'comment australia opkillingbay seashepherd helpcovedolphins thecove helpcovedolphins',
 'ouch junior junior yugyoem omg',
 'thankful paner thankful positive',
 'retweet agree',
 'friday smile around via ig cooky make 

In [14]:
train.shape,len(preprocessed_train)

((31962, 2), 31962)

# Using Unigram, Bigram, Trigram and Tetragrams in TFIDF Vectorizer

## Split Data

In [26]:
y = train['label']
x = preprocessed_train
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test=train_test_split(x, y, test_size = 0.33, stratify = y)


print("x_train: ", len(x_train), "x_test: ", len(x_test))
print("y_train: ", len(y_train), "y_test: ", len(y_test))

x_train:  21414 x_test:  10548
y_train:  21414 y_test:  10548


Here is a [good resource](https://towardsdatascience.com/hacking-scikit-learns-vectorizers-9ef26a7170af) to understand the following code.

Of course this helps too:  [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Unigram features
vectorizer_unigram = TfidfVectorizer()
vectorizer_unigram.fit(x_train)
tfidf_words = vectorizer_unigram.get_feature_names()
x_tfidf = vectorizer_unigram.transform(x)
x_train_tfidf = vectorizer_unigram.transform(x_train)
x_test_tfidf = vectorizer_unigram.transform(x_test)
x_CV_tfidf = vectorizer_unigram.transform(preprocessed_test)
print(x_train_tfidf.shape)

(21414, 26737)   (0, 22487)	0.5201568246294398
  (0, 19633)	0.4996381355838922
  (0, 17443)	0.4036337833633234
  (0, 8255)	0.2856149096331439
  (0, 5826)	0.48507989402234686
  (1, 25824)	0.3969471248342019
  (1, 25386)	0.3212593168195984
  (1, 20676)	0.3136877969947271
  (1, 16037)	0.30271066612297354
  (1, 15457)	0.4323985404371476
  (1, 7565)	0.38462077497162966
  (1, 2486)	0.46291469166844706
  (2, 25500)	0.2914579153080008
  (2, 20264)	0.413116731168469
  (2, 10410)	0.28702971832106644
  (2, 10397)	0.35128122191329086
  (2, 10392)	0.4257084470340827
  (2, 683)	0.3255745422994896
  (2, 153)	0.5013739417789232
  (3, 26482)	0.30880945027975465
  (3, 19648)	0.4988434140669818
  (3, 10276)	0.48430832875992014
  (3, 8011)	0.3483337528784555
  (3, 4150)	0.30503718987935996
  (3, 2839)	0.3124634387836758
  :	:
  (17194, 23467)	0.353208750696606
  (17194, 20464)	0.19644450719976192
  (17194, 16838)	0.2353022622010997
  (17194, 16764)	0.3139065878647706
  (17194, 16016)	0.19144461171927224
 

Bigram Features

In [31]:
# min_dffloat or int, default=1
# When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

vectorizer_bigram = TfidfVectorizer(ngram_range = (1, 2), min_df = 1)
vectorizer_bigram.fit(x_train)
tfidf_bigram_words = vectorizer_bigram.get_feature_names()
x_tfidf_bigram = vectorizer_bigram.transform(x)
x_train_tfidf_bigram = vectorizer_bigram.transform(x_train)
x_test_tfidf_bigram = vectorizer_bigram.transform(x_test)
x_CV_tfidf_bigram = vectorizer_bigram.transform(preprocessed_test)
print(x_train_tfidf_bigram.shape)

(21414, 131724)


Trigram Features

In [32]:
#Trigram features
vectorizer_trigram = TfidfVectorizer(ngram_range = (1,3), min_df = 2)
vectorizer_trigram.fit(x_train)
x_train_tfidf_trigram = vectorizer_trigram.transform(x_train)
x_test_tfidf_trigram = vectorizer_trigram.transform(x_test)
x_CV_tfidf_trigram =vectorizer_trigram.transform(preprocessed_test)
print(x_train_tfidf_trigram.shape)

(21414, 28802)


In [None]:
Tetragram Features

In [33]:
#Tetragram features
vectorizer_tetragram = TfidfVectorizer(ngram_range = (1, 4), min_df = 2)
vectorizer_tetragram.fit(x_train)
x_train_tfidf_tetragram = vectorizer_tetragram.transform(x_train)
x_test_tfidf_tetragram = vectorizer_tetragram.transform(x_test)
x_CV_tfidf_tetragram = vectorizer_tetragram.transform(preprocessed_test)
print(x_train_tfidf_tetragram.shape)

(21414, 33344)


In [None]:
Pentagram Features

In [34]:
#Pentagram features
vectorizer_pentagram = TfidfVectorizer(ngram_range = (1, 5), min_df = 2)
vectorizer_pentagram.fit(x_train)
x_train_tfidf_pentagram = vectorizer_pentagram.transform(x_train)
x_test_tfidf_pentagram = vectorizer_pentagram.transform(x_test)
x_CV_tfidf_pentagram = vectorizer_pentagram.transform(preprocessed_test)
print(x_train_tfidf_pentagram.shape)

(21414, 36850)


Polarity scores come from VADER

Recall this was previously imported:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

Recalling 'x':

In [37]:
x[: 10]

['father dysfunctional selfish drag kid dysfunction run',
 'thanks lyft credit use cause offer wheelchair van pdx disapointed getthanked',
 'bihday majesty',
 'model love u take u time ur',
 'factsguide society motivation',
 'huge fan fare big talking leave chaos pay dispute get allshowandnogo',
 'camping tomorrow danny',
 'next school year year exam think school exam hate imagine actorslife revolutionschool girl',
 'love land allin cavs champion cleveland clevelandcavaliers',
 'welcome']

In [38]:
ss_tr = []
ss_data = []
ss_test = []
ss_cv = []

for i in x:
    s = sid.polarity_scores(i)
    ss_data.append(list(s.values()))

for i in x_train:
    s = sid.polarity_scores(i)
    ss_tr.append(list(s.values()))

for i in x_test:
    s=sid.polarity_scores(i)
    ss_test.append(list(s.values()))

for i in preprocessed_test:
    s = sid.polarity_scores(i)
    ss_cv.append(list(s.values()))

In [39]:
from scipy.sparse import hstack
#X_test_grade
import scipy
data = hstack((x_tfidf, x_tfidf_bigram, ss_data)).tocsr()
x_tr = hstack((x_train_tfidf, x_train_tfidf_bigram, ss_tr)).tocsr()
x_cv = hstack((x_CV_tfidf, x_CV_tfidf_bigram, ss_cv)).tocsr()
x_tst=hstack((x_test_tfidf, x_test_tfidf_bigram, ss_test)).tocsr()
print(data.shape, x_tr.shape, x_tst.shape, x_cv.shape)

(31962, 158465) (21414, 158465) (10548, 158465) (17197, 158465)


**Using LinearSVC Algorithm**

In [41]:
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score,f1_score,accuracy_score
C=[0.001,0.01,0.1,0.5,1,10,100,1000]
for i in C:
  model = LinearSVC(penalty='l2', C=i, dual=False, random_state=0, max_iter=1000)
  model.fit(x_tr, y_train)
  y_pred_tr = model.predict(x_tr)
  y_pred_te = model.predict(x_tst)
  print("C: {}".format(i))
  print('Train Accuracy:', accuracy_score(y_pred_tr, y_train))
  print("Train F1 Score: ", f1_score(y_pred_tr, y_train))
  print('Test Accuracy:', accuracy_score(y_pred_te, y_test))
  print("Test F1 Score: ", f1_score(y_pred_te, y_test))

C: 0.001
Train Accuracy: 0.929858970766788
Train F1 Score:  0.0
Test Accuracy: 0.9298445202882063
Test F1 Score:  0.0
C: 0.01
Train Accuracy: 0.9360698608387037
Train F1 Score:  0.1647345942647956
Test Accuracy: 0.9346795601061813
Test F1 Score:  0.13333333333333333
C: 0.1
Train Accuracy: 0.9688521527972355
Train F1 Score:  0.7172530733361595
Test Accuracy: 0.9549677664012135
Test F1 Score:  0.5489078822412156
C: 0.5
Train Accuracy: 0.9992528252545064
Train F1 Score:  0.9946524064171123
Test Accuracy: 0.9635949943117179
Test F1 Score:  0.6805324459234608
C: 1
Train Accuracy: 0.9997198094704399
Train F1 Score:  0.998
Test Accuracy: 0.9642586272279106
Test F1 Score:  0.6957223567393059
C: 10
Train Accuracy: 0.9998132063136266
Train F1 Score:  0.9986666666666667
Test Accuracy: 0.9636897990140311
Test F1 Score:  0.7033307513555384
C: 100
Train Accuracy: 0.9998132063136266
Train F1 Score:  0.9986666666666667
Test Accuracy: 0.9633105802047781
Test F1 Score:  0.7006960556844547
C: 1000
Train 

In [43]:
model = LinearSVC(penalty = 'l2', C = 10, dual = False, random_state = 0, max_iter = 1000)
model.fit(data,y)
y_pred_tr = model.predict(x_tr)
y_pred_te = model.predict(x_tst)
print("C: {}".format(i))
print('Train Accuracy:', accuracy_score(y_pred_tr, y_train))
print("Train F1 Score: ", f1_score(y_pred_tr, y_train))
print('Test Accuracy:', accuracy_score(y_pred_te, y_test))
print("Test F1 Score: ", f1_score(y_pred_te, y_test))

C: 1000
Train Accuracy: 0.9997665078920333
Train F1 Score:  0.9983327775925308
Test Accuracy: 0.9994311717861206
Test F1 Score:  0.9959294436906377


In [234]:
y_pred_cv=model.predict(x_cv)
y_pred_cv=pd.DataFrame(y_pred_cv)
y_pred_cv.to_csv('/content/drive/My Drive/Twitter Sentiment Analysis/test_predictions_linearSVC.csv', index=True)

**Using Decision Trees Classifier**

In [44]:
from sklearn.tree import DecisionTreeClassifier
depth=[1,5,10,50,100, 500, 1000]
samples_split=[5, 10, 20, 45, 75, 100, 135, 270, 500]
train_f1_score=[]
test_f1_score=[]
for i in depth:
    for j in samples_split:
        clf=DecisionTreeClassifier(max_depth=i,min_samples_split=j,class_weight='balanced',random_state=0)
        clf.fit(x_tr, y_train)
        y_pred_tr = clf.predict(x_tr)
        y_pred_te=clf.predict(x_tst)
        train_f1 = f1_score(y_train, y_pred_tr)
        test_f1 = f1_score(y_test, y_pred_te)
        print("Depth: {} Samples_Split: {} Train F1 Score: {} Test F1 Score: {}".format(i,j,train_f1,test_f1))
        train_f1_score.append(train_f1)
        test_f1_score.append(test_f1)
    print('\n')

Depth: 1 Samples_Split: 5 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 10 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 20 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 45 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 75 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 100 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 135 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 270 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986
Depth: 1 Samples_Split: 500 Train F1 Score: 0.17929810187081796 Test F1 Score: 0.18522632602054986


Depth: 5 Samples_Split: 5 Train F1 Score: 0.32712915961646927 Test F1 Score: 0.32494279176201374
Depth: 5 Samples

In [None]:
from sklearn.tree import DecisionTreeClassifier
depth=[1,5,10,50,100, 500, 1000]
samples_split=[5, 10, 20, 45, 75, 100, 135, 270, 500]
train_f1_score=[]
test_f1_score=[]
for i in depth:
    for j in samples_split:
        clf = DecisionTreeClassifier(max_depth = i, min_samples_split = j, class_weight = 'balanced', random_state = 0)
        clf.fit(x_tr, y_train)
        y_pred_tr = clf.predict(x_tr)
        y_pred_te = clf.predict(x_tst)
        train_f1 = f1_score(y_train, y_pred_tr)
        test_f1 = f1_score(y_test, y_pred_te)
        print("Depth: {} Samples_Split: {} Train F1 Score: {} Test F1 Score: {}".format(i,j,train_f1,test_f1))
        train_f1_score.append(train_f1)
        test_f1_score.append(test_f1)
    print('\n')

In [None]:
clf=DecisionTreeClassifier(max_depth = 1000, min_samples_split = 500, class_weight = 'balanced', random_state=0)
clf.fit(x_tr, Yytrain)
y_pred_tr = clf.predict(x_tr)
y_pred_te = clf.predict(x_tst)
y_pred_cv = clf.predict(_cv)
train_score = f1_score(y_train, y_pred_tr)
test_score = f1_score(y_test, y_pred_te)
#cv_score=f1_score(Y_test,y_pred_cv)
print(train_score, test_score)

**Part 3: Using BOW features**

In [222]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer() #in scikit-learn
count_vect.fit(x_train)

x_train_bow = count_vect.transform(x_train)
x_test_bow = count_vect.transform(x_test)
x_cv_bow = count_vect.transform(preprocessed_test)

In [223]:
x_tr = hstack((x_train_bow, ss_tr)).tocsr()
x_cv = hstack((x_cv_bow, ss_cv)).tocsr()
x_tst = hstack((x_test_bow, ss_test)).tocsr()
print(x_tr.shape, x_tst.shape, x_cv.shape)

(21414, 26942) (10548, 26942) (17197, 26942)


In [224]:
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score,f1_score,accuracy_score
C=[0.001,0.01,0.1,0.5,1,10,100,1000]
for i in C:
  model = LinearSVC(penalty = 'l2', C = i, dual = False, random_state = 0, max_iter = 1000)
  model.fit(x_train_bow, y_train)
  y_pred_tr = model.predict(x_train_bow)
  y_pred_te = model.predict(x_test_bow)
  print("C: {}".format(i))
  print('Train Accuracy:', accuracy_score(y_pred_tr, y_train))
  print("Train F1 Score: ", f1_score(y_pred_tr, y_train))
  print('Test Accuracy:', accuracy_score(y_pred_te, y_test))
  print("Test F1 Score: ", f1_score(y_pred_te, y_test))

C: 0.001
Train Accuracy: 0.9300457644531614
Train F1 Score:  0.007947019867549669
Test Accuracy: 0.9298445202882063
Test F1 Score:  0.0
C: 0.01
Train Accuracy: 0.9516671336508826
Train F1 Score:  0.4868616757560733
Test Accuracy: 0.94871065604854
Test F1 Score:  0.44169246646026833
C: 0.1
Train Accuracy: 0.9860371719435883
Train F1 Score:  0.8899521531100479
Test Accuracy: 0.9635949943117179
Test F1 Score:  0.6740237691001698
C: 0.5
Train Accuracy: 0.9977117773419258
Train F1 Score:  0.9834515366430261
Test Accuracy: 0.9654910883579826
Test F1 Score:  0.7187017001545596
C: 1
Train Accuracy: 0.9987858410385729
Train F1 Score:  0.9912810194500337
Test Accuracy: 0.962836556693212
Test F1 Score:  0.7048192771084338
C: 10
Train Accuracy: 0.9994396189408798
Train F1 Score:  0.9959946595460614
Test Accuracy: 0.9534508911642018
Test F1 Score:  0.6587908269631689
C: 100
Train Accuracy: 0.9995797142056598
Train F1 Score:  0.996996996996997
Test Accuracy: 0.949089874857793
Test F1 Score:  0.63691

**Part 3: Using Tfidf Weighted Word2Vec features**

In [194]:
import pickle
with open('/content/drive/My Drive/glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())

In [195]:
dictionary = dict(zip(vectorizer_unigram.get_feature_names(), list(vectorizer_unigram.idf_)))

In [196]:
from tqdm import tqdm
tfidf_w2v_vectors_tr = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_train): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors_tr.append(vector)

100%|██████████| 21414/21414 [00:33<00:00, 634.78it/s]


In [197]:
tfidf_w2v_vectors_te = []; # the avg-w2v for each sentence/review is stored in this list
for sentence in tqdm(X_test): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors_te.append(vector)

100%|██████████| 10548/10548 [00:17<00:00, 610.55it/s]


In [198]:
from tqdm import tqdm

tfidf_w2v_vectors_cv = []; # the avg-w2v for each sentence/review is stored in this list
    
for sentence in tqdm(preprocessed_test): # for each review/sentence
    vector = np.zeros(300) # as word vectors are of zero length
    tf_idf_weight =0; # num of words with a valid vector in the sentence/review
    for word in sentence.split(): # for each word in a review/sentence
        if (word in glove_words) and (word in tfidf_words):
            vec = model[word] # getting the vector for each word
            # here we are multiplying idf value(dictionary[word]) and the tf value((sentence.count(word)/len(sentence.split())))
            tf_idf = dictionary[word]*(sentence.count(word)/len(sentence.split())) # getting the tfidf value for each word
            vector += (vec * tf_idf) # calculating tfidf weighted w2v
            tf_idf_weight += tf_idf
    if tf_idf_weight != 0:
        vector /= tf_idf_weight
    tfidf_w2v_vectors_cv.append(vector)

100%|██████████| 17197/17197 [00:28<00:00, 612.95it/s]


In [199]:
tfidf_w2v_vectors_tr=np.array(tfidf_w2v_vectors_tr)
tfidf_w2v_vectors_te=np.array(tfidf_w2v_vectors_te)
tfidf_w2v_vectors_cv=np.array(tfidf_w2v_vectors_cv)
print(tfidf_w2v_vectors_tr.shape,tfidf_w2v_vectors_te.shape)

(21414, 300) (10548, 300)


In [154]:
tfidf_w2v_vectors_tr.shape,len(ss_tr[0])

((21414, 300), 4)

In [201]:
from sklearn.svm import LinearSVC
from sklearn.metrics import roc_auc_score,f1_score,accuracy_score
C=[0.001,0.01,0.1,0.5,1,10,100,1000]
for i in C:
  model = LinearSVC(penalty = 'l2', C = i, dual = False, random_state = 0, max_iter = 1000)
  model.fit(tfidf_w2v_vectors_tr,Y_train)
  y_pred_tr = model.predict(tfidf_w2v_vectors_tr)
  y_pred_te = model.predict(tfidf_w2v_vectors_te)
  print("C: {}".format(i))
  print('Train Accuracy:', accuracy_score(y_pred_tr, y_train))
  print("Train F1 Score: ", f1_score(y_pred_tr, y_train))
  print('Test Accuracy:', accuracy_score(y_pred_te, y_test))
  print("Test F1 Score: ", f1_score(y_pred_te, y_test))

C: 0.001
Train Accuracy: 0.9374241150649109
Train F1 Score:  0.24379232505643342
Test Accuracy: 0.9373340917709518
Test F1 Score:  0.2547914317925592
C: 0.01
Train Accuracy: 0.9425609414401793
Train F1 Score:  0.39048562933597625
Test Accuracy: 0.9416003033750474
Test F1 Score:  0.39130434782608686
C: 0.1
Train Accuracy: 0.9448958625198468
Train F1 Score:  0.44652908067542213
Test Accuracy: 0.9429275692074327
Test F1 Score:  0.4331450094161958
C: 0.5
Train Accuracy: 0.9453628467357803
Train F1 Score:  0.4563197026022304
Test Accuracy: 0.943022373909746
Test F1 Score:  0.43779232927970063
C: 1
Train Accuracy: 0.9454095451573736
Train F1 Score:  0.45653184565318455
Test Accuracy: 0.9428327645051194
Test F1 Score:  0.4369747899159664
C: 10
Train Accuracy: 0.9453628467357803
Train F1 Score:  0.4568245125348189
Test Accuracy: 0.9427379598028062
Test F1 Score:  0.43656716417910446
C: 100
Train Accuracy: 0.9453628467357803
Train F1 Score:  0.4568245125348189
Test Accuracy: 0.9427379598028062
