<a href="https://colab.research.google.com/github/bucuram/foundations-of-NLP-labs/blob/main/Lab3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Word Representations

###Bag of Words

A bag-of-words is a representation of text that describes the occurrence of words within a document. 

It is called a ***bag*** of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

If your dataset is small and context is domain specific, BoW may work better than Word Embedding because you may not find the corresponding vector from pre-trained word embedding models for some of the words.

![bow](https://miro.medium.com/max/554/0*B9GC_f3BMtjGMdQ-.png)

[Photo source](https://medium.com/analytics-vidhya/does-tf-idf-work-differently-in-textbooks-and-sklearn-routine-cc7a7d1b580d)

In [60]:
corpus = ["Flora is all the plant life present in a particular region or time, generally the naturally occurring (indigenous) native plants.",
    "The corresponding term for animal life is fauna. Flora, fauna, and other forms of life, such as fungi, are collectively referred to as biota.",
    "Sometimes bacteria and fungi are also referred to as flora, as in the terms gut flora or skin flora."]

corpus

['Flora is all the plant life present in a particular region or time, generally the naturally occurring (indigenous) native plants.',
 'The corresponding term for animal life is fauna. Flora, fauna, and other forms of life, such as fungi, are collectively referred to as biota.',
 'Sometimes bacteria and fungi are also referred to as flora, as in the terms gut flora or skin flora.']

We will create the BoW vectors using `CountVectorizer`

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(lowercase = False)
bow_representation = vectorizer.fit_transform(corpus)
vocabulary = vectorizer.get_feature_names()

print(vocabulary)
print(len(vocabulary))
print(bow_representation.toarray())

['Flora', 'Sometimes', 'The', 'all', 'also', 'and', 'animal', 'are', 'as', 'bacteria', 'biota', 'collectively', 'corresponding', 'fauna', 'flora', 'for', 'forms', 'fungi', 'generally', 'gut', 'in', 'indigenous', 'is', 'life', 'native', 'naturally', 'occurring', 'of', 'or', 'other', 'particular', 'plant', 'plants', 'present', 'referred', 'region', 'skin', 'such', 'term', 'terms', 'the', 'time', 'to']
43
[[1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1
  0 0 0 0 2 1 0]
 [1 0 1 0 0 1 1 1 2 0 1 1 1 2 0 1 1 1 0 0 0 0 1 2 0 0 0 1 0 1 0 0 0 0 1 0
  0 1 1 0 0 0 1]
 [0 1 0 0 1 1 0 1 2 1 0 0 0 0 3 0 0 1 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0
  1 0 0 1 1 0 1]]


####N-grams encoding

Extracts features from text while capturing local word order by defining
counts over sliding windows.

![ngrams](https://i.stack.imgur.com/8ARA1.png)


In [56]:
bigram = CountVectorizer(lowercase = False, ngram_range=(2, 2))
bigram_representation = bigram.fit_transform(corpus)

bigram_vocabulary = bigram.get_feature_names()

print(bigram_vocabulary)
print(len(bigram_vocabulary))
print(bigram_representation.toarray())

['Flora fauna', 'Flora is', 'Sometimes bacteria', 'The corresponding', 'all the', 'also referred', 'and fungi', 'and other', 'animal life', 'are also', 'are collectively', 'as biota', 'as flora', 'as fungi', 'as in', 'bacteria and', 'collectively referred', 'corresponding term', 'fauna Flora', 'fauna and', 'flora as', 'flora or', 'for animal', 'forms of', 'fungi are', 'generally the', 'gut flora', 'in particular', 'in the', 'indigenous native', 'is all', 'is fauna', 'life is', 'life present', 'life such', 'native plants', 'naturally occurring', 'occurring indigenous', 'of life', 'or skin', 'or time', 'other forms', 'particular region', 'plant life', 'present in', 'referred to', 'region or', 'skin flora', 'such as', 'term for', 'terms gut', 'the naturally', 'the plant', 'the terms', 'time generally', 'to as']
56
[[0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 1
  1 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0]
 [1 0 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 0 0 0 

###TF-IDF

TF-IDF represents
text data by indicating the importance of the word relative to the other words in
the text.

A problem with scoring word frequency is that highly frequent words start to dominate in the document (e.g. larger score), but may not contain as much “informational content” to the model as rarer but perhaps domain specific words.

One approach is to rescale the frequency of words by how often they appear in all documents, so that the scores for frequent words like “the” that are also frequent across all documents are penalized.

This approach to scoring is called Term Frequency – Inverse Document Frequency, or TF-IDF for short, where:

- **Term Frequency**: the frequency of a given term in a document.



- **Inverse Document Frequency**: the ratio of documents that contain a given term.

![tf](https://www.affde.com/uploads/article/5516/PVpklt43xBCKRFBa.png)

TF-IDF penalizes stopwords, they will not have a high score, but stopwords removal may stil be used to reduce the dimensionality of the input space.

In [68]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase = False)
tfidf_representation = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_representation.toarray())

[[0.18334923 0.         0.         0.2410822  0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.2410822  0.         0.18334923 0.2410822  0.18334923 0.18334923
  0.2410822  0.2410822  0.2410822  0.         0.18334923 0.
  0.2410822  0.2410822  0.2410822  0.2410822  0.         0.2410822
  0.         0.         0.         0.         0.36669846 0.2410822
  0.        ]
 [0.15630031 0.         0.20551613 0.         0.         0.15630031
  0.20551613 0.15630031 0.31260063 0.         0.20551613 0.20551613
  0.20551613 0.41103226 0.         0.20551613 0.20551613 0.15630031
  0.         0.         0.         0.         0.15630031 0.31260063
  0.         0.         0.         0.20551613 0.         0.20551613
  0.         0.         0.         0.         0.15630031 0.
  0.         0.20551613 0.20551613 0.         0.         0.
  0.15630031]
 [0.         0.21348818 0.         0.         0.21348818 0.16236326
  

####Limitations

- **Vocabulary**: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.

- **Sparsity:** Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.

- **Meaning**: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

##Sentiment analysis

In [69]:
import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [75]:
from nltk.corpus  import twitter_samples

pos_tweets = twitter_samples.strings('positive_tweets.json')
print(len(pos_tweets))

neg_tweets = twitter_samples.strings('negative_tweets.json')
print(len(neg_tweets))

5000
5000


In [85]:
import pandas as pd
pos_df = pd.DataFrame(pos_tweets, columns = ['tweet'])
pos_df['label'] = 1

In [86]:
neg_df = pd.DataFrame(neg_tweets, columns = ['tweet'])
neg_df['label'] = 0

In [87]:
data_df = pd.concat([pos_df, neg_df], ignore_index=True)
data_df

Unnamed: 0,tweet,label
0,#FollowFriday @France_Inte @PKuchly57 @Milipol...,1
1,@Lamb2ja Hey James! How odd :/ Please call our...,1
2,@DespiteOfficial we had a listen last night :)...,1
3,@97sides CONGRATS :),1
4,yeaaaah yippppy!!! my accnt verified rqst has...,1
...,...,...
9995,I wanna change my avi but uSanele :(,0
9996,MY PUPPY BROKE HER FOOT :(,0
9997,where's all the jaebum baby pictures :((,0
9998,But but Mr Ahmad Maslan cooks too :( https://t...,0


####Split data into train and test

In [88]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(data_df, test_size=0.2, shuffle = True)
print(train_df)
print(test_df)

                                                  tweet  label
3520  @MusicMetrop @LostInMuzic @karenak @MosesMo @c...      1
9447                           @sugarymgc i miss you :(      0
8347                        Miss u :-( @deepikapadukone      0
2274  @sculptorfred You're doing well for a beginner...      1
9425                            I miss my boyfriend :-(      0
...                                                 ...    ...
5669  So Much New Music that I cant record yet coz m...      0
3601  💃💃💃"@Boitumelo_SA: @Roooosta happy birthday to...      1
7579  @itsNotMirna so true, I voted for them so many...      0
2530  @savkra wow. So much hate came your way for sp...      1
9603                                      I need hug :(      0

[8000 rows x 2 columns]
                                                  tweet  label
6186  @RichelleMead it's being more than enough sinc...      0
6326  @VMilas dont  really know sorry.:( try opening...      0
4241  @katie_taylorkay yes omg

####TF-IDF Vectorization

In [91]:
tfidf_vectorizer = TfidfVectorizer(lowercase = False)
tfidf_representation = tfidf_vectorizer.fit(train_df['tweet'])

X_train = tfidf_vectorizer.transform(train_df['tweet'])
X_test = tfidf_vectorizer.transform(test_df['tweet'])

y_train = train_df['label']
y_test = test_df['label']


Classification

In [92]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [97]:
print('Train Score', logreg.score(X_train, y_train))
print('Test Score', logreg.score(X_test, y_test))

Train Score 0.8995
Test Score 0.7675


##Assignment

To be uploaded here: https://forms.gle/qTzLy6F6jkUtQrvy7

Investigate the effect of text normalization.

- Search for a dataset for classification (or experiment with the same dataset from this lab)
- Preprocess the text
- Compare the vocabulary size with and without preprocessing
- Get the numerical representation of the text
- Train a model
- Test your model 
- Compare the performance of your model with and without text normalization