<a href="https://colab.research.google.com/github/bikash119/learn_tensorflow/blob/main/learn_tf_005.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing


In [1]:
!nvidia-smi -L

/bin/bash: nvidia-smi: command not found


In [2]:
!wget "https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip"

--2023-07-11 22:00:38--  https://storage.googleapis.com/ztm_tf_course/nlp_getting_started.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.203.128, 74.125.204.128, 64.233.187.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.203.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 607343 (593K) [application/zip]
Saving to: ‘nlp_getting_started.zip’


2023-07-11 22:00:40 (983 KB/s) - ‘nlp_getting_started.zip’ saved [607343/607343]



In [6]:
import zipfile
import os

def unzip_file(file):
  """
    Unzips a file
    Args:
      file(str): Absolute file path
    Returns
      None
  """
  zip_ref = zipfile.ZipFile(file)
  zip_ref.extractall()
  zip_ref.close()


In [7]:
unzip_file("/content/nlp_getting_started.zip")

In [9]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [10]:
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

In [15]:
train_df[["target","text"]]

Unnamed: 0,target,text
0,1,Our Deeds are the Reason of this #earthquake M...
1,1,Forest fire near La Ronge Sask. Canada
2,1,All residents asked to 'shelter in place' are ...
3,1,"13,000 people receive #wildfires evacuation or..."
4,1,Just got sent this photo from Ruby #Alaska as ...
...,...,...
7608,1,Two giant cranes holding a bridge collapse int...
7609,1,@aria_ahrary @TheTawniest The out of control w...
7610,1,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611,1,Police investigating after an e-bike collided ...


In [13]:
# Let's visualize some random training samples
import random

# get a random number between the range passed as arguments to randint
random_index = random.randint(0,len(train_df)-5)
for row in train_df[["text","target"]][random_index:random_index+5].itertuples():
  _,text,target = row
  print(f"Target : {target}","(real disaster) " if target > 0 else "(not a real disaster)")
  print(f"Text:\n{text}\n")
  print("---\n")

Target : 0 (not a real disaster)
Text:
This LA Startup Is So Hot that Their Flowers Come Straight from a Volcano http://t.co/R3PDdjPiEe via @LATechWatch

---

Target : 0 (not a real disaster)
Text:
Sitting still in the #CityofMemphis traffic is like sitting in a war zone! They don't move for the Police.. They don't care

---

Target : 0 (not a real disaster)
Text:
Zone of the Enders MGS2 God of War. RT @D_PageXXI: Quote this with your favorite PS2 game

---

Target : 0 (not a real disaster)
Text:
Bedroom clean  bathroom clean  laundry done .. Shit was looking like a war zone in here ??

---

Target : 0 (not a real disaster)
Text:
This bed looks like a war zone.

---



### Split the data into train and test

In [17]:
from sklearn.model_selection import train_test_split

train_sentences, val_sentences, train_labels, val_labels = train_test_split(train_df["text"]
                                                                           ,train_df["target"]
                                                                           ,test_size=0.2
                                                                           ,random_state=42)

len(train_sentences),len(val_sentences),len(train_labels),len(val_labels)

(6090, 1523, 6090, 1523)

## Convert text to numbers also known as **numericalization**
In NLP there are two main concepts for turning text to numbers
* Tokenization
* Embeddings

**Tokenization** - is straight mapping of a word or a character or a sub-word to a numerical value. There are 3 main level of tokenization:
  * Character level tokenization
  * Word level tokenization
  * Sub-word level tokenization

**Embeddings** - An embedding is a representation of natural language which can be learned. These are represented as feature vectors
  * Create your own embedding - Once the text has been converted to numbers, we can put it through an embedding layer and an embedding re-presentation will be learnt during model training
  * Re-use pretrained embedding - Many pre-trained embedding exist online.

### Text Vectorization
To tokenize our sentences, we will use the helpful pre-processing layer `tf.keras.layers.experimental.preprocessing.TextVectorization`

In [18]:
import tensorflow as tf
from tensorflow.keras import layers

text_vectorizer = layers.TextVectorization(max_tokens=None
                                           ,standardize="lower_and_strip_punctuation"
                                           ,split="whitespace"
                                           ,ngrams=None
                                           ,output_mode="int"
                                           ,output_sequence_length=None)

In [25]:
# Avg number of words in a tweet
round(sum([len(i.split()) for i in train_sentences])/len(train_sentences))

15

In [26]:
max_vocab_length=10000
max_length=15
text_vectorizer = layers.TextVectorization(max_tokens=max_vocab_length
                                           ,output_mode="int"
                                           ,output_sequence_length=max_length)

In [29]:
text_vectorizer.adapt([train_sentences])

In [32]:
random_sentence = random.choice(train_sentences)
print(f" Raw text : {random_sentence}")
text_vectorizer([random_sentence])

 Raw text : Creation of AI
Climate change
Bioterrorism
Mass automation of workforce
Contact with other life
Wealth inequality

Yea we've got it easy


<tf.Tensor: shape=(1, 15), dtype=int64, numpy=
array([[   1,    6,    1,  867,  280,  745,  158, 5604,    6, 6070,    1,
          14,  505,  116, 2045]])>

In [36]:
words_in_vocab=text_vectorizer.get_vocabulary()
print(f" Top 5 common words :{words_in_vocab[:5]}")
print(f" Top 5 un-common words :{words_in_vocab[-5:]}")

 Top 5 common words :['', '[UNK]', 'the', 'a', 'in']
 Top 5 un-common words :['mideast', 'middleeasteye', 'midday', 'microwave', 'microphone']


Create and Embedding using Embedding layer.

We can see what an embedding looks like by using `tf.keras.layers.Embedding`

In [38]:
tf.random.set_seed(42)
from tensorflow.keras import layers
embedding = layers.Embedding(input_dim=max_vocab_length
                             ,output_dim=128
                             ,embeddings_initializer="uniform"
                             ,input_length=max_length
                             ,name="embedding_1")

In [42]:
random_sentence= train_sentences[random.randint(0,len(train_sentences))]
print(f"random sentence :{random_sentence}")
print(f"Vectorized Version :{text_vectorizer(random_sentence)}")
sentence_embeddings = embedding(text_vectorizer(random_sentence))
print(f"Embeddings : {sentence_embeddings}")
print(f"Embeddings Shape: {sentence_embeddings.shape}")

random sentence :@brobread looks like mudslide????
Vectorized Version :[  1 273  25 387   0   0   0   0   0   0   0   0   0   0   0]
Embeddings : [[-0.04624858 -0.00230882  0.01583031 ... -0.04633186  0.0030102
  -0.02077737]
 [ 0.00421077  0.03818896  0.02766346 ...  0.02217532 -0.02499047
   0.01551687]
 [-0.01642071 -0.00016425  0.01188429 ...  0.03702518  0.03157497
  -0.00243261]
 ...
 [-0.02550201 -0.01700755 -0.03625858 ...  0.02770311 -0.017689
   0.03074707]
 [-0.02550201 -0.01700755 -0.03625858 ...  0.02770311 -0.017689
   0.03074707]
 [-0.02550201 -0.01700755 -0.03625858 ...  0.02770311 -0.017689
   0.03074707]]
Embeddings Shape: (15, 128)


In [47]:
sentence_embeddings[0]

<tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-0.04624858, -0.00230882,  0.01583031, -0.01589829, -0.04974852,
        0.04314191,  0.00948632,  0.0114125 ,  0.01952113,  0.01776084,
       -0.00078173,  0.03914021,  0.04415382,  0.02867979, -0.04913266,
       -0.01775808,  0.02228029,  0.01711312, -0.02547604,  0.03076637,
       -0.04906312, -0.03966314, -0.04418211, -0.01446499, -0.00787419,
       -0.0086859 , -0.04114062, -0.04950554, -0.0245211 ,  0.01627352,
        0.00306585, -0.02590847, -0.0275027 , -0.04454866,  0.03783581,
        0.04825044, -0.04926258, -0.00946664, -0.02107433,  0.01821114,
        0.02811838, -0.04290438, -0.04256919, -0.03598412, -0.02264677,
        0.04069329, -0.00330695, -0.01186788,  0.00515471, -0.04625354,
        0.00793714,  0.02927386,  0.02211311,  0.04343529,  0.02081007,
       -0.01098495,  0.02011273, -0.0178793 ,  0.01169036, -0.01547121,
       -0.04830229, -0.02883117, -0.02653791, -0.04631009, -0.0330305 ,
       -0.031057