# Natural Language Processing with TensorFlow

**In this notebook I will be using TensorFlow to process tweets and predict whether they pertain to a disaster or not.**

---

**Notes**  

- This notebook was created using the guide found [here](https://www.kaggle.com/code/calebreigada/tensorflow-natural-language-processing-guide/notebook). 
- I do not take any credit for the methods displayed, as I am following a guide and simply applying the techniques there to a different set of data.
- As suggested in the guide linked, prior research was done on unfamiliar topics to help understand the work flow better. I am certainly not an expert in this topic and am still learning.

---


# Loading the Data

For this notebook, I will be using the [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started) dataset. This set is a great way to get started with NLP models and applications, and understand how machine learning algorithms can be used on non-numeric data through data transformations.  

The dataset contains tweets that fall into one of two conditions: either they are discussing a real life disaster, or they are not. This may seem like a fairly easy thing to discern, but as listed on the dataset's page, there are instances where based on wording alone, it can be difficult to identify the correct category. 

In [1]:
# Importing libraries
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Reading in data
train_full = pd.read_csv("../input/nlp-getting-started/train.csv")
test_full = pd.read_csv("../input/nlp-getting-started/test.csv")

train_text = train_full["text"]
train_labels = train_full["target"]

test_text = test_full["text"]
test_labels = test_full["text"]

Now that our data has been loaded, let's take a quick look at some tweet examples. We'll investigate a random sample of the training data to see what a disaster-related tweet looks like as well as a non-disaster-related tweet.

In [3]:
# Print sample observations from training data
print("Example training data observations")
print("---------------\n")
np.random.seed(777)
for i in np.random.randint(0, high = 100, size = 10):
    tweet_type = "Disaster" if train_labels[i] == 1 else "Non-disaster"
    print("Tweet classifier:", tweet_type)
    print("Tweet:", train_text[i], "\n")

Example training data observations
---------------

Tweet classifier: Non-disaster
Tweet: Building the perfect tracklist to life leave the streets ablaze 

Tweet classifier: Disaster
Tweet: How the West was burned: Thousands of wildfires ablaze in #California alone http://t.co/iCSjGZ9tE1 #climate #energy http://t.co/9FxmN0l0Bd 

Tweet classifier: Disaster
Tweet: Barbados #Bridgetown JAMAICA ÛÒ Two cars set ablaze: SANTA CRUZ ÛÓ Head of the St Elizabeth Police Superintende...  http://t.co/wDUEaj8Q4J 

Tweet classifier: Disaster
Tweet: RT nAAYf: First accident in years. Turning onto Chandanee Magu from near MMA. Taxi rammed into me while I was halfway turned. Everyone confÛ_ 

Tweet classifier: Non-disaster
Tweet: First night with retainers in. It's quite weird. Better get used to it; I have to wear them every single night for the next year at least. 

Tweet classifier: Non-disaster
Tweet: #stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents https://t.co/k4zoMOF319 

Great! This printout gives us good insight into how the tweets may vary in their wording, but also how certain words may appear to be disaster related, but also come up in non-disaster tweets. For example, the 8th tweet listed above contains the word "ablaze". This would typically be associated with a disaster related tweet talking about a fire. However, here we see it being used in a non-disaster tweet. This sort of occurance is an obstacle in predicting the tweets sentiment, and represents just one type of challenge faced when performing NLP.

Other observations:
- Many of the tweets contain links to images, something to consider when cleaning the text
- There are a couple unusual characters in the tweet, such as "ÛÒ" and "Û_"
- Hashtags may need to get cleaned out to ensure the words following the tag are read properly