<a href="https://colab.research.google.com/github/austinfroste/dsci_club/blob/main/nlp/week2_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro To Natural Language Processing (NLP)

The goal of this exercise is to give you a brief introduction to NLP so you can have a starting point to do a project of some kind and get comfortable looking up documentation and terms you may be unfamiliar with.

The data is from this [Kaggle competition](https://www.kaggle.com/competitions/tweet-sentiment-extraction/data). Feel free to submit what you came up with and branch out from what we have given you here.

In [None]:
import nltk
import time
import random
import string
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
# 1) Spell Check
# 2) Stem/tokenize word
# 3) Bag of Words
# 4) Algo: methods such as regression, K-Nearest Neighbors, Neural Nets, etc.

In [None]:
train = pd.read_csv("https://raw.githubusercontent.com/eliotjmartin/uodsc-club/main/twitter_train.csv")
train = train.dropna()

In [None]:
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


## Preprocessing Techniques

##### How can we make our textual data easier to work with?
- Remove "stop words" that do not add meaning to the sentence
- Map synonyms to a single word to reduce the number of unique features and increase the frequency of important words
- Reduce the feature space by stemming each word to its root form
- Create a representation of the sentence using techniques such as bag of words or embeddings

#### First, let's create a function to tokenize a sentence:

In [None]:
def tokenizer(sentence):
    # remove punctuation from sentence
    sentence = ''.join(
        char for char in sentence if char not in string.punctuation
    )
    # tokenizing the sentence
    tokens = nltk.word_tokenize(sentence)
    return [token.lower() for token in tokens]

In [None]:
example = train.loc[0, 'text']
example

' I`d have responded, if I were going'

In [None]:
tokenizer(example)

['id', 'have', 'responded', 'if', 'i', 'were', 'going']

#### Let's define our stop words (meaningless words that do not add too much value to the meaning of our sentence)

*Example: what sort of words are in the stop words list?*


In [None]:
stop_words = set(stopwords.words('english'))
'do' in stop_words, 'when' in stop_words

(True, True)

In [None]:
type(stop_words)

set

### Your turn:

Write a function that accepts a list of tokens and returns the same list of tokens without stop words:

In [None]:
def stopword_destroyer(tokens): 
  tokens_no_sw = []

  for word in tokens:
    if word not in stop_words:
      tokens_no_sw.append(word)

  return tokens_no_sw

In [None]:
assert stopword_destroyer(tokenizer(example)) == ['id', 'responded', 'going']

#### Stemming

Stemming reduces words to their root form by removing parts of the word like prefixes and suffixes. This also helps to reduce the feature space as well as increase the frequency of similar words (like "run", "running", and "runs").

In [None]:
# initialize the porter stemmer from NLTK
stemmer = PorterStemmer()

In [None]:
stemmer.stem("running")

'run'

## Your turn

Similar to above, this function should accept a list of tokens(list of words) as an argument and return the stemmed tokens (as a list).

In [None]:
def stemmerizer(tokens):
  stemmed_tokens = []
  for i in tokens:
    stemmed_tokens.append(stemmer.stem(i))
  return stemmed_tokens

In [None]:
assert stemmerizer(tokenizer(example)) == ['id', 'have', 'respond', 'if', 'i', 'were', 'go']

#### A Preprocessing Function

Now we can combine the preprocessing techniques described above into a single function that we can use to remove noise and irrelevant information from our data.

The preprocess function in the cell below takes a sentence as input, removes punctuation and stop words, stems each word in the sentence, maps synonymous words to a single word, and returns the preprocessed sentence as a list of words.

### Your Turn

Fill in the missing parts in the `Preprocess` using the functions we made above:

In [None]:
def preprocess(sentence):
    """
    This function takes a sentence as input and performs various text preprocessing steps on it,
    including removing punctuation, stop words, and stemming each word in the sentence.
    """
    # tokenizing the sentence
    tokens = tokenizer(sentence)
    
    # removing stop words
    tokens = stopword_destroyer(tokens)

    # stemming each word in the sentence
    tokens = stemmerizer(tokens)

    # return the preprocessed sentence as a list of words
    return tokens

What exactly does our preprocessing do?

In [None]:
tokens = preprocess(example)
tokens

['id', 'respond', 'go']

First, the `preprocess` function removes all punctuation marks from the sentence using the `string.punctuation` module. Then, the sentence is tokenized into a list of words using the nltk.word_tokenize method.

Next, the function removes stop words, which are common words that do not carry much meaning in the sentence, such as "a", "an", "the", "of", and so on. In this case, the function is using a pre-defined list of stop words to remove them from the list of tokens.

After that, the function performs stemming on each word in the sentence, which involves converting the words into their root or base form, called their stem. The function uses a stemmer to perform this task.

Finally, the preprocessed words are returned as a list.

#### A bag-of-words representation of a sentence

The main goal of the `bag_of_words` function is to convert a sentence into a numerical representation that captures the presence or absence of each known word in the vocabulary of known words (we will build our vocabulary soon!).

Here is how the function works:

The bag_of_words function takes a tokenized sentence and a list of all known words in the vocabulary as input, and creates a bag of words representation for the given sentence. It initializes the bag with zeros for each word in the vocabulary, and updates the bag with 1 for each word in the sentence that exists in the vocabulary. The function returns a numpy array representing the bag of words with 1 for each known word that exists in the sentence, 0 otherwise.

In [None]:
def bag_of_words(tokenized_sentence, map):
    """
    Create a bag of words representation for a given tokenized sentence.
    """
    # initialize the bag with zeros for each word in the vocabulary
    bag = np.zeros(len(map), dtype=np.int8)

    # update the bag with 1 for each word in the sentence that exists in the vocabulary
        
    for token in tokenized_sentence:
      try:
        bag[map[token]] = 1
      except:
        continue

    return bag

## Loading and Preprocessing the Data


We will now load our data from the intents file and preprocess its content using the logic we defined earlier.

In [None]:
def fullDataPrep(df, map=None, all_words_list=None):
  # build a set of all words if map is none
  if map is None:
    all_words = {}

  preprocessed_list = []
  for sentence in df['text']:
    preprocessed = preprocess(sentence)
    preprocessed_list.append(preprocessed)

    if map is None:
      for token in preprocessed:
        if token in all_words:
          all_words[token] += 1
        else:
          all_words[token] = 0

  if map is None:
    keys_to_delete = []
    for key, value in all_words.items():
        if value <= 5:
            keys_to_delete.append(key)

    for key in keys_to_delete:
        del all_words[key]

  # order set by making it a sorted list if map is none
  if map is None:
    all_words_list = sorted(list(all_words.keys()))
  
  # create a mapping from words to corresponding index if map is 
  # none
  # this is an optimization... 
  if map is None:
    map = {}
    for i in range(len(all_words_list)):
      word = all_words_list[i]
      map[word] = i

  # build new dataframe with bow repr
  bow_array = []
  for sentence in preprocessed_list:
    row = bag_of_words(sentence, map)
    bow_array.append(row)

  bow_array = np.array(bow_array)
    
  bow_dict = {}
  for i in range(len(all_words_list)):
    word = all_words_list[i]
    bow_dict[word] = bow_array[:, i]

  return pd.DataFrame(bow_dict), map, all_words_list

In [None]:
new_train, map, all_words = fullDataPrep(train)
new_train

Unnamed: 0,0,09,1,10,100,1000,10th,11,12,13,...,yucki,yum,yummi,yup,zealand,zero,zombi,zone,zoo,ï¿½
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27475,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27476,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27477,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
27478,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Create and Training a Model

First, we are going to define our input features(X_train) and our target vector(y_train)

In [None]:
X_train, y_train = new_train, train['sentiment']
y_train.head()

0     neutral
1    negative
2    negative
3    negative
4    negative
Name: sentiment, dtype: object

In [None]:
def y_encode(row):
  """
  encode the target column into integers we can work with
  """
  if row == 'negative':
    return 0
  elif row =='neutral':
    return 1
  return 2

In [None]:
y_train = pd.Series(y_train.apply(y_encode))

### Your Turn: Fit and Predict a model

Though you do not have to use Logistic Regression, it is a simple method to start out with and use as a baseline. That being said, there are many other models out there-- check them out!!

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(max_iter=1000, random_state=42)
reg = lr.fit(X_train, y_train)

### Test Data

Now, try the same process on the test data

In [55]:
test = pd.read_csv("https://raw.githubusercontent.com/eliotjmartin/uodsc-club/main/twitter_test.csv")
test = test.dropna()

In [75]:
test['sentiment']

0        neutral
1       positive
2       negative
3       positive
4       positive
          ...   
3529    negative
3530    positive
3531    negative
3532    positive
3533    positive
Name: sentiment, Length: 3534, dtype: object

In [57]:
new_test, map, all_words = fullDataPrep(test, map, all_words)

In [78]:
X_test, y_test = new_test, test['sentiment']
y_test = pd.Series(y_test.apply(y_encode))

### Get the accuracy of your model

In [83]:
from sklearn.metrics import accuracy_score
# Predict the labels for the test data
y_pred = ...

# Calculate accuracy
accuracy = accuracy_score(y_pred, y_test)
print("Accuracy:", accuracy)

InvalidParameterError: ignored