<a href="https://colab.research.google.com/github/fradicus/Shellhacks2025/blob/main/NLP_ShellHacks25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Natural Language Processing**

Import handling
```
nltk - Natural Language Toolkit
```



In [21]:
!pip install nltk



In [42]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# **2. Dataset input**

Read in semi-strucutred data

In [43]:
userPrompt = open("theTableofTables2.tsv").read()
userPrompt[0:250]

'Query\tLabel\nWhere can I learn how to invest in the stock market?\tInvalid\nWhat color is my dog?\tInvalid\nWhere can a broke college student from a low-income family get financial aid?\tValid\nWhere can I apply for TANF benefits?\tValid\nHow can I start inve'

# **3. Interpreting Data**

Using Pandas to help translate our semi-structured data, into a format that can be understood by a machine


In [44]:
import pandas as pd
inputData = pd.read_csv('theTableofTables2.tsv', sep='\t', names=['label', 'body_text'], header=None)
inputData.head()

Unnamed: 0,label,body_text
0,Query,Label
1,Where can I learn how to invest in the stock market?,Invalid
2,What color is my dog?,Invalid
3,Where can a broke college student from a low-income family get financial aid?,Valid
4,Where can I apply for TANF benefits?,Valid


# **4. Cleaning our Data**

4a.) Removing Punctuation
```
Our vectorizer is only interested in the number of words within the user's prompt.
Punctuation is irelevant, special characters are removed. (CHANGE "removed")
```

In [45]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [46]:
def delPunc(text):
  dePuntuate = "".join([char for char in text if char not in string.punctuation])
  return dePuntuate
inputData['cleanBody'] = inputData['body_text'].apply(lambda x: delPunc(x))

inputData.head()

Unnamed: 0,label,body_text,cleanBody
0,Query,Label,Label
1,Where can I learn how to invest in the stock market?,Invalid,Invalid
2,What color is my dog?,Invalid,Invalid
3,Where can a broke college student from a low-income family get financial aid?,Valid,Valid
4,Where can I apply for TANF benefits?,Valid,Valid


4.b) Tokenize our Text      
```
Separate the text from user prompt into indivdual words or "tokens",
providing structure to formerly unstructured text.
```

In [47]:
import re

def tokenize(text):
    token = re.split('\W+', text)
    return token

inputData['tokenText'] = inputData['cleanBody'].apply(lambda x: tokenize(x.lower()))

inputData.head()

Unnamed: 0,label,body_text,cleanBody,tokenText
0,Query,Label,Label,[label]
1,Where can I learn how to invest in the stock market?,Invalid,Invalid,[invalid]
2,What color is my dog?,Invalid,Invalid,[invalid]
3,Where can a broke college student from a low-income family get financial aid?,Valid,Valid,[valid]
4,Where can I apply for TANF benefits?,Valid,Valid,[valid]


4.c) Removing Stopwords     
```
We delete unnecessary verbage or "stopwords" that are solely useful for human communication,
to focus only on what the text tells us about our data.
This process is handled in 29 of the most common spoken languages.
```

In [48]:
import nltk

stopwords = (nltk.corpus.stopwords.words('arabic') +
             nltk.corpus.stopwords.words('azerbaijani') +
             nltk.corpus.stopwords.words('basque') +
             nltk.corpus.stopwords.words('bengali') +
             nltk.corpus.stopwords.words('catalan') +
             nltk.corpus.stopwords.words('chinese') +
             nltk.corpus.stopwords.words('danish') +
             nltk.corpus.stopwords.words('dutch') +
             nltk.corpus.stopwords.words('english') +
             nltk.corpus.stopwords.words('finnish') +
             nltk.corpus.stopwords.words('french') +
             nltk.corpus.stopwords.words('german') +
             nltk.corpus.stopwords.words('greek') +
             nltk.corpus.stopwords.words('hebrew') +
             nltk.corpus.stopwords.words('hinglish') +
             nltk.corpus.stopwords.words('hungarian') +
             nltk.corpus.stopwords.words('indonesian') +
             nltk.corpus.stopwords.words('italian') +
             nltk.corpus.stopwords.words('kazakh') +
             nltk.corpus.stopwords.words('nepali') +
             nltk.corpus.stopwords.words('norwegian') +
             nltk.corpus.stopwords.words('portuguese') +
             nltk.corpus.stopwords.words('romanian') +
             nltk.corpus.stopwords.words('russian') +
             nltk.corpus.stopwords.words('slovene') +
             nltk.corpus.stopwords.words('spanish') +
             nltk.corpus.stopwords.words('swedish') +
             nltk.corpus.stopwords.words('tajik') +
             nltk.corpus.stopwords.words('turkish'))


In [49]:
def delStopwords(tokenList):
  text = [word for word in tokenList if word not in stopwords]
  return text

inputData['body_text_nostop'] = inputData['tokenText'].apply(lambda x: delStopwords(x))

inputData.head()

Unnamed: 0,label,body_text,cleanBody,tokenText,body_text_nostop
0,Query,Label,Label,[label],[label]
1,Where can I learn how to invest in the stock market?,Invalid,Invalid,[invalid],[invalid]
2,What color is my dog?,Invalid,Invalid,[invalid],[invalid]
3,Where can a broke college student from a low-income family get financial aid?,Valid,Valid,[valid],[valid]
4,Where can I apply for TANF benefits?,Valid,Valid,[valid],[valid]


4.d) Stemming our data     
```
As the term implies, we are breaking down a word to its main component.
We remove suffixes, transforming our word into its "stem form".
```



In [50]:
ps = nltk.PorterStemmer()

def stemming(tokendText):
    text = [ps.stem(word) for word in tokendText]
    return text

inputData['stemmedText'] = inputData['body_text_nostop'].apply(lambda x: stemming(x))

inputData.head()

Unnamed: 0,label,body_text,cleanBody,tokenText,body_text_nostop,stemmedText
0,Query,Label,Label,[label],[label],[label]
1,Where can I learn how to invest in the stock market?,Invalid,Invalid,[invalid],[invalid],[invalid]
2,What color is my dog?,Invalid,Invalid,[invalid],[invalid],[invalid]
3,Where can a broke college student from a low-income family get financial aid?,Valid,Valid,[valid],[valid],[valid]
4,Where can I apply for TANF benefits?,Valid,Valid,[valid],[valid],[valid]


4e.) Lemmatizing

```
We must find the base of a word through a dictionary-based approach, performing a morphological analysis.
This permits the restructure of our words into their constituent pieces, while understanding their meanings.
```

In [51]:
wordNet = nltk.WordNetLemmatizer()

def lemmatizing(tokenText):
    text = [wordNet.lemmatize(word) for word in tokenText]
    return text

inputData['lemmatizedText'] = inputData['body_text_nostop'].apply(lambda x: lemmatizing(x))

inputData.head(10)

Unnamed: 0,label,body_text,cleanBody,tokenText,body_text_nostop,stemmedText,lemmatizedText
0,Query,Label,Label,[label],[label],[label],[label]
1,Where can I learn how to invest in the stock market?,Invalid,Invalid,[invalid],[invalid],[invalid],[invalid]
2,What color is my dog?,Invalid,Invalid,[invalid],[invalid],[invalid],[invalid]
3,Where can a broke college student from a low-income family get financial aid?,Valid,Valid,[valid],[valid],[valid],[valid]
4,Where can I apply for TANF benefits?,Valid,Valid,[valid],[valid],[valid],[valid]
5,How can I start investing in the stock market?,Invalid,Invalid,[invalid],[invalid],[invalid],[invalid]
6,Where is the nearest SNAP office,Valid,Valid,[valid],[valid],[valid],[valid]
7,Can I apply for finanical aid online,Invalid,Invalid,[invalid],[invalid],[invalid],[invalid]
8,What documents do I need to bring to apply for TANF at the local office?,Valid,Valid,[valid],[valid],[valid],[valid]
9,How do I apply for a student loan?,Invalid,Invalid,[invalid],[invalid],[invalid],[invalid]


In [52]:
print(inputData.head())

                                                                           label  \
0                                                                          Query   
1                           Where can I learn how to invest in the stock market?   
2                                                          What color is my dog?   
3  Where can a broke college student from a low-income family get financial aid?   
4                                           Where can I apply for TANF benefits?   

  body_text cleanBody  tokenText body_text_nostop stemmedText lemmatizedText  
0     Label     Label    [label]          [label]     [label]        [label]  
1   Invalid   Invalid  [invalid]        [invalid]   [invalid]      [invalid]  
2   Invalid   Invalid  [invalid]        [invalid]   [invalid]      [invalid]  
3     Valid     Valid    [valid]          [valid]     [valid]        [valid]  
4     Valid     Valid    [valid]          [valid]     [valid]        [valid]  


## **5. Vectorizing our Data**
```
Converting our text to integers, in order to create feature vectors,
which permit our model to understand lanugage.
```

5a.) Bag-Of-Words (BoW) Model

```
We create a "bag of words", by determining whether or not a word is found within our data.
```

In [53]:
import pandas as pd
import re
import string
import nltk

pd.set_option('display.max_colwidth', 100)

stopwords = (nltk.corpus.stopwords.words('arabic') +
             nltk.corpus.stopwords.words('azerbaijani') +
             nltk.corpus.stopwords.words('basque') +
             nltk.corpus.stopwords.words('bengali') +
             nltk.corpus.stopwords.words('catalan') +
             nltk.corpus.stopwords.words('chinese') +
             nltk.corpus.stopwords.words('danish') +
             nltk.corpus.stopwords.words('dutch') +
             nltk.corpus.stopwords.words('english') +
             nltk.corpus.stopwords.words('finnish') +
             nltk.corpus.stopwords.words('french') +
             nltk.corpus.stopwords.words('german') +
             nltk.corpus.stopwords.words('greek') +
             nltk.corpus.stopwords.words('hebrew') +
             nltk.corpus.stopwords.words('hinglish') +
             nltk.corpus.stopwords.words('hungarian') +
             nltk.corpus.stopwords.words('indonesian') +
             nltk.corpus.stopwords.words('italian') +
             nltk.corpus.stopwords.words('kazakh') +
             nltk.corpus.stopwords.words('nepali') +
             nltk.corpus.stopwords.words('norwegian') +
             nltk.corpus.stopwords.words('portuguese') +
             nltk.corpus.stopwords.words('romanian') +
             nltk.corpus.stopwords.words('russian') +
             nltk.corpus.stopwords.words('slovene') +
             nltk.corpus.stopwords.words('spanish') +
             nltk.corpus.stopwords.words('swedish') +
             nltk.corpus.stopwords.words('tajik') +
             nltk.corpus.stopwords.words('turkish'))
ps = nltk.PorterStemmer()

inputData = pd.read_csv("theTableofTables2.tsv", sep='\t')
inputData.columns = ['label', 'body_text']

Cleaning our text

```
This step eliminates our punctuation, tokenizes, deletes stopwords, and stem.
```

In [54]:
def cleaningText(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    token = re.split('\W+', text)
    text = [ps.stem(word) for word in token if word not in stopwords]
    return text

5a.) (Continuted) - CountVectorizer

```
Using a specific implementation of the BoW model, known as CountVectorizer, we store the count of each word in our document matrix.
We transform the text data into a matrix of token counts, where each row represents a document and each column represents a word from the vocabulary.
```


In [55]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer=cleaningText)
X_counts = vectorizer.fit_transform(inputData['body_text'])
print(X_counts.shape)
print(vectorizer.get_feature_names_out())

(15, 2)
['invalid' 'valid']


In [56]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [10, 150, 300],
    'max_depth': [30, 60, 90, None]
}

In [57]:
print("Number of features (documents):", X_counts.shape[0])
print("Number of labels:", len(inputData['label']))


Number of features (documents): 15
Number of labels: 15


5b.) N-Grams

```
Here we use a concept that gives us the opportunity to capture the letter or word, that is expected to follow a given word.
We're searching for all combinations of trailing or adjacent letters/words
of variable length 'n'.
```


In [58]:
from sklearn.feature_extraction.text import CountVectorizer

ngramVect = CountVectorizer(ngram_range=(2,2),analyzer=cleaningText)
X_counts = ngramVect.fit_transform(inputData['body_text'])
print(X_counts.shape)
print(ngramVect.get_feature_names_out())

(15, 2)
['invalid' 'valid']




5c.) TF-IDF
```
Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Measures how important a term is.
We're finding the relative frequency of a word, as it appears in a document, compared to all docs.
```
Equation: wᵢⱼ = tfᵢ,ⱼ × log(N / dfᵢ)

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidfVect = TfidfVectorizer(analyzer=cleaningText)
X_tfidf = tfidfVect.fit_transform(inputData['body_text'])
print(X_tfidf.shape)
print(tfidfVect.get_feature_names_out())

(15, 2)
['invalid' 'valid']
