<a href="https://colab.research.google.com/github/cartmarsh/TwitterSentiment/blob/main/nlp_neural_networks_and_deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Neural Networks and Embeddings for Natural Language Processing

Outline:
- Download the Data
- Prepare Data for Training
- Logistic Regression Model
- Feed Forward Nueral Network


Dataset: https://www.kaggle.com/c/quora-insincere-questions-classification

## Download the Data

Upload your `kaggle.json` file to Colab

In [12]:
!pwd

/content


In [13]:
import os

In [14]:
os.environ["KAGGLE_CONFIG_DIR"] = "/content"

In [15]:
!kaggle competitions download -c quora-insincere-questions-classification

Downloading quora-insincere-questions-classification.zip to /content
100% 6.02G/6.03G [01:18<00:00, 34.1MB/s]
100% 6.03G/6.03G [01:18<00:00, 82.7MB/s]


In [16]:
!ls


kaggle.json  quora-insincere-questions-classification.zip  sample_data


In [17]:
!unzip quora-insincere-questions-classification.zip

Archive:  quora-insincere-questions-classification.zip
  inflating: embeddings.zip          
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [18]:
import pandas as pd


In [19]:
!ls


embeddings.zip				      sample_data	     train.csv
kaggle.json				      sample_submission.csv
quora-insincere-questions-classification.zip  test.csv


In [20]:
raw_df = pd.read_csv("train.csv")
raw_test = pd.read_csv("test.csv")
raw_sub = pd.read_csv("sample_submission.csv")

In [21]:
raw_df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


## Prepare Data for Training


Outline:
- Convert text to TF-IDF Vectors
- Split training & validation set
- Convert to PyTorch tensors

- Create PyTorch DataLoaders

### Conversion to TF-IDF Vectors

In [22]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from nltk.corpus import stopwords


In [23]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [24]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [25]:
word_tokenize("what is the fuck lol, law)=7")

['what', 'is', 'the', 'fuck', 'lol', ',', 'law', ')', '=7']

In [27]:
stemmer = SnowballStemmer(language='english')

In [28]:
def tokenize(text):
  return [stemmer.stem(token) for token in word_tokenize(text)]

In [29]:
tokenize("is this token,ize wo wokring or whaaa")

['is', 'this', 'token', ',', 'ize', 'wo', 'wokr', 'or', 'whaaa']

In [31]:
english_stopwords = stopwords.words('english')

In [32]:
", ".join(english_stopwords)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [33]:
## we give vectorizer all the created parameters
vectorizer = TfidfVectorizer(tokenizer=tokenize, stop_words=english_stopwords, max_features=1200)

In [36]:
sample_df = raw_df.sample(100_000)

In [37]:
vectorizer.fit(sample_df.question_text)

  "The parameter 'token_pattern' will not be used"


TfidfVectorizer(max_features=1200,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function tokenize at 0x7f3fa8645d40>)

In [38]:
vectorizer.get_feature_names_out()[1000:2000]

array(['small', 'smart', 'smoke', 'social', 'societi', 'softwar', 'solar',
       'soldier', 'solut', 'solv', 'someon', 'someth', 'sometim', 'son',
       'song', 'sound', 'sourc', 'south', 'space', 'speak', 'special',
       'specif', 'speech', 'speed', 'spend', 'sport', 'stand', 'standard',
       'star', 'start', 'startup', 'state', 'statement', 'status', 'stay',
       'step', 'still', 'stock', 'stop', 'store', 'stori', 'straight',
       'strategi', 'stream', 'street', 'stress', 'strong', 'structur',
       'student', 'studi', 'stupid', 'style', 'subject', 'success',
       'sudden', 'suffer', 'suggest', 'suicid', 'summer', 'super',
       'support', 'suppos', 'sure', 'surviv', 'switch', 'symptom',
       'system', 'take', 'taken', 'talk', 'tamil', 'tax', 'teach',
       'teacher', 'team', 'tech', 'technic', 'techniqu', 'technolog',
       'teenag', 'tell', 'temperatur', 'term', 'terrorist', 'test',
       'text', 'themselv', 'theori', 'thing', 'think', 'though',
       'thought',

In [40]:
inputs = vectorizer.transform(sample_df.question_text)

In [41]:
test_inputs = vectorizer

In [42]:
inputs.shape

(100000, 1200)

KeyboardInterrupt: ignored

In [49]:
len(raw_df)

1306122

In [None]:
inputs = vectorizer.transform(sample_df)
targets = sample_df.target 

In [47]:
inputs.shape

(100000, 1200)

In [48]:
inputs


<100000x1200 sparse matrix of type '<class 'numpy.float64'>'
	with 570562 stored elements in Compressed Sparse Row format>

### Split training and validation set

In [44]:
from sklearn.model_selection import train_test_split

In [52]:
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs, sample_df.target, test_size=0.3)

In [55]:
train_inputs.shape


(70000, 1200)

In [57]:
train_targets

54471      1
85651      0
271311     1
328405     0
1183313    0
          ..
995008     1
143480     0
1191035    0
212533     0
682550     0
Name: target, Length: 70000, dtype: int64

### Convert to PyTorch Tensors

In [58]:
import torch

In [60]:
train_input_tensors = torch.tensor(train_inputs.toarray()).float()

In [62]:
val_input_tensors = torch.tensor(val_inputs.toarray()).float()

In [63]:
train_input_tensors

tensor([[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        ...,
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.5926, 0.5917],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]])

In [64]:
train_target_tensors = torch.tensor(train_targets.values)
val_target_tensors = torch.tensor(val_targets.values)

In [66]:
test_input_tensors = torch.tensor(inputs.toarray()).float()

In [68]:
test_input_tensors.shape

torch.Size([100000, 1200])

## Logistic Regression Model

## Feed Forward Neural Network

## Make Predictions and Submit