# Text Classification with Bag of Words - Natural Language Processing

![](https://i.imgur.com/hlEQ5X8.png)

> _"Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data."_ - Wikipedia

> _**Bag of Words**: The bag-of-words (BOW) model is a representation that turns arbitrary text into fixed-length vectors by counting how many times each word appears._
 
Outline:

1. Download and explore a real-world
2. Apply text preprocessing techniques
3. Implement the bag of words model
4. Train ML models for text classification


Dataset: https://www.kaggle.com/c/quora-insincere-questions-classification

## Download & Explore Data

Outline:

* Download the data
* Explore the data
* Pick a working sample

### Data Download

In [15]:
import os
from pathlib import Path

In [16]:
IS_KAGGLE = 'KAGGLE_KERNEL_RUN_TYPE' in os.environ
print('IS_KAGGLE:', IS_KAGGLE)

IS_KAGGLE: False


In [17]:
if IS_KAGGLE:
    data_dir = Path('../input/quora-insincere-questions-classification')
else:
    !pip install opendatasets --upgrade --quiet
    import opendatasets as od

    dataset_url = 'https://www.kaggle.com/c/quora-insincere-questions-classification'
    od.download(dataset_url)
    data_dir = Path('./quora-insincere-questions-classification')

Skipping, found downloaded files in "./quora-insincere-questions-classification" (use force=True to force download)


In [20]:
data_dir

PosixPath('quora-insincere-questions-classification')

In [10]:
os.listdir(data_dir)

['sample_submission.csv', 'train.csv', 'embeddings.zip', 'test.csv']

In [9]:
import pandas as pd

### Data Exploration

In [21]:
raw_df = pd.read_csv(data_dir/'train.csv')
test_df = pd.read_csv(data_dir/'test.csv')
sub_df = pd.read_csv(data_dir/'sample_submission.csv')

In [22]:
raw_df

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0
...,...,...,...
1306117,ffffcc4e2331aaf1e41e,What other technical skills do you need as a c...,0
1306118,ffffd431801e5a2f4861,Does MS in ECE have good job prospects in USA ...,0
1306119,ffffd48fb36b63db010c,Is foam insulation toxic?,0
1306120,ffffec519fa37cf60c78,How can one start a research project based on ...,0


In [23]:
raw_df.question_text.values[0]

'How did Quebec nationalists see their province as a nation in the 1960s?'

In [25]:
# Sincere questions
raw_df[raw_df.target == 0].question_text.sample(10).values

array(['What are the best topics for paper presentation for ECE?',
       'How does federalism protect against tyranny?',
       'What do you think of bullies who believe their victims deserved it?',
       'What is the other name given to lines of latitude?',
       'What is the eligibility for getting a gold medal in University Of Rajasthan? Also what have been the highest marks obtained over the years in M.A. History?',
       'What are some strong arguments against homosexuality?',
       "Gf (22) & I don't spend much time while in school cause of busy schedules, I tell her lets spend time together when on summer/winter/spring break and she always picks family over me. Probably hangout 2 times the whole break. What should I do?",
       'What is the function of a plasma cell?',
       'How much money does a member of special forces such as navy seal, ranger, delta, etc… earn annually?',
       'Why is my cat peeing on my bed every time I return from a long trip?'],
      dtype=obje

In [26]:
# Insincere questions
raw_df[raw_df.target == 1].question_text.sample(10).values

array(['How do pro-Chinese or Chinese who are pro-PRC feel when they are referred to as "commies"?',
       'Do men with money test their girlfriend to see if they are gold diggers?',
       'Is thin in America basically fat everywhere else?',
       'Why cant the Chinese have there own ideas instead of making Rip offs. Is it really that hard for them?',
       'Do Pakistani terrorist army ISI learn their brutality from their equally brutal Islamic prophet Muhammad?',
       'Does it surprise you that most Quora subscribers are flaming liberals who attack if you don’t agree with them?',
       'Why are Kerala Muslims reproducing twice as fast as humans in the state?',
       'Do Asian Americans cheat more in academics than white Americans?',
       'How long will western democracies tolerate Hindu fascism as a suitable counterweight to Chinese dictatorship? Do strategists genuinely think that a country that wants to turn into a medieval theocracy can take on a modern industrial giant?'

In [35]:
pd.options.plotting.backend = "plotly"

In [37]:
raw_df.target.value_counts(normalize=True).plot(kind='bar')

In [27]:
test_df

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?
...,...,...
375801,ffff7fa746bd6d6197a9,How many countries listed in gold import in in...
375802,ffffa1be31c43046ab6b,Is there an alternative to dresses on formal p...
375803,ffffae173b6ca6bfa563,Where I can find best friendship quotes in Tel...
375804,ffffb1f7f1a008620287,What are the causes of refraction of light?


In [28]:
sub_df

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,0
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0
...,...,...
375801,ffff7fa746bd6d6197a9,0
375802,ffffa1be31c43046ab6b,0
375803,ffffae173b6ca6bfa563,0
375804,ffffb1f7f1a008620287,0


Questions to ask:
- What is the distribution of question length (in words/characters)?
- What are the most common words in the questions?
- ??? (add more here)

### Working Sample

In [29]:
if IS_KAGGLE:
    # SAMPLE_SIZE = len(raw_df)
    SAMPLE_SIZE = 100_000
else:
    SAMPLE_SIZE = 100_000

In [30]:
sample_df = raw_df.sample(SAMPLE_SIZE, random_state=42)

In [31]:
sample_df

Unnamed: 0,qid,question_text,target
443046,56d324bb1e2c29f43b12,What is the most effective classroom managemen...,0
947549,b9ad893dc78c577f8a63,Can I study abroad after 10th class from Bangl...,0
523769,6689ebaeeb65b209a412,How can I make friends as a college junior?,0
949821,ba1e2c4a0fef09671516,How do I download free APK Minecraft: Pocket E...,0
1030397,c9ea2b69bf0d74626f46,"Like Kuvera, is ""Groww"" also a free online inv...",0
...,...,...,...
998930,c3c03a307a29c69971b4,How do I research list of reliable charcoal im...,0
66641,0d119aba95ee6684f506,"What are petroleum products, and what is petro...",0
90024,11a46cd148a104b271cf,What are some services that will let you quick...,0
130113,1973e6e2111a0c93193a,What credit card processors do online marketpl...,0


## Text Preprocessing Techniques


Outline:

- Bag of words intuition
- Tokenization & stop word removal
- Stemming / Lemmatization

### Bag of Words Intutition



How to create a bag of words model:

1. Make a list of all possible words & number them (starting from 0)
2. Encode each document as a vector of word frequencies


Example:

![](https://i.imgur.com/OpwwKBv.png)


Issues:

1. There might be too many unique words (tens of thousands)
2. Some words may be too frequent and some too rare
3. The same root word may have many different forms (e.g. bird/birds, go/going/gone)

This is why we require preprocessing.

In [43]:
q0 = sample_df[sample_df.target == 0].question_text.iloc[0]
q0

'What is the most effective classroom management skill/technique to create a good learning environment?'

In [44]:
q1 = sample_df[sample_df.target == 1].question_text.iloc[1]

In [71]:
q2 = """
Ain't nothing but a heartache,
Ain't nothing but a mistake, (yeah)
I never wanna hear you say,
I want it that way!
Can't you do that for me?
"""

### Tokenization

Tokenization is the process of splitting a single sentence of document into words

In [40]:
import nltk
from nltk.tokenize import word_tokenize

In [41]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [53]:
q0

'What is the most effective classroom management skill/technique to create a good learning environment?'

In [63]:
q0_tok = word_tokenize(q0)
q0_tok 

['What',
 'is',
 'the',
 'most',
 'effective',
 'classroom',
 'management',
 'skill/technique',
 'to',
 'create',
 'a',
 'good',
 'learning',
 'environment',
 '?']

In [54]:
q1

'If Macedonia is supposedly "purely" Greek and so important to Greece then why they have an English Cross and not the Vergina Sun on their flag?'

In [64]:
q1_tok = word_tokenize(q1)
q1_tok

['If',
 'Macedonia',
 'is',
 'supposedly',
 '``',
 'purely',
 "''",
 'Greek',
 'and',
 'so',
 'important',
 'to',
 'Greece',
 'then',
 'why',
 'they',
 'have',
 'an',
 'English',
 'Cross',
 'and',
 'not',
 'the',
 'Vergina',
 'Sun',
 'on',
 'their',
 'flag',
 '?']

In [72]:
print(q2)


Ain't nothing but a heartache,
Ain't nothing but a mistake, (yeah)
I never wanna hear you say,
I want it that way!
Can't you do that for me?



In [73]:
word_tokenize(q2)

['Ai',
 "n't",
 'nothing',
 'but',
 'a',
 'heartache',
 ',',
 'Ai',
 "n't",
 'nothing',
 'but',
 'a',
 'mistake',
 ',',
 '(',
 'yeah',
 ')',
 'I',
 'never',
 'wan',
 'na',
 'hear',
 'you',
 'say',
 ',',
 'I',
 'want',
 'it',
 'that',
 'way',
 '!',
 'Ca',
 "n't",
 'you',
 'do',
 'that',
 'for',
 'me',
 '?']

### Stop Word Removal

Removing commonly occuring words (pronouns, prepositions, articles, etc.)

In [58]:
from nltk.corpus import stopwords

In [59]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [60]:
english_stopwords = stopwords.words('english')

In [62]:
len(english_stopwords)

179

In [61]:
", ".join(english_stopwords)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [66]:
def remove_stopwords(tokens):
    return [token for token in tokens if token not in english_stopwords]

In [67]:
q0_tok_stp = remove_stopwords(q0_tok)

In [68]:
q0_tok

['What',
 'is',
 'the',
 'most',
 'effective',
 'classroom',
 'management',
 'skill/technique',
 'to',
 'create',
 'a',
 'good',
 'learning',
 'environment',
 '?']

In [69]:
q0_tok_stp

['What',
 'effective',
 'classroom',
 'management',
 'skill/technique',
 'create',
 'good',
 'learning',
 'environment',
 '?']

Try tokenizing `q1` and `q2`

### Stemming

Fix "birds vs. bird" by removing suffixes attached to words.

In [75]:
from nltk.stem.snowball import SnowballStemmer

In [76]:
stemmer = SnowballStemmer(language='english')

In [78]:
q0_tok_stp_stm = [stemmer.stem(word) for word in q0_tok_stp]

In [79]:
q0_tok_stp

['What',
 'effective',
 'classroom',
 'management',
 'skill/technique',
 'create',
 'good',
 'learning',
 'environment',
 '?']

In [81]:
q0_tok_stp_stm

['what',
 'effect',
 'classroom',
 'manag',
 'skill/techniqu',
 'creat',
 'good',
 'learn',
 'environ',
 '?']

Note that stemmer also performs lowercasing.

Repeat with `q1` and `q2`

### Lemmatization

Simlar to stemming, but creates sensible words

In [84]:
from nltk.stem import WordNetLemmatizer

In [85]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [86]:
nltk.download('omw-1.4')

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [87]:
lemmatizer = WordNetLemmatizer()

In [93]:
q0_tok_stp

['What',
 'effective',
 'classroom',
 'management',
 'skill/technique',
 'create',
 'good',
 'learning',
 'environment',
 '?']

In [189]:
if not IS_KAGGLE:
    q0_tok_stp_lm = [lemmatizer.lemmatize(word) for word in q0_tok_stp]
    print(q0_tok_stp_lm)

['What', 'effective', 'classroom', 'management', 'skill/technique', 'create', 'good', 'learning', 'environment', '?']


In [190]:
q0_tok_stp_stm

['what',
 'effect',
 'classroom',
 'manag',
 'skill/techniqu',
 'creat',
 'good',
 'learn',
 'environ',
 '?']

Repeat with `q1` and `q3`.

Find the difference between stemming and lemmatization.

## Bag of Words Implementation

- Create a vocabulary 
- Convert text to vectors
- Finetune conversion parameters

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

### Vocabulary

In [98]:
from sklearn.feature_extraction.text import CountVectorizer

In [99]:
small_vectorizer = CountVectorizer()

In [101]:
small_df = sample_df.sample(10)

In [102]:
small_df

Unnamed: 0,qid,question_text,target
966429,bd57ac4079c035afbf34,Why did Indians not come on roads to protest a...,1
910921,b27e76f16ff101a2c26a,I want to become beta tester of iPhone mobile ...,0
1074805,d29b078be9ba07f0f6be,How it feels to be with a man of your life?,0
1141505,dfaf71c3a5d5c9fc4659,What have been the accomplishments of the acto...,0
693717,87e07df565389d63a9c2,Who is the strongest taijutsu user in Naruto?,0
1111379,d9c772f1df1508cb0b6e,What is the derivative of y = (x^5+5x+8) (x^3+...,0
118743,174026d4df02d6d3ef69,Is clubbing baby seals funny?,1
836155,a3da86b76f8a0b594856,How did Tom Stagliano find out that he was int...,0
800427,9cd64eb9a69b2f4343f1,What are some top affordable online universiti...,0
38703,0791ba4a83778c65fa80,Would God throw gay's out of his church and no...,1


In [103]:
small_df.question_text.values

array(['Why did Indians not come on roads to protest against Modi who first implemented Demonetization without preparation causing troubles to poor and then did not allow RBI to tell the actual figures for so many months? Are Indians taken for granted?',
       'I want to become beta tester of iPhone mobile games, where do I start and what are the qualifications?',
       'How it feels to be with a man of your life?',
       'What have been the accomplishments of the actor Rishi Deepak?',
       'Who is the strongest taijutsu user in Naruto?',
       'What is the derivative of y = (x^5+5x+8) (x^3+7x+9)?',
       'Is clubbing baby seals funny?',
       'How did Tom Stagliano find out that he was interested in applied physics, not theoretical physics, at MIT?',
       "What are some top affordable online universities to study bachelor's in computer science?",
       "Would God throw gay's out of his church and not throw other sinners out?"],
      dtype=object)

In [104]:
small_vectorizer.fit(small_df.question_text)

CountVectorizer()

In [105]:
len(small_vectorizer.vocabulary_)

105

In [106]:
small_vectorizer.get_feature_names_out()

array(['5x', '7x', 'accomplishments', 'actor', 'actual', 'affordable',
       'against', 'allow', 'and', 'applied', 'are', 'at', 'baby',
       'bachelor', 'be', 'become', 'been', 'beta', 'causing', 'church',
       'clubbing', 'come', 'computer', 'deepak', 'demonetization',
       'derivative', 'did', 'do', 'feels', 'figures', 'find', 'first',
       'for', 'funny', 'games', 'gay', 'god', 'granted', 'have', 'he',
       'his', 'how', 'implemented', 'in', 'indians', 'interested',
       'iphone', 'is', 'it', 'life', 'man', 'many', 'mit', 'mobile',
       'modi', 'months', 'naruto', 'not', 'of', 'on', 'online', 'other',
       'out', 'physics', 'poor', 'preparation', 'protest',
       'qualifications', 'rbi', 'rishi', 'roads', 'science', 'seals',
       'sinners', 'so', 'some', 'stagliano', 'start', 'strongest',
       'study', 'taijutsu', 'taken', 'tell', 'tester', 'that', 'the',
       'then', 'theoretical', 'throw', 'to', 'tom', 'top', 'troubles',
       'universities', 'user', 'want

### Transformation

In [108]:
small_question_vectors = small_vectorizer.transform(small_df.question_text)

In [109]:
small_question_vectors.shape

(10, 105)

In [110]:
small_question_vectors.toarray()[0]

array([0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 0, 2, 0, 0, 1, 0, 1, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 2, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0,
       0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0])

In [111]:
small_df.question_text.iloc[0]

'Why did Indians not come on roads to protest against Modi who first implemented Demonetization without preparation causing troubles to poor and then did not allow RBI to tell the actual figures for so many months? Are Indians taken for granted?'

### Configured Vectorizer

In [112]:
stemmer = SnowballStemmer(language='english', ignore_stopwords=True)

def tokenize(text):
    return [stemmer.stem(word) for word in word_tokenize(text)]

In [113]:
tokenize(q0)

['what',
 'is',
 'the',
 'most',
 'effect',
 'classroom',
 'manag',
 'skill/techniqu',
 'to',
 'creat',
 'a',
 'good',
 'learn',
 'environ',
 '?']

In [114]:
", ".join(english_stopwords)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [115]:
vectorizer2 = CountVectorizer(lowercase=True, 
                              tokenizer=tokenize,
                              stop_words=english_stopwords, 
                              max_features=1000)

In [116]:
%%time
vectorizer2.fit(sample_df.question_text)


The parameter 'token_pattern' will not be used since 'tokenizer' is not None'


Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ["'d", "'s", 'could', 'might', 'must', "n't", 'need', 'sha', 'wo', 'would'] not in stop_words.



CPU times: user 26.2 s, sys: 4.4 ms, total: 26.2 s
Wall time: 26.1 s


CountVectorizer(max_features=1000,
                stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours',
                            'ourselves', 'you', "you're", "you've", "you'll",
                            "you'd", 'your', 'yours', 'yourself', 'yourselves',
                            'he', 'him', 'his', 'himself', 'she', "she's",
                            'her', 'hers', 'herself', 'it', "it's", 'its',
                            'itself', ...],
                tokenizer=<function tokenize at 0x7ff7c4ef73b0>)

In [118]:
len(vectorizer2.vocabulary_)

1000

In [143]:
vectorizer2.get_feature_names_out()[:100]

array(['!', '$', '%', '&', "'", "''", "'m", "'s", '(', ')', ',', '-', '.',
       '1', '10', '100', '12', '12th', '15', '2', '20', '2017', '2018',
       '3', '30', '4', '5', '6', '7', '8', ':', '?', '[', ']', '``',
       'abl', 'abroad', 'abus', 'accept', 'access', 'accomplish',
       'accord', 'account', 'achiev', 'act', 'action', 'activ', 'actor',
       'actual', 'ad', 'add', 'address', 'admiss', 'adult', 'advanc',
       'advantag', 'advic', 'affect', 'africa', 'african', 'age', 'agre',
       'air', 'allow', 'almost', 'alon', 'alreadi', 'also', 'altern',
       'alway', 'amazon', 'america', 'american', 'amount', 'analysi',
       'android', 'anim', 'anoth', 'answer', 'anxieti', 'anyon', 'anyth',
       'apart', 'app', 'appear', 'appl', 'appli', 'applic', 'approach',
       'arab', 'area', 'armi', 'around', 'art', 'asian', 'ask', 'associ',
       'atheist', 'attack', 'attend'], dtype=object)

In [119]:
%%time
inputs = vectorizer2.transform(sample_df.question_text)

CPU times: user 25.8 s, sys: 6.2 ms, total: 25.8 s
Wall time: 25.7 s


In [120]:
targets = sample_df.target

In [129]:
test_df

Unnamed: 0,qid,question_text
0,0000163e3ea7c7a74cd7,Why do so many women become so rude and arroga...
1,00002bd4fb5d505b9161,When should I apply for RV college of engineer...
2,00007756b4a147d2b0b3,What is it really like to be a nurse practitio...
3,000086e4b7e1c7146103,Who are entrepreneurs?
4,0000c4c3fbe8785a3090,Is education really making good people nowadays?
...,...,...
375801,ffff7fa746bd6d6197a9,How many countries listed in gold import in in...
375802,ffffa1be31c43046ab6b,Is there an alternative to dresses on formal p...
375803,ffffae173b6ca6bfa563,Where I can find best friendship quotes in Tel...
375804,ffffb1f7f1a008620287,What are the causes of refraction of light?


In [127]:
%%time
test_inputs = vectorizer2.transform(test_df.question_text)

## ML Models for Text Classification

How machine learning works:

![](https://i.imgur.com/UMUfBRj.png)


Outline
* Create training, validation, and test set
* Make submission to Kaggle leaderboard
* Train, evalute & improve different ML models




### Create Training, Validation & Test Set

In [121]:
from sklearn.model_selection import train_test_split

In [122]:
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs, targets, test_size=0.3)

In [123]:
train_inputs.shape

(70000, 1000)

In [141]:
train_inputs[20].toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [124]:
val_inputs.shape

(30000, 1000)

In [125]:
train_targets.shape

(70000,)

In [126]:
val_targets.shape

(30000,)

In [132]:
test_inputs.shape

(375806, 1000)

### Make submission to Kaggle leaderboard

We'll start with a random submission

In [133]:
import numpy as np

In [146]:
def get_random_preds(inputs):
    return np.random.choice((0, 1), inputs.shape[0])

In [147]:
random_train_preds = get_random_preds(train_inputs)

In [151]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [153]:
def evaluate_preds(targets, preds):
    print('Accuracy: ', accuracy_score(targets, preds))
    print('Precision: ', precision_score(targets, preds))
    print('Recall: ', recall_score(targets, preds))
    print('F1 score:', f1_score(targets, preds))

In [154]:
evaluate_preds(train_targets, random_train_preds)

Accuracy:  0.4997857142857143
Precision:  0.06019312078619586
Recall:  0.498698224852071
F1 score: 0.10742053072981722


In [155]:
random_val_preds = get_random_preds(val_inputs)

In [156]:
evaluate_preds(val_targets, random_val_preds)

Accuracy:  0.5054666666666666
Precision:  0.06289857018191582
Recall:  0.516822945394374
F1 score: 0.11214841412327947


In [157]:
random_test_preds = get_random_preds(test_inputs)

In [159]:
sub_df

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,0
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0
...,...,...
375801,ffff7fa746bd6d6197a9,0
375802,ffffa1be31c43046ab6b,0
375803,ffffae173b6ca6bfa563,0
375804,ffffb1f7f1a008620287,0


In [160]:
def generate_submission_file(preds):
    submission_df = pd.read_csv(data_dir/'sample_submission.csv')
    submission_df.prediction = preds
    submission_df.to_csv('submission.csv', index=None)

In [161]:
# generate_submission_file(random_test_preds)

To submit on Kaggle:
- Save this notebook on Jovian
- Run it on Kaggle using "Run on Kaggle"
- Connect to the Quora competition & turn off internet
- Click "Submit" and create a version of the notebook
- Go to the "Data" tab in the output and submit the file

### Train & Evaluate ML Models

In [168]:
from sklearn.linear_model import LogisticRegression

In [173]:
logreg_model = LogisticRegression(solver='sag', max_iter=2000)

In [174]:
%%time
logreg_model.fit(train_inputs, train_targets)

CPU times: user 43 s, sys: 5.57 ms, total: 43 s
Wall time: 42.6 s



The max_iter was reached which means the coef_ did not converge



LogisticRegression(max_iter=2000, solver='sag')

In [178]:
train_preds = logreg_model.predict(train_inputs)

In [180]:
evaluate_preds(train_targets, train_preds)

Accuracy:  0.9486571428571429
Precision:  0.6670195870831127
Recall:  0.2982248520710059
F1 score: 0.41216879293424924


In [181]:
val_preds = logreg_model.predict(val_inputs)

In [182]:
evaluate_preds(val_targets, val_preds)

Accuracy:  0.9466333333333333
Precision:  0.6305418719211823
Recall:  0.2824048538334253
F1 score: 0.3900952380952381


In [183]:
test_preds = logreg_model.predict(test_inputs)

In [184]:
generate_submission_file(test_preds)

How to improve the model:

- Use a larger training sample (or entire the dataset)
- Experiment with tokenizer, stemmer, stop words etc.
- Experiment with CountVectorizer parameters (vocab size etc.)
- Experiment with logistic regression parameters
- Experiment with different ML models
- Use n-grams instead of single words for BoW
- Use a different technique for converting text to vectors

Try applying the above ideas and see how far you can get!

## Summary and References


Summary

* Downloaded and explore the data
* Applied text preprocessing techniques
* Convert text to bag of words vectors
* Trained ML models & made predictions

References

* Quora Insincere Question Classification competition: https://www.kaggle.com/c/quora-insincere-questions-classification
* What is NLP: https://www.ibm.com/cloud/learn/natural-language-processing
* Text preprocessing with NLTK: https://realpython.com/nltk-nlp-python/
* CountVectorizer documentation: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
* LogisticRegression documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
* Data Analysis with Python: https://zerotopandas.com
* Machine Learning with Python: https://zerotogbms.com
* Open Datasets Library: https://github.com/JovianML/opendatasets


In [191]:
if not IS_KAGGLE:
    import jovian
    jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/aakashns/natural-langauge-processing-bag-of-words
