# Quora Insincere Questions Classification

## Context
Quora is a popular website where anyone can ask and/or answer a question. There are more than 100 millions unique visitors per month.

Like any other forum, Quora is facing a problem: toxic questions and comments.

As you can imagine, Quora teams cannot check all of the Q&A by hand. So they decided to ask the data science community to help them to perform automatically insincere questions classification.

## Data
This challenge was launched on Kaggle : https://www.kaggle.com/c/quora-insincere-questions-classification.

Read the overall information on Kaggle. Quora provided a dataset of questions with a label, and the features are the following:

+ qid: a unique identifier for each question, an hexadecimal number
+ question_text: the text of the question
+ target: either 1 (for insincere question) or 0

In this competition, the metric used for performance evaluation is the F-score.

## EDA

In [1]:
import pandas as pd
import numpy as np
import os

#TODO: Read the training data in CSV file
df = pd.read_csv("train.csv")
df.shape

(1306122, 3)

In [2]:
#TODO: Check data
df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


In [3]:
# TODO: Print the class distribution 
toxic = df[df["target"]==1]["target"].count() / df.shape[0] * 100
nontoxic = df[df["target"]==0]["target"].count() / df.shape[0] * 100

print("Ratio of toxic question {} %".format(toxic))
print("Ratio of non toxic question {} %".format(nontoxic))

Ratio of toxic question 6.187017751787352 %
Ratio of non toxic question 93.81298224821265 %


In [4]:
pd.set_option('display.max_colwidth', 1000)
df[df.target==1].head(n=10)

Unnamed: 0,qid,question_text,target
22,0000e91571b60c2fb487,Has the United States become the largest dictatorship in the world?,1
30,00013ceca3f624b09f42,Which babies are more sweeter to their parents? Dark skin babies or light skin babies?,1
110,0004a7fcb2bf73076489,If blacks support school choice and mandatory sentencing for criminals why don't they vote Republican?,1
114,00052793eaa287aff1e1,"I am gay boy and I love my cousin (boy). He is sexy, but I dont know what to do. He is hot, and I want to see his di**. What should I do?",1
115,000537213b01fd77b58a,Which races have the smallest penis?,1
119,00056d45a1ce63856fc6,Why do females find penises ugly?,1
127,0005de07b07a17046e27,How do I marry an American woman for a Green Card? How much do they charge?,1
144,00068875d7c82a5bcf88,"Why do Europeans say they're the superior race, when in fact it took them over 2,000 years until mid 19th century to surpass China's largest economy?",1
156,0006ffd99a6599ff35b3,Did Julius Caesar bring a tyrannosaurus rex on his campaigns to frighten the Celts into submission?,1
167,00075f7061837807c69f,In what manner has Republican backing of 'states rights' been hypocritical and what ways have they actually restricted the ability of states to make their own laws?,1


The dataset is quite big, Let's play with a sample of 10000 lines.

In [5]:
from sklearn.utils import resample

#TODO: sample 10000 questions
df_sample = resample(df, n_samples = 10000)
df_sample.shape

(10000, 3)

Check the proportion of toxic question within our sample

In [6]:
#TODO

toxic = df_sample[df_sample["target"]==1]["target"].count() / df_sample.shape[0] * 100
nontoxic = df_sample[df_sample["target"]==0]["target"].count() / df_sample.shape[0] * 100


print("Ratio of toxic question {} %".format(toxic))
print("Ratio of non toxic question {} %".format(nontoxic))


Ratio of toxic question 5.82 %
Ratio of non toxic question 94.17999999999999 %


## Text Preprocessing

In [7]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", 
                       "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not",
                       "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", 
                       "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", 
                       "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", 
                       "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", 
                       "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", 
                       "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                       "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have",
                       "mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", 
                       "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                       "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", 
                       "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
                       "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", 
                       "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would",
                       "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", 
                       "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", 
                       "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", 
                       "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", 
                       "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", 
                       "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", 
                       "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", 
                       "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                       "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", 
                       "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", 
                       "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are",
                       "y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", 
                       "you'll've": "you will have", "you're": "you are", "you've": "you have" }

#TODO: normalize the text by using contraction mapping

for contraction in contraction_mapping:
    df_sample["question_text"] = df_sample["question_text"].apply(lambda x: str(x).replace(contraction, contraction_mapping[contraction]))


In [8]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def text_preprocess(text):
    text = text.lower()
    #after checking vocab_out dict we decided to aply those following replace
    text = text.replace("-", " ").replace("/", " ").replace("\\", " ").replace("'", " ")
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.isalpha()]
    return tokens

#TODO: you can further remove stop words and use lemmtizer
lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words("english")

In [9]:
df_sample["tokens"] = df_sample["question_text"].apply(lambda x: text_preprocess(x))
df_sample.head()

Unnamed: 0,qid,question_text,target,tokens
1306057,fffcc30ffad59abeb37c,Can profanity make you have bad breath?,0,"[can, profanity, make, you, have, bad, breath]"
1054851,ceb482134d2febdf1ca9,What is the scope of food microbiology in India?,0,"[what, is, the, scope, of, food, microbiology, in, india]"
458916,59e38ecfecac0d30ff17,What is the Maharashtrian Panchkarma?,0,"[what, is, the, maharashtrian, panchkarma]"
777189,983ba415e1b49aa501ef,What's the best alternative to Craigslist's casual encounters?,0,"[what, s, the, best, alternative, to, craigslist, s, casual, encounters]"
106550,14df3cd54d22e3f7d989,Do left-leaning parties mostly win in the U.K. and the E.U. because of the voting rights extended to most immigrants?,1,"[do, left, leaning, parties, mostly, win, in, the, and, the, because, of, the, voting, rights, extended, to, most, immigrants]"


In [10]:
#TODO: you can further remove stop words and use lemmtizer
df_sample["tokens_no_stop"] = df_sample["tokens"].apply(lambda x : [word for word in x if word not in stop_words])
df_sample["lemmatized"] = df_sample["tokens_no_stop"].apply(lambda x : [lemmatizer.lemmatize(word) for word in x])
df_sample

Unnamed: 0,qid,question_text,target,tokens,tokens_no_stop,lemmatized
1306057,fffcc30ffad59abeb37c,Can profanity make you have bad breath?,0,"[can, profanity, make, you, have, bad, breath]","[profanity, make, bad, breath]","[profanity, make, bad, breath]"
1054851,ceb482134d2febdf1ca9,What is the scope of food microbiology in India?,0,"[what, is, the, scope, of, food, microbiology, in, india]","[scope, food, microbiology, india]","[scope, food, microbiology, india]"
458916,59e38ecfecac0d30ff17,What is the Maharashtrian Panchkarma?,0,"[what, is, the, maharashtrian, panchkarma]","[maharashtrian, panchkarma]","[maharashtrian, panchkarma]"
777189,983ba415e1b49aa501ef,What's the best alternative to Craigslist's casual encounters?,0,"[what, s, the, best, alternative, to, craigslist, s, casual, encounters]","[best, alternative, craigslist, casual, encounters]","[best, alternative, craigslist, casual, encounter]"
106550,14df3cd54d22e3f7d989,Do left-leaning parties mostly win in the U.K. and the E.U. because of the voting rights extended to most immigrants?,1,"[do, left, leaning, parties, mostly, win, in, the, and, the, because, of, the, voting, rights, extended, to, most, immigrants]","[left, leaning, parties, mostly, win, voting, rights, extended, immigrants]","[left, leaning, party, mostly, win, voting, right, extended, immigrant]"
...,...,...,...,...,...,...
1115727,daa2076beefd50f49dae,What are pulses and lentils?,0,"[what, are, pulses, and, lentils]","[pulses, lentils]","[pulse, lentil]"
601322,75c7dbd950bf92b7583b,"Is ""black girl magic"" a way to make black women feel better about their plight and the way they are perceived by society ?",1,"[is, black, girl, magic, a, way, to, make, black, women, feel, better, about, their, plight, and, the, way, they, are, perceived, by, society]","[black, girl, magic, way, make, black, women, feel, better, plight, way, perceived, society]","[black, girl, magic, way, make, black, woman, feel, better, plight, way, perceived, society]"
610079,7777d238611d8533815e,"What are the best ways to get from Tallahassee, FL to Ft. Lauderdale, FL?",0,"[what, are, the, best, ways, to, get, from, tallahassee, fl, to, lauderdale, fl]","[best, ways, get, tallahassee, fl, lauderdale, fl]","[best, way, get, tallahassee, fl, lauderdale, fl]"
424017,531dbc157892ba1dec04,Are there any essays about technology's penchant for shifting a person into multiple concurrent situations and realities?,0,"[are, there, any, essays, about, technology, s, penchant, for, shifting, a, person, into, multiple, concurrent, situations, and, realities]","[essays, technology, penchant, shifting, person, multiple, concurrent, situations, realities]","[essay, technology, penchant, shifting, person, multiple, concurrent, situation, reality]"


## Word Embeddings

Should we preprocess the text with our classic methods... well not really ! First let's check what is the proportion of our document vocabulary that is taken into account by our embeddings.

In [11]:
import numpy as np

# Function that allows to read a pretrained model and returns words and a dictionary of word embeddings
def read_glove_vecs(glove_file):
    with open(glove_file, 'r', encoding = 'utf-8') as f:
        words = []
        word_to_vec_map = {}
        bad = 0
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.append(curr_word)
            try :
                word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            except ValueError:
                bad +=1
            
        print(f'There are {bad} bad lines')
    return words, word_to_vec_map

In [12]:
#TODO: Load Glove embedding
glove_file = "glove.6B.50d.txt"
words, word_to_vec_map = read_glove_vecs(glove_file)

There are 0 bad lines


In [15]:
import operator

#TODO: check if any token is not in the Glove embedding 

def is_in_vocab(tokens_list):
    in_vocab = {}
    out_vocab = {}
    for lst in tokens_list:
        for token in lst:
            if token.lower() in words:
                in_vocab[token]  = 1
            elif token in out_vocab.keys():
                out_vocab[token] += 1
            else:
                out_vocab[token] = 1

    out_vocab_ordered = sorted(out_vocab.items(), key=operator.itemgetter(1))[::-1]
    return in_vocab, out_vocab_ordered

In [16]:
text = np.array(df_sample.tokens)
in_vocab, out_vocab = is_in_vocab(text)

in_vocab_ratio = len(in_vocab.keys())/(len(in_vocab.keys()) + len(out_vocab))
out_vocab_ratio = len(out_vocab)/(len(in_vocab.keys()) + len(out_vocab))
                                                             
print("proportion of words in word embedding vocab: ", in_vocab_ratio*100, "%")
print("proportion of words not in word embedding vocab: ", out_vocab_ratio*100, "%")

proportion of words in word embedding vocab:  93.39843212763031 %
proportion of words not in word embedding vocab:  6.601567872369688 %


In [17]:
len(out_vocab)

960

In [18]:
out_vocab

[('bitsat', 6),
 ('quorans', 6),
 ('blockchain', 5),
 ('cryptocurrencies', 4),
 ('pdpu', 4),
 ('iiser', 4),
 ('upvotes', 4),
 ('numericals', 3),
 ('ntse', 3),
 ('cryptocurrency', 3),
 ('viteee', 3),
 ('flipkart', 3),
 ('passout', 3),
 ('friendzone', 3),
 ('ibps', 3),
 ('redmi', 3),
 ('chatbot', 3),
 ('iisers', 3),
 ('duolingo', 3),
 ('ethereum', 3),
 ('upvote', 3),
 ('jiren', 3),
 ('intjs', 2),
 ('potterheads', 2),
 ('intps', 2),
 ('async', 2),
 ('dyi', 2),
 ('upwork', 2),
 ('ocytocin', 2),
 ('indianization', 2),
 ('afcat', 2),
 ('jvzoo', 2),
 ('divs', 2),
 ('binance', 2),
 ('rotj', 2),
 ('rnsit', 2),
 ('mpsc', 2),
 ('downvotes', 2),
 ('kardashev', 2),
 ('bittrex', 2),
 ('displaystyle', 2),
 ('selfie', 2),
 ('cpec', 2),
 ('fiverr', 2),
 ('uchicago', 2),
 ('regenepure', 2),
 ('jaigaon', 2),
 ('psycopath', 2),
 ('mehanical', 2),
 ('kavalireddi', 2),
 ('argota', 2),
 ('coep', 2),
 ('pentene', 2),
 ('suncream', 2),
 ('coinbase', 2),
 ('qty', 2),
 ('manaphy', 2),
 ('whatare', 2),
 ('antifa'

(OPTIONAL)
How to improve this rate:
* Should we remove punctuation ? 
* Should we remove numbers ? 
* Should we remove stopwords ? 
* Should we Stemmatize / Lemmatize ?

We could also use TextBlob for mispellings

In [19]:
#Compute the embedding for each question text from word embeddings
def get_vector_from(tokens):
    word_vect = np.array([word_to_vec_map[t] for t in tokens if t in words])
    try: 
        word_vect = word_vect.mean(axis = 0).astype("float64")
    except:
        print("Can not convert tokens into vector")
    return word_vect

In [23]:
df_sample["vector"] = df_sample["tokens"].apply(lambda x : get_vector_from(x))
df_sample.head()

Unnamed: 0,qid,question_text,target,tokens,tokens_no_stop,lemmatized,vector
1306057,fffcc30ffad59abeb37c,Can profanity make you have bad breath?,0,"[can, profanity, make, you, have, bad, breath]","[profanity, make, bad, breath]","[profanity, make, bad, breath]","[0.2735097285714286, 0.05876699999999997, 0.056122857142857155, -0.375049, 0.0704845142857143, 0.15373485714285712, 0.017307428571428547, -0.015300399999999976, -0.08001171428571427, 0.5773367142857143, -0.1778442857142857, 0.5031191428571429, -0.04093857142857145, 0.026195285714285716, 0.7030200000000002, 0.35897428571428575, 0.10021428571428574, -0.01711428571428571, -0.1063357142857143, -0.9463, -0.34211299999999994, 0.34108171428571427, 0.8415585714285714, 0.2369317142857143, 0.4061657142857143, -1.4598014285714285, -0.5350414285714286, 0.4181371428571429, 0.8740585714285712, -0.5098187142857143, 3.0746428571428575, 0.6231571428571429, -0.19968942857142857, -0.33182042857142857, 0.005309999999999997, 0.143164, 0.1407133857142857, -0.1728361428571428, 0.2896671428571428, -0.36075552857142856, 0.03326571428571429, 0.3396215714285714, 0.012548571428571418, 0.67879, 0.28333142857142857, 0.3194828571428571, 0.1407157142857143, 0.18043328571428568, -0.010018999999999986, 0.3801571428..."
1054851,ceb482134d2febdf1ca9,What is the scope of food microbiology in India?,0,"[what, is, the, scope, of, food, microbiology, in, india]","[scope, food, microbiology, india]","[scope, food, microbiology, india]","[0.26735222222222227, 0.1263556666666667, -0.4713877777777778, 0.12336433333333334, 0.38066144444444444, 0.1495381111111111, -0.05365067777777781, -0.45607711111111104, 0.522403641111111, -0.20170766666666667, 0.17037688888888888, -0.09953077777777777, -0.1056943333333333, -0.34298, 0.19559116666666668, 0.3171564688888889, 0.15637926666666668, 0.16883233333333333, -0.2752677777777778, -0.00864044444444445, 0.31375322222222224, 0.025500888888888896, 0.13465855555555553, -0.08030177777777779, 0.09772866666666667, -1.5335244444444445, -0.5496866666666667, -0.02418655555555555, -0.08116377777777777, 0.06215588888888887, 3.0917703333333337, -0.04429, -0.19633677777777775, -0.5331522222222221, 0.010794125555555553, -0.051371522222222216, -0.19677566666666665, 0.262649, 0.16964566666666667, 0.11761744444444444, -0.27918511111111116, 0.06670253333333333, 0.10599711111111111, 0.009764666666666692, -0.059343333333333276, 0.41201999999999994, -0.00021323333333334027, 0.1760566666666667, 0.004..."
458916,59e38ecfecac0d30ff17,What is the Maharashtrian Panchkarma?,0,"[what, is, the, maharashtrian, panchkarma]","[maharashtrian, panchkarma]","[maharashtrian, panchkarma]","[0.4911175, 0.14684275, -0.7289025, -0.002835000000000011, 0.5833625, 0.31701325, -0.18990652500000002, -0.12743625000000003, -0.2273625575, -0.022062499999999992, 0.154975, 0.009442499999999993, 0.05002774999999998, 0.008412500000000031, 0.310776375, 0.200337, 0.22200999999999999, 0.32897999999999994, -0.1579, -0.07375499999999997, -0.29808775, -0.09314, -0.263776, 0.4692765, 0.2922925, -1.4767825, -0.7988175000000001, 0.21444525, 0.05886150000000001, -0.15591100000000002, 2.6559725000000003, -0.009847500000000009, -0.21625999999999995, -0.201905, -0.026730717499999994, -0.121178775, -0.09519, -0.03971925, 0.04669275, 0.003631750000000003, -0.25042575, 0.1369925, -0.22264450000000002, 0.211798, -0.140708, 0.22685999999999998, -0.132297275, 0.06205, 0.1790175, 0.04411]"
777189,983ba415e1b49aa501ef,What's the best alternative to Craigslist's casual encounters?,0,"[what, s, the, best, alternative, to, craigslist, s, casual, encounters]","[best, alternative, craigslist, casual, encounters]","[best, alternative, craigslist, casual, encounter]","[0.26287000000000005, 0.24856470000000003, -0.05383500000000001, 0.2535116, 0.37770000000000004, 0.10803951, -0.369654, -0.2999075, 0.099873177, 0.20097199999999998, -0.1618011, 0.03411259999999999, -0.277228, 0.0389016, 0.37124405, -0.011123100000000007, -0.08957260000000002, -0.11411669999999999, -0.24060699999999996, -0.3232131, 0.0831878, 0.2526446, -0.03518800000000001, 0.42797100000000005, 0.10403100000000001, -1.192477, -0.4633010000000001, -0.08083470000000001, 0.22203499999999998, -0.1351104, 2.515914, 0.10661400000000001, -0.30100866600000004, -0.18184399999999998, 0.03959891299999999, -0.286501429, -0.0841674, -0.1339672, -0.3354867, -0.4263293000000001, 0.153457, -0.03991760000000001, -0.0013958500000000027, 0.2794663, 0.06239179999999998, -0.0013390999999999763, 0.10113608999999998, 0.034419899999999996, 0.26385169999999997, 0.315264]"
106550,14df3cd54d22e3f7d989,Do left-leaning parties mostly win in the U.K. and the E.U. because of the voting rights extended to most immigrants?,1,"[do, left, leaning, parties, mostly, win, in, the, and, the, because, of, the, voting, rights, extended, to, most, immigrants]","[left, leaning, parties, mostly, win, voting, rights, extended, immigrants]","[left, leaning, party, mostly, win, voting, right, extended, immigrant]","[0.16921894736842105, 0.040345473684210514, -0.09927273684210525, -0.08561805263157896, 0.19972473684210526, 0.2122246315789474, -0.48332221052631585, 0.055205473684210554, -0.28147698368421054, -0.14968936842105263, -0.0028764105263157764, -0.34974242105263154, -0.15371484210526323, -0.02815231578947367, 0.23334671052631578, -0.0033001052631578967, 0.1637976842105263, -0.2856836157894737, -0.21952210526315788, -0.3368117894736843, -0.05661942105263157, 0.051310810526315795, 0.13674094736842102, 0.12000321052631578, -0.04337121052631581, -1.557054210526316, -0.2516126368421053, -0.12944405263157893, -0.011500894736842106, -0.12386026315789471, 3.3999894736842107, 0.26771911578947366, -0.4143708947368421, -0.44870905263157895, -0.09155924263157894, -0.16445562631578942, -0.06166684210526313, 0.07429142105263159, -0.11561926315789472, -0.062279736842105256, -0.34679715789473686, 0.048625125789473686, 0.08223668421052632, 0.22430173684210525, -0.3311406842105263, 0.0907856263157895, -..."


In [24]:
X = df_sample.vector.apply(lambda x : pd.Series(x))
X = X.set_index(df_sample.index)
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
1306057,0.27351,0.058767,0.056123,-0.375049,0.070485,0.153735,0.017307,-0.0153,-0.080012,0.577337,...,0.033266,0.339622,0.012549,0.67879,0.283331,0.319483,0.140716,0.180433,-0.010019,0.380157
1054851,0.267352,0.126356,-0.471388,0.123364,0.380661,0.149538,-0.053651,-0.456077,0.522404,-0.201708,...,-0.279185,0.066703,0.105997,0.009765,-0.059343,0.41202,-0.000213,0.176057,0.004581,-0.165927
458916,0.491117,0.146843,-0.728903,-0.002835,0.583363,0.317013,-0.189907,-0.127436,-0.227363,-0.022062,...,-0.250426,0.136992,-0.222645,0.211798,-0.140708,0.22686,-0.132297,0.06205,0.179017,0.04411
777189,0.26287,0.248565,-0.053835,0.253512,0.3777,0.10804,-0.369654,-0.299907,0.099873,0.200972,...,0.153457,-0.039918,-0.001396,0.279466,0.062392,-0.001339,0.101136,0.03442,0.263852,0.315264
106550,0.169219,0.040345,-0.099273,-0.085618,0.199725,0.212225,-0.483322,0.055205,-0.281477,-0.149689,...,-0.346797,0.048625,0.082237,0.224302,-0.331141,0.090786,-0.420932,-0.123158,-0.030519,-0.393429


In [25]:
y = df_sample.target
y.head()

1306057    0
1054851    0
458916     0
777189     0
106550     1
Name: target, dtype: int64

In [26]:
df_new = pd.concat([X, y], axis=1)
df_new = df_new.dropna()

## ML model

In [27]:
y = df_new.target
X = df_new.drop("target", axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# TODO: Train the model
lr = LogisticRegression()
lr.fit(X_train, y_train)

LogisticRegression()

In [29]:
# TODO Estimate the accuracy
y_pred = lr.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97      1886
           1       0.48      0.13      0.21       114

    accuracy                           0.94      2000
   macro avg       0.72      0.56      0.59      2000
weighted avg       0.92      0.94      0.93      2000



## Can you explain why the performance for class 1 (insincere questions) is bad? Can we improve?

In [None]:
#Only simple model with some missed vocabulary, improve preprocessing
#Try multiple algorithms
