In [1]:
## 2.1 Introduction<a id='2.2_Introduction'></a>

This is an original dataset, made publicly available for researchers.

We collected 60,000 Stack Overflow questions from 2016-2020 and classified them into three categories:

HQ: High-quality posts with 30+ score and without a single edit.
LQ_EDIT: Low-quality posts with a negative score and with multiple community edits. However, they still remain open after the edits.
LQ_CLOSE: Low-quality posts that were closed by the community without a single edit.
Notes:

Questions are sorted according to Question Id.
Question body is in HTML format.
All dates are in UTC format.

## 2.2 Objectives

There are some fundamental questions to resolve in this notebook before you move on.

* Do you think you may have the data you need to tackle the desired question?
    * Have you identified the required target value?
    * Do you have potentially useful features?
* Do you have any fundamental issues with the data?

## 2.3 Imports <a id='2.3_Imports'></a>

In [2]:
#Code task 1#
#Import pandas, matplotlib.pyplot, and seaborn in the correct lines below
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import numpy as np
import nltk
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from textblob import TextBlob

## 2.4 Load The Stack Overflow Data<a id='2.5_Load_The_Stack_Overflow_Data'></a>

In [3]:
# the supplied CSV data file is the raw_data directory

# stack_data = pd.read_csv('https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate?select=data.csv', error_bad_lines=False)

stack_data = pd.read_csv('../850380_1463404_compressed_data.csv/data.csv')

In [4]:
#Code task 2#
#Call the info method on stack_data to see a summary of the data
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60197 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            60171 non-null  object
 1   Title         60011 non-null  object
 2   Body          59999 non-null  object
 3   Tags          59998 non-null  object
 4   CreationDate  59998 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 2.8+ MB


In [5]:
#Code task 3#
#Call the head method on ski_data to print the first several rows of the data
stack_data.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE
1,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT
2,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,1/1/2016 2:03,HQ
3,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ
4,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ


In [6]:
stack_data.isnull().sum()

Id               26
Title           186
Body            198
Tags            199
CreationDate    199
Y               200
dtype: int64

In [7]:
stack_data.dropna(inplace = True)
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59997 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            59997 non-null  object
 1   Title         59997 non-null  object
 2   Body          59997 non-null  object
 3   Tags          59997 non-null  object
 4   CreationDate  59997 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 3.2+ MB


In [8]:
stack_data['Id'].value_counts().head(25)

39573575    1
53309070    1
39909015    1
38638829    1
45805486    1
52251489    1
59097259    1
42027534    1
42174632    1
55230628    1
36822544    1
39092896    1
41308834    1
42291869    1
51905712    1
37745459    1
52953078    1
51135511    1
52135983    1
50835374    1
50441031    1
52243672    1
38360689    1
45412795    1
49466084    1
Name: Id, dtype: int64

In [9]:
stack_data[['Id', 'Title', 'Body']].nunique()

Id       59997
Title    59987
Body     59997
dtype: int64

In [10]:
stack_data['Title'].value_counts().head(8)

#NAME?                                                                                  6
Regular Expression                                                                      3
Regular expression                                                                      3
SyntaxError: Unexpected token }                                                         2
"Fatal error: Call to a member function query() on a non-object" Database connection    1
Webserver attack solutions?                                                             1
SSMS WORKSPACE SYNTAX ERRORS AFTER PASTING SQL STATEMENT COPIED FROM SUBLIME TEXT       1
Object.constructor() vs Array.prototype.join() in terms of performance                  1
Name: Title, dtype: int64

In [11]:
stack_data = stack_data[stack_data['Y'].notna()]

In [12]:
stack_data.isnull().sum()

Id              0
Title           0
Body            0
Tags            0
CreationDate    0
Y               0
dtype: int64

In [13]:
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59997 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            59997 non-null  object
 1   Title         59997 non-null  object
 2   Body          59997 non-null  object
 3   Tags          59997 non-null  object
 4   CreationDate  59997 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 3.2+ MB


In [14]:
stack_data['Y'].value_counts().head(8)

HQ          20000
LQ_CLOSE    19999
LQ_EDIT     19998
Name: Y, dtype: int64

In [15]:
stack_data[stack_data['Y'].apply(lambda x:x not in ['HQ', 'LQ_CLOSE', 'LQ_EDIT'])]

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y


In [16]:
stack_data.isnull().sum()

Id              0
Title           0
Body            0
Tags            0
CreationDate    0
Y               0
dtype: int64

In [17]:
print(len(stack_data))

59997


In [18]:
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59997 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            59997 non-null  object
 1   Title         59997 non-null  object
 2   Body          59997 non-null  object
 3   Tags          59997 non-null  object
 4   CreationDate  59997 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 3.2+ MB


In [19]:
stack_data.dropna()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE
1,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT
2,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,1/1/2016 2:03,HQ
3,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ
4,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ
...,...,...,...,...,...,...
60192,60467932,C++ The correct way to multiply an integer and...,<p>I try to multiply an integer by a double bu...,<c++>,2/29/2020 17:46,LQ_CLOSE
60193,60468018,How can I make a c# application outside of vis...,<p>I'm very new to programming and I'm teachin...,<c#><visual-studio>,2/29/2020 17:55,LQ_CLOSE
60194,60468378,WHY DJANGO IS SHOWING ME THIS ERROR WHEN I TRY...,*URLS.PY*\r\n //URLS.PY FILE\r\n fro...,<django><django-views><django-templates>,2/29/2020 18:35,LQ_EDIT
60195,60469392,PHP - getting the content of php page,<p>I have a controller inside which a server i...,<javascript><php><html>,2/29/2020 20:32,LQ_CLOSE


In [20]:
# remove punctuations and beautify the strings in title and body column
# stack_data['Body'] = stack_data['Body'].str.replace(r'[^\w\s]+', '')
stack_data['Body'] = [BeautifulSoup(text).get_text() for text in stack_data['Body']]
stack_data['Title'] = [BeautifulSoup(text).get_text() for text in stack_data['Title']]

In [21]:
stack_data.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,I'm already familiar with repeating tasks ever...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE
1,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT
2,34553034,Why are Java Optionals immutable?,I'd like to understand why Java 8 Optionals we...,<java><optional>,1/1/2016 2:03,HQ
3,34553174,Text Overlay Image with Darkened Opacity React...,I am attempting to overlay a title over an ima...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ
4,34553318,Why ternary operator in swift is so picky?,"The question is very simple, but I just could ...",<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ


In [22]:
# stack_data['Y_clean'] = stack_data['Y'].str.replace('LQ_CLOSE', 'LQ')
# stack_data['Y_clean'] = stack_data['Y'].str.replace('LQ_EDIT', 'LQ')
stack_data['Y_clean'] = stack_data['Y'].map(lambda x: "LQ" if x in ['LQ_CLOSE', 'LQ_EDIT'] else x)
stack_data.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y,Y_clean
0,34552656,Java: Repeat Task Every Random Seconds,I'm already familiar with repeating tasks ever...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE,LQ
1,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT,LQ
2,34553034,Why are Java Optionals immutable?,I'd like to understand why Java 8 Optionals we...,<java><optional>,1/1/2016 2:03,HQ,HQ
3,34553174,Text Overlay Image with Darkened Opacity React...,I am attempting to overlay a title over an ima...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ,HQ
4,34553318,Why ternary operator in swift is so picky?,"The question is very simple, but I just could ...",<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ,HQ


In [23]:
# extract words from the body column
# nltk.download('punkt')
words_body = stack_data['Body'].apply(lambda text: [sent for sent in sent_tokenize(text)
                                       if any(True for w in word_tokenize(sent))])
# if w.lower() in searched_words

In [24]:
words_body

0        [I'm already familiar with repeating tasks eve...
1        [I am having 4 different tables like \r\nselec...
2        [I'd like to understand why Java 8 Optionals w...
3        [I am attempting to overlay a title over an im...
4        [The question is very simple, but I just could...
                               ...                        
60192    [I try to multiply an integer by a double but ...
60193    [I'm very new to programming and I'm teaching ...
60194    [*URLS.PY*\r\n    //URLS.PY FILE\r\n    from d...
60195    [I have a controller inside which a server is ...
60196    [So i was recently helping someone out with so...
Name: Body, Length: 59997, dtype: object

In [25]:
len(words_body)

59997

In [26]:
# count the number of words
stack_data['word_count_Body'] = stack_data['Body'].apply(lambda x: len(str(x).split(" ")))
stack_data['word_count_Title'] = stack_data['Title'].apply(lambda x: len(str(x).split(" ")))
stack_data[['Title', 'Body', 'word_count_Body', 'word_count_Title']].head()

Unnamed: 0,Title,Body,word_count_Body,word_count_Title
0,Java: Repeat Task Every Random Seconds,I'm already familiar with repeating tasks ever...,55,6
1,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,105,18
2,Why are Java Optionals immutable?,I'd like to understand why Java 8 Optionals we...,19,5
3,Text Overlay Image with Darkened Opacity React...,I am attempting to overlay a title over an ima...,1233,8
4,Why ternary operator in swift is so picky?,"The question is very simple, but I just could ...",91,8


In [27]:
# count the number of characters
stack_data['char_count_Body'] = stack_data['Body'].str.len() ## this also includes spaces
stack_data['char_count_Title'] = stack_data['Title'].str.len() ## this also includes spaces
stack_data[['Title', 'Body', 'char_count_Body', 'char_count_Title']].head()

Unnamed: 0,Title,Body,char_count_Body,char_count_Title
0,Java: Repeat Task Every Random Seconds,I'm already familiar with repeating tasks ever...,306,38
1,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,573,93
2,Why are Java Optionals immutable?,I'd like to understand why Java 8 Optionals we...,106,33
3,Text Overlay Image with Darkened Opacity React...,I am attempting to overlay a title over an ima...,4509,53
4,Why ternary operator in swift is so picky?,"The question is very simple, but I just could ...",498,42


In [28]:
# count the number of stop words
# nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

stack_data['stopwords_Body'] = stack_data['Body'].apply(lambda x: len([x for x in x.split() if x in stop]))
stack_data['stopwords_Title'] = stack_data['Title'].apply(lambda x: len([x for x in x.split() if x in stop]))
stack_data[['Title', 'Body', 'stopwords_Title', 'stopwords_Body']].head()

Unnamed: 0,Title,Body,stopwords_Title,stopwords_Body
0,Java: Repeat Task Every Random Seconds,I'm already familiar with repeating tasks ever...,0,18
1,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,6,47
2,Why are Java Optionals immutable?,I'd like to understand why Java 8 Optionals we...,1,8
3,Text Overlay Image with Darkened Opacity React...,I am attempting to overlay a title over an ima...,1,54
4,Why ternary operator in swift is so picky?,"The question is very simple, but I just could ...",3,31


In [29]:
# count the number of numerics
stack_data['numerics'] = stack_data['Body'].apply(lambda x: len([x for x in x.split() if x.isdigit()]))
stack_data[['Body','numerics']].head()

Unnamed: 0,Body,numerics
0,I'm already familiar with repeating tasks ever...,0
1,I am having 4 different tables like \r\nselect...,1
2,I'd like to understand why Java 8 Optionals we...,1
3,I am attempting to overlay a title over an ima...,1
4,"The question is very simple, but I just could ...",3


In [30]:
# make text to lowercase
stack_data['Body'] = stack_data['Body'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stack_data['Title'] = stack_data['Title'].apply(lambda x: " ".join(x.lower() for x in x.split()))
stack_data[['Title', 'Body']].head()

Unnamed: 0,Title,Body
0,java: repeat task every random seconds,i'm already familiar with repeating tasks ever...
1,how to get all the child records from differen...,i am having 4 different tables like select * f...
2,why are java optionals immutable?,i'd like to understand why java 8 optionals we...
3,text overlay image with darkened opacity react...,i am attempting to overlay a title over an ima...
4,why ternary operator in swift is so picky?,"the question is very simple, but i just could ..."


In [31]:
# removing punctuations
stack_data['Body'] = stack_data['Body'].str.replace('[^\w\s]','')
stack_data['Title'] = stack_data['Title'].str.replace('[^\w\s]','')
stack_data[['Title', 'Body']].head()

Unnamed: 0,Title,Body
0,java repeat task every random seconds,im already familiar with repeating tasks every...
1,how to get all the child records from differen...,i am having 4 different tables like select fr...
2,why are java optionals immutable,id like to understand why java 8 optionals wer...
3,text overlay image with darkened opacity react...,i am attempting to overlay a title over an ima...
4,why ternary operator in swift is so picky,the question is very simple but i just could n...


In [32]:
# removal of stop words

from nltk.corpus import stopwords
stop = stopwords.words('english')
stack_data['Body'] = stack_data['Body'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
stack_data['Title'] = stack_data['Title'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
stack_data[['Title', 'Body']].head()

Unnamed: 0,Title,Body
0,java repeat task every random seconds,im already familiar repeating tasks every n se...
1,get child records different tables based given...,4 different tables like select system select s...
2,java optionals immutable,id like understand java 8 optionals designed i...
3,text overlay image darkened opacity react native,attempting overlay title image image darkened ...
4,ternary operator swift picky,question simple could find answer doesnt retur...


In [33]:
# common words
freq_Title = pd.Series(' '.join(stack_data['Title']).split()).value_counts()[:10]
print('Freq_Title', freq_Title)
freq_Body = pd.Series(' '.join(stack_data['Body']).split()).value_counts()[:10]
print('Freq_Body', freq_Body)


Freq_Title using      4110
c          3772
python     3182
error      3016
string     2560
array      2539
file       2448
get        2437
android    2418
code       2356
dtype: int64
Freq_Body 1         28802
code      25057
0         24958
new       23171
using     20783
int       20195
like      19858
error     19805
string    19578
public    19519
dtype: int64


In [34]:
# common words removal

freq_Title = list(freq_Title.index)
stack_data['Title'] = stack_data['Title'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_Title))
print(stack_data['Title'].head())

freq_Body = list(freq_Body.index)
stack_data['Body'] = stack_data['Body'].apply(lambda x: " ".join(x for x in x.split() if x not in freq_Body))
print(stack_data['Body'].head())

0                java repeat task every random seconds
1    child records different tables based given par...
2                             java optionals immutable
3     text overlay image darkened opacity react native
4                         ternary operator swift picky
Name: Title, dtype: object
0    im already familiar repeating tasks every n se...
1    4 different tables select system select set se...
2    id understand java 8 optionals designed immuta...
3    attempting overlay title image image darkened ...
4    question simple could find answer doesnt retur...
Name: Body, dtype: object


In [35]:
# rare words

rare_Title = pd.Series(' '.join(stack_data['Title']).split()).value_counts()[-10:]
print(rare_Title.head())
rare_Body = pd.Series(' '.join(stack_data['Body']).split()).value_counts()[-10:]
print(rare_Body.head())

refreshtoken    1
htmllayout      1
htmlxml         1
pnotify         1
heigth          1
dtype: int64
comandroidsupportdesign2700       1
basicmoviesusecasesmoduleclass    1
bluecolored                       1
setimageurl4string                1
nostriroi                         1
dtype: int64


In [36]:
# rare words removal

rare_Title = list(rare_Title.index)
stack_data['Title'] = stack_data['Title'].apply(lambda x: " ".join(x for x in x.split() if x not in rare_Title))
print(stack_data['Title'].head())

rare_Body = list(rare_Body.index)
stack_data['Body'] = stack_data['Body'].apply(lambda x: " ".join(x for x in x.split() if x not in rare_Body))
print(stack_data['Body'].head())

0                java repeat task every random seconds
1    child records different tables based given par...
2                             java optionals immutable
3     text overlay image darkened opacity react native
4                         ternary operator swift picky
Name: Title, dtype: object
0    im already familiar repeating tasks every n se...
1    4 different tables select system select set se...
2    id understand java 8 optionals designed immuta...
3    attempting overlay title image image darkened ...
4    question simple could find answer doesnt retur...
Name: Body, dtype: object


In [37]:
# Lemmatization
# Lemmatization is a more effective option than stemming because it converts the word into its root word,
#  rather than just stripping the suffices.
# It makes use of the vocabulary and does a morphological analysis to obtain the root word.
# Therefore, we usually prefer using lemmatization over stemming.

from textblob import Word
stack_data['Title'] = stack_data['Title'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print(stack_data['Title'].head())
stack_data['Body'] = stack_data['Body'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
print(stack_data['Body'].head())

0                 java repeat task every random second
1    child record different table based given paren...
2                             java optionals immutable
3     text overlay image darkened opacity react native
4                         ternary operator swift picky
Name: Title, dtype: object
0    im already familiar repeating task every n sec...
1    4 different table select system select set sel...
2    id understand java 8 optionals designed immuta...
3    attempting overlay title image image darkened ...
4    question simple could find answer doesnt retur...
Name: Body, dtype: object


In [38]:
#  N-grams
TextBlob(stack_data['Body'][0]).ngrams(2)

[WordList(['im', 'already']),
 WordList(['already', 'familiar']),
 WordList(['familiar', 'repeating']),
 WordList(['repeating', 'task']),
 WordList(['task', 'every']),
 WordList(['every', 'n']),
 WordList(['n', 'second']),
 WordList(['second', 'javautiltimer']),
 WordList(['javautiltimer', 'javautiltimertask']),
 WordList(['javautiltimertask', 'let']),
 WordList(['let', 'say']),
 WordList(['say', 'want']),
 WordList(['want', 'print']),
 WordList(['print', 'hello']),
 WordList(['hello', 'world']),
 WordList(['world', 'console']),
 WordList(['console', 'every']),
 WordList(['every', 'random']),
 WordList(['random', 'second']),
 WordList(['second', '15']),
 WordList(['15', 'unfortunately']),
 WordList(['unfortunately', 'im']),
 WordList(['im', 'bit']),
 WordList(['bit', 'rush']),
 WordList(['rush', 'dont']),
 WordList(['dont', 'show']),
 WordList(['show', 'far']),
 WordList(['far', 'help']),
 WordList(['help', 'would']),
 WordList(['would', 'apriciated'])]

In [39]:
# term frequency
# Term frequency is simply the ratio of the count of a word present in a sentence, to the length of the sentence.

tf1 = (stack_data['Body'][0:100]).apply(lambda x: pd.value_counts(x.split(" "))).sum(axis = 0)
tf1.columns = ['words','tf']
tf1.head(30)

# NOTE: perform it in a better way/approach by using .extend keyword. Idea-> change the column in the list of words
# and perform the above action.

im                   32.0
second                6.0
every                 6.0
world                 1.0
hello                 6.0
15                    2.0
unfortunately         2.0
would                14.0
print                15.0
rush                  1.0
far                   2.0
apriciated            1.0
javautiltimer         1.0
bit                   5.0
say                   8.0
want                 28.0
already               5.0
n                     9.0
repeating             1.0
familiar              1.0
javautiltimertask     1.0
let                  13.0
task                  2.0
dont                 15.0
show                  8.0
random                1.0
console               1.0
help                 22.0
set                  27.0
item                 10.0
dtype: float64

In [40]:
tf1.columns
type(tf1)

pandas.core.series.Series

In [41]:
# Inverse Document Frequency
# The intuition behind inverse document frequency (IDF) is that
# a word is not of much use to us if it’s appearing in all the documents.

# for i,row in enumerate(tf1):
#   tf1.loc[i, 'idf'] = np.log(stack_data.shape[0]/(len(stack_data[stack_data['Body'].str.contains(row)])))
#     print(row);
# tf1

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = stack_data['Body']
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [42]:
# Bag of Words
#  Bag of Words (BoW) refers to the representation of text which describes the presence of words within the text data.
# The intuition behind this is that two similar text fields will contain similar kind of words,
# and will therefore have a similar bag of words.

from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(max_features=1000, lowercase=True, ngram_range=(1,1),analyzer = "word")
stack_data_bow = bow.fit_transform(stack_data['Body'])
stack_data_bow

<59997x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 1272278 stored elements in Compressed Sparse Row format>

In [43]:
# Sentiment Analysis
# If you recall, our problem was to detect the sentiment of the tweet.
# So, before applying any ML/DL models (which can have a separate feature detecting the sentiment using the textblob library).

stack_data['Body'][:5].apply(lambda x: TextBlob(x).sentiment)

0                    (-0.08750000000000001, 0.5)
1    (-0.03571428571428571, 0.40714285714285714)
2                                     (0.0, 0.0)
3      (0.06360544217687075, 0.3096371882086168)
4     (-0.08779761904761905, 0.4467261904761905)
Name: Body, dtype: object

__A:__ Above, you can see that it returns a tuple representing polarity and subjectivity of each tweet. Here, we only extract polarity as it indicates the sentiment as value nearer to 1 means a positive sentiment and values nearer to -1 means a negative sentiment. This can also work as a feature for building a machine learning model.

In [44]:
stack_data['sentiment'] = stack_data['Body'].apply(lambda x: TextBlob(x).sentiment[0] )
stack_data[['Body','sentiment']].head()

Unnamed: 0,Body,sentiment
0,im already familiar repeating task every n sec...,-0.0875
1,4 different table select system select set sel...,-0.035714
2,id understand java 8 optionals designed immuta...,0.0
3,attempting overlay title image image darkened ...,0.063605
4,question simple could find answer doesnt retur...,-0.087798


In [45]:
noun_phrases_list = []
for i, row in stack_data.iterrows():
    text = row['Body']
    blob = TextBlob(text)
#     print(row)
    noun_phrase = blob.noun_phrases
#     print(text)
    noun_phrases_list.extend(noun_phrase)
    if i >= 5:
        break
print(noun_phrases_list)
# do pre-processing to remove stars and unwanted characters for every text


['javautiltimer javautiltimertask', 'print hello world console', 'im bit rush dont show', 'different table', 'select system', 'select set', 'select item', 'select version system id n noof', 'qill n item item n noof version system n', 'n item item n version', 'systemid retrieve record', 'version item single storedprocedure', 'immutable threadsafety', 'overlay title image image', 'opacity effect', 'dim fix look custom component article preview image row article preview component component article preview', 'touchable image', 'r feed api user keyword interest parse homejs parse db need', 'heart parse db need', 'press google news var', 'requirereactnative var view stylesheet text image touchablehighlight', 'dimension var dimension requiredimensions var window dimensionsgetwindow var imagebutton requirecommonimagebutton var keywordbox requireauthenticationonboardingkeywordbox', 'additional library moduleexports reactcreateclass onpress function trigger button', 'thispropstext property', 'so

In [46]:
df = pd.DataFrame(stack_data)
df.reset_index(inplace = True, drop = True)
df

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y,Y_clean,word_count_Body,word_count_Title,char_count_Body,char_count_Title,stopwords_Body,stopwords_Title,numerics,sentiment
0,34552656,java repeat task every random second,im already familiar repeating task every n sec...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE,LQ,55,6,306,38,18,0,0,-0.087500
1,34552974,child record different table based given paren...,4 different table select system select set sel...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT,LQ,105,18,573,93,47,6,1,-0.035714
2,34553034,java optionals immutable,id understand java 8 optionals designed immuta...,<java><optional>,1/1/2016 2:03,HQ,HQ,19,5,106,33,8,1,1,0.000000
3,34553174,text overlay image darkened opacity react native,attempting overlay title image image darkened ...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ,HQ,1233,8,4509,53,54,1,1,0.063605
4,34553318,ternary operator swift picky,question simple could find answer doesnt retur...,<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ,HQ,91,8,498,42,31,3,3,-0.087798
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59992,60467932,correct way multiply integer double,try multiply integer double obtain wrong resul...,<c++>,2/29/2020 17:46,LQ_CLOSE,LQ,151,11,762,55,28,4,3,-0.062500
59993,60468018,make application outside visual studio,im programming im teaching made calculator cal...,<c#><visual-studio>,2/29/2020 17:55,LQ_CLOSE,LQ,108,11,558,57,55,3,0,0.187302
59994,60468378,django showing try open new page hyperlink,urlspy urlspy file djangocontrib import admin ...,<django><django-views><django-templates>,2/29/2020 18:35,LQ_EDIT,LQ,249,16,1128,78,20,0,0,0.000000
59995,60469392,php getting content php page,controller inside server connected network sea...,<javascript><php><html>,2/29/2020 20:32,LQ_CLOSE,LQ,95,8,574,37,25,2,0,0.000000


In [47]:
df.to_csv('formatted_stack_data.csv')

In [48]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

import re
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

def clean_tags(T):
    T=str(T).lower()
    text=re.sub(r'<','',T)
    text=re.sub(r'>',' ',text)
    return text


def clean_body(x):
    x=str(x).lower()
    x=re.sub(r'[^(a-zA-Z)\s]','', x)
    return x


if __name__ == '__main__':
    train_df = pd.read_csv('C:/Users/opa/Downloads/archive/train.csv')
    print(train_df.columns)

    valid_df = pd.read_csv('C:/Users/opa/Downloads/archive/valid.csv')
    print(valid_df.columns)
    #lb = LabelEncoder()
    #new_data = lb.fit_transform(train_df.CreationDate)
    #print(new_data)

    train_df.drop(['Id', 'CreationDate'], axis=1, inplace=True)
    valid_df.drop(['Id', 'CreationDate'], axis=1, inplace=True)
    print(train_df.head())

    train_df['Y'] = train_df.Y.map({'LQ_CLOSE': 0, 'LQ_EDIT': 1, 'HQ': 2})
    valid_df['Y'] = valid_df.Y.map({'LQ_CLOSE': 0, 'LQ_EDIT': 1, 'HQ': 2})

    train_df['Tags'] = train_df['Tags'].map(clean_tags)
    valid_df['Tags'] = valid_df['Tags'].map(clean_tags)

    train_df['Body'] = train_df.Body.map(clean_body)
    valid_df['Body'] = valid_df.Body.map(clean_body)

    train_df['Title'] = train_df.Title.map(clean_body)
    valid_df['Title'] = valid_df.Title.map(clean_body)

    train_df['text'] = train_df['Title'] + ' ' + train_df['Body'] + ' ' + train_df['Tags']
    valid_df['text'] = valid_df['Title'] + ' ' + valid_df['Body'] + ' ' + valid_df['Tags']

    tfidf = TfidfVectorizer()
    transform_text_train = tfidf.fit_transform(train_df.text)
    transform_text_test = tfidf.transform(valid_df.text)

    train_y = train_df['Y']
    valid_y = valid_df['Y']

    lr_classifier = LogisticRegression(C=1.)
    lr_classifier.fit(transform_text_train, train_y)

    print(f"Validation Accuracy of Logsitic Regression Classifier is: {(lr_classifier.score(transform_text_test, valid_y)) * 100:.2f}%")
    pred = lr_classifier.predict(transform_text_test)

    print(pred)
    # target_names = ['LQ', 'HQ']
    print(classification_report(valid_y, pred))

Index(['Id', 'Title', 'Body', 'Tags', 'CreationDate', 'Y'], dtype='object')
Index(['Id', 'Title', 'Body', 'Tags', 'CreationDate', 'Y'], dtype='object')
                                               Title  \
0             Java: Repeat Task Every Random Seconds   
1                  Why are Java Optionals immutable?   
2  Text Overlay Image with Darkened Opacity React...   
3         Why ternary operator in swift is so picky?   
4                 hide/show fab with scale animation   

                                                Body  \
0  <p>I'm already familiar with repeating tasks e...   
1  <p>I'd like to understand why Java 8 Optionals...   
2  <p>I am attempting to overlay a title over an ...   
3  <p>The question is very simple, but I just cou...   
4  <p>I'm using custom floatingactionmenu. I need...   

                                                Tags         Y  
0                                     <java><repeat>  LQ_CLOSE  
1                                   <java><o

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [49]:
# save and load the model
import joblib
filename = 'finalized_model.sav'
joblib.dump(lr_classifier, filename)
 
# some time later...
 
# load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(transform_text_test, valid_y)
print(result)

0.8820666666666667


## Naive Bayes Algorithm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import GaussianNB

vectorizer = CountVectorizer(max_features=2000)
x = vectorizer.fit_transform(train_df.text)
x_test = vectorizer.fit_transform(valid_df.text)

classifier = GaussianNB()
classifier.fit(x.toarray(), train_y)

y_pred = classifier.predict(x_test.toarray())

from sklearn import metrics
cm = metrics.confusion_matrix(valid_y, y_pred) 
print(cm)
pred = lr_classifier.predict(x_test.toarray())
print(classification_report(valid_y, pred))
# accuracy = metrics.accuracy_score(valid_y, y_pred) 
# print("Accuracy score:",accuracy)
# precision = metrics.precision_score(valid_y, y_pred)
# print("Precision score:",precision)
# recall = metrics.recall_score(valid_y, y_pred) 
# print("Recall score:",recall)

In [51]:
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10, criterion="entropy", random_state=0)
classifier.fit(transform_text_train, train_y)
print(classifier.predict(transform_text_test))
pred = classifier.predict(transform_text_test)
print(classification_report(valid_y, pred))

[0 1 0 ... 1 0 2]
              precision    recall  f1-score   support

           0       0.61      0.68      0.64      5000
           1       0.76      0.81      0.78      5000
           2       0.77      0.63      0.69      5000

    accuracy                           0.71     15000
   macro avg       0.71      0.71      0.71     15000
weighted avg       0.71      0.71      0.71     15000



In [52]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

import re

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

def clean_tags(T):
    T=T.lower()
    text=re.sub(r'<','',T)
    text=re.sub(r'>',' ',text)
    return text


def clean_body(x):
    x=x.lower()
    x=re.sub(r'[^(a-zA-Z)\s]','', x)
    return x


if __name__ == '__main__':
    train_df = pd.read_csv('C:/Users/opa/Downloads/archive/train.csv')
    print(train_df.columns)

    valid_df = pd.read_csv('C:/Users/opa/Downloads/archive/valid.csv')
    print(valid_df.columns)
    #lb = LabelEncoder()
    #new_data = lb.fit_transform(train_df.CreationDate)
    #print(new_data)

    train_df.drop(['Id', 'CreationDate'], axis=1, inplace=True)
    valid_df.drop(['Id', 'CreationDate'], axis=1, inplace=True)
    print(train_df.head())

    train_df['Y'] = train_df.Y.map({'LQ_CLOSE': 0, 'LQ_EDIT': 1, 'HQ': 2})
    valid_df['Y'] = valid_df.Y.map({'LQ_CLOSE': 0, 'LQ_EDIT': 1, 'HQ': 2})

    train_df['Tags'] = train_df['Tags'].map(clean_tags)
    valid_df['Tags'] = valid_df['Tags'].map(clean_tags)

    train_df['Body'] = train_df.Body.map(clean_body)
    valid_df['Body'] = valid_df.Body.map(clean_body)

    train_df['Title'] = train_df.Title.map(clean_body)
    valid_df['Title'] = valid_df.Title.map(clean_body)

    train_df['text'] = train_df['Title'] + ' ' + train_df['Body'] + ' ' + train_df['Tags']
    valid_df['text'] = valid_df['Title'] + ' ' + valid_df['Body'] + ' ' + valid_df['Tags']

    tfidf = TfidfVectorizer(max_features=2000)
    transform_text_train = tfidf.fit_transform(train_df.text).toarray()
    transform_text_test = tfidf.transform(valid_df.text).toarray()

    train_y = train_df['Y']
    valid_y = valid_df['Y']

    lr_classifier = GaussianNB()#LogisticRegression(C=1.)
    lr_classifier.fit(transform_text_train, train_y)

    print(f"Validation Accuracy of Logsiti Regression Classifier is: {(lr_classifier.score(transform_text_test, valid_y)) * 100:.2f}%")
    pred = lr_classifier.predict(transform_text_test)

    print(pred)
    # target_names = ['LQ', 'HQ']
    print(classification_report(valid_y, pred))

Index(['Id', 'Title', 'Body', 'Tags', 'CreationDate', 'Y'], dtype='object')
Index(['Id', 'Title', 'Body', 'Tags', 'CreationDate', 'Y'], dtype='object')
                                               Title  \
0             Java: Repeat Task Every Random Seconds   
1                  Why are Java Optionals immutable?   
2  Text Overlay Image with Darkened Opacity React...   
3         Why ternary operator in swift is so picky?   
4                 hide/show fab with scale animation   

                                                Body  \
0  <p>I'm already familiar with repeating tasks e...   
1  <p>I'd like to understand why Java 8 Optionals...   
2  <p>I am attempting to overlay a title over an ...   
3  <p>The question is very simple, but I just cou...   
4  <p>I'm using custom floatingactionmenu. I need...   

                                                Tags         Y  
0                                     <java><repeat>  LQ_CLOSE  
1                                   <java><o

# Summary

#### DATA:
Stackoverflow is one of the best website on question and answer for professional and enthusiast programmer. The data has been taken from Kaggle inorder to analyze the quality rate of 60k stack-overflow questions.
As per the data table, 4 of the columns i.e. title, body, tags and quality can be taken into account to apply the machine learning algorythms. I chose to work on title, body and quality column for applying the algorythms, as these are the most informative columns w.r.t text and the quality ratings provided initially.

#### DATA CLEANING:
1. Problem 1-> The datasets in title and body is completely user-entered information:
So, there are many empty title, body and ratings. The dataset has null values which we need to remove so that we can achieve the equal distribution of data through out the columns.
2. Problem 2-> LQ_CLOSE and LQ_EDIT quality rating less significant:
Then we can tag the LQ_CLOSE and LQ_EDIT with a single term LQ to bind the data and can differentiate between the HQ and LQ with more accuracy.
3. Problem 3-> special characters, stop words, least significant words/rare words, punctuations, root words
Analysing the data more closely needs to remove the use of special characters and unwanted symbols inbetween or we can say beautification of text. 

#### FEATURE ENGINEERING:
Find out the bag of words and the noun phrases for the sentences in the body and title column.

##### NOTE: The overall logic is that after rectifying the above 3 problems we can easily figure out the type of words used for HQ and LQ. And then by finding out the bag of words responsible for HQ and LQ will be helpful in analysing the questions with what kind of words or phrases are included in HQ and LQ.

#### MACHINE LEARNING ALGORITHMS:
As the dataset is completely on to NLP, I tested the dataset on the 4 different algorithms provided, and on that tfidf vectorizer using logarithmic regression performed the best. This gives the high accuracy rate with a balanced precision and recall.

#### PREDICTION:
Using the saved model for tdidf vectorizer, one can use the same flavour dataset to figure out the acuracy of the model and can distinguish the words/phrases responsible for high quality or low quality.

#### FUTURE WORK:
1. In future I would love to spend more time on the Tags/labels column to figure out the predicted labels with accuracy. That can be a chosen idea behind categorising the quality ratings. The words responsible for HQ, LQ_CLOSE and LQ_EDIT can be easily figured out by this.

2. Due to RAM constraints on local machine, I had to train 2000 features/sample out of 337250. Without resource limitations, I would love to train on the full dataset.