## 2.1 Introduction<a id='2.2_Introduction'></a>

This is an original dataset, made publicly available for researchers.

We collected 60,000 Stack Overflow questions from 2016-2020 and classified them into three categories:

HQ: High-quality posts with 30+ score and without a single edit.
LQ_EDIT: Low-quality posts with a negative score and with multiple community edits. However, they still remain open after the edits.
LQ_CLOSE: Low-quality posts that were closed by the community without a single edit.
Notes:

Questions are sorted according to Question Id.
Question body is in HTML format.
All dates are in UTC format.

## 2.2 Objectives

There are some fundamental questions to resolve in this notebook before you move on.

* Do you think you may have the data you need to tackle the desired question?
    * Have you identified the required target value?
    * Do you have potentially useful features?
* Do you have any fundamental issues with the data?

## 2.3 Imports <a id='2.3_Imports'></a>

In [2]:
#Code task 1#
#Import pandas, matplotlib.pyplot, and seaborn in the correct lines below
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import numpy as np

## 2.4 Load The Stack Overflow Data<a id='2.5_Load_The_Stack_Overflow_Data'></a>

In [3]:
# the supplied CSV data file is the raw_data directory

# stack_data = pd.read_csv('https://www.kaggle.com/imoore/60k-stack-overflow-questions-with-quality-rate?select=data.csv', error_bad_lines=False)

stack_data = pd.read_csv('../850380_1463404_compressed_data.csv/data.csv')

In [4]:
#Code task 2#
#Call the info method on stack_data to see a summary of the data
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60197 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            60171 non-null  object
 1   Title         60011 non-null  object
 2   Body          59999 non-null  object
 3   Tags          59998 non-null  object
 4   CreationDate  59998 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 2.8+ MB


In [5]:
#Code task 3#
#Call the head method on ski_data to print the first several rows of the data
stack_data.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE
1,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT
2,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,1/1/2016 2:03,HQ
3,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ
4,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ


In [6]:
stack_data.isnull().sum()

Id               26
Title           186
Body            198
Tags            199
CreationDate    199
Y               200
dtype: int64

In [9]:
stack_data.dropna(inplace = True)
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59997 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            59997 non-null  object
 1   Title         59997 non-null  object
 2   Body          59997 non-null  object
 3   Tags          59997 non-null  object
 4   CreationDate  59997 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 3.2+ MB


In [156]:
stack_data['Id'].value_counts().head(25)

&lt;/tr&gt;                                                                       16
&lt;/td&gt;                                                                       16
&lt;tr&gt;                                                                        15
&lt;td colspan=""6""&gt; &lt;/td&gt;                                               8
&lt;td&gt;                                                                         8
&lt;td colspan=""2""&gt; &lt;/td&gt;                                               8
&lt;td class=""small""&gt;                                                         8
&lt;td&gt;S                                                                        7
\t\t\t\t\t&lt;/td&gt;                                                              7
&lt;td class=""center""&gt;ZZ                                                      7
&lt;td class=""right""&gt;1.0&lt;/td&gt;                                           6
&lt;td class=""right""&gt;86.09  NOK&lt;/td&gt;                  

In [157]:
stack_data[['Id', 'Title', 'Body']].nunique()

Id       60057
Title    59993
Body     59999
dtype: int64

In [158]:
stack_data['Title'].value_counts().head(8)

  15.0% &lt;/td&gt;                7
#NAME?                             6
Regular expression                 3
Regular Expression                 3
SyntaxError: Unexpected token }    2
 resolve)                          2
 16 biter&lt;/td&gt;               2
How to read a lifetime error       1
Name: Title, dtype: int64

In [172]:
stack_data = stack_data[stack_data['Y'].notna()]

In [173]:
stack_data.isnull().sum()

Id              0
Title           0
Body            0
Tags            0
CreationDate    0
Y               0
dtype: int64

In [179]:
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59997 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            59997 non-null  object
 1   Title         59997 non-null  object
 2   Body          59997 non-null  object
 3   Tags          59997 non-null  object
 4   CreationDate  59997 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 3.2+ MB


In [180]:
stack_data['Y'].value_counts().head(8)

HQ          20000
LQ_CLOSE    19999
LQ_EDIT     19998
Name: Y, dtype: int64

In [187]:
stack_data[stack_data['Y'].apply(lambda x:x not in ['HQ', 'LQ_CLOSE', 'LQ_EDIT'])]

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y


In [188]:
stack_data.isnull().sum()

Id              0
Title           0
Body            0
Tags            0
CreationDate    0
Y               0
dtype: int64

In [189]:
print(len(stack_data))

59997


In [190]:
stack_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 59997 entries, 0 to 60196
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Id            59997 non-null  object
 1   Title         59997 non-null  object
 2   Body          59997 non-null  object
 3   Tags          59997 non-null  object
 4   CreationDate  59997 non-null  object
 5   Y             59997 non-null  object
dtypes: object(6)
memory usage: 3.2+ MB


In [191]:
stack_data.dropna()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,1/1/2016 0:21,LQ_CLOSE
1,34552974,How to get all the child records from differen...,I am having 4 different tables like \r\nselect...,<sql><sql-server>,1/1/2016 1:44,LQ_EDIT
2,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,1/1/2016 2:03,HQ
3,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,1/1/2016 2:48,HQ
4,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,1/1/2016 3:30,HQ
...,...,...,...,...,...,...
60192,60467932,C++ The correct way to multiply an integer and...,<p>I try to multiply an integer by a double bu...,<c++>,2/29/2020 17:46,LQ_CLOSE
60193,60468018,How can I make a c# application outside of vis...,<p>I'm very new to programming and I'm teachin...,<c#><visual-studio>,2/29/2020 17:55,LQ_CLOSE
60194,60468378,WHY DJANGO IS SHOWING ME THIS ERROR WHEN I TRY...,*URLS.PY*\r\n //URLS.PY FILE\r\n fro...,<django><django-views><django-templates>,2/29/2020 18:35,LQ_EDIT
60195,60469392,PHP - getting the content of php page,<p>I have a controller inside which a server i...,<javascript><php><html>,2/29/2020 20:32,LQ_CLOSE
