## Analyzing an initial dataset
In this notebook, we will quickly explore a real dataset of questions from writers.stackexchange.com. The dataset was initially sourced from the [archive](https://archive.org/details/stackexchange).

First, we will load the data. If you are loading a different csv, make sure you have pre-processed the raw xml using the ml_editor python package

In [7]:
import json
from tqdm import tqdm
from bs4 import BeautifulSoup
import xml.etree.ElementTree as ElT
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import pandas as pd

from pathlib import Path
import sys

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 1000)
%matplotlib inline
# %load_ext autoreload
# %autoreload 2

In [2]:
df = pd.read_csv(r'../data/posts.csv')

### Data format
Let's start by thinking through how we would like to format the data. Amongst other decisions, we will need to decide which label we should give our model.

We want a model that measures the quality of a question. To that end, we could use:

* The number of upvotes a question gets
* The number of answers a question gets, or whether they get an answer at all
* Whether an answer was marked as accepted or not
First, let's format our dataset to reconcile questions and associated answers, and verify that they match up.

We will start by filling missing values, as well as adding two features (text_len and is_question) we will use later.

In [3]:

# Start by changing types to make processing easier
df["AnswerCount"] = df["AnswerCount"].fillna(-1)
df["AnswerCount"] = df["AnswerCount"].astype(int)
df["PostTypeId"] = df["PostTypeId"].astype(int)
df["Id"] = df["Id"].astype(int)
df.set_index("Id", inplace=True, drop=False)

# Add measure of the length of a post
df["full_text"] = df["Title"].str.cat(df["body_text"], sep=" ", na_rep="")
df["text_len"] = df["full_text"].str.len()

# A question is a post of id 1
df["is_question"] = df["PostTypeId"] == 1

### Data quality
Let's examine the quality of the data in this dataset, starting by answering the questions below

* How much of the data is missing?
* What is the quality of the text?
* Do the answers match the questions?

In [9]:
df.head(2)

Unnamed: 0_level_0,Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastEditorUserId,LastEditorDisplayName,LastEditDate,LastActivityDate,Title,Tags,AnswerCount,CommentCount,FavoriteCount,ClosedDate,ContentLicense,body_text,ParentId,CommunityOwnedDate,OwnerDisplayName,full_text,text_len,is_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1
1,0,1,1,15.0,2010-11-18T20:40:32.857,32,1481.0,<p>I've always wanted to start writing (in a t...,8.0,32946.0,user29032,2019-02-10T04:06:33.283,2019-03-31T20:10:59.657,What are some online guides for starting writers?,<resources><first-time-author>,10,7,19.0,2019-09-09T15:44:30.727,CC BY-SA 3.0,I've always wanted to start writing (in a tota...,,,,What are some online guides for starting write...,352,True
2,1,2,1,16.0,2010-11-18T20:42:31.513,23,9777.0,<p>What kind of story is better suited for eac...,8.0,,user29032,2018-04-29T19:35:55.850,2018-04-29T19:35:55.850,What is the difference between writing in the ...,<fiction><grammatical-person><third-person>,7,0,5.0,,CC BY-SA 3.0,What kind of story is better suited for each p...,,,,What is the difference between writing in the ...,331,True


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44829 entries, 1 to 55098
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             44829 non-null  int64  
 1   Id                     44829 non-null  int32  
 2   PostTypeId             44829 non-null  int32  
 3   AcceptedAnswerId       5290 non-null   float64
 4   CreationDate           44829 non-null  object 
 5   Score                  44829 non-null  int64  
 6   ViewCount              10495 non-null  float64
 7   Body                   44742 non-null  object 
 8   OwnerUserId            41779 non-null  float64
 9   LastEditorUserId       15192 non-null  float64
 10  LastEditorDisplayName  998 non-null    object 
 11  LastEditDate           16015 non-null  object 
 12  LastActivityDate       44829 non-null  object 
 13  Title                  10495 non-null  object 
 14  Tags                   10495 non-null  object 
 15  An

We have a little over 44.000 posts which consist of both questions and answers.

Looking at the Body column, it appears that it is null in 44829 - 44742 = 87 rows. Let's take a look at these rows to see if we should remove them.

In [5]:
df[df["Body"].isna()]

Unnamed: 0_level_0,Unnamed: 0,Id,PostTypeId,AcceptedAnswerId,CreationDate,Score,ViewCount,Body,OwnerUserId,LastEditorUserId,...,FavoriteCount,ClosedDate,ContentLicense,body_text,ParentId,CommunityOwnedDate,OwnerDisplayName,full_text,text_len,is_question
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2145,1956,2145,5,,2011-03-22T19:49:56.600,0,,,20.0,20.0,...,,,CC BY-SA 2.5,,,,,,1,False
2147,1958,2147,5,,2011-03-22T19:51:05.897,0,,,20.0,20.0,...,,,CC BY-SA 2.5,,,,,,1,False
2215,2026,2215,5,,2011-03-24T19:35:10.353,0,,,-1.0,-1.0,...,,,CC BY-SA 2.5,,,,,,1,False
2218,2029,2218,5,,2011-03-24T19:41:38.677,0,,,-1.0,-1.0,...,,,CC BY-SA 2.5,,,,,,1,False
2225,2036,2225,5,,2011-03-24T19:58:59.833,0,,,-1.0,-1.0,...,,,CC BY-SA 2.5,,,,,,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45437,36660,45437,5,,2019-05-26T12:04:39.597,0,,,23253.0,23253.0,...,,,CC BY-SA 4.0,,,,,,1,False
46197,37364,46197,5,,2019-06-24T15:25:41.197,0,,,32946.0,32946.0,...,,,CC BY-SA 4.0,,,,,,1,False
50756,41174,50756,5,,2020-04-11T11:59:12.140,0,,,23253.0,23253.0,...,,,CC BY-SA 4.0,,,,,,1,False
54356,44188,54356,5,,2021-01-06T22:33:28.910,0,,,-1.0,-1.0,...,,,CC BY-SA 4.0,,,,,,1,False


All of the null bodys are of PostTypeId 4 or 5.

The readme file that accompanied the archive only mentions PostTypeIds of 1 (questions) and 2 (answers). We will remove all rows not marked PostTypeId 1 or 2, since we are only interested in questions and answers.

In [10]:
df = df[df["PostTypeId"].isin([1,2])]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44076 entries, 1 to 55098
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             44076 non-null  int64  
 1   Id                     44076 non-null  int32  
 2   PostTypeId             44076 non-null  int32  
 3   AcceptedAnswerId       5290 non-null   float64
 4   CreationDate           44076 non-null  object 
 5   Score                  44076 non-null  int64  
 6   ViewCount              10495 non-null  float64
 7   Body                   44076 non-null  object 
 8   OwnerUserId            41031 non-null  float64
 9   LastEditorUserId       14446 non-null  float64
 10  LastEditorDisplayName  990 non-null    object 
 11  LastEditDate           15262 non-null  object 
 12  LastActivityDate       44076 non-null  object 
 13  Title                  10495 non-null  object 
 14  Tags                   10495 non-null  object 
 15  An

Now let's look at a few questions and answers and verify that they match, and that the text is readable.

In [12]:
quetions_with_accepted_answers = df[df['is_question'] & ~(df['AcceptedAnswerId'].isna())]
q_and_a =  quetions_with_accepted_answers.join(df['body_text'], on='AcceptedAnswerId', how='left', rsuffix="_answer")

# Setting this option allows us to display all the data
pd.options.display.max_colwidth = 500
q_and_a[["body_text", "body_text_answer"]][:3]

Unnamed: 0_level_0,body_text,body_text_answer
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,"I've always wanted to start writing (in a totally amateur way), but whenever I want to start something I instantly get blocked having a lot of questions and doubts.\nAre there some resources on how to start becoming a writer?\nI'm thinking something with tips and easy exercises to get the ball rolling.\n","When I'm thinking about where I learned most how to write, I think that reading was the most important guide to me. This may sound silly, but by reading good written newspaper articles (facts, opinions, scientific articles and most of all, criticisms of films and music), I learned how others did the job, what works and what doesn't. In my own writing, I try to mimic other people's styles that I liked. Moreover, I learn new things by reading, giving me a broader background that I need when re..."
2,"What kind of story is better suited for each point of view? Are there advantages or disadvantages inherent to them?\nFor example, writing in the first person you are always following a character, while in the third person you can ""jump"" between story lines.\n","With a story in first person, you are intending the reader to become much more attached to the main character. Since the reader sees what that character sees and feels what that character feels, the reader will have an emotional investment in that character. Third person does not have this close tie; a reader can become emotionally invested but it will not be as strong as it will be in first person.\nContrarily, you cannot have multiple point characters when you use first person without ex..."
3,"I finished my novel, and everyone I've talked to says I need an agent. How do I find one?\n","Try and find a list of agents who write in your genre. Check out their websites!\nFind out if they are accepting new clients. If they aren't, then check out another agent. But if they are, try sending them a few chapters from your story, a brief, and a short cover letter asking them to represent you.\nIn the cover letter mention your previous publication credits. If sent via post, then I suggest you give them a means of reply, whether it be an email or a stamped, addressed envelope.\nAgents ..."
