# Preprocesamiento

Para responder las hipótesis planteadas se consideraron solamente las tablas de posts, usuarios y tags, eliminando las columnas de identificadores de las tablas  y ejecutando un query se obtuvieron nuevos atributos que se indican acontinuación. 

Para el análisis se obtuvieron 3 grupos de características:

**1. Características del usuario que realiza la pregunta**

Con la finalidad de determinar la influencia de la reputación y experiencia del usuario que plantea la pregunta se proponen las siguientes características:

La edad del usuario desde su registro hasta la fecha de creación del post, para ello se utilizaron los atributos: **users_creation_date**  y **post_creation_date** y el resultado es **user_age**.
* **users_reputation**: reputación del usuario obtenido del dataset.
* **users_up_votes**: obtenido del dataset
* **users_down_votes**: obtenido del dataset.
 
Adicionalmente se obtiene el score de las preguntas, respuestas y comentarios realizadas por el usuario previamente:
* **score_prev_acceptans**: sumatoria del score de las respuestas que ha dado el usuario y que se han marcado como aceptadas
* **score_prev_ans**: sumatoria del score de las respuestas que no han sido aceptadas.
* **score_prev_comment**: sumatoria del score de los comentarios realizados.
* **score_prev_question**: sumatoria del score de las preguntas.
* **score_prev_favquestion**: sumatoria de las marcas como favorito de las preguntas realizadas por el usuario. 

Para obtener estos atributos se eliminaron aquellos post que no tienen usuario registrado.

**2. Características del post**

Se obtuvo el tamaño del título como característica reemplazando a **post_title**:  
* **title_lenght**: número de caracteres

Además se obtuvieron características del cuerpo del post **post_body**:

* **num_block_code**: cantidad de bloques de código considerando el tag "pre"
* **code_lenght**: número de caracteres en los bloques de código
* **num_i_sentences**: número de oraciones que empiezan con “I”
* **num_wh_words**: número de oraciones que empiezan con  una pregunta. (How, What, etc.)
* **num_words**: número de palabras en el post, eliminando las "top words"
* **num_y_sentences**: número de oraciones que empiezan con “You”. Post que contienen una explicación previa.

Adicionalmente se mantienen algunas de las características obtenidas del dataset como:
* **post_comment_count**: número de comentarios.
* **post_favorite_count**: número de marcas como favorito.
* **post_score**: score de acuerdo a los up-votes.
* **post_view_count**: número de vistas.

**3. Características de Tag **

Cada post del dataset contiene los tags en una columna separados por el símbolo ‘|’. De la cual se obtiene el número de tags num_tags, y adicionalmente se evalúa la popularidad de cada tag tags_popularity.

La popularidad del tag se obtiene seleccionando los 100 top tags y contando cuantos de los tags de cada post se encuentran entre esos 100. 


El dataset contiene 5000 posts de los cuales se obtendrán las características para el análisis. 

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

df=pd.read_csv("stackoverflow_data.csv", quotechar='"',
               usecols=['post_id','accepted_ans','post_title','post_body','post_creation_date',
                            'post_answer_count','post_comment_count',
                            'post_favorite_count', 'post_score','post_tags',
                            'post_view_count','users_creation_date',
                            'users_reputation','users_up_votes','users_down_votes',
                            'score_prev_acceptans' ,'score_prev_ans',
                            'score_prev_comment','score_prev_question',
                            'score_prev_favquestion'])
#Archivo con los tags más populares
df_tags=pd.read_csv("top_tags.csv", quotechar='"')
#Se elimina registros con valores nulos
df.dropna() 
df['post_creation_date']=pd.to_datetime(df['post_creation_date'])
df['users_creation_date']=pd.to_datetime(df['users_creation_date'])
df['class']=df['post_answer_count'].apply(lambda x: 1 if x>0 else 0)

In [2]:
df.head()

Unnamed: 0,post_id,accepted_ans,post_title,post_body,post_creation_date,post_answer_count,post_comment_count,post_favorite_count,post_score,post_tags,...,users_creation_date,users_reputation,users_up_votes,users_down_votes,score_prev_acceptans,score_prev_ans,score_prev_comment,score_prev_question,score_prev_favquestion,class
0,40287199,0,Define the correct NSTableView behavior with d...,<p>I have a view based NSTableView in my appli...,2016-10-27 14:24:53.537,0,1,0,0,osx|cocoa|tableview|nstableview|nswindow,...,2011-08-08 21:57:35.103,1848,142,30,40,64,28,4,0,0
1,43396867,0,Undefined variable but already declared,<p>I'm trying to make a simple form that can c...,2017-04-13 15:59:39.453,0,6,0,0,php|mysql,...,2017-04-13 15:43:24.877,1,0,0,0,0,0,0,0,0
2,39747783,0,How to ignore folder in just post_process whil...,<p>Is there any way to ignore a folder only fo...,2016-09-28 12:38:00.257,0,2,0,0,django|deployment,...,2014-03-31 16:03:04.167,62,2,0,0,3,0,5,0,0
3,44417709,0,Java wildfly java.lang.NoClassDefFoundError,<p>I am developing javaee maven web project us...,2017-06-07 16:03:59.080,0,5,0,0,java|java-ee|wildfly,...,2013-01-31 19:55:55.543,66,9,0,0,0,1,4,0,0
4,41607780,0,Execution failed for task :app:transformClasse...,<p>I am trying to release my Android applicati...,2017-01-12 07:54:29.743,0,5,0,0,android|android-studio|gradle,...,2017-01-12 07:42:58.243,3,0,0,0,0,0,0,0,0


In [6]:
from nltk.corpus import stopwords

#Se considera las features que se utilizarán en el clasificador. 
df2 = pd.DataFrame(data=df, index=df.index, columns=['class',
                            'post_comment_count',
                            'post_favorite_count', 'post_score',
                            'post_view_count',
                            'users_reputation',	'users_up_votes','users_down_votes',
                            'score_prev_acceptans' ,'score_prev_ans',
                            'score_prev_comment','score_prev_question',
                            'score_prev_favquestion']
                   )
#Edad de creacion de la cuenta en relacion al post en dias.
df2['age_user'] = (df['post_creation_date'] - df['users_creation_date']).fillna(0).astype('timedelta64[D]')
df2['title_length'] = df['post_title'].apply(lambda x: len(x))
df2['num_block_code'] = 0
df2["num_i_sentences"]=0
df2["num_wh_words"]=0
df2["num_y_sentences"]=0
df2["tags_popularity"]=0
df2["num_tags"]=0

whwords=['what','how', 'which', 'when', 'why', 'where']
for index, row in df.iterrows():
    sbody=row["post_body"]
    soup = BeautifulSoup(sbody, "html5lib")
    sentences =  soup.find_all(name="p")
    #Questions words
    count_wh=0
    #Oraciones que tienen el pronombre I
    count_is=0
    #Oraciones que tienen el pronombre You
    count_ys = 0
    palabras=[]
    filtered_words=[]
    for sentence in sentences:
        try:
            palabras=sentence.contents[0].split()
        except:
            palabras=str(sentence.contents).split() 
        if(len(palabras)==0):
            children = soup.find_all("li", { "class" : "expandable" }, recursive=False)
            for child in children:
                palabras.extend(child.getText().split())
        filtered_words.extend([word for word in palabras if word not in stopwords.words('english')])
        count_is=count_is+len([x for x in palabras if x == "I"])
        count_is=count_is+len([x for x in palabras if len(x.split("I'"))>1])    
        count_ys=count_ys+len([x for x in palabras if x == "You"])
        count_ys=count_ys+len([x for x in palabras if len(x.split("You'"))>1])
        for word in whwords:
            count_wh=count_wh+len([x for x in palabras if x == word])
    df2.loc[index, "num_i_sentences"] = count_is
    df2.loc[index, "num_wh_words"] = count_wh
    df2.loc[index, "num_y_sentences"] = count_ys    
    df2.loc[index, "num_words"] = len(filtered_words)

In [7]:
#Caracteristicas del codigo y tags
for index, row in df.iterrows():
    body=row["post_body"]
    tags_column=row["post_tags"]
    tags=tags_column.split("|")
    soup = BeautifulSoup(body, "html5lib")
    precode = soup.find_all("pre")
    df2.loc[index, "num_block_code"]=len(precode)
    content=""
    countError = 0
    for codeline in precode:
        contentPre = codeline.contents
        for contentCode in contentPre:
            try:
                content=content+contentCode.contents[0]
            except :
                try:
                    content = content + str(contentCode)
                except:
                    print(contentCode)
    wordCodeCount =len(content)
    df2.loc[index,"code_length"]=wordCodeCount    
    counttag=len(tags)
    pop_tag=0
    #numero de tags y popularidad del tag
    for tag in tags:
        if tag in df_tags['tag_name'].values:
            pop_tag+=1;
    df2.loc[index, "num_tags"] = counttag
    df2.loc[index, "tags_popularity"] = pop_tag
df2.head()

Unnamed: 0,class,post_comment_count,post_favorite_count,post_score,post_view_count,users_reputation,users_up_votes,users_down_votes,score_prev_acceptans,score_prev_ans,...,age_user,title_length,num_block_code,num_i_sentences,num_wh_words,num_y_sentences,tags_popularity,num_tags,num_words,code_length
0,0,1,0,0,55,1848,142,30,40,64,...,1906.0,60,0,4,1,0,1,5,103.0,0.0
1,0,6,0,0,19,1,0,0,0,0,...,0.0,39,1,3,1,0,2,2,43.0,2262.0
2,0,2,0,0,27,62,2,0,0,3,...,911.0,69,1,1,1,0,1,2,30.0,49.0
3,0,5,0,0,76,66,9,0,0,0,...,1587.0,43,2,5,2,0,1,3,45.0,20340.0
4,0,5,0,0,287,3,0,0,0,0,...,0.0,81,1,1,0,0,1,3,9.0,869.0


Se obtuvieron 22 características para el primer análisis. 

## ESTADÍSTICAS

In [15]:
#class 0: preguntas sin respuesta
#class 1: preguntas con respuesta

class0= df2.loc[df2['class'] ==0,:]
class1= df2.loc[df2['class'] ==1,:]
df2.describe()


Unnamed: 0,class,post_comment_count,post_favorite_count,post_score,post_view_count,users_reputation,users_up_votes,users_down_votes,score_prev_acceptans,score_prev_ans,...,age_user,title_length,num_block_code,num_i_sentences,num_wh_words,num_y_sentences,tags_popularity,num_tags,num_words,code_length
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,0.7728,2.1324,0.14,0.5542,241.6304,969.8716,159.6094,14.7166,27.0564,49.0104,...,739.9308,53.928,1.4706,3.507,1.0422,0.0106,1.3968,3.0214,45.5788,959.363
std,0.419065,2.839733,0.558267,2.016302,930.855619,5746.156694,672.061197,109.408935,280.743409,456.01589,...,725.846256,20.38462,1.404112,2.895864,1.270253,0.111759,0.979153,1.218868,32.930695,1972.114842
min,0.0,0.0,0.0,-8.0,5.0,0.0,0.0,0.0,-7.0,-7.0,...,-1.0,7.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,39.0,13.0,0.0,0.0,0.0,0.0,...,76.75,39.0,0.0,1.0,0.0,0.0,1.0,2.0,24.0,0.0
50%,1.0,1.0,0.0,0.0,69.0,73.0,10.0,0.0,0.0,0.0,...,528.0,51.0,1.0,3.0,1.0,0.0,1.0,3.0,38.0,353.0
75%,1.0,3.0,0.0,1.0,173.0,476.0,75.0,2.0,3.0,8.25,...,1239.25,65.0,2.0,5.0,2.0,0.0,2.0,4.0,58.0,997.25
max,1.0,39.0,19.0,59.0,38320.0,229974.0,24787.0,3665.0,11804.0,19751.0,...,3203.0,149.0,17.0,25.0,10.0,2.0,5.0,5.0,502.0,25051.0


In [17]:
class0.describe()

Unnamed: 0,class,post_comment_count,post_favorite_count,post_score,post_view_count,users_reputation,users_up_votes,users_down_votes,score_prev_acceptans,score_prev_ans,...,age_user,title_length,num_block_code,num_i_sentences,num_wh_words,num_y_sentences,tags_popularity,num_tags,num_words,code_length
count,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,...,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0,1136.0
mean,0.0,2.269366,0.101232,0.372359,94.213028,741.03081,138.831866,10.389965,18.926937,36.462148,...,746.642606,54.65757,1.322183,3.523768,1.044014,0.009683,1.300176,3.074824,48.576585,1108.462148
std,0.0,2.621762,0.367583,0.807177,186.908657,2981.198762,854.015101,83.923713,120.629111,205.943237,...,730.176308,20.403675,1.421695,2.993879,1.259956,0.114551,0.972141,1.239099,37.464525,2361.641236
min,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,-1.0,-3.0,...,0.0,7.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,29.0,6.0,0.0,0.0,0.0,0.0,...,76.75,40.0,0.0,1.0,0.0,0.0,1.0,2.0,26.0,0.0
50%,0.0,2.0,0.0,0.0,47.0,49.0,5.0,0.0,0.0,0.0,...,527.5,52.0,1.0,3.0,1.0,0.0,1.0,3.0,39.5,324.0
75%,0.0,3.0,0.0,1.0,101.0,339.0,47.25,1.0,2.0,5.0,...,1281.0,67.0,2.0,5.0,2.0,0.0,2.0,4.0,60.0,1125.0
max,0.0,25.0,3.0,11.0,4543.0,58334.0,24787.0,1440.0,2219.0,4661.0,...,3203.0,138.0,11.0,22.0,8.0,2.0,5.0,5.0,502.0,21845.0


In [18]:
class1.describe()

Unnamed: 0,class,post_comment_count,post_favorite_count,post_score,post_view_count,users_reputation,users_up_votes,users_down_votes,score_prev_acceptans,score_prev_ans,...,age_user,title_length,num_block_code,num_i_sentences,num_wh_words,num_y_sentences,tags_popularity,num_tags,num_words,code_length
count,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,...,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0,3864.0
mean,1.0,2.092133,0.151398,0.60766,284.970497,1037.149845,165.717909,15.988613,29.446429,52.699534,...,737.957557,53.713509,1.514234,3.50207,1.041667,0.01087,1.425207,3.005694,44.697464,915.528468
std,0.0,2.899793,0.60253,2.248779,1050.128249,6332.209177,608.304353,115.818584,312.560644,506.538845,...,724.651392,20.376687,1.396087,2.866789,1.273426,0.110939,0.979519,1.212571,31.424804,1840.052951
min,1.0,0.0,0.0,-8.0,7.0,0.0,0.0,0.0,-7.0,-7.0,...,-1.0,15.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
25%,1.0,0.0,0.0,0.0,44.0,16.0,0.0,0.0,0.0,0.0,...,76.75,39.0,1.0,2.0,0.0,0.0,1.0,2.0,24.0,38.0
50%,1.0,1.0,0.0,0.0,77.0,83.0,12.0,0.0,0.0,0.0,...,528.0,51.0,1.0,3.0,1.0,0.0,1.0,3.0,38.0,359.0
75%,1.0,3.0,0.0,1.0,208.0,529.25,84.0,2.0,4.0,10.0,...,1219.25,64.0,2.0,5.0,2.0,0.0,2.0,4.0,57.0,978.25
max,1.0,39.0,19.0,59.0,38320.0,229974.0,17298.0,3665.0,11804.0,19751.0,...,3176.0,149.0,17.0,25.0,10.0,2.0,5.0,5.0,438.0,25051.0


El dataset resultante tiene 5000 registros de los cuales:
* Clase 0: 1136 registros 
* Clase 1: 3864 registros