# Análise dos dados

## Importação das dependências

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Leitura dos dados

In [2]:
df = pd.read_csv(r'C:\Users\jpfca\Downloads\bq-results-20220725-121025-1658751889618.csv')
df.head()

Unnamed: 0,title,body,tags
0,Using Components folder instead of Pages,<p>With Blazor being component based and compo...,directory|architecture|components|blazor
1,Select data from sqlite3 before or after a cer...,<p>I wan to <strong>select</strong> the data b...,javascript|database|typescript|sqlite|typeorm
2,Listen to Firebase Firestore data changes for ...,<p>Let's say I have a Firebase firestore datab...,javascript|reactjs|firebase|react-native|googl...
3,How to decode a base64 image and getting It's ...,<p>newbie here. I've been working on an image ...,python|tensorflow|machine-learning|base64|fastapi
4,Pods not found while using kubectl port-forward,<p>I want to forward the ports</p>\n<pre><code...,kubernetes|kubectl


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   title   500000 non-null  object
 1   body    500000 non-null  object
 2   tags    499994 non-null  object
dtypes: object(3)
memory usage: 11.4+ MB


O dataset possui três atributos, todos do tipo objeto (sendo o atributo *tags*, uma classe).

Porém, há amostras com tags vazias, vamos checar quais são.

In [4]:
tags_vazias = df[df['tags'].isna() == True]
tags_vazias

Unnamed: 0,title,body,tags
45182,How to avoid null pointer exception from fires...,<p>I have a firebase application which loads p...,
222104,ruby function is returning nil when it should not,<p>I have a written a ruby code that take two ...,
327727,Map Interface Methods. first things first< I w...,<p>// --------------Map Interface Methods-----...,
375230,Nan loss value after few epochs with Contrasti...,<p>I used a Siamese network with contrastive l...,
387970,Why is the Button Null?,<p>I'm receiving a NullPointerException. It sa...,
394739,Why are NaNs produced for pchisq?,<p>i was using serial.test to check for autoco...,


In [5]:
round((len(tags_vazias)/len(df))*100, 5)

0.0012

Todas elas se tratam de questões específicas de tecnologias/linguagens, portanto o uso de tags seria recomendado nesses casos. Desse modo, como suas tags estão vazias e representam apenas 0.0012% de todas as amostras do dataset, podemos remover essas amostras do banco.

In [6]:
df_sem_vazios = df.dropna()
df_sem_vazios.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 499994 entries, 0 to 499999
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   title   499994 non-null  object
 1   body    499994 non-null  object
 2   tags    499994 non-null  object
dtypes: object(3)
memory usage: 15.3+ MB


Embora todos os dados sejam textuais, há a possibilidade de haver amostras duplicadas

In [7]:
df_sem_duplicados = df_sem_vazios.drop_duplicates()
df_sem_duplicados.count()

title    499994
body     499994
tags     499994
dtype: int64

Os valores da coluna body, por serem escritos originalmente em *Markdown*, possuem caracteres especiais (sequências ASCII) e tags HTML. É interessante removê-las para limpar o corpo das questões, pois esses caracteres e tags não influenciam na questão em si, apenas em sua formatação.

In [8]:
df_sem_duplicados['body'] = df_sem_duplicados['body'].str.replace(r'<[^<>]*>|\\', '', regex=True)
df_sem_duplicados['body'] = df_sem_duplicados['body'].str.replace(r'\n', ' ', regex=True)

In [9]:
df_sem_duplicados.head()

Unnamed: 0,title,body,tags
0,Using Components folder instead of Pages,With Blazor being component based and componen...,directory|architecture|components|blazor
1,Select data from sqlite3 before or after a cer...,I wan to select the data before or after a cer...,javascript|database|typescript|sqlite|typeorm
2,Listen to Firebase Firestore data changes for ...,Let's say I have a Firebase firestore database...,javascript|reactjs|firebase|react-native|googl...
3,How to decode a base64 image and getting It's ...,newbie here. I've been working on an image cla...,python|tensorflow|machine-learning|base64|fastapi
4,Pods not found while using kubectl port-forward,I want to forward the ports kubectl port-forwa...,kubernetes|kubectl


## Análise das classes

In [10]:
df_tags = df_sem_duplicados.copy()

Ao todo, 57.9% das tags (por questão) são únicas

In [11]:
df_tags.tags.nunique()/len(df_tags)*100

57.89889478673744

Podemos checar qual a frequência de aparição desse conjunto de tags

In [12]:
df_tags.tags.value_counts()

python                                                             2786
r                                                                  2742
python|pandas                                                      2621
javascript                                                         2079
reactjs                                                            2050
                                                                   ... 
laravel|eloquent|phpdoc|documentation-generation|phpdocumentor2       1
bash|apt|apt-get                                                      1
access-token|insomnia                                                 1
android|gradle|android-gradle-plugin|gradle-kotlin-dsl                1
c|pointers|linked-list|mergesort                                      1
Name: tags, Length: 289491, dtype: int64

In [39]:
from collections import Counter

all_tags = []

for i in range(len(df_tags)):
    try:
        all_tags.extend(df_tags.tags[i].split('|'))
    except:
        continue

all_tags_count = Counter(all_tags)

print('Quantidade de tags: ', len(all_tags))
print('Quantidade de tags únicas: ', len(all_tags_count.keys()))

Quantidade de tags:  1439906
Quantidade de tags únicas:  35730
