# Importação e organização inicial dos dados

In [1]:
import pandas as pd
pd.set_option('max_colwidth', 300)

# 1 - Importação do arquivo
> [Fonte do DataSet](https://www.crowdflower.com/wp-content/uploads/2016/03/twitter-hate-speech-classifier-DFE-a845520.csv)

Seleção das colunas relevantes e organização do Data Frame

In [2]:
data = pd.read_csv(
    "./arquivos/twitter-hate-speech-classifier-DFE-a845520.csv", 
    encoding='ISO-8859-1', 
    usecols=[
        'tweet_text', 
        'does_this_tweet_contain_hate_speechconfidence', 
        'does_this_tweet_contain_hate_speech:confidence',
        'does_this_tweet_contain_hate_speech'])

In [3]:
data = data.rename(columns={
    'does_this_tweet_contain_hate_speech:confidence':'confidence',
    'does_this_tweet_contain_hate_speechconfidence':'hate_speech:confidence'
})

In [4]:
data = data.reindex(columns=[
    'tweet_text', 
    'hate_speech:confidence',
    'does_this_tweet_contain_hate_speech',
    'confidence'])

In [5]:
data.head()

Unnamed: 0,tweet_text,hate_speech:confidence,does_this_tweet_contain_hate_speech,confidence
0,Warning: penny boards will make you a faggot,1.0,The tweet uses offensive language but not hate speech,0.6013
1,Fuck dykes,1.0,The tweet contains hate speech,0.7227
2,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,1.0,The tweet contains hate speech,0.5229
3,"""@jayswaggkillah: ""@JacklynAnnn: @jayswaggkillah Is a fag"" jackie jealous"" Neeeee",1.0,The tweet contains hate speech,0.5184
4,@Zhugstubble You heard me bitch but any way I'm back th texas so wtf u talking about bitch ass nigga,1.0,The tweet uses offensive language but not hate speech,0.5185


In [6]:
data.describe()

Unnamed: 0,hate_speech:confidence,confidence
count,67.0,14509.0
mean,1.0,0.865844
std,0.0,0.178734
min,1.0,0.3333
25%,1.0,0.6684
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


# 2 - Criar novas colunas

## Labels

Coluna que resume a classificação do tweet em uma letra

In [7]:
labels = {
    'The tweet uses offensive language but not hate speech': 'O',
    'The tweet contains hate speech': 'H',
    'The tweet is not offensive': 'N'
}
data['labels'] = [labels[value[2]] for value in data.values]

Remover colunas que não vão trazer mais informações relevantes

In [8]:
data = data.drop(columns=['does_this_tweet_contain_hate_speech', 'hate_speech:confidence'])

## Length
Coluna que informa a quantidade de caracteres em cada tweet

In [9]:
data['length'] = [len(value[0]) for value in data.values]

In [10]:
data.describe()

Unnamed: 0,confidence,length
count,14509.0,14509.0
mean,0.865844,88.857606
std,0.178734,38.724537
min,0.3333,4.0
25%,0.6684,55.0
50%,1.0,89.0
75%,1.0,125.0
max,1.0,469.0


# 3 - Exemplo de filtragem de dados

In [11]:
hate_data = pd.DataFrame.from_dict(
    [dict(zip(data.columns,v))for v in data.values if v[2]=='H']
)

In [12]:
hate_data

Unnamed: 0,tweet_text,confidence,labels,length
0,Fuck dykes,0.7227,H,10
1,@sizzurp__ @ILIKECATS74 @yoPapi_chulo @brandonernandez @bootyacid at least i dont look like jefree starr faggot,0.5229,H,111
2,"""@jayswaggkillah: ""@JacklynAnnn: @jayswaggkillah Is a fag"" jackie jealous"" Neeeee",0.5184,H,81
3,"@elaynay your a dirty terrorist and your religion is a fucking joke, you go around screaming Allah akbar doing terrorist shit. Dirty faggot.",0.8816,H,140
4,RT @ivanrabago_: @_WhitePonyJr_ looking like faggots?,0.5207,H,53
...,...,...,...,...
2394,@PoliticsPeach @ArmAndProtect @AbolishWelfare @NIX1331_ @Likaveli The coon preachers that met w Trump also still have followers. Travesty _Ì«ÌÐå©,0.6718,H,145
2395,@Huscoon Coon~,0.6598,H,14
2396,@AS_Waffloid *u dam coon,0.6717,H,24
2397,Ben Carson thinks it's okay to have A confederate flag hanging on your property this nigglet is the biggest coon ever lol where his mother?,0.6684,H,139


Fazer correlação da confidência e a quantidade de caracteres

In [13]:
graph = dict(zip(hate_data['confidence'], hate_data.length))
graph = pd.DataFrame.from_dict(graph, orient='index')

In [14]:
graph.sort_index()

Unnamed: 0,0
0.3365,50
0.3367,37
0.3368,72
0.3382,94
0.3383,58
...,...
0.8435,23
0.8816,140
0.8826,105
0.9221,57


# 4 - Salvar dados

In [15]:
data.to_csv('./arquivos/data.csv', encoding='ISO-8859-1', index=False)