<a href="https://colab.research.google.com/github/airpods69/DeepLearningGrind/blob/main/HateSpeech.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hate Speech

### Imports 

In [None]:
# Imports for reading and cleaning data
import pandas as pd
import re # for regex commands

from sklearn.utils import resample # handling imbalanced data

# Used to create a pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

In [None]:
# from google.colab import files
# files.upload()
# files

<module 'google.colab.files' from '/usr/local/lib/python3.7/dist-packages/google/colab/files.py'>

### Data Reading And Cleaning

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
print("Training Set:"% train.columns, train.shape, len(train))
print("Test Set:"% test.columns, test.shape, len(test))

Training Set: (31962, 3) 31962
Test Set: (17197, 2) 17197


In [None]:
def clean_text (df, text_field):
    df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", elem))
    return df

test_clean = clean_text(test, "tweet")
train_clean = clean_text(train, "tweet")

### Handling the Imbalanced data
On a deep analysis of the data we find out that tweets related to hate speech are less than others. This creates a situation where we have an unbalanced dataset.

To deal with this problem we either oversample or downsample the data.

In the case of oversampling, we use a function that repeatedly samples with replacement from the minority class untill the class is the same size as the majority.

In [None]:
train_majority = train_clean[train_clean.label == 0]
train_minority = train_clean[train_clean.label == 1]

train_minority_upsampled = resample(train_minority,
                                    replace = True,
                                    n_samples = len(train_majority),
                                    random_state = 123)

train_upsampled = pd.concat([train_minority_upsampled, train_majority])
train_upsampled['label'].value_counts()

1    29720
0    29720
Name: label, dtype: int64

### Creating a pipeline
for the sake of simplicity of the hate speech detection model, we will use the scikit-learn's pipeline with an SGDClassifier

In [None]:
pipeline_sdg = Pipeline([
                         ('vect', CountVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('nb', SGDClassifier()),
])

### Training the hate speech detection model

#### Split data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_upsampled['tweet'],
                                                    train_upsampled['label'],random_state = 0)

#### Training and predicting using F1 score method

In [None]:
model = pipeline_sdg.fit(X_train, y_train)
y_predict = model.predict(X_test)

f1_score(y_test, y_predict)

0.9696605987864239

In [None]:
print(X_test)

1095                     bihday brother cool thebomb      
9766     gf asked me to make an account i told her i di...
26909         sunday   withyou happiness family  vinallop 
1386     been feeling low for ages and when the one per...
4616     chaplin  the dictator speech  via   theresista...
                               ...                        
17825    im looking forward to a few titles being annou...
30707    lol these pasty cakes telling the rockette to ...
15560    gorgeous evening family   lakesimcoe  willow b...
30566    rain or shine this treasure hunt is gonna go d...
9278     woman catcalled walking in newyork shuts down ...
Name: tweet, Length: 14860, dtype: object


In [None]:
d = {'fuck off': 1, 'hey': 2, 'I hate you': 3}

In [None]:
print(model.predict(d))

[1 0 1]
