#### Loading data

We start by importing the packages we are going to use:

In [1]:
import pandas as pd

import string
from nltk import download
from nltk.corpus import stopwords
download('stopwords')

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.neural_network import MLPClassifier

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/athena/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We load the datasets:

In [2]:
train_data = pd.read_csv("../data/train.csv")
test_data = pd.read_csv("../data/test.csv")

# We drop for now the keyword and location information
train_data = train_data.drop(['id', 'keyword', 'location'], axis=1)
test_data = test_data.drop(['keyword', 'location'], axis=1)

train_data

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...
7608,Two giant cranes holding a bridge collapse int...,1
7609,@aria_ahrary @TheTawniest The out of control w...,1
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,Police investigating after an e-bike collided ...,1


We clean the text by removing punctuation characters and stopwords:

In [4]:
def process_text(raw):
    # Remove punctuation characters
    no_punct = [char for char in raw if char not in string.punctuation]
    no_punct = ''.join(no_punct)

    # Remove stopwords
    all_stopwords = stopwords.words('english') + ['u', 'ü', 'ur', '4', '2', 'im', 'dont', 'doin', 'ure']
    no_stopwords = [word for word in no_punct.split() if word.lower() not in all_stopwords]
    no_stopwords = ' '.join(no_stopwords)

    return no_stopwords

train_data['clean_text'] = train_data['text'].apply(process_text)
test_data['clean_text'] = test_data['text'].apply(process_text)

train_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]


Unnamed: 0,text,target,clean_text
0,Our Deeds are the Reason of this #earthquake M...,1,Deeds Reason earthquake May ALLAH Forgive us
1,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask Canada
2,All residents asked to 'shelter in place' are ...,1,residents asked shelter place notified officer...
3,"13,000 people receive #wildfires evacuation or...",1,13000 people receive wildfires evacuation orde...
4,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo Ruby Alaska smoke wildfires pou...
...,...,...,...
95,9 Mile backup on I-77 South...accident blockin...,1,9 Mile backup I77 Southaccident blocking Right...
96,Has an accident changed your life? We will hel...,0,accident changed life help determine options f...
97,#BREAKING: there was a deadly motorcycle car a...,1,BREAKING deadly motorcycle car accident happen...
98,@flowri were you marinading it or was it an ac...,0,flowri marinading accident


We generate vector counts for both train and test data using scikit's **CountVectorizer**. In particular, notice that we fit the vectorizer only with the train tokens, and use it to transform both train and test data. If there are N unique tokens in the train dataset, for each tweet we obtain a vector of length N whose values are the word counts:

In [5]:
cvec = CountVectorizer(stop_words='english')
cvec.fit(train_data['clean_text'])
X_train = cvec.transform(train_data['clean_text'])
X_test = cvec.transform(test_data['clean_text'])
y_train = train_data['target']

X_train.shape, X_test.shape, y_train.shape

((100, 553), (3263, 553), (100,))

#### Neural network

We will train a model based on a neural network, using the **MLPClassifier** available in scikit-learn.

In [6]:
nn = MLPClassifier(hidden_layer_sizes=(1000, 100), max_iter=10000, verbose=1)

We train it with the whole train dataset:

In [7]:
nn.fit(X_train, y_train)

Iteration 1, loss = 0.69664965
Iteration 2, loss = 0.62496169
Iteration 3, loss = 0.56261060
Iteration 4, loss = 0.50296276
Iteration 5, loss = 0.44449089
Iteration 6, loss = 0.38694254
Iteration 7, loss = 0.33130532
Iteration 8, loss = 0.27858193
Iteration 9, loss = 0.22998522
Iteration 10, loss = 0.18669301
Iteration 11, loss = 0.14938047
Iteration 12, loss = 0.11811356
Iteration 13, loss = 0.09257406
Iteration 14, loss = 0.07213433
Iteration 15, loss = 0.05600717
Iteration 16, loss = 0.04343694
Iteration 17, loss = 0.03373616
Iteration 18, loss = 0.02628152
Iteration 19, loss = 0.02056079
Iteration 20, loss = 0.01617013
Iteration 21, loss = 0.01280048
Iteration 22, loss = 0.01020577
Iteration 23, loss = 0.00820645
Iteration 24, loss = 0.00665982
Iteration 25, loss = 0.00546005
Iteration 26, loss = 0.00452414
Iteration 27, loss = 0.00379103
Iteration 28, loss = 0.00321322
Iteration 29, loss = 0.00275552
Iteration 30, loss = 0.00239032
Iteration 31, loss = 0.00209712
Iteration 32, los

MLPClassifier(hidden_layer_sizes=(1000, 100), max_iter=10000, verbose=1)

And we generate the predictions for submission:

In [8]:
y_pred = nn.predict(X_test)

output = pd.DataFrame({'id': test_data['id'], 'target': y_pred})
output.to_csv('predictions/nnets.csv', index=False)
print("Submission successfully saved!")

Submission successfully saved!
