<a href="https://www.kaggle.com/code/appleturnovers/nlp-intro?scriptVersionId=98269688" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from xgboost import XGBClassifier

In [2]:
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test_df = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

#### **Taking a quick look at our data**  
Let's first take a look at an example of what is NOT a disaster tweet.

In [3]:
train_df[train_df["target"] == 0]["text"].values[1]

'I love fruits'

Next, one that is:

In [4]:
train_df[train_df["target"] == 1]["text"].values[1]

'Forest fire near La Ronge Sask. Canada'

#### **Word Vectorization**  
This model works on the assumption that the words contained in each tweet are a good indicator of whether they're about a real disaster ot not. While this theory certainly isn't entirely correct, it is a fantastic place for us to start.  
  
  
We will begin by counting the words in each tweet and converting them into data that the machine learning algorithm can understand.

In [5]:
count_vectorizer = feature_extraction.text.CountVectorizer()

example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])

In [6]:
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())

(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


This tells us two things:

1. There are 54 unique words within the first five tweets.
2. The first of these tweets contains only some of those unique words, represented by a nonzero value above.

Now we will vectorize the entirety of the train set.

In [7]:
train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])
# note that we use .transform() for the test vectors to ensure the set of words is the same set

#### Building the model
We are building the model on the idea that the words contained in a tweet are a good indicator of whether or not they're about a real disaster.  
What this means is that we are assuming a linear connection, so let's use a linear model.

In [8]:
clf = XGBClassifier()

To test our model, we can use cross-validation to train the model on a certain portion of the known data, and use the remaining data to validate or test the model. If performed several times, this process should give us a good idea for how a particular model or method performs with the data.

In [9]:
scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv = 3, scoring = "f1")
scores

array([0.59326868, 0.53601695, 0.64097363])

These scores aren't too bad!  
We will now create predictions using the testing set and build a submission for the competition.

In [10]:
clf.fit(train_vectors, train_df["target"])

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

In [11]:
sample_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")

In [12]:
sample_submission["target"] = clf.predict(test_vectors)

In [13]:
sample_submission.head()

Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,0
4,11,1


In [14]:
sample_submission.to_csv("submission.csv", index = False)

We can now submit the above file to the competition!