# Kaggle NLP with Disaster Tweets

Competition: https://www.kaggle.com/competitions/nlp-getting-started/overview

Download the data with [Kaggle CLI](https://www.kaggle.com/docs/api)

```bash
kaggle competitions download -c nlp-getting-started
```

Or manually from the competition website.

### Load the data

In [None]:
import pandas as pd

In [None]:
tweets = pd.read_csv('data/nlp-getting-started/train.csv')
tweets.head()

In [None]:
X = tweets['text']
y = tweets['target']

X.shape, y.shape

In [None]:
y.value_counts()

### CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vect = CountVectorizer(stop_words='english', strip_accents='unicode', min_df=5)

In [None]:
vect.fit(X)

In [None]:
len(vect.vocabulary_)

In [None]:
X_vect = vect.transform(X)

### Train the model

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=287)

param_grid = {'C': [0.01, 0.1, 1.0, 10.0]}

grid = GridSearchCV(LogisticRegression(max_iter=500), 
                    param_grid=param_grid, 
                    cv=cv, 
                    scoring='f1',
                   return_train_score=True)

In [None]:
grid.fit(X_vect,y)

In [None]:
grid.best_score_

In [None]:
grid.best_params_

In [None]:
grid.cv_results_['mean_train_score']

In [None]:
grid.cv_results_['mean_test_score']

### Predict on the test set and prepare submission

In [None]:
tweets_test = pd.read_csv('data/nlp-getting-started/test.csv')
tweets_test.head()

In [None]:
test_pred = grid.predict(vect.transform(tweets_test['text']))

In [None]:
df_submission = pd.DataFrame()
df_submission['id'] = tweets_test['id']
df_submission['target'] = test_pred
df_submission.head()

In [None]:
df_submission.to_csv('data/nlp-getting-started/2022-11-16-submission.csv',index=False)

In [None]:
!head data/nlp-getting-started/2022-11-16-submission.csv

### Submission to Kaggle 

With the Kaggle CLI

```bash
kaggle competitions submit -c nlp-getting-started -f data/nlp-getting-started/2022-11-16-submission.csv -m 'sklearn CountVectorizer and LogisticRegression, cv_val f1=0.75'
```

Or manually on the competition website.

Submitted 2022-11-16

training score (f1) 0.870  
validation score (f1) 0.748  

Public score 0.78945 

Position 492