## NLP Getting Started Tutorial:

My work following a NLP Starter Tutorial with relevance to this project so that I can get a better understanding of how to do NLP tasks

Arun Agarwal
1/31/2023-2/1/2023

In [1]:
# Library Imports:
import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing



In [2]:
train_df = pd.read_csv("../Data/train.csv")
test_df = pd.read_csv("../Data/test.csv")

#### Building Vectors:
We use sckit-learn's CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process

In [6]:
count_vectorizer = feature_extraction.text.CountVectorizer()

train_vectors = count_vectorizer.fit_transform(train_df['text'])

#For test_vectors, we only use .transform() function b/c we need that the train and test vectors use the same set of tokens
test_vectors = count_vectorizer.transform(test_df["text"])

#### Building Model:
We use the words contained in each tweet as the predictor variable for real/fake disaster tweet (1/0). We assume a linear model/connection in this tutorial

In [8]:
# Since the vectors are big, we want to push the model's weights toward 0 without completely discounting different words
# Thus, we use Ridge Regression
clf = linear_model.RidgeClassifier()
# cv = 3 means we are using Three-Fold Cross Validation
scores = model_selection.cross_val_score(clf, train_vectors, train_df['target'], cv = 3, scoring = "f1")
print("Scores: ", scores)

Scores:  [0.59421842 0.56498283 0.64149093]


Ways to improve this score: Do TFIDF, LSA, LSTM/RNNs, etc

#### Predictions and Submission File:

In [10]:
clf.fit(train_vectors, train_df['target'])
tutorial_submission = pd.read_csv("../Data/sample_submission.csv")
tutorial_submission['target'] = clf.predict(test_vectors)
tutorial_submission.to_csv("../Result Files/submission.csv", index = False)