## Discussion Week 7

Author: Eva Newby

https://maro406.github.io/eds-232-machine-learning/discussion/week7.html

## Introduction
In this week’s discussion section, we will use a dataset containing tweets related to different disasters. For each observation (tweet), there is an outcome variable that classifies the disasters talked about in the tweet as real (1), or not (0). Rather than having multiple predictors as our X, we will have one predictor - the tweet. However, each individual word can be thought of as their own predictor, each contributing to predicting our outcome variable.

## Data
The dataset this week is a commonly used dataset for NLP (Natural Language Processing). The dataset can be found here. Disasters.csv includes a text variable, which contains the tweet as a string. Our target variable, target, is a binary outcome variable with 1 representing the disaster discussed as real, and 0 representing the disaster discussed as not real.

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
import re
import string
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [14]:
fp = '/Users/ejnewby/MEDS/eds232-ml/EDS232-discussion/EDS232-dicsussion/data/disaster.csv - disaster.csv'

print(os.path.exists(fp))


True


In [15]:
# Read in the data
disaster = pd.read_csv(fp)

In [16]:
# Cleaning text data
def preprocess(text):
    text = text.lower() # converting text to lower case
    text=  text.strip()  # removing leading/trailing spaces
    text=  re.sub(r'<.*?>','', text) # remove html syntax
    text = re.sub(r'[^\w\s]','',text) # remove punctuation
    text = re.sub(r'\[[0-9]*\]',' ',text) # remove reference numbers
    text = re.sub(r'\d',' ',text) # removing digits
    text = re.sub(r'\s+', ' ', text) # collapsing multiple spaces into a singl space
    return text

In [17]:
# Apply string cleaning to text variable
disaster['clean_text'] = disaster['text'].apply(preprocess)
disaster.head()

Unnamed: 0,id,keyword,location,text,target,clean_text
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,our deeds are the reason of this earthquake ma...
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,all residents asked to shelter in place are be...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfires evacuation orders in...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,just got sent this photo from ruby alaska as s...


### What about stop words?

In [22]:
# Proof that Tfidf vectorizer excludes stopwords
stop_words = ["On March 5th, I will crush my capstone presentation with my awesome team."]

vectorizer_english = TfidfVectorizer(stop_words = 'english')
X_english = vectorizer_english.fit_transform(stop_words)

print("Remaining Words:")
print(vectorizer_english.get_feature_names_out())

Remaining Words:
['5th' 'awesome' 'capstone' 'crush' 'march' 'presentation' 'team']


### Logistic Regression

In [23]:
# Split into test and train
X_train, X_test, y_train, y_test = train_test_split(disaster['clean_text'], disaster['target'], test_size = ...)

In [None]:
# Vectorize words
tfidf_vectorizer = TdifVectorizer(stop_words = 'english')
X_train_tfdif = tfidf_vectorizer.fit_transform(X_train)
X_test_tfdif = tfidf_vectorizer.transform(X_test)

In [None]:
# Initialize a logistic regression model and fit to vectorized training data
lr_model = LogisticRegression(random_state = 42)
lr_model.fit(X_train_tfdif, y_train)
y_pred = lr_model.predict(X_test_tfdif)

## Logistic Regression Results
